The purpose of writing Domainatrix was to
build an OS specifically for serving,ie. I/O performance-centric while
introducing a couple of new features specifically for the Web.The architecture
I had in mind was a kernel built around high performance networking (ie.
TCP/IP stack) while offering as near-hardware level access to storage devices
as possible,to facilitate the path between them and the network.A large
amount of thinking has gone into,and is currently going into its design,to
be able to facilitate this,while keeping other features intact,such as
hotswapping,and the Merced port (which has provisions only for memory-mapped
I/O,requiring a *severe* redesign of our memory management subsystem).While
we do currently actually offer abstractions,that may be changed later with
the OS being content to do only *multiplexing* ie. only restricting its
duties to see whether that hardware access is _safe_ (See [1]).We have
decided to break up the project into separate codestreams for the 32-bit
and 64-bit versions (as of this writing the future of the Itanium processor
seems to be in doubt somewhat,however we may presume that it will be out
*sometime* - I don't however expect to have an actual box on my desk
till fall 2001 at the very least,the simulators will have to do till then).As
such we've decided to use the ideas I originally had to get a working Dx-32
out the door while Dx-64 will be a thorough redesign and less experimental
and more robust.Dx-32 uses *only* a segmented memory management scheme,it
was decided that it's simplicity given our objectives outweighs the disadvantages.A
simple linked list structure is used to keep track of the segments in memory,and
when need comes for a new one to be allocated,we simply search through
all of them to find a hole in memory big enough,failing which segments
are simply moved up in memory,in order of the holes,so as to create one
sufficiently big.
As it is specifically not a desktop OS,throughout
its design and implementation the fundamental expectation I always kept
in mind was that it would be *programmed for* in assembly language only
as well,leading to the fastest possible code in all cases.Now while that
would probably be possible in the cases of all applications, it is simply
not a feasible proposition for day-to-day running of the OS.As such we
have started to get some ideas for a systemwide scripting language which
will produce optimized machine code,anybody well versed in compiler design/language
implementation,please feel free to contact us.
The function calls currently programmed or coming soon are:
(1) PROCS.
Exec(path);
Call(path);
TSR();
AllocDMABuffer(Amt.);
DeAllocDMABuffer(Amt.);
Flush();
SetPriority(PID,No.);
Alarm(Time);
Hotswap(path to kernel image);
Exit();
(2) DEVICES
HookInterrupt(Vector no.,Offset into program);
DeHookInterrupt(vector no.);
Send(DeviceName,Amt.,offset);
Recv(DeviceName,Amt.,offset);
(3) SIGNALS
Signal(PID,No.);
HandleSignal(No.,Offset into program);
(4) TERMINALS
ReadCurrKeyP();
ReadLastKeyP();
ChangeTTY(No.);
CreateTerm(TTY No.,total chars);
WriteString(TTY,offset,length);
(5) NETWORKING
Netable(Device,Local IP);
Socket(Name,Protocol,Local IP,Local Port);
Connect(Name,Remote IP,Remote Port);
Close(Name);
Write(Name,offset,amt.);
Read(Name,offset,amt.);
ARCHITECTURE
A process in Domainatrix can only be created
by the Exec() system call,and consists of 7 segments:
1) Code
2) Stack
3) what I call a "Write"
Data Segment,and a
4) "Read" Data Segment
5) DMA buffer
6) Sockets
7) Temp. (needed by
the OS)
These 7,in addition to the TSS and LDT make
up the 9 actual regions of memory needed for a process.Their names are
mostly self-explanatory,however the "Read" and "Write" segments bear special
mention - these are two segments,for which convention will be set to be
used for just that.As for example if a syscall is executed,parameters for
it will be placed in the "Write" one,and the program can safely expect
any possible returned values from the OS in the "Read" one.(ie. they are
Write and Read from point of view of the *process*).Note this has nothing
to do with the actual writable permissions set on that segment,these are
merely conventions that the OS follows,with the obvious expectation that
programs will too.
The temp. segment is guaranteed to be available
to the process at all times,ie. work in it is done only during system calls,where
an LDT entry is needed by the OS for fast operation (eg. IP reassembly
buffers).
A Call() as outlined above is probably
the second most important syscall in Domainatrix,and is the basis of how
the STDIO socket redirection mechanism works.The system call itself would
result in a new program getting loaded from disk,and run by the OS as a
*second Code Segment* of the current proc.,ie. when that program Exit()s,control
will be returned to the program that Call()ed it.When a web server program
gets a request for a CGI page,it will Call() the program referenced in
the page,with its output redirected to that particular socket,having to
do *nothing further* itself.Since the second program shares all the data
segments of the first,including the one holding socket data,this will work.This
also makes it tremendously easy for doing pipes,ie. a shell could Call()
a program with the input in its "Write" segment,and upon return have the
output in its "Read" segment.
Process priorities are done in Domainatrix using an extremely primitive method: a syscall sets the priority for a process from a number in the range 1-FF,which is used as a counter,ie. when a process is executing and a timer interrupt occurs,the counter is decremented one from its current value.If it is 0,a context switch is caused to the next task,otherwise control simply returns.This ensures that,with all processes running,that particular one will get a total timeslice proportional to its priority from the CPU.
Signal handling is done by a program when it registers a handler for a particular signal (from currently 1 of 9), along with its address inside the executable.Upon another process sending that signal,the formers EIP is simply changed to point to the handler,and control passed to it.
TSR handling is done by registering a key
combination,ie. a mix of scancodes with the kernel.For every keystroke,the
keyboard handler simply checks,and revives the proc. if it was suspended.This
means that the OS has to keep a list of all the scancodes registered by
any processes running on the system,which could slow things down.Hence
the need for Event Abstraction: in an OS like this,generally speaking only
two types of events occur that we're interested in: mouse/keyboard,and
network messages.It should be possible in the future to activate processes,or
have them be caused specific signals by a particular request coming over
the network,that "particular request" would have to be defined,and registered
with the kernel.
DEVICE DRIVERS
Much thought has gone into the device
driver model in Domainatrix,and as the first page says,we have decided
to use UDI,however this choice was mainly due to the promised vendor support.We
are unclear where to go on this at the moment,but it's likely that UDI
will only be implemented in Dx-64 (implementing the model given,which obviously
*firmly* assumes a HLL in 100% assembly is a *gargantuan* task,and as such
we will probably only want to do it once),and the original model I had
thought of,given below,in Dx-32.However it will require a bit more work,since
as it stands now,it would have a bit of a problem dealing with things like
video cards.The model itself is quite simple: a device driver is made up
of 3 parts of code: an Initialisation Routine,a Send routine,and a Recv
routine,and a header pointing to each of them.The first is called when
the driver is loaded into memory (a special area at the start of RAM is
kept aside for drivers,before programs begin,since relocating drivers in
memory would cause us to revector the interrupts it's handling),and the
latter 2 when a proc makes a Send() request to that device,and when an
interrupt occurs FOR THE IRQ THAT DRIVER IS HANDLING,respectively. The
kernel keeps track of both the addresses to jump to in Layers,ie. the Send
entries for a device could have 3 entries, each of those 3 would be the
addresses of the Send() routines for 3 different drivers that have each
registered handlers for that device,and similarly for Recv().This facilitates
IRQ sharing sharing in PCI systems in the simplest way possible.What it
also facilitates is *network* layering,ie a new layer such as PPP could
be easily added by adding a new entry in the Recv() entries.This is precisely
the way the Netable() call works to bring up an interface,ie. make it available
to kernel TCP/IP.In some cases the device driver would be enough previous
layering needed,eg. Ethernet,if it's a modem requring PPP (or some hitherto
unheard of protocol),simply add the code to the kernel to handle it,and
hotswap in a new kernel after having added a new Recv() entry.In a pinch
this could be done without missing a hit!!!
DISK CACHE
The caching scheme will concentrate on caching
by FILE,since we're serving webpages,which,presumably,are no use in part
:) When a fread() is done without it being fseek()ed before,ie. the entire
file is being read,we assume it's webpage data (*.html,*.gif,*.jpg) and
copy it to 2 regions of mem simultaneously: the actual buffer,and wherever
in the cache it's going to go.The cache kernel data segment just contains
a list of files,ie. full pathnames like "/webserver/texture.jpg",together
with it's start and end addresses in the cache,in RAM.THERE IS NO WRITE
CACHING.
*UPDATE* While the above is true,as is
mentioned somewhere else on the site,there has been talk of video serving,which
is nowhere near my original intentions for the purpose of this OS.As far
as the caching scheme is concerned,this would mean that all files above
a certain size would *not* be cached in their entirety,but by means of
the maximum memory available.
FILESYSTEMS
The actual design of the FS we are going to use is being worked on right now,and there was some talk of using temporary ones to get our OS working,and in the end it we decided to use FAT-32,for obvious reasons.As far as Dx's own FS is concerned however,some of the original thoughts I'd had about it still stand - mainly regarding the namespace.Storage devices in Dx. will be accessible by name,very similarly to Unix eg. ide0,however they will be hardwired into the root directory namespace,ie. the first drive will be accessible under /ide0,instead of using mount points.This is slightly more rigid than Unix,but I think it solves more problems than it creates.Since,in any case,symbolic links have by no means been scrapped,the above scheme is extremely workable.
The SOURCE :
We are currently building the cross-web development system the Dx. is going to use,look for it soon on the "Tech." link.If you're trying to get a picture of what it will look like,think of SourceForge,but a low-level,assembly language version.
TO DO:
Well,tons of things obviously: Extensibility,Executable File Format decision,Virtual Memory,etc..... please join our mailing list to find out,and contribute...... mail : domainator@flashmail.com .
REFERENCES
[1] The Exokernel approach to Operating System extensibility - Dawson R. Engler,M.Fraans Kaashoek,James W.O'Toole Jr.,MIT Laboratory for Computer Science.
[2] Extensibility,Safety and Performance in the SPIN Operating System - Brian N. Bershad, Stefan Savage,Przemyslay Pardyak,Emin Gun Sirer,Marc E. Fiuczynski,David Becker,Craig Chambers,Susan Eggers,University of Washington.
[3] Server Operating Systems - M.Fraans Kaashoek,Dawson R. Engler,Gregory R. Granger,Deborah Wallach,MIT Laboratory for Computer Science.
[4] Exterminate all Operating System Abstractions - Dawson R. Engler,M.Fraans Kaashoek,MIT Laboratory for Computer Science.
[5] Efficient,Portable,and Robust Extension
of Operating System Functionality - Amin Vahdat,Douglas Ghormley and Thomas
Anderson,UC Berkeley.