UML Projects

Site Home Page
The UML Wiki
UML Community Site
The UML roadmap
What it's good for
Case Studies
Kernel Capabilities
Downloading it
Running it
Compiling
Installation
Skas Mode
Incremental Patches
Test Suite
Host memory use
Building filesystems
Troubles
User Contributions
Related Links
Projects
Diary
Thanks
Contacts

Tutorials
The HOWTO (html)
The HOWTO (text)
Host file access
Device inputs
Sharing filesystems
Creating filesystems
Resizing filesystems
Virtual Networking
Management Console
Kernel Debugging
UML Honeypots
gprof and gcov
Running X
Diagnosing problems
Configuration
Installing Slackware
Porting UML
IO memory emulation
UML on 2G/2G hosts
Adding a UML system call
Running nested UMLs

How you can help
Overview
Documentation
Utilities
Kernel projects

Screenshots
A virtual network
An X session

Transcripts
A login session
A debugging session
Slackware installation

Reference
Kernel switches
Slackware README

Papers
ALS 2000 paper (html)
ALS 2000 paper (TeX)
ALS 2000 slides
LCA 2001 slides
OLS 2001 paper (html)
OLS 2001 paper (TeX)
ALS 2001 paper (html)
ALS 2001 paper (TeX)
UML security (html)
LCA 2002 (html)
WVU 2002 (html)
Security Roundtable (html)
OLS 2002 slides
LWE 2005 slides

Fun and Games
Kernel Hangman
Disaster of the Month

UML Projects

Here's a list of interesting things that need doing, but which I'm not going to have time for any time soon. These are all going to happen after UML V1.0 hits the streets. If you're interested in working on any of these and want pointers, help, etc, contact the UML devel list.

Architecture Ports

UML needs to be ported to other Linux architectures besides i386. This isn't very hard. The ppc port is nearly working at this point, and that has provided a lot of information that other ports will need. This is written up here.

OS Ports

It also needs to be ported to other operating systems. Other Unixes would probably be easiest. An OS needs some way of intercepting Linux system calls in order for a port to be possible. There are other needs, but they can probably be worked around if they're not met. A Windows port would be most interesting. NT apparently has the basic capabilities to run UML. Because UML provides a secure virtual machine, running it on Windows would provide a sandbox in which users could run untrusted code (like the email virus du jour) safely. It would be very amusing to have Linux provide a capability for Windows that Windows can't provide for itself.
It turns out that Michael Vines has written what amounts to the core of a Windows port. It's a Windows app which intercepts Linux system calls and interprets them itself. This is the critical part of UML. If this works, everything else can probably be made to work. So, if you're interested in working on this and want a head start, see his page for the details.

Native driver development

It is possible to allow direct access to hardware. This would allow native drivers to run (and be debugged) in userspace. This requires that the host provide access to io space (there is currently the ability to map io memory into a process address space, which I think satisfies this), and provide something to sit on the device irq and notify the userspace driver of interrupts. I think this can be done with a stub driver which probes the device at boot time, provides a file that the userspace driver can open, and sends a SIGIO down the file descriptor whenever the device interrupts. The thing I'm not sure about is timing. If the driver needs to do stuff with interrupts disabled, then that would make a userspace version hard, since processes can't run without interrupts on.
As of 15 Apr 2001, it is possible to do USB development under UML. Johan Verrept posted a patch which allows a USB device driver in UML to control a physical USB device on the host.

hostfs extensions

hostfs is now used to access filesystems on the host, but it's potentially much more general than that. It's separated cleanly into kernel and userspace pieces, with the userspace code dealing with the external data.
The simplest extension would be to move the user code to a remote machine and put a simple RPC mechanism between the kernel and userspace pieces. This would allow mounting of a remote directory into a UML, similar to the userspace nfs daemon.
A more interesting possibility is to make the userspace host code deal with non-filesystems that can be made to look like filesystems. There are lots of databases which might be interesting to represent as filesystems:

SQL databases - each row is a directory, with each column a file containing the entry for that row and column

File databases like /etc/passwd and /etc/group - each entry is a directory with columns as files like the SQL scheme above

Network databases like NIS and BIND

Other possibilities include

networks - with each machine getting a directory containing information about that box

people and users

And there are probably lots of other possibilities. Anything that can at all be reasonably made to look like a file or filesystem could be mounted inside UML.
To make this work, all that's needed is a rewrite of arch/um/fs/hostfs/hostfs_user.c. It implements the operations on the underlying objects, which currently are host files. If you can implement those same operations for some other kind of object, then that object, or a set of them, can be mounted inside a UML and operated on as files.
Storing the same data in multiple places and mounting them jointly as a single hosfs filesystem offers other possibilities as well:

Dumping a directory into a filesystem into a database and having hostfs provide access to both would allow normal file access to the filesystem plus queries to the database through a mechanism such as an ioctl or a virtual filesystem on the side. So, you'd get both normal access to the filesystem plus the ability to do database queries on it.

Mounting many almost-identical directories from multiple machine on a single hostfs mount point would allow software to be installed to all the machines simulataneously with a single install command inside the virtual machine.

SMP support

Enabling SMP in UML would allow it to emulate a multiprocessor box. Getting this working should be fairly straightforward. I did have it working in early 2000, but turned it off to work on the UP side of things.
Here's what needs to be done:

Uncomment the SMP config in arch/um/config.in, run your favorite config, and turn it on

Get UML to compile - there's been some bitrot since this last worked, so the some of the kernel headers don't agree any more with what's in UML, and I've put in some #error directives where I've noticed there needs to be some thought about SMP.

Audit the UML arch code, looking for global data that needs locking. There isn't very much of it. I did the audit in a couple of days, and didn't miss very much.

Debug it. Bang on it until you stop hitting races.

Send in the patch.

UML Clustering

DSM stands for Distributed Shared Memory, where the nodes of a cluster share a portion of their memory through some kind of special interconnect.
This can be done with UML because UML's physical memory is really virtual memory on the host, so it can be mapped and unmapped. So, the idea is to spread a single UML instance over multiple hosts by running a virtual processor on each host and partitioning UML physical memory between them. A page that's present on one node will be unmapped from the others. If one of the other nodes accesses it, it will fault, and a new low-level fault handler will figure out what node currently has it, and request that it be sent over. The other node will unmap it, and copy it over to the requesting node, which will map it in to the appropriate location and continue running.
This will get a UML-based Single-System Image (SSI) cluster up and running. It will be very inefficient because there are data structures that the kernel references very frequently. The pages that contain this data will be constantly being copied from node to node and the nodes will be spending much of their time waiting for that to happen.
There are two avenues through which this can be fixed. One follows from the observation that this cluster is an extreme form of NUMA, with no global memory and very expensive access to other nodes' local memory. So, the ongoing NUMA work going into Linux will help this. Plus, UML will effectively make NUMA hardware available to everyone who runs Linux, which will hopefully pull more people into the effort and speed it up.
The other is to start replacing the shared memory communication with an RPC-based communication mechanism. While the NUMA work will reduce the amount of inter-node chatter, an RPC mechanism will make the remaining communication much more efficient. The ultimate outcome of this effort will be a standard SSI cluster implementation which can run both with physical nodes or virtual ones.

UML as a normal userspace library

UML normally runs as a standalone executable, but there's no reason bthat it can't be packaged slightly differently, as a library which other applications could link against. Those apps would gain access to all the functionality in the kernel, including

memory allocation - these are specialized to be fast and scalable. In addition the virtual memory allocator is able to defragment memory efficiently with a buddy system algorithm.

memory management - built-in swapping and cache size control. With some work, the userspace swapping code could be used to tell the host kernel what memory the application doesn't really need, allowing userspace and kernelspace to cooperate in getting the best performance from the system.

virtual memory and multiple address spaces - these are things which libc doesn't provide, except by calling into the kernel to get them. They can be used to jail some of the application's threads, if it is running something it doesn't trust. With the addition of highmem support to UML, this could be used to transparently provide access to more than 4G of address space.

threads - these are specialized to be low-overhead and fast, as is the scheduler.

filesystems - the filesystems can be thought of hierarchical storage systems for the application's internal data. They can be mounted on an internal ramdisk, making them private and non-persistent, or an a host file, making them persistent and possibly shared.

the full network stack and network devices - this is interesting when combined with the presence of a filesystem inside the application. It could export the filesystem via NFS or another remote filesystem protocol to the outside world.

Here are some examples of how these capabilities can be used:

Apache could store its configuration in an internal filesystem and export it to the outside world over the network. All of the Apaches in a server farm could store their configurations in an internal filesystem and export them to the outside world over their networks. A central server mounts them all and uses information about the load on each machine to tweak the settings of its Apache so that the server does as much work as possible without causing it to be overloaded.

Alternatively, the Apaches could all import their configurations from a central location so that when a change is made on the central server, all of the servers immediately see the change. This would be useful for things like virtual server configuration. Create the new virtual server configuration by creating the appropriate directories and files in the configuration filesystem and all of the web servers importing that configuration would immediately start serving the new virtual domain.

Combine the two previous examples, with each server exporting performance-related configuration items so they can be individually tweaked on the fly, but importing the global configuration items that need to be shared across the entire server farm.

These servers could use the virtual memory and jailing abilities of UML to accept code from web clients and run it. These pieces of code would be run as UML processes, jailing them inside restricted environments with their own private address spaces, allowing them to change their own data, but providing no access to any other data. To prevent these threads from consuming too much CPU and preventing the servers from doing their normal work, they could be given a lower priority than the normal Apache threads, so that the UML scheduler will only run a client's thread when the server would otherwise be idle.

An interactive application could store its UI in an internal filesystem and export it to the host. External scripts could then navigate around that filesystem, activating parts of the interface, which would effectively provide scripting to applications which don't have it built-in.

A script could watch the exported filesystem for events such as a dialog box being opened and fill in the fields with defaults from information sources that the application knows nothing about. The script could simply remember the last set of values for that dialog box and fill them in the next time. Or it could be watching a number of applications, copying information from one to the other. For example, if the user has just looked up a particular person's calendar and is now starting to compose an email message, the script could look up that person's email address in its own database, and put it in the 'To' field of the new message composition window.

Clusterized applications

Combine the ideas of using UML as a cluster and using it as an application library and you get an application which can spread itself over multiple machines and effectively remain a single process. Since the kernel's data structures and algorithms will preseumably be suited for clustering already, the more extensively the application uses them for its own purposes, the more prepared it will be to spread itself transparently over a virtual cluster.
Servers would be able to automatically add virtual nodes on new physical hosts to themselves as the load increased and shut them down as their load goes back down. And as they do that, some other clusterized server may be starting to run on those hosts as its load goes up. This will makes it possible to run one server for each service on a server farm, rather than one server per machine.

Hosted at