Site Home Page
The UML Wiki
UML Community Site
The UML roadmap
What it's good for
Case Studies
Kernel Capabilities
Downloading it
Running it
Compiling
Installation
Skas Mode
Incremental Patches
Test Suite
Host memory use
Building filesystems
Troubles
User Contributions
Related Links
Projects
Diary
Thanks
Contacts
Tutorials
The HOWTO (html)
The HOWTO (text)
Host file access
Device inputs
Sharing filesystems
Creating filesystems
Resizing filesystems
Virtual Networking
Management Console
Kernel Debugging
UML Honeypots
gprof and gcov
Running X
Diagnosing problems
Configuration
Installing Slackware
Porting UML
IO memory emulation
UML on 2G/2G hosts
Adding a UML system call
Running nested UMLs
How you can help
Overview
Documentation
Utilities
Kernel projects
Screenshots
A virtual network
An X session
Transcripts
A login session
A debugging session
Slackware installation
Reference
Kernel switches
Slackware README
Papers
ALS 2000 paper (html)
ALS 2000 paper (TeX)
ALS 2000 slides
LCA 2001 slides
OLS 2001 paper (html)
OLS 2001 paper (TeX)
ALS 2001 paper (html)
ALS 2001 paper (TeX)
UML security (html)
LCA 2002 (html)
WVU 2002 (html)
Security Roundtable (html)
OLS 2002 slides
LWE 2005 slides
Fun and Games
Kernel Hangman
Disaster of the Month

Dear Diary

This page contains information about what's currently happening with the project. I may update it once in a while if I feel like it.

29 Feb 2008

I've put out a few more releases of the SKAS4 patch, including a couple to LKML. No comments as yet, except for one from Andrew about style and a minor ocding issue.

I decided to add another idea to SKAS4 since the switch_mm bit, although it greatly improves performance on x86_64, still isn't close to i386. oprofile says I'm getting killed in the scheduler, so I revived an old idea which has come up a few times, and which was actually implemented by Ingo a few years ago.

That is to allow a process to put itself into an "unprivileged" mode, where it can't make system calls or receive signals. That is, the process would make a call like vcpu(mm, registers); which would switch to the given address space with the given registers and run there until it receives a signal or makes a system call. At that point, the vcpu call returns with information about what happened. This is essentially self-ptracing, except there's only one process involved. It also returns all of the information needed to deal with the event, without needing to make more calls to ptrace in order to get the current registers or signal information.

Making this work was surprisingly easy, except for saving and restoring TLS state. Despite my best effort, I'm getting UML process TLS segments active when vcpu returns back to the UML kernel.

When this works and is tidied up, vcpu will be in the next skas4 patch.

1 Feb 2008
The big news recently is that I've figured out a SKAS interface that might get into mainline. What's in it:
  • Two new system calls
    • new_mm - returns file descriptor referring to a new address space
    • switch_mm - switches the process to a new address space
  • /proc/<pid>/mm - opening this returns a file descriptor referring to the process' address space
  • PTRACE_SWITCH_MM - switches the child to a new address space
Manipulation of a remote address space (as when handling a page fault or swapping out a page) is done by switching to the address space, in the stub page mapped there by UML, and executing the mmap/munmap/mprotect calls directly.

A user of switch_mm would normally open /proc/self/mm in order to get a handle to its original address space. switch_mm requires a file descriptor telling it what address space to switch to, so this provides the means to return to the original address space. This descriptor also holds a reference to the address space, ensuring that it isn't freed when the process visits another address space temporarily.

I've made several releases of SKAS4 to the UML lists, and one to LKML. A full rolled-up patch can be found here.

It also includes a siginfo_t extension which adds the CPU error code and trap number to the SIGSEGV case. These are needed in order to figure out what sort of operation faulted, and thus how to fix it.

SKAS4 currently supports 32 and 64-bit UML and x86, plus 32-bit compatibility on 64-bit x86.

With this on the host, and a UML close to what's currently in -mm, I'm getting 82-83% of native performance on a kernel build on i386.

29 Oct 2007
In the three months since the last entry here, the whole 2.6.23 cycle came and went. It wasn't a huge UML release - I was saving up changes for the 2.6.24 cycle:
  • 4K stacks and IRQ stacks - The old, larger stacks were a reliability problem on x86_64 since there often wasn't enough contiguous memory in order to allocate a kernel stack, even if there was plenty of memory overall. You'd see forks start to fail when there was lots of free memory. I implemented IRQ stacks in order to cut down on stack usage and cut i386 kernel stacks to 4K and x86_64 kernel stacks to 8K, the same as the host. As well as improving reliability and saving memory, it gave a noticable speedup to a kernel build.
  • Core dumping now works on x86_64.
  • A fair amount of code cleanup.

I spent the 2.6.23 cycle accumulating things in -mm for 2.6.24. There's a fair amount, which is now in 2.6.24-rc1:

  • tt mode is gone. This was prompted by Adrian Bunk looking around for unused and defunct config options, and spotting CONFIG_MODE_TT. Since it no longer worked, and I'm not sure if it built, I just got rid of it. This was a fairly major undertaking, resulting in a long patchset. I was fairly careful about it and it seems to have gone smoothly. I think between 5000 and 10000 lines of code were deleted. This exposed a number of opportunities to simplify the remaining code, which I am doing on a case-by-case basis. Perhaps surprisingly, this resulted in a measurable speed boost on a kernel build.
  • Tickless support - this resulted in a noticable performance boost by itself, and it should help out hosts that are running many UMLs. They will no longer have to deliver 100 timer interrupts per second to each UML.
  • Performance enhancements - UML doesn't save and restore FP state from processes and the tt mode removal made it possible to streamline the page fault path some more.
  • VDE (Virtual Distributed Ethernet - http://vde.sourceforge.net/) support - this is something like UML's uml_switch. The VDE driver was contributed by the VDE developers.
13 Jul 2007
The issue of valgrinding the kernel by way of UML came up again. Unfortunately, valgrind is just as broken as ever when it comes to UML. It caused a cloned subprocess to segfault when UML is checking the ptrace capabilities of the host during startup. I ifdef-ed all that code out to see what would happen. valgrind then dies because of an instruction it doesn't know how to emulate. It supposedly can handle everything emitted by gcc, but this instruction came from some hand-written assembly in include/asm-i386. After seeing this, I gave up on valgrind again.

I've been playing with KVM some. After much futzing around, I got a trivial guest to run. The issue was the initialization of the guest's physical memory. You end up getting several file descriptors from KVM as part of setting up the guest. The guest's physical memory is created by mmapping one of these descriptors. After mapping it, you fill it in with whatever data you want the guest to have when it starts. I was calling mmap with MAP_PRIVATE - a decision made somewhat randomly without any particular thought. It turns out that, with respect to this buffer, the guest and the process initializing it are different processes. When I initialized the buffer, the kernel privatized the pages, leaving the guest with zero pages.

The guest was running on virtual 8086 mode, complete with 16-bit segment limits and everything. I read the specs some to see how to initialize the registers so as to have the guest start in protected 64-bit mode, without much success. KVM does come with a bunch of little demo guests, one of which puts the virtual CPU into 64-bit mode. I will use that and see if I can figure out the appropriate register initializations.

I also spend some time playing with Ingo's and Zach Brown's syslets. The attraction there is that they allow synchronous system calls to be turned into asynchronous calls if they block. In this case, the system call will return to userspace in a different thread, and the status from the original, blocked thread can be collected later, when it finishes. This is near to optimal as far as CPU consumption and code cleanliness is concerned. You don't need to make everything asynchronous in case something might occasionally block, with the switching and event collection that implies. If the data is available, the call returns immediately, and if it's not, you collect the status later.

I found a few bugs, for which I sent patches. I couldn't get UML to boot using syslets, for reasons I didn't fully debug. A problem is the fact that the asychronous threads may return to userspace if they get a signal, and a stack and an entry point needs to be provided to each thread. This is a big wart on the design, and it causes problems when a process that thinks it's single-threaded receives a signal in 16 or 32 threads. This feature seems not to be completely needed, and I'm hopeful that it will disappear.

In other news, 2.6.22 is out, with all the changes I mentioned previously.

25 May 2007
My pile of patches (50-60 or so) are now in 2.6.22-rc2. Among the changes that'll be in 2.6.22:
  • Better hot-plug and hot-unplug of block devices and network interfaces - I fixed some misbehavior which could cause a block device to be neither pluggable or unpluggable and fixed crashes in the network interface hotplug code.
  • Various performance improvements already mentioned.
  • IRQ stacks and smaller kernel stacks (4K on i386 and 8K on x86_64, matching the host) - this knocked a bit more time off a kernel build.
  • Lots of other code cleanups and bug fixes.
I'm working on getting the ubd readv/writev support ready for mainline. The per-device thread patch, which leads to it, has been fixed so that all device I/O threads get killed on shutdown, so it's ready. The readv/writev patch itself doesn't do the right thing with COW files, so it still needs some work. The problems here are similar to those encountered in the AIO work in my tree, so I might end up pulling that forward and getting it into mainline and out of my tree finally.

I figured out something about I/O performance. The important thing is not to have lots of I/O going to the host at the same time, although I imagine it doesn't hurt. During a kernel build in UML, the host CPU is pegged, so anything that cuts down on CPU consumption directly cuts down on the kernel build time. So, readv and writev are nice, not because they send a lot of I/O to the host at once, but because they allow the UML driver to be notified of a bunch of I/O with a single interrupt. Linux signals are fairly expensive, so cutting down on their number helps performance.

There was still another example of a boot hang after mounting the root filesystem that I hadn't figured out. I did finally, and it turns out to be that PTRACE_SYSEMU is broken on FC6 on i386. I let Roland McGrath know, and added some more checking to UML, so that it now will look at PTRACE_SYSEMU more carefully, and fall back to PTRACE_SYSCALL if anything is wrong.

27 Apr 2007
2.6.21 came out yesterday, so the changes I have pending in -mm should start flowing in. I spent the afternoon playing with the UML network layer. Antoine Martin found that you couldn't assign a MAC to a pcap device (which he wanted to do in order to not get a different device every time his distro saw a new MAC). In the course of fixing this, I found and fixed a pile of other bugs. I was making no attempt to have the assignment of a MAC succeed, so see how the error cases worked. I found that
  • the free wrapper wasn't detecting pointers that had been malloced, so was passing them to kfree, which blew up
  • the failure path of the ethernet device configuration was causing double-frees because sysfs was helpfully freeing things
  • the pcap backend wasn't printing a proper initialization string, checking that it was passing a valid pointer to pcap_close, or returning an error when it saw an invalid option
In other news, I love UML's new behavior of dumping core on panic. It gives me much better debugging when I don't already have a gdb on it, and it panics.
20 Apr 2007
So, what's been going on in the last two months? Lots of patches either in mainline or on their way:
  • UML will now dump core on a panic (and it prints the current core dump limits on boot) to give me a better chance of debugging problems that I can't reproduce. To allow this to work, the core dump rlimit must be unlimited (or at least very large) - see ulimit -c for your limits. It seems that most distros set it to zero in order to inhibit core dumping.
  • Device hotplug fixes - if you hotplug a disk with a bogus filename, that device is no longer unremoveable and unrepluggable. Also, you don't get nasty messages from sysfs (or wherever) about a lack of a release method when you unplug a device.
  • Yet another x86_64 TLS fix. I think TLS is fine now - I know of no failures at this point.
  • UML in -mm compiles now in the presence of utrace. It's non-functional, but enough is there to let it build.
  • Lots of code cleanup and fixes of a few miscellaneous crashes.
  • Performance improvements - kernel builds are now ~20% faster than before and are about 2/3 native speed. I found a horrible mess which was causing a random userspace page to be faulted in whenever UML did a file read or write.

    The reason for this was that, a long long time ago, in the tt era, I was having problems with userspace addresses being passed into read or write on the host and having that return -EFAULT because the page had not yet been faulted in. So, I added some code which faulted in any such pages (in a totally bogus way). However, in the skas era, when a kernel address gets passed to read or write on the host, it doesn't need to be faulted in at all and doing a copy_user to it to make sure it's present just faults in the process page, if any, at the corresponding process address. Removing this gave me ~10% on a kernel build.

    Improving the page fault path gave me another ~10%.

  • I/O improvements - I've done a number of things to the ubd driver to speed things up:
    • Stuff as many requests at the I/O thread as possible to allow them to possibly be handled sooner, and pass pointers instead of entire structures to increase that number and reduce the amount of data going through the pipe.
    • Each device now gets its own I/O thread. This should improve throughput when multiple devices are active. It also paves the way for using readv/writev.
    • The I/O threads now use readv and writev to get some parallelism on I/O. This doesn't help kernel builds (which is all I've looked at so far) too much - the times do seem more consistent and at the low end of the range that I'm used to seeing.
  • More locking fixes.
  • All critical fixes have gone into 2.6.20-stable, so the UML there is in pretty good shape.
20 Feb 2007
I've been spending time tracking down bugs in both UML and the host, with the following results:
  • Found a bug in 32-bit ptrace on x86_64 which mangled the 6th system call argument. I fixed it to my satisfaction, but when Andi sent it to LKML for review, it turns out that this bug had been seen before and fixed, and the fix, which had never been merged, was better than mine. I'm going to push that patch if the author (Chuck Ebbert) doesn't.
  • BB and I added PTRACE_OLDSETOPTIONS to 32-bit ptrace on x86_64 on the same day. His patch will be the one to go in.
  • I found and fixed a bug where kernelspace faults were trashing the segfault information stored in the task structure by a previous userspace fault. When the segfault was finally delivered to UML, it decided that the fault was fatal because of the bogus information that the host had saved, and killed the process. My fix went through three iterations at the behest of Jan Beulich, who kept spotting problems with it.
  • I found a signal frame alignment bug in UML which caused a few processes to segfault. They were executing MMX instructions, which expect data to be 16-byte aligned, inside a signal handler, and the misalignment was causing them to segfault.
  • After a few days of debugging, I fixed a TLS problem on x86_64 which caused host to segfault. I wasn't implementing CLONE_SETTID properly there. host still isn't totally happy, so there's more work that needs to be done.
  • Yesterday, I found the problem which causes UML to hang early in boot with some more recent host kernels. UML was mishandling the host VDSO information, with the result that init tried to branch to a empty part of memory when it wanted to use the host's VDSO.
  • It turns out that an earlier cleanup suddenly started causing UML to hang in 2.6.20 with a couple of threads sending an infinite stream of single characters back and forth. With the help of someone on #uml who was seeing the problem, I diagnosed and fixed it. It turned out to be a badly designed interface being used for something it wasn't intended to be used for. I had implemented a little growable array abstraction which didn't bother preserving its contents when it needed to allocate more memory, leaving that up to its callers. I reused it with something that wasn't prepared to fill in the contents, with the result that the array was filled with garbage when it got reallocated.

I am continuing to send SMP cleanups to Andrew. When 2.6.20 opened up, he sent a bunch of them to Linus. I will queue up a bunch more after 2.6.20 closes for inclusion in 2.6.21.

utrace, which is a ptrace replacement, made its debut in -mm with 2.6.20-mm1. ptrace requires a fair amount of architecture support, so any replacement of it will require some work on the part of the architecture maintainers. Fortunately, Roland McGrath wrote a document on updating architectures, so it was fairly easy to get UML compiling and booting again. However, ptrace doesn't work - that will be the goal of the rest of the work.

Brian Ducharme of Virtual Strategy Magazine did a podcast with me and Chris Aker of Linode.com. Linode is a large UML ISP (and Chris is a long-time supporter of UML), and happens to host virtual-strategy.com. The podcast is located here.

26 Jan 2007
With the SMP mechanisms basically working, I've been making an SMP cleanliness pass over UML. I had previously gone over the code and made a list of things that needed to be looked at. Now, I'm going over that list and fixing things. I've sent a pile of patches to Andrew for inclusion in 2.6.21, and I have a lot more which I haven't sent in yet.

I spent last week in Sydney at LCA 2007. The last LCA I was at was Brisbane in 2002. That was an awesome show, and it has gotten more awesome in the meantime.

I gave a talk at the virtualization miniconf on my plans on making a UML KVM client. This is to take advantage of the virtualization support in current Intel and AMD chips. This work was done by a couple Intel engineers in Moscow. The UML side is fine, but the host side is a bit scary, and prompted me to work on more pressing things. KVM is essentially the same thing, and I can make the UML work fit within it pretty easily.

A couple of aricles flowed from that. One was an interview with me by Joe Brockmeier here. The other was an article about my talk in ComputerWorld.

Since I got back (with minimal jet lag (!)), I've been on a bug hunt. There have been a number of reports of various host-related UML problems - UML works on one host, but not another. I've got access to a couple of such hosts and have been tracking down the bugs. It turns out that the 32-bit ptrace support on x86_64 is buggy. There was no support for PTRACE_OLDSETOPTIONS, which is easily fixed. I'm currently looking at a problem where the sixth system call argument is trashed when read from ptrace. This is causing process bus errors inside UML because that's the offset argument to mmap and when you try touch touch memory which is mapped from a very negative file offset, you get a SIGBUS.

15 Dec 2006
After much argument with ptrace and how signals get delivered to ptraced processes, I got a 2-CPU UML to boot. After playing with it some, I haven't found any unknown problems. The two problems that did turn up were already known. The locking in the console driver is screwed up, and we've known that for a long time. Aside from that, this thing seems reasonably healthy. I'm going to beat on it, and I'm sure more problems will turn up then.

I think I'm going to redo the detach/attach nonsense that I now do on every context switch. The problem is that when a CPU switches from one process to another, it needs to be able to attach to the associated host process. I did this by having each CPU only attach to whatever process it is running, and so all non-running processes are detached, and can be attached by a CPU without any trouble later. The alternative to this is to leave processes attached to their CPU when they are switched out. If they are on a different CPU when they are next run, the new CPU has to send an IPI to the old one to detach it. That sounded nasty and painful to me, but what I did instead was extremely nasty and painful. So, the alternative has been looking somewhat more attractive of late.

4 Dec 2006
I've some some amount of SMP progress. Moving that wait to inside a attach/detach pair still doesn't work, but UML at least runs longer before crashing.

On the bug-hunting side of things, it's looking like the as-iosched crash is caused by an interrupt happening when it shouldn't. My debugging stuff is showing that a pointer is tested as not being NULL, and immediately inside the test, it is NULL. This is the sort of thing that says that we have an interrupt problem. Specifically, my theory right now is that an interrupt comes in while the I/O scheduler is playing with the request queue (and interrupts are disabled for this reason) and does something to change the queue, such as pull a request off it and run it. This messes up the request structure that the scheduler is in the process of looking at, causing the crash. That's what it looks like right now, but I don't have any idea how this might be happening.

I can usually reproduce this in a few days of a make -j 64 kernel build, but the last time, it took a week to happen. This slows down the debugging effort some.

What also slows down the debugging process is discovering another serious bug that needs to be fixed. UML processes on x86_64 sometimes segfault. I haven't been able to reproduce this reliably enough to track it down. Until now. I discovered that running two UMLs (one of which was running the 64-way kernel build) was enough to make processes segfault in one or the other of them. After trying to figure out how they could be interfering with each other (i.e. by trashing each other's tmpfs through a tmpfs bug), I found out that the faults are almost legitimate. The faulting instruction is consistent with the fault address passed in by the host kernel. What's wrong is the CPU trap number and error code passed in along with the fault address. In particular, the trap number is 13 (a general protection fault) rather than the 14 (page fault) expected. A trap of 13 is usually caused by a segment being set up wrong, and is not a fixable page fault, so the UML process is segfaulted.

I spent a while trying to figure out how this can be happening, and did, somewhat, although I'm still unsure of the details. In arch/x86_64/kernel/traps.c:do_general_protection, we have this code:

                
	tsk->thread.error_code = error_code;
	tsk->thread.trap_no = 13;

	if (user_mode(regs)) {
		...
		force_sig(SIGSEGV, tsk);
		return;
	} 

	/* kernel gp */
	{
		...
	}

              
Whether the fault is caused by userspace code or kernel code, the task's error_code and trap_no (which are ultimately passed to the process that is running, if it catches SIGSEGV) are set. If there was already a SIGSEGV pending, with its own trap_no and error_code, those will be lost.

I tried to catch this in the act of happening, by adding some code there which looked for an already-queued SIGSEGV, and failed. However, when I applied the obvious fix, which is to only set the error_code and trap_no in the case of a userspace fault, the UML process segfaults disappeared. So, this theory appears to be basically correct, although I'm still missing one or two critical details.

I've sent this patch off for comment, and if there are no objections, I'll be sending it to mainline.

20 Nov 2006
The ptrace detaching and reattaching works with ncpus == 1. I had some nasty flags floating around saying when wait should be called and when it shouldn't. Once I had UML booting, I looked at this nastiness and it turns out that it could all be simplified away.

The next thing to do is try booting with ncpus == 2, and here it breaks. The reason is that there is some waiting on a process that hasn't been attached. With multiple CPUs, this is a problem because when that process was detached, it went back to being the child of its original parent, which may be a different CPU than the one trying to wait for it. Since you must wait for your own children, this is a problem.

The solution is to move all waiting to between attaching and detaching. There is one wait which needs to be moved. It can be moved to just before an attach (from just after a detach), but taking it one step further and moving it to after the attach causes a strange crash later.

In other news, I'm chasing a sporadic crash that I see every few days with a make -j 64 kernel build loop. The crash is a NULL dereference in the AS I/O scheduler, and at this point, it looks like a generic bug. I see no UML involvement right now. I'm putting instrumentation in the block layer to try to track this problam back to the source. I've gone a couple of steps back from the crash, but have no real idea what the problem is yet.

10 Nov 2006
My metadata-mashing workload is still failing. The removed-working-directory fix was wrong. It turns out that the directory may not be removed on the host in d_drop before something else recreates a directory of the same name. The host directory lookup will produce the inode number of the not-yet-removed directory, and that will resurrect the old inode. So, we'll have two UML directories, the removed and the recreated, referring to the same host directory.

It appears that the host rmdir must be done at the same time as the UML rmdir. In order to fix the inode reuse bug, we must hold a reference to the host directory so that the host inode stays around even though the directory is no longer in the host namespace.

What I think I'll do is open the directory and rmdir it. The open file descriptor will hold into the inode, which will go away when the descriptor is closed, and that will happen when the UML dentry is freed. This complicates things because host file operations require file names. When a UML process whose working directory has been removed creates a file, what name on the host do you use? The only solution I see is openat(), where you open a file whose name is specified relative to an open file descriptor. One trouble is that the infrastructure under hostfs isn't equipped for this. It expects full absolute path names. Another trouble is that openat (and the other *at system calls) are fairly new. For example, FC6 has no man page for openat, although there appears to be libc support for it. I can't count on them existing on any system that might run UML. Even if it exists in the kernel, it may not be in libc, or vice-versa.

While I think about that, I decided to go back to SMP support. The (unforeseen) complication here is that processes must be ptrace detached and reattached when they are scheduled out and back in. The reason is that there will be one host process per UML processor. When a process is switched out, it may be switched back in on a different UML CPU. That CPU must be able to ptrace the process, so the old CPU must have detached it. The default behavior when a process is detached is that it is continued. This is obviously wrong for UML which needs the process to just sit there until it is reattached. So, SIGSTOP is specified as the signal to be delivered at the detach. This works, but it complicates reattaching. Attach also delivers a SIGSTOP, and this stacks on the detach SIGSTOP in the sense that one is stored in task->exit_code and the other in the task's pending signal mask. Both must be cleared out before the process is good for anything, and there must be a PTRACE_CONT in there to get the last SIGSTOP. This nastiness interacts with new thread creation, where the new thread has been fully waited for, to make it very non-obvious where to wait in order to get any thread scheduled properly.

3 Nov 2006
There have been sporadic reports of UML just freezing, but waking up as soon as there is any I/O, such as whacking the keyboard. There wasn't a lot I could do about it as long it there was nothing consistent about it and I couldn't reproduce it myself.

However, I found a workload which would make it happen maybe once a week, so I started chasing it. Over the weekend, it started happening every few hours with this workload. This made things much easier, and I tracked it down to the soft interrupt code. Nothing was wrong with the code, exactly, but there were two things that should have happened in a particular order, and gcc was rearranging the code so that they were reversed. This left (in the assembly I looked at) a one-instruction window in which an interrupt could happen, and get lost.

If the UML was doing only disk I/O, then this would freeze the system, since no more disk requests would get processed until the one outstanding was finished, and its interrupt was lost. It would wake up on any other input because the interrupt handler will look at all active file descriptors, discover the finished disk request, and process it.

I added a memory barrier to fix this problem, and added a couple more, plus made a couple of variables volatile in order to guard against future compiler misbehavior. This patch is now in mainline, and should be in rc5.

30 Oct 2006
I spent the last two weeks chasing externfs bugs. This was prompted by deciding to beat on it before sending it to -mm. I settled on a make -j 16 kernel build and a set of commands which perform lots of conflicting metadata operations - file creation, appends, removals, and mode changes.

The kernel build exposed a file corruption bug which was easily fixed once I figured out what was happening.

The metadata workload was significantly more troublesome. The symptom was that some command would start emitting errors like "Not a directory". These were caused by the in-kernel inode structures getting out of sync with the host filesystem. When a file or directory on the host is removed, the corresponding UML inode should also be freed. If it's not, when the host reuses the inode number for a different file, the old UML inode will be reused, and it will contain the data for the removed file. The "Not a directory" errors come from the removed file being a directory and the new file, with the reused inode number, being a file. Things like find and chmod -R will stat the file and see that it's a directory, since that's what the inode says, try to readdir it, and get a failure when that operation fails on the host.

The underlying problem is that the UML inode structure wasn't being freed when the corresponding host file goes away. There turn out to be a variety of ways to make that happen.

  • The inode may be dirty. In this case, it won't be freed until it has been flushed out to the host. If the file or directory was removed on the host, this opened a window in which the host could reuse an inode number before the inode structure is freed.

    This shouldn't be a problem since externfs performs operations immediately, so the host filesystem is synced with the inode, and the inode shouldn't be dirty. So, I had to chase down all the ways I was dirtying inodes. A common case was i_nlink manipulation. externfs was changing the inode link count in order to keep it in sync with the host, sometimes correctly and sometimes incorrectly. The incorrect changes usually shouldn't have been there in the first place, which alleviated the next cause. The correct ones were calling an interface (inode_{inc,dec}_link_count) which marks the inode dirty. It turns out there's another interface ({inc,drop}_nlink) which doesn't. Switching from one to the other fixed most of these problems.

    Any operation which accessed the file would update the inode's atime, marking it dirty. I fixed this by setting S_NOATIME on all externfs inodes. Any such operation will eventually be reflected out to the host, and the host atime will be updated at that point.

    That left externfs_setattr. This is called on most metadata changes - chmod, chown, truncate (which is also a data operation), etc. The filesystem needs to update the inode as well as handle the operation in whatever way it needs to, and there's a helper, inode_setattr, which does this. Unfortunately, it calls make_inode_dirty, and there's no way to pursuade it not to. Here, I resorted to explicitly syncing the inode after calling inode_setattr. Nasty, but it works.

  • The inode link count may be wrong. If so, if the inode use count is zero and the link count inside UML is greater than zero but the file has been removed on the host, the UML VFS layer will still hang onto the inode in case some process later reopens the file. Then, the old inode will be reused, saving the trouble of reading it back from disk. Of course, if there's memory pressure, that inode is a prime candidate for being freed, but until then, it will hang around. Again, this opens up a window for the host to reuse the old inode number.

    For the most part, this was caused by me slavishly copying the ext2 i_nlink manipulations. The problem was that some operations, like creat and mkdir, cause a stat() on the host, which fills the inode with a bunch of stuff, including the link count. Any further changes are just wrong. Of course, there are operations which do affect an inode's link count without causing a stat() on the host. rmdir and unlink change the parent directory's link count. These must be done by hand by the filesystem.

    With these fixed, I had one more instance to track down, and this took almost a week.

  • The inode may be a removed directory, but a UML process has it as its working directory. In this case, the process will be holding a reference to the dentry belonging to that directory, which in turn will hold a reference to the directory's inode. Again, this opens the same window of host inode reuse. The solution to this was to postpone the actual host rmdir() to the dentry d_delete operation. This is called when the last reference to a dentry is dropped. At this point, it is safe to remove the directory. I'm only postponing rmdirs, not unlinks, because I can't think of any way that a file dentry could be held after it is removed without the file being opened. In this case, UML will hold the host inode by having the file open.
22 Sep 2006
2.6.18 is out finally. I have a pile of patches stuck in -mm which will finally get flushed out to Linus. I have another pile stuck here which I will finally flush out to Andrew.

I implemented ethernet MAC randomization yesterday. This automatically assigns a random MAC to any interfaces which didn't get one on the command line. This eliminates the practice of asigning a MAC based on the first IP address assigned to the iterface. That stopped working when distributions started bringing interfaces up before assigning IP addresses to them.

I'm also going to start sending out SMP patches. These won't enable SMP, exactly. They will do things like fix and document locking, so that will largely be ready by the time SMP support itself is there. I also have a patch which allows an SMP kernel to boot with one CPU. This goes through the SMP boot process, and has locking compiled in, but, obviously, there is no concurrency to actually exercise the locking. I'm debating whether that should be sent in.

14 Sep 2006
Another long interval between diary entries, sorry.

We are now up to 2.6.18-rc7 and 2.6.18-rc6-mm2. UML builds and runs fine in both trees, except that the jmpbuf problem with newer libcs is still not fixed in -linus. There's a patch in -mm for this, but Andrew hasn't sent it along to Linus yet. There was recent -mm breakage caused by Rusty fiddling the i386 ptrace.h. This provoked me into splitting it into a userspace-usable part (ptrace-abi.h, which UML uses rather than ptrace.h) and a non-userspace-usable part, in which i386 people can do whatever they want, and I don't care. This resulted in a nice tidying of the UML ptrace.h, eliminating all of the cpp tricks to rename i386 ptrace symbols out of the way of UML ones.

I've started making SMP work in skas mode. What I have so far allows UML to boot in SMP mode with a single CPU. This goes through the mechanics of an SMP boot and locking (without any contention, obviously). The next step is, of course, multiple CPUs. UML currently gets into userspace with two CPUS, but dies. The skas0 mode address space handling needs some generalization. Currently, there is one host process per UML address space. This breaks when there are two threads in the same address space, both wanting to run at the same time. They obviously can't both use the same host address space, and this manifests itself when two virtual CPUs try to ptrace the same host process simultaneously. This happens more often than you might think, between a vfork and an exec, for example. You don't need an explictly multi-threaded process in order to make this happen.

I was somewhat out of commission for a couple of weeks (although I did get some useful work done on trains and boats) on a trip to Europe which included Linux-Kongress. There, I gave a talk on my view of the state of Linux as a hypervisor. This was essentially the same talk as I did at OLS.

1 Jun 2006
I spent a couple of days tracking down a timer bug which caused sleeps to be about one second short. This was particularly noticeable with one second sleeps, which returned immediately. I tracked it down (I thought) to an optimization in the ktime implementation, which operated on a structure with two 32-bit elements as though it were a 64-bit integer. I wrote a fix and posted that to the world, plus the maintainer, who told me that I had passed a non-normalized time into his API. It turns out that the entire problem was caused by my initialization of wall_to_monotonic, which is a negative time, and is used to convert wall time to the time since the system booted.

It turns out that BB had found this bug a long time ago, and had a patch (which fixed a bunch of other things) which he never finished. He sent it out, and I merged everything that was still applicable.

26 May 2006
My x86_64 box rebooted after a long uptime, and got a newer kernel than it had been running before. This new kernel broke UML. This was caused by a bug which had been noted previously on the lists. x86_64 system call tracing returned two exit notifications for each system call. Needless to say, this will badly mess up anything which pays attention to system call notifications.

Ironically, this was introduced as part of another bug fix which potentially could have affected UML. All system call returns on x86_64 are done through sysret, which is a relatively new, low-overhead, system call mechanism. It reserves a couple of registers, in particular, %RCX, into which you load the userspace address to which you want to return. There is the potential for %RCX corruption during sigreturn, which must preserve all registers. This is because UML converts this, like all system calls, into getpid(). There is a special return path in x86_64 for sigreturn, but UML evades this by changing it to getpid. Thus, there is potential for %RCX to be corrupted on return from sigreturn. It looks like that is now fixed.

I figured out what was happening yesterday, and sent a patch to Andi Kleen, plus the x86_64 and kernel worlds. Andi sent back a different, and presumably better patch. However, today, akpm picked up my patch and dropped it into -mm. I immediately requested it be dropped and replaced with Andi's patch.

In other news, Al Viro noticed the warning about strcpy being undefined at the end of the UML build. This was caused by a combination of a sprintf in nfs being converted to strcpy by gcc, and UML following i386 in having an inline-only strcpy. This was fixed by following the i386 CFLAGS more closely, and adding -ffreestanding to them.

23 May 2006
Some more beating exposed a humfs memory leak which killed the UML after about a day of a kernel build loop. I uncovered it after a day of instrumenting humfs. This process convinced me that that was the only humfs memory leaks. Every byte allocated was freed and I have the trace to prove it.
11 May 2006
Another week, another batch of patches off to Andrew. These are small fixes, no major features at this point.

I started banging on humfs again. Last I played with it, I could boot a humfs filesystem, but couldn't do a kernel build on one. The build failed with errors that suggested file corruption. Now, that behavior just vanished. I've done a number a kernel builds without seeing that problem. I've seen other problems, but they are now fixed.

The major one was that each user of the AIO subsystem handled -EAGAIN from the host on its own. This happens when the buffer used to send AIO requests to the host is full of already-submitted requests which have not yet finished. When this happens, the caller was expected to handle this in whatever way makes sense. The ubd driver would just stick the queue on a restart list and return back to the block layer. When the ubd interrupt routine next handled a finished request, it would rerun any queue on this list.

This is fine when it's the only user of AIO, but it fails when there are other users. If humfs had filled up the host with requests, then this behavior of the ubd driver will fail because the ubd interrupt routine will never be called again since there are no pending ubd requests.

This prompted the centralization of some code into the AIO subsystem. In particular, AIO users no longer create their own pipes to receive finished requests. Now, there is a single pipe, created by the AIO subsystem, interrupt from which are handled by a single handler. This handler takes the finished requests and hands them off to the AIO user. This is conceptually a fairly simple change, but it resulted in a lot of changed code. The humfs and ubd interrupt handlers were restructured, their -EAGAIN handling is different, structures changed, etc.

At this point, humfs seems healthy. It boots and does kernel builds. As long I don't discover any major problems, I'm going to send it in for 2.6.18.

26 Apr 2006
The time namespace patches I sent out just before leaving for Brazil got a bunch of comments pointing out things I did wrong. The PTRACE_SYSCALL_MASK interface is wrong because strace -e (which is the other possible user) has opposite requirements from UML. It wants to selectively trace system calls rather than selectively not trace them. There were also some bugs in the implementation.

Eric Biederman pointed out some failings in the actual time virtualization code. What I did won't work so well when you want to migrate a container from one host to another. This will require some thought.

LWN picked up on it and Jon Corbett wrote a better summary of these patches than I managed.

I spent last week in Porto Alegre, Brazil for FISL 7.0. It's a pretty good show, thought somewhat less technical than the likes of OLS and LCA. It's larger, with ~4000 attendees registered, and it had a small exhibition floor. There were a good number of international speakers, and we all seemed to be in one track, in the largest room, with simultaneous translation into English, Spanish, and Portuguese. My own talk was somewhat sparsely attended. The user-level, eye-candy-type talks were much better attended.

The UML Book is now out! There was a bit of a delay at the printer, but it is now available, and people have it in their hot little hands. It's available from Prentice-Hall and from Amazon.

13 Apr 2006
I've been slowly redoing the UML web site. The idea is to make it more friendly to newbies, by having more step-by-step instructions for the normal ways of doing things and less reference-type dumps of information. The reference stuff will still be there, but it won't be the first thing you see. What I have so far can be seen here.

I implemented time namespaces in the host and support for them in UML. The idea is to virtualize time in the host by creating a partition which has its own independent clock. This partition takes the form of a namespace which is created by the new unshare system call. Within this partition, a settimeofday call will just change the offset inside the namespace without changing the system time. gettimeofday reads the system time and adds the offset.

The advantage to UML is that this allows gettimeofday to run on the host as though it were running inside UML. The results are the same, except that the system call doesn't need to be traced. In order to make this work, I also had to add a mechanism for system call tracing to be selective - I want to turn off tracing of gettimeofday while retaining interception of everything else. This is done with the new PTRACE_SYSCALL_MASK, which takes a bitmask saying which system calls will be traced and which won't.

With this stuff working, you'd expect that gettimeofday would be pretty close to native speed, since it runs on the host without UML doing anything. The two measurements I've done, with a loop of a million calls, are 98.8% and 99.2% of native.

I need to get the host side of settimeofday working, and then I will send this off to LKML as an RFC.

31 Mar 2006
All of the patches that BB and I sent to akpm are now on their way to mainline. So, 2.6.17 will have TLS and hotplug memory support. The TLS support, in particular, was a long time coming, but it works very well and has received a lot of testing, so I don't expect many problems. It would be too optimistic to expect no problems when exposing something new to a much larger user base, but I don't expect any problems to be very large.

OLS papers are due tomorrow (April 1), so I've been working on mine for the last couple of days. Eric Beiderman is giving a talk on extending namespaces to cover the entire kernel, so there's going to be some overlap there, since that's also relevant to my hypervisor talk.

28 Mar 2006
A bunch of UML patches have gone to Andrew in preparation for 2.6.17. I sent in 10 more today, which were mostly Al Viro's UML cleanup patches. The one that was mine was a cleanup of the earlier printf patch.

BB sent in his TLS patches. They've been tested for a long time, and have no known bugs - the last one was spotted and fixed a few days ago.

This eliminates a whole lot of patches from my development tree. With -mm2, I should be into the low 40s. There will be three major (i.e. multipatch) projects left to be merged

  • externfs, hostfs, and humfs - humfs can't currently do a kernel build, and all of that code is in one directory, where it should probably be three. Also, we have questions about the structure of the underlying code, mainly the stuff that deals with filehandles.
  • UML/S390 - Most of the preparatory patches are merged. The signal hander restructuring is still sitting there, as are a couple of patches which I really don't like. The add-gate-vmas patch works, but it's fairly gross. The fix-jiffies patch is small but it seems like the wrong way to go about it. Also, there has been some bit-rotting, as the skas0 interfaces have changed since I merged Bodo's patches into my tree.
  • The ubd driver rework - I have a report, which I can't reproduce, that this hangs under heavy I/O. Also, I have yet to sort out dealing with the early I/O on COW files with O_DIRECT in the picture.
In other news, the sanitized kernel headers project was restarted. This is of interest because that can, if done right, greatly simplify UML's use of the host arch's headers. Now, most of them can just be reused by UML. However, some of them are mostly usable, but have stuff which is wrong for UML. I either copy these into asm-um, leaving out the objectionable bits or include them, but use various nasty tricks to get rid of the parts I don't want. Usually, this means using defines before including them to rename the things I don't want.

With a clean set of kernel headers, that won't be necessary any more, assuming that they include all of the userspace-usable things, rather than just the things that make up the kernel ABI. Kyle Moffett has seemed amenable to accommodating UML, so we might get some nice UML header cleanups from his KABI work.

25 Mar 2006
I fixed the humfs hang, so I consider it to be stable at this point. However, I haven't yet tried a kernel build loop on it.

BB sent out his current TLS patchset for comment and review. I dropped it into my akpm tree, which contains the patches destined for -mm. It works fine, and fixes a problem that I had been chasing. So, it looks good for 2.6.17, except that the patchset itself is a bit disorganized.

This morning, I sent out 16 patches to Andrew for routing to mainline. They include

  • the rest of Gennady Sharapov's isolation of libc code
  • fixing the get_user warnings that popped up on current gcc
  • memory hotplug
  • allowing a ubd device to be shared among clustered UMLs
  • a handful of smallish bug fixes
22 Mar 2006
With hostfs mostly out of the way, I started looking at humfs. This was fairly easy. The bugs were mostly due to bit rot, and easily fixed. The one exception was a bug in the symlink handling in humfsify. It turns out that both -d and -l are true in Perl for a symlink that points to a directory. This was faking humfsify into treating such symlinks as directories.

On an intensive I/O load on humfs, I get hangs after a while. The AIO thread is somehow faked into calling io_getevents when there are none to get. My investigation into this uncovered some weaknesses in AIO handling of -EAGAIN, when no requests can be queued until some have been retrieved with io_getevents. The ubd driver assumes that, if it gets -EAGAIN from an aio submission, that some pending requests are ubd requests, and the queue will be restarted from the ubd interrupt handler.

However, if the AIO queue is filled with humfs requests, there will be no ubd request to wake up the queue, so the ubd driver will stall. This isn't hard to fix, but there are some subtleties, mostly in avoiding races.

19 Mar 2006
I found a couple of inode refcounting bugs in externfs. When these were fixed, I can do kernel builds on hostfs. What was happening was that I was giving inodes extra references. When the files were deleted, they were deleted on the host, but the inodes within UML remained. At some point, a new file on the host would get the same inode number as the deleted file. When externfs looked up this inode number, it got the inode structure belonging to the old file since that was never thrown out due to the extra reference counts. This let to directories looking like files (when the old inode was a file) and to normal files having the wrong length (because the length was taken from the old file).

There are some things which still don't work. Loop-mounting an image from a hostfs filesystem doesn't work (and never did, even with the old hostfs) because it doesn't support sendfile. I added that, and it turns out that the commit_write method is broken in a few ways. It returns a byte count, rather than 0/-errno, which is wrong. It also supports only mmap, rather than read/write to the host.

17 Mar 2006
I spent this week banging on the new, externfs-based hostfs. The book is going to be out in a few weeks, and there are some things it promises which I need to get working. The new hostfs and humfs are among them. I've killed a bunch of bugs, and am currently chasing one more. Everything I've found has been in the externfs layer, so I'm debugging humfs at the same time. Hopefully that will work without too much trouble once I'm happy with externfs.
8 Mar 2006
I implemented open, read, and close for umlfs, so now you can actually look at the files you've exported from your UML to the host. I spent most of the day tracking down a stupid vfsmount and dentry refcounting bug. Once that was fixed, umlfs started working nicely again.

I found and fixed a bug in my FUSE async support. It turns out I was referencing something right after it was freed.

1 Mar 2006
FUSE will fully support async operation when Miklos sends my patches to mainline. I did a second round of the O_ASYNC patch and another patch to enable O_NONBLOCK. Miklos queued them both, so we'll probably be seeing them in mainline.

I'm also finding inconveniences in the FUSE library, which I am complaining about and sending patches for, starting with a receive routine which doesn't let me know if the /dev/fuse read returned -EAGAIN. Miklos is being receptive, and fixing things.

In actual UML work, I implemented readlink today, so you can cd and ls around inside the UML filesystem on the host without many problems. There are still some problems on filesystem boundaries which I don't understand yet.

BB made a release of the uml_utilities which contains a good number of cleanups, including some 64-bit fixes in the COW file utilities. The header wasn't specified correctly for 64-bit boxes, with the result that a 64-bit system would misread a COW header produced by a 32-bit system. I merged his changes into my tree, which contains the umlfs utility, plus some mconsole work. I'll be making an official release once I get some data off a laptop disk. Among that data is the Makefile which I use to actually upload new releases of the utilities and other UML-related files.

27 Feb 2006
Miklos (the FUSE guy) announced mountlo today, which is very similar to the FUSE thing that I've been doing. He wrote this for a very different purpose - to allow user-level mounting of host filesystems, where UML is just an enabling tool. His design is also very different - he creates a device inside UML and has a UML process doing the file operations. In related news, I polished the fuse-async patch a little and sent it to Miklos.

I finished my review of the page proofs of the UML book, fixing a number of errors, some of which were pretty embarrassing. I think this is the last thing I have to do with it until it's published. At this point, it's on automatic pilot inside the publisher.

24 Feb 2006
I got the umlfs FUSE filesystem working a little. lookup and readdir now work, so you can now cd and ls in a UML filesystem on the host. There is a dentry corruption problem, which causes crashes later (on shutdown usually). I think I'm either double-freeing a dentry somehow or dputting it too often. Today I looked at adding asynchronous notification to FUSE, so that the /dev/fuse file descriptor can generate a SIGIO when there's something the userspace server has to do. Lack of this is annoying since I have to generate other types of interrupts to UML in order to get it to service FUSE requests. Normally, this means whacking the keyboard. I am building the O_ASYNC-enabled FUSE now, and we'll see how well it does.
20 Feb 2006
I started seriously working on putting a FUSE server inside UML to export the UML filesystem to the host. It was easier than I expected, but needed some effort because the FUSE library wants to do things during initialization that can't easily be done inside UML, such as setting signals, forking, and execing. What I ended up doing is putting that stuff in a separate helper process, which gets a file descriptor to /dev/fuse and passes that to UML through mconsole. Inside UML, the driver does the rest of the initialization, like creating a fuse session and registering an operations vector.

I copied the "hello" filesystem into my driver for testing purposes, and that mostly works. One wart is that /dev/fuse doesn't support SIGIO, so UML can't easily get an interrupt when a FUSE operation needs to be handled. I'm planning on fixing this and sending the patch to Miklos.

The UML book is nearing print. I'm currently reviewing a PDF containing page proofs. This gives the thing a reality that it didn't have before. Amazon knows about it already, and it is currently 2,382,397 in sales rank. The official release date is April 7.

8 Feb 2006
There is a virtualization infrastructure thread happening on LKML right now which looks like it fits nicely with my thinking on having UML do less system call tracing. The OpenVZ and CKRM folks are interested in this, as is Eric Biederman, who wants to migrate workloads around a compute cluster. The thinking is to introduce namespaces for the various resources that the kernel controls and make compartments by creating new namespaces and grouping them together. For UML, this would mean creating a compartment which contains enough UML data that system calls execute directly on the host, and do the right thing.

Two of the eight patches I sent to mainline got dinged yesterday. Ulrich Drepper didn't like my resurrection of the internal jmp_buf defines that he got rid of, and suggested that I just reimplement setjmp and longjmp myself. Linus didn't like the uaccess warning patch since I apparently got rid of the warnings by making changes to the declarations which didn't make any sense.

2.6.16 seems to be winding down now, so I think I'm through with pushing patches into it.

6 Feb 2006
Last week, I discovered that rc1-mm4 didn't boot on x86_64, while everything up to -mm3 did. A UML process would die strangely after having a page fault handled. This is an invitation to do a bisection search on the patches between the two releases in order to find the culprit patch. I just did the search on the UML patches rather than everything, and that turned up a patch which did nothing but move code from one file to another. Further poking at that patch revealed that moving a jmp_buf from one file to another was enough to cause the crash.

Needless to say, this was uninformative. There was another clue, in the form of a complaint by the host kernel about a bogus signal frame whenever UML crashed like this. I pursued this by comparing "good" signal frames to "bad" ones, and decided that the floating-point register values must be to blame. It turns out that these weren't initialized to any sort of sane values when a new UML process was created. When I fixed this, the crash went away. I still have no idea what the connection is with the jmp_buf moving around.

Last week, Andrew released rc1-mm4 and Linus released -rc2. After tracking down that bug this morning, I updated my trees, throwing out patches that had made it into mainline. After this, I had 67 patches in both trees. Then, I sent 8 more to Andrew. These are small fixes and cleanups that had accumulated over the last couple of weeks.

30 Jan 2006
With FUSE now in mainline, I started thinking about umlfs, which would export, through FUSE, a UML filesystem to the host. I think this would provide pretty much everything that the mconsole exec fans want. If you chroot yourself to this mount, you will be running the UML binaries, looking at the UML /proc, etc. So, you'd be able to fix passwords and look at running processes. The things that wouldn't work as expected would be things that don't go through the UML filesystem. For example, ifconfig creates a socket and calls ioctl on it to get interface information. This is a host socket, so you'd just be looking at the host's interfaces. Anything that looks at the UML filesystem will report on the UML.

This started me thinking. We would have processes that are acting almost exactly like they are running inside a UML, except they are not being system call traced like normal UML processes. Instead, they are coupled to the UML at the subsystem, i.e. the filesystem layer, level. This provides a whole new way of thinking about how to do virtualization.

This is essentially migrating a UML process to the host. When you migrate a process from one system to another, the process can't know that it has moved. The new system has to proxy the old system's information to the process. This can be done by redirecting every system call back to the home system, which is how UML works. However, that's not necessary. When you do file accesses, there are perfectly good network filesystems which will readahead and cache the remote system's data so that a process doesn't have to call back to its home system on every file operation. This is what the FUSE does (or could do) with the umlfs data. File operations would run at full speed as long as FUSE has the required UML data cached already.

The question this raises is whether it's possible to do this in general - have nearly all UML process system calls run untraced on the host, with the host proxying the UML data such that those system calls run at full speed. This would have to include

  • memory - all cached data would have to be in memory owned by UML in order to maintain mmeory jailing
  • processes and networking - the host would have to show UML processes the UML network and the other UML processes. This information would need to be passed from the UML to the host and kept up to date there in order to minimize callbacks to the UML.
  • kernel version skew would be an issue - a process using a new flag that the host didn't implement would get a -EINVAL if the system call runs natively on the host. So, there would need to be some selective system call tracing when this is a problem.
So, this would require some significant support in the host kernel in order to run, but raises the possibility that UML at some point could mostly do away with system call tracing and run at nearly native speed. How close it gets is determined by how often the host would have to make UML processes wait for information from UML.

Since this is really implementing process migration, this will be of interest outside the virtualization world. The same support would be usable for migrating processes between systems, except that communication between those systems would be over the network rather than over memory as with a host and a UML running on it.

And once you have process migration, you start to think about full-blown clustering. With migration, you are mirroring one system's data on another. Clustering takes that one step further and does away with the idea that data belongs to a particular system. The data belongs to the cluster and lives wherever it is most convenient.

So, implementing virtualization as process migration makes a not-quite-complete migration system useful as a performance boost for UML. Finishing it by implementing the network support needed for two physical systems to migrate processes would provide full process migration to Linux. And that will get people thinking about implementing full SSI clustering.

27 Jan 2006
I have a bunch of patches in my queue which together redo how the ubd driver works. I've been wanting to send it to Andrew for a while, except that they broke COW. This week, I tracked down the problem. I wasn't specifying the length of bitmap I/O requests properly, causing some parts of the COW bitmap not to be written out. The next time the filesystem was mounted, these missing parts of the bitmap would cause data to be read from the backing file rather than the COW file, causing file corruption.

With that fixed, I upgraded my FC4 filesystem to FC5 Test, and it seems to be working pretty well. The most obvious difference is that udev is a lot faster in FC5.

18 Jan 2006
-mm4 broke the UML build in a couple of ways. There is a new thread flag, TIF_RESTORE_SIGMASK, which, among other things, allows some signal code to be pulled from the arches and made generic. UML didn't define this, or support it, so the build broke. Adding support was fairly simple, so I did it.

There was a __iowrite32_copy function introduced which is built in to the kernel unconditionally, even though UML is most unlikely to use it. It calls __raw_writel, which UML didn't define. I stole the definition for this (and the "b" and "w" variants while I was at it) from x86.

My asm-offsets patch had a bug in it which broke the build. This fixed, but not yet on its way to Andrew.

After these, UML builds and runs again.

The soft interrupts patch is now in -mm, finally. It hasn't been sent to Linus yet, but I told Andrew that it is 2.6.16 material.

Linus released 2.6.16-rc1 and Andrew did -mm1 in quick succession. I duly updated my trees, throwing out a number of patches (which had been merged) in the process. My -mm tree has 58 patches in it, which is a nice decrease from the ~90 I had before. When I push out the ubd cleanup and the TLS stuff, that looks like it will go under 40, which is much more manageable.

BB took back his TLS patches which I had cleaned up some, and cleaned them up some more, so that they build on all combinations of CONFIG_MODE_TT and CONFIG_MODE_SKAS. I dropped those back in my tree, replacing my old ones. They seem to work OK, except that his system seems to define a struct user_desc and mine doesn't. I have a struct modify_ldt_t_s instead. We'll need to sort out which one we can count on. I also deleted an annoying error message. These need some more looking at, but they're closer to being sent to Andrew.

8 Jan 2006
The aforementioned half-dozen patches are now on their way to Andrew. The nasty compilation fix is now a lot nicer. It turns out that I could use the existing asm-offsets.h mechanism, which fixed the problem and let me delete some Makefile crud.

All of my previous patches are now in the hands of Linus, so they should be in his tree soon, if they're not there already.

6 Jan 2006
I sent Andrew a smallish set of four patches - three libc code isolation patches and the futex.h consolidation. The futex.h patch came about because I needed to revert the UML futex.h back to the original version, which just returned -ENOSYS. It turns out that most of the architectures had their own identical copies of this. Rather than have UML create yet another copy, I stuck it in asm-generic and made the other arches use that.

I spent today cleaning up the debris from the earlier patches that came out in -mm1. It turned out that the boot output was printed twice due to the mconsole stack patch. This isn't happening in my main 2.6.15 tree for some reason, so I didn't notice until I saw it with -mm1. There was also a nasty compilation problem which I have a nasty fix for, plus reverting my earlier patch that checks at compile-time that either MODE_TT or MODE_SKAS is enabled. Adrian Bunk sent in a Kconfig way of doing it.

Beyond that, I have about a dozen more patches ready to go. These are the ones that lead to softints. This has been in my tree forever, and it speeds up UML quite nicely, so it's about time it went to mainline.

With these in mainline, my own patchset goes from the low 90s down to the 50s, which is a much more manageable number. To get more patches out of my tree, I'm thinking to clean up the ubd driver series and the TLS/NPTL series and send them to Andrew for inclusion in 2.6.17. That's about 15 more. If that happens, I will be down to around 40 patches, which is less than half of what I started with.

3 Jan 2006
Linus released 2.6.15 yesterday. It contained a last-minute set of fixes from Paolo which fixed segfaults caused by running printk on the wrong stack, cleaned up some code, and cleaned up the compilation. I pulled it, and it seems healthy.

That release was my cue to send in a bunch of patches to Andrew. I sent twelve, which were the bunch that I did in the last couple of weeks, cleaning up the console code, and a bunch of other cleanups that I did along the way.

Next on the list is the umid OS abstraction and the mconsole printk interception that I did last week. Then, I'm looking at the ubd patch series that I've had for a while.

31 Dec 2005
I generalized the mconsole printk thing a bit, and made sysrq use it. Now, when you invoke sysrq, you get any output back in the mconsole client. This is especially nice with sysrq t since that gives you the stack of every process on the system. It's much better to get it from the mconsole client than to have to go looking through the UML's dmesg in order to see it.
29 Dec 2005
While playing with the UML consoles last week, I noticed that it is possible to register consoles at runtime, and this makes it possible to send printk output back to an mconsole client. This is done by registering a console from the mconsole driver. This console is called whenever there is output. Normally, it ignores the output and just returns. However, when a "stack" command is active, it collects the output and sends it to the client. This will capture the stack and registers nicely. It will also capture any other printk that happen to occur at the same time. I don't know of a way of ignoring any such output, but that shouldn't happen very often.

Now, I need to hook sysrq up to this mechanism because it has had the same problem. It is also possible to monitor all printk output from the host. I would do this by adding another mconsole notification for the kernel log. Whenever there is printk output, it would be sent to the mconsole notification socket and whatever is listening to it.

In doing this, I learned something about Unix sockets. There is a limited number of packets that can be pending at once, and it is easy to fill it up. I was trying to figure out why the mconsole client printed only a couple of lines of stack and then hung. It turns out that the sendto from the mconsole driver was returning -EAGAIN and I wasn't checking for it. This happened after only a few hundred characters of output, which seemed a bit thin. After looking at the code, it turns out that there is a limit of 10 (by default - this is tunabled from /proc or sysctl) packets pending on a Unix socket. Send another and you get -EAGAIN.

I was sending one packet per printk call, some of which were pretty small. I changed things so that the output gets accumulated into a buffer, which is sent out when it fills up. This is more complicated, but it makes things work a lot better.

25 Dec 2005
Last week was console week. I decided to figure out what was causing the -EBADFs in deactivate_all_fds on shutdown sometimes. This is a consequence of closing a file descriptor and forgetting to shut down the associated IRQ. In the console code, it is not obvious that when the close happens, the IRQ is also freed. So, I decided restructure the code to make it so. In doing so, I discovered a bunch of other things that needed fixing and cleaning up.

I ended up with 10 new patches, which do things ranging from code reformatting with no functional changes to fixing the console behavoir when pasting a large amount of data into one.

That last problem has been a long-standing one. It turned out to be two problems. One is that the console driver never implemented throttling, which is how the tty driver tells the hardware driver to stop sending it data. Adding this caused large pastes to stop losing data.

The other problem was that the process receiving the large paste would see an EOF in the middle and exit, and the shell would receive the rest. This turned out to be caused by a bug in the driver exercising a bug in the tty driver. The tty driver bug was that it was update a counter before queuing a character, assuming that the enqueuing would succeed. However, it can fail, and the counter being updated for a non-existant character can cause premature EOFs to be emitted by the tty driver.

The UML bug was that, while it detected a full tty buffer and rescheduled the processing of the current interrupt, it delayed it only long enough to return from the IRQ handler, because the tasklet was scheduled immediately. Delaying that for one jiffy fixed that bug.

14 Dec 2005
There has been a problem with the TLS patches since they've been in my patchset - they don't build on x86_64. This is now fixed. As part of my cleanup of the TLS stuff, x86_64 now builds and runs with the full patchset. I did implement a bunch of do-nothing stubs for x86_64, which is worrisome since it should have the same requirements as i386. However, my x86_64 filesystem boots fine, and always has. This will need looking into later.
9 Dec 2005
The Book has pretty much been handed over to the publisher now. I finished some last-minute items like the acknowledgements, bio, and artwork earlier this week. I still need to reread the thing to see if there's anything that needs fixing, but aside from that, it is done.

Another book showed up on Linux Journal. It's not really entirely about UML - it's about debugging and performance tuning on Linux - but there's a chapter on UML, Linux Journal chose that chapter to excerpt (or the publisher chose it to make available for excerpting).

On actual UML work, Blaisorblade's NPTL patches broke in an interesting way with 2.6.15--rc5-mm1. I started getting lots of complaints about PTRACE_SET_THREAD_AREA calls failing, and failing with a bogus errno. After looking at this, it turns out that those calls had always been failing, but failing silently because

  • ptrace was called from kernel code
  • in kernel files, errno is renamed to kernel_errno to avoid conflicting with the libc errno
  • so uses of errno in kernel files will refer to kernel_errno, not the real libc errno
Since the PTRACE_SET_THREAD_AREA calls and references to errno were in kernel files, they tested kernel_errno, which is little-used, and had some random value in it. Until -rc5-mm1, that random value happened to be 0, and it looked like the calls were succeeding. In -rc5-mm1, kernel_errno contained 2, so I started seeing all these nasty error messages.

I fixed it so that the constant failures were recognized as such, then tracked them down. It turns out that TLS entries weren't being copied into the child process during a fork, and bogus, empty entries were given to the child instead. Fixing this allowed me to remove a patch which I made without understanding it, but which made NPTL and the TLS stuff work. With that patch gone, I think TLS support in UML is understood, and we can start cleaning it up in preparation for mainline.

15 Nov 2005
Hotplug memory is now a semi-reality. What made it so was a patch from Badari Pulavarty which allows the punching of holes in mmapped tmpfs files. Even as limited as this, it is exactly what I need for hotplug memory. The way this currently works is that you use uml_mconsole to add or remove memory in the same way that you'd plug or unplug a device. The driver inside UML will try to allocate the requisite memory. If it fails to allocate it all, then you won't have pulled out the full amount. You can try again a bit later after the kernel has had a chance to free up some memory.

You can plug in memory in the same way. The current restriction is that you can only plug in memory that had been previously unplugged. You can make this somewhat less onerous by giving the UML a generous amount of memory at boot time and immediately unplugging a lot of it.

This much has always been possible. What hasn't been possible up to now is actually freeing the memory to the host. This is what Badari's patch does. Once it's freed on the host, it can be plugged into a different UML or just left on the host.

The next step is a memory management daemon on the host which watches the memory pressure on the UMLs and the host, shuffling memory around as needed. One thing that's fairly important is to keep the host from swapping. This makes UML performance much more predictable, as it won't need to be swapped in to be woken up. It also avoids some pathological swap conditions where the host and a UML swap the same page to their respective swap devices.

The Book is nearing completion. I've sent two revisions to the publisher in the last couple of weeks. The final manuscript is due at the end of the month.

28 Oct 2005
So, it's been a long time since the last entry. Some of last was laziness, some was enforced by a stolen laptop. A lot a happened in UML-land over the last six months, and I'll mention the highlights.

I'm about done writing a book about UML. This has been taking a great deal of my time, and it is just about done. My deadline for the final manuscript is Nov 27. The first draft is done, and I'm currently going through it, and the reviewers' comments, to polish it up.

It is a an how-to-use-UML book, from getting started for the first time to setting up and running large UML servers. There's also some history and my prognostications about the future. The publisher is Prentiss-Hall, and it's due out in the spring.

In actual UML development, 2.6.13 and 2.6.14 (as of yesterday) are out. A great number of UML patches are now in mainline, making it much more robust than it had been. skas0 is now in there, along with a bunch of associated performance improvements. More are coming in 2.6.15, notably Bodo's ldt patches.

The host AIO support went in during 2.6.14, but I yanked the ubd driver's use of it at the last minute. There were problems in the driver that were exposed by its use of AIO. I fixed them, but didn't feel comfortable with them going in so late in the 2.6.14 cycle. So, they will be introduced early in the 2.6.15 cycle in order to give them some testing.

x86_64 has been working well for a while and now seems reasonably mature. We are still smoking out bugs in skas0 once in a while. The most recent one is the skas0 assembly stubs being assembled in an unexpected way. It turns out the way I had written them (mostly with each register assignment in a separate asm statement) doesn't guarantee that the registers will stay that way up to the actual system call. It also turns out that there are are asm idioms for doing this right.

As for the lost laptop, this happened during my trip to the FSM (Free Software Meeting) in Dijon, France. On the train from the airport to Paris, someone sat down opposite us, and distracted us with some coins he dumped on the floor. When we were looking under the seats, he lifted my laptop from the overhead rack and left. I didn't discover this until we got into Paris. The UML web site hadn't been checked into CVS for a while (or backed up), so the only up-to-date copy of the XML I had disappeared with the laptop. So, I grabbed a copy of the site and gradually reproduced the XML from that HTML. This was done a week or so ago, and I was able to start updating my patches page.

6 May 2005
I went back to UML/x86_64 and got it working. The problem that I was stuck on was that some processes would segfault in a way that I couldn't diagnose. It turns out there were a few bugs in the x86_64 signal delivery code. They all were involved in extending the stack downward when constructing a signal frame. One of them just tried to extend the stack too far. Another, which was a generic bug, failed to exempt signal frame generation from a stack extension check. And the last totally mishandled a failure to do a virtual to physical translation, resulting in signal frame data being written to physical address zero.

With these fixed, UML/x86_64 seems healthy. I finished building a 64-bit LFS filesystem with it. Aside from some difficulties with the packages themselves, that went fine.

Since my 2.6 test box is my x86_64 box, and since AIO appeared in 2.6, testing the UML AIO support has been held up on getting UML/x86_64 working. Since that is done, I went back to AIO and made it work. Considering that the host AIO support had never been tested before, it was surprisingly easy. A couple of bug fixes later, and it was working. I instrumented it to see that multiple requests were being sent to the host, and they were. I saw up to 16 outstanding requests during a boot.

I have a ton of stuff to merge into mainline, so I started working on that. Al Viro helped out by breaking his big cross-build patch into manageable chunks. I sent those along, as well as about 12 others that I had pending. I messed up a couple of them. I just forgot one of Al's patches, on which future ones depended, so they didn't merge well for Andrew. Another didn't go at all because I diffed it against a built tree, and the result didn't patch into a clean tree. In the end, those all ended up in 2.6.12-rc3-mm3, and Andrew queued them to Linus for -rc4.

Today, I sent out 12 more patches. This time, I tested them in a clean tree, fresh from being untarred from kernel.org tarballs and patches. With those in, my patchset should be down to around 60. There are more easy, independent fixes in there, so I'm going to start doing them.

22 Apr 2005
I got my virtualized scheduler working somewhat, and announced to LKML. Dead silence from LKML, and a bit of reaction from my group at Intel, where I forwarded the announcement. The LKML announcement can be seen here.

The jist of it is this:

  • The virtual schedulers form sched groups, each of which is a CPU container that competes as a single process on its host scheduler. The processes within the guest compete against each other for whatever CPU time the container process gets from the host scheduler.
  • The sched groups are visible in /proc as /proc/sched-groups/pid, where pid is the process id of the process that made the sched-group. These directories now contain the former /proc/pid directories, and symlinks have been left behind for compatibility. Initially, all processes are in sched group 0, proc/sched-groups/0, which is the host scheduler.
  • The available schedulers are visible in /proc/schedulers. A process becomes a guest scheduler by opening one of those files.
  • You move processes from one scheduler to another just by moving the pid directory from one /proc/sched-groups directory to another.
The example in the announcement is three CPU hogs, one on the host scheduler and two in a guest. They should get a 50-25-25 CPU split because the two in the guest are competing for the 50% of the CPU that the container process gets from the host scheduler. This is how it actually looks:
                
root       292 49.1  0.7  2324  996 tty0     R    21:51  14:40 bash -c 
root       293 24.7  0.7  2324  996 tty0     R    21:51   7:23 bash -c 
root       294 24.7  0.7  2324  996 tty0     R    21:51   7:23 bash -c 

              
Currently, it's for UML only - there are a couple of minor things which I suspect will cause it not to build on x86 or anywhere else.

In other news, I brought UML up to 2.6.12-rc3. This required some nasty hackery to get around an interaction between skas0, stack randomization, and a consistency check in exit_mmap. Basically, the stub data page added by skas0 was causing the check to possibly fail, depending on where the process stack ended up in memory. Bodo and I worked out a nasty solution this afternoon on #uml. I've been working on getting skas0 in shape for mainline, and I pretty much had it, but this is going to require some more work to do cleanly.

Bodo's S/390 port seems to be coming along nicely. He reports UML working in TT, SKAS0, and SKAS3 modes. I've been merging his non-S/390-specific patches, and waiting for him to bless the S/390 bits so I can merge those, too.

11 Mar 2005
Along with normal UML stuff, I've spent the last couple of weeks virtualizing the the Linux scheduler. What this means is that you have a guest scheduler which looks like the host scheduler as a single process. This is done currently by having a process open a magic /proc file:
cat /proc/schedulers/guest-26
This causes the cat process to turn into a new instance of the scheduler running on top of the original. This scheduler gets whatever cycles it can from the host scheduler and uses them to run its own processes. This means that the group of processes within the guest scheduler compete for cycles from the host scheduler as a single process, making this a CPU compartment.

I gave /proc a bit of an overhaul, with all the processes on the system starting in CPU compartment 0, which is represented by /proc/sched-groups/0. So, all the former /proc/pid directories start off in /proc/sched-groups/0, with symbolic links pointing there from /proc, so that ps and related utilities continue to work.

When a process opens /proc/schedulers/guest-26 and becomes a guest scheduler, a new entry in /proc/sched-groups is created:

                
usermode:~# cat /proc/schedulers/guest-26 &
Created sched_group 177 ('guest-26')
[1] 177
usermode:~# ls -l /proc/sched-groups/
total 0
dr-xr-xr-x    2 root     root            0 Mar 11 17:20 0
dr-xr-xr-x    2 root     root            0 Mar 11 17:20 177

              
This is initially empty, except for itself, because there are no processes in this scheduling group. We fix this by moving one there - in this case, the shell that we are typing at:
                
usermode:~# mv /proc/sched-groups/0/158 /proc/sched-groups/177/
usermode:~# ls -l /proc/sched-groups/177/
total 0
dr-xr-xr-x    3 root     root            0 Mar 11 17:20 158
dr-xr-xr-x    3 root     root            0 Mar 11 17:23 177
dr-xr-xr-x    3 root     root            0 Mar 11 17:23 185

              
So, you can see that pid 158 is now in this group, along with 185, which is the ls.

The next thing to get working is to have a couple of infinite loops in a sched-group and one outside, and to see that the two inside each get 25% of the CPU and the other gets 50%. This doesn't work right now because I don't have the timer interacting with the guest scheduler correctly.

What does this have to do with UML? UML is going to provide the structure for virtualizing the scheduler and other kernel subsystems. You can make UML (or any part of it, such as the scheduler) run inside the kernel as a guest by treating this as a new "OS". You would port it to internal kernel interfaces rather than the libc system call interface. This makes the "U" part of "UML" something of a misnomer, but that's OK.

In the longer run, I'd like to be able to run a userspace guest scheduler which would have the same properties of the current in-kernel guest. This would use its cycles to run the (otherwise unjailed) processes under its control. A tt-mode-like implementation would have normal processes being run one at a time, while a skas-mode-like implementation would have a single process constructing the comfined processes from pieces provided by the host. So, for each confined process, this scheduler would have an address space, a set of register values, a set of open files, etc, and activate them all at the same time when the corresponding process is supposed to run.

The confined processes would no longer exist in the host kernel, except as disassociated parts. Thus, moving a process from the host scheduler to a guest scheduler is really a process migration. I still want the guest scheduler and its processes to be visible on the host as they are now, so this means that the host will have to have "stub" processes which are representatives of processes which are owned by something else. Operations on these stubs will be passed along to the scheduler that owns them.

This is starting to resemble a cluster, with processes migrating from node to node, but visible across the cluster. I think that this process of virtualizing kernel subsystems one at a time can lead to something resembling a cluster.

16 Feb 2005
The x86_64 problem was that I was trying to use 64-bit constants and not noticing the assembler telling me they were being truncated. I now compute them (by or-ing two 32-bit constants together). Now, I have to start hitting it with some I/O to see how well the AIO support works.
15 Feb 2005
I finished the first pass of the ubd AIO support. The driver now issues as many requests as it can, leaves them for the AIO layer to deal with, and tells the block layer when a request is finished. It's only tested on 2.4, which does one request at a time to the host, even though a bunch have been issued by the driver. This is still a reasonable test - the one thing that's different from the 2.6 AIO is that the requests are guaranteed to finish in the order they are issued. I think there are no issue-order dependencies, but I need to test this on 2.6 to be sure.

However, my 2.6 box is x86_64, on which UML is currently broken. A possible culprit is Bodo's skas0-clone patch, which makes sure that child processes get a copy of the parent's segment registers by having the stub in the parent actually create the child. Bodo gave me a mass of assembly, which I gradually translated into C (with a few assembly helpers) over the course of a couple days.

Another possible culprit is the VM op batching which I added recently, which has the process stub perform a number of VM changes at once instead of switching back and forth to UML for each one. I discovered two bugs in the x86_64 support for that (one of which was also present in the x86_64 skas0-clone support). These haven't changed the symptom greatly, which is that the whole thing hangs on the exit of the first non-init process. I'm seeing the two processes sharing a stack, which is very bad, and I haven't yet tracked the cause down.

4 Feb 2005
In preparation for 2.6.11 having a good, buildable-out-of-the-box UML, BlaisorBlade and I have been feeding select patches from my queue to Andrew. In addition, I sent a patch to Linus (directly, not through Andrew, which he took (!)) which fixed some mainline build problems that didn't exist in -mm. In the meantime, BlaisorBlade pulled some stability patches from my patchset and sent them to Andrew, who pretty promptly sent them on to Linus.

In the meantime, the sense of urgency we feel about this has been lessened somewhat by UML in a different way - its ability to expose ptrace bugs. Bodo was seeing segment register corruption, and tracked it down to a race in the host kernel where there was window during which a parent could ptrace some values into the process, and the process would just overwrite them due to being in the middle of a context switch. He explained this race to me in #uml, and I got it, somewhat dimly, after a while. He then brought it up on LKML, got Andrew and Nick Piggin interested in it, with Andrew saying that this was a bug to hold up 2.6.11 for.

I've started going through my patch backlog and integrating stuff from it. The most interesting piece I did was Gerd Knorr's X11 framebuffer driver. This gives UML a real framebuffer to use as a console, which is pretty slick. It has to be configured somewhat carefully - until I disabled some things and enabled others, UML first wouldn't build, and then it built and ran, but just popped up an empty black window for a console. This patch will fix defconfig so that the configuration is right, and it will build out of the box.

There are some oddities with it. If you have two consoles with gettys on them, they will both appear in the framebuffer window. This seems to me like we need to enable virtual console switching, although it may also be a symptom of my /dev/tty0 botch. I haven't looked at it closely enough to tell yet.

1 Feb 2005
I spend a good part of the last week playing with AIO support in the ubd driver. I'm doing this very incrementally in order to be able to track down breakage easily when it happens. The first step, which went well, was to remove the ubd I/O thread and use the existing UML aio support, which will run an I/O thread of its own on 2.4 hosts.

Next, I tried dequeuing a request and handling it totally by myself instead of using elv_next_request to give me a piece of it at a time. This went less well, due to my not understanding the 2.6 block layer changes. Once I figured out the relationship between the struct request sector, the bio sector, and the SG offset, things went rather better.

Now, I'm pulling a full request off the queue, turning it into a set of scatter-gather structs, and issuing them to the host. Right now, it's one at a time like before, but the infrastructure is closer to being able to issue them all at once, and letting the lower layers issue them to the host in whatever way they can.

I fell a little behind Andrew over the last couple of weeks, so I caught up today. Merging in rc2-mm2 was fairly easy, and I'll be pushing the current incremental patches out today.

21 Jan 2005
Bodo's segfault stub cleanups accidentally broke x86_64 by reintroducing the original bug that had me stuck for a while. Those changes brought back the property that UML re-enters userspace after a page fault from inside a system call. To recap, this is bad because system calls return using SYSRET, which returns to the address stored in %rcx. This means that the system call wrappers can guarantee that they don't care about %rcx, but a fault that can happen at any time where %rcx might contain comething important can't return to userspace using SYSRET.

I complained at him about this, and he came up with the neat suggestion of signalling the end of the address space fixups with an int3, which generates a SIGTRAP, but does it with a processor interrupt rather than a system call. This returns to userspace using IRET, not SYSRET, which solves the problem nicely.

I had my patchset down to 10 patches at one point after sending another batch to Andrew. It's now more than that as I merge more stuff from BlaisorBlade and keep working off my backlog. Most of this stuff is cleanups. I split the AIO support out of externfs and started to add AIO support to the ubd driver. Also, I'm getting UML help from other people at Intel, and those patches have started going into my tree. We're separating out the userspace code from the kernel code in preparation for a ring 0 port of UML.

14 Jan 2005
I sent 28 patches to Andrew the other day. I had 50 patches in my patchset and it was getting a bit much. This included most of the x86_64 stuff. Amusingly, Andrew's -mm3 announcement got eaten by vger's spam blocker, and it was my fault:
The 2.6.10-mm3 announcement was munched by the vger filters, sorry. One of the uml patches had an inopportune substring in its name (oh pee tee hyphen oh you tee). Nice trick if you meant it ;)
I'm not actually sure what he's talking about. I grepped my patches for the suspect string and it wasn't there. So, maybe it was one of his names for my patches that did the trick.

The largest things I'm still holding onto are skas0, Bodo's faultinfo rework, and the tlb flushing improvements.

skas0 is coming along, but it still needs work. Bodo made a good suggestion about how to pass page fault information back from the segfault handler. This simplified the code and made it more efficient. The original implementation was

  • the stub handles the segfault, and copies the fault information from its sigcontext to registers
  • it stops itself
  • the kernel reads the values from those registers with PTRACE_GETREGS
  • the kernel continues the process by putting the original registers back and PTRACE_SYSCALLing it.
This didn't work on x86_64 because of the way SYSRET works. If you try to continue from a signal from the inside of a system call, you corrupt %rcx. So, on x86_64, I had some special code which continued the stub by having it return fairly normally from the signal with sigreturn.

Bodo made the following suggestion

  • instead of putting the fault information in registers, just make the stub's stack available to the kernel and have the stub write the address of the sigcontext in a known location
  • have the stub hit itself with a signal that's blocked in the handler, but unblocked outside, causing it to stop with a signal right after the sigreturn
  • the kernel reads the page fault information from the sigcontext, handles the page fault, and continues the process
This unifies the x86 and x86_64 implementations, removes some architecture-specific definitions that used to be needed for pulling the page fault information from the register set, and makes the x86_64 implementation cleaner and faster.

Bodo also noticed that skas0 doesn't copy ldt entries properly from parent to child across a fork. This is not hard to fix, but does point out that skas0 is not ready for prime time yet. I'm also concerned about page fault speed. skas0 is noticably faster than tt mode for everything else, but it is a couple of context switches slower when mapping in a single page during a page fault. He made a suggestion for this, which should speed it up a little, but still won't touch those extra context switches. Rather than having the stub run the system call and signal itself, it's a little quicker to just single-step the system call. This saves the system call exit from the single-stepped mmap and the entry for the kill(). Not a lot, but I'll take anything I can get.

I also have patches for improving tlb flushing performance by merging adjacent operations wherever possible. The next step, which I had implemented, and then somehow lost, is to have skas0 batch them up so the stub does a lot of them without switching back and forth to the kernel on each one.

Also in my tree now are a patchbombing from Blaisorblade, so I am back up to 40 patches. He's doing a lot of good cleaning work, and these patches make large improvements in some areas of the code.

I need to get 2.4 back on track, so I've been figuring out what patches need to be merged from 2.6. Sadly, I'm not organized enough to have held onto copies of the patches I have sent to Andrew. So, I pulled all his broken-out patchsets, and grepped out the UML-related ones. There are 166 of them, not all of which are backportable to 2.4. However, probably most of them are, so that's what you'll be seeing in the 2.4 part of my patches page soon.

7 Jan 2005
I finally got the x86_64 port merged into my tree. See the incremental patches page for the patches. I will be sending them to Andrew when I think they're stable, which right now, they're not. I have a busybox filesystem which boots nicely, but when I try anything fancy, I get crashes.

To test this, and to get a 64-bit filesystem, I've been doing the Linux From Scratch thing. This is a hugely useful project. I started off by naively building gcc, then realizing I was going to need libc, and trying to build that, and failing. Starting a toolchain and libc from scratch is subtle, and you'll waste a lot of time if you want to figure it out on your own. So, I now have a temporary set of tools in my filesystem, and I'm going to continue building it from inside UML to exercise it.

A while back, BlaisorBlade started asking questions about why something very much like skas mode can't be implemented in stock, unpatched hosts. It turns out that the answer is that it can. The result is skas0 mode, which offers the security of skas mode, and a good part of the performance, without needing to patch the host. The two things that skas mode gives you can be implemented on stock hosts

  • /proc/mm - This lets you create new address spaces without needing a new process for each one. This can't be done without patching the host, but the benefits of this are somewhat lower memory consumption in the host kernel. It's easy enough to get new address spaces by creating new processes.

    It also lets you modify those address spaces, i.e. mapping and unmapping pages, and changing page protections. This can be done by inserting code into the process, and making the process call it whenever you need address space changes. I've done this by taking away the top two pages of the process address space, and using one for code that the process will run, and the other for handling SIGSEGV. More about that later.

    The code page has two pieces of UML code mapped into it. The first just executes a system call. UML sets up the process registers for the system call and sets the IP to the start of this bit of code, and continues the process. The child just executes the system call and signals itself so UML knows that it is done. This is used for mmap, munmap, and mprotect.

  • PTRACE_FAULTINFO - This allows UML to get page fault information from the child when it gets a SIGSEGV. This is done in skas0 with the second piece of code inserted into the child. This is a segfault handler, which reads the page fault information out of its sigcontext, puts them in registers, and signals itself. UML then reads the registers, and gets the page fault information from there. The data page that is mapped into the child is just the stack for the SIGSEGV handler to run on.
I'm trying to convince myself that this obsoletes tt mode so completely that it can just be thrown out. From a security point of view, it's a no-brainer. From a performance point of view, skas0 is generally a win, but there are specific spots where it performs worse. In particular, page fault handling is slower. tt mode can get the page fault information directly from its stack, and fix its address space just by calling mmap/munmap/mprotect directly. skas0 has to make two context switches in order to get the page fault information, and at least two more to fix it. This aside, it it noticeably faster on my favorite benchmark, the kernel build. UML can do a build in just under three minutes on my laptop.

I've been doing some other performance improvements in order to get skas0 performance above that of tt mode wherever possible. I've removed some unneeded code from the system call path. I've also made tlb flushing more efficient by minimizing the number of system calls needed. This is done by merging adjacent operations wherever possible. For example, a large munmap, which used to be done page-by-page, is now coalesced and done with a single munmap. Similarly, mmaps which are contiguous in memory and the backing file and have the same permissions are done in one operation, rather than one operation per page.

Having started using evolution to keep track of deadlines and to-do lists, I've started thinking again about embedded UML as a way of adding things that I would like, but which are specialized enough that they probably never will be.

To recap, embedded UML is the idea of making UML into a library which can be linked into other apps. Having done this, you would then implement a little filesystem which would be mounted inside the UML and give you access to internal application state. This would be exactly like /proc, which provides access to internal kernel state. Many entries in /proc allow reading and writing kernel variables, but many have more complicated semantics. This "appfs" would be exactly the same thing, except it would be specific for the application that you're embedding UML into.

Getting back to evolution, a very simple thing I would like is some statistics on the average age of my todo items. Are they getting older or younger? And maybe a pretty graph showing that. I would like to do this by making an "evofs", mounting it (on /evolution in the embedded UML, say), looking at the items in /evolution/tasks, pulling out their ages and doing the calculation. Lets say that each task is a directory, which attributes for each task in its own file within that directory. So, I could write a script that looks at /evolution/tasks/*/start-date, calculate the age of each one, and average them.

I realize that there's a config file with a reasonable format under ~/.evolution which can be parsed to provide the same information, but there are other things I want which can't be had by parsing config files.

For example, there are scheduled things which have to happen on the same day each month, but which I'd like evolution to schedule on a weekday if that happens to be Saturday or Sunday on a given month. I don't necessarily fire evolution up during the weekend, so the alert will show up on Monday in those cases, which would be non-optimal. If it alerts my on Friday, I make a note on a Post-It and have some chance of remembering it during the weekend.

Needless to say, there is no "put this on the 15th of each month, except if that's a weekend, in which case put it on the previous Friday" button. So, what I want to do is have a little daemon in my UML inside evolution which watches for new appointments by watching /evolution/appointments via dnotify or inotify. When one shows up on a weekend, it would just just change the date field to the previous Friday. There are obvious problems with appointments which are scheduled forever into the future. One of these can't make an infinite number of /evolution/appointment directories for this daemon to look at. Evolution could just make directory entries for appointments that are visible in the interface. So, if you go scrolling through the rest of the year, everything you see will be on a weekday.

Another example - I'm starting to write a book about UML, and just finished the schedule for it, i.e. when each chapter will be ready. I had a similar problem with this. Each deadline is some number of days later than the previous one, and some of those will end up on a weekend. I wanted each deadline to be on a weekday, except the nearest one this time, so a Sunday deadline gets moved to Monday. A similar daemon could be used to do this - the logic would be quite similar.

However, we can't have two daemons battling over the same appointments, so we need some way to specify which private rule will be applied to an appointment. The best thing would be some button in the appointment dialog that says "If this appointment falls on a weekend, it should be moved to the previous Friday" or "If this appointment falls on a weekend, it should be moved to the nearest weekday". Evolution can export its internal state to an internal UML via a filesystem, so it can export its UI in the same way. So, somewhere in /evolution/ui, there'd be a directory for the appointment dialog box. Fiddling that in the appropriate way, i.e. creating directories or files for the new labels and checkboxes with the information needed to position it correctly, would cause additions to the UI when you next create an appointment.

However, if you do check those boxes, evolution won't do anything about it because it knows nothing about those widgets except that you told it to make them. But part of the information that you plugged into the UI could tell evolution to include the final state of those widgets in a particular place within the appointment's directory. So, when it set up the checkboxes, my script could tell it to put their state in the nearest-weekday and previous-friday files and could look at /evolution/appointments/00123/nearest-weekday and /evolution/appointments/00123/previous-friday to tell whether I selected either one and move it appropriately.

Admittedly, these are minor things to use to justify a non-trivial project such as making UML embeddable, and then modifying applications to embed it. However, there are much more substantial uses for this that I can imagine. For example, the embedded UML can export application filesystems to the outside world via NFS or some other network filesystem. In a group environment, this could mean exporting your task list to your boss's evolution, where his UML will make a big list by glomming together all the underlings' lists. The individual could be mounted under /tasks/alice, /tasks/bob, etc, and then symlinks made from /evolution/tasks to tasks within those mounts. The "taskfs" running that filesystem would see the symlinks being created, read the contents, and create the datastructures inside the application to make all those tasks appear within the boss's UI.

As described above, the UI could also contain additions, such as the owner of a task, that the original UI doesn't have. Then, there could be an interface for changing it. Again, this would be completely implemented by a script within the embedded UML. Changing the owner of a task would involved moving it from one person's list to another's. From the point of view of the script, this is moving the task's directory from /tasks/bob to /tasks/alice. Then, Bob's and Alice's UMLs will find out, via NFS, about the moved file, make the changes locally, and by doing so, inform their respective evolutions so that one will show one more task in its UI and the other will show one fewer.

I don't know if evolution has this sort of thing already, or if it's planned. If not, this shows how embedding a UML into an application can make it much easier to extend, possibly in fundamental ways, such as making an isolated application group-aware. You get a standardized interface, the Unix file interface, and all the Linux tools for using that interface in all applications that embed UMLs. You don't need to worry about the application's source, or building it, or the language that you need to use. The one place where you're dependent on the application is what information it exports via a filesystem and what you can do through it.

18 Oct 2004
UML updates have been going nicely into 2.6 via Andrew, so much so that 2.6.9 won't be that different from my own tree. Thanks to Blaisorblade for pushing patches to Andrew as well. He pushed the initial set that got UML back into -mm, plus a batch just under the wire for 2.6.9-final.

This leaves the question of what happens to the separate 2.6 UML patches. Right now, my plan is to stop producing them, and let people use the UML in -mm or -linus. People who want the latest bleeding-edge stuff or a bug fix that hasn't made it to Andrew or Linus yet can just grab it from my patches page.

The other major UML happening is getting the x86_64 port merged into my tree. The results of this can also be seen on my patches page. I've got most of it merged now - the exception is the 32-bit compatibility code, a lot of which is gross, and it's optional, so I'll merge that after the main body of the patch is merged and working.

So, I'm currently trying to get it to build. It looked ugly at first, but fixing the most common compilation errors made the rest look a lot more manageble.

23 Sep 2004
My patches are continuing to flow nicely through Andrew into 2.6. At this point, 2.6.9-rc2-mm2 is almost the same as my 2.6 tree. There are still some patches to go, but it's pretty close.

I'm working on getting the other major outstanding patches merged. BlaisorBlade sent me a good number of his patches, and I've got them mostly merged now. He did a bunch of good work on the build. make -j now works nicely, plus UML does its final link in the same way as the other arches. His patch did away with linux being the default target, but I added that back. You now get a vmlinux and a linux, which are the same thing. I sent about half of these to Andrew tonight, so they should be appearing in the mm tree shortly.

Another of the outstanding patches is the x86_64 port. I spent yesterday breaking it up into smaller patches. So, now I have 29 patches that I will be merging in. I have three of the smaller ones in now. They are available from the new -mm section of the patches page.

I also revived the skas4 patch. I'm having confusing problems with UML host processes getting signals that I think they shouldn't. Debugging on this is continuing.

After all these are done, or maybe in parallel, I'll start looking at Gerd Knorr's patchset, plus smaller things like meta_tdb. This will knock the patch backlog down nicely, and then it will be possible to start looking at the small things that people have sent in, and which have been languishing.

11 Sep 2004
This month's big news is that Andrew has started feeding the UML patches he has been keeping in his tree on to Linus. BlaisorBlade sent the big patch plus some patches of his own to Andrew, and I sent in some more which made it build and work. At that point, which was sooner than I expected, Andrew sent the whole thing to Linus, who, miraculously, took it.

There were some glitches. Among them were my patch changing the initial value of jiffies from -5 minutes to 0, which I put in because it was getting a strange value in UML, another which added a UML patch number to EXTRAVERSION, and, of course, ghash.h. The first two were easily fixed, and I ripped ghash out of UML the day before yesterday.

I've also started feeding him the patches that make up the difference between what he has and what I have here. So, UML should be reasonably up to date in -mm, and also 2.6, pretty soon.

BlaisorBlade made a suggestion which I think is reasonable, which is that -mm and 2.6 should be the UML stable tree, and my tree should be the development tree. Thus, patches would be forwarded from my tree to Andrew when we think they're stable, and my tree would consist of so-far-inadequately-tested patches. I'm also considering stopping releasing 2.6 patches. You'll have the 2.6 tree with a good UML in it, and my ongoing development will be available as patches from my incremental patches page.

I've also got externfs/humfs ported to 2.6 and working reasonably well, except for some internal glitches, which will be fixed. With some more work to enable mmap in humfs, this will let us start eliminating the excess memory consumption caused by double-caching on the host and UML.

19 Aug 2004
I've been keeping up with 2.6. I put out a second 2.6.7 patch which syncs up my 2.4 and 2.6 trees. Then, I put out a 2.6.8.1 patch which just updates to 2.6.8.1. I'm also tracking -mm in hopes of cleaning up the UML in Andrew's tree enough that he can send it to Linus, and I can finally get something current into stock 2.6.

I spent the last couple of days writing my views of where UML is headed in the future. It is somewhat grandiosely called the Road Map. It doesn't contain anything that I haven't talked about before, but that is all buried in the papers and slides on the site, and I think that very few people read them. This puts it all in a much more conspicuous spot on the site.

Most people, understandably, consider UML to be just a virtual machine technology. I'm trying to make clear that virtualization in general, and UML in particular, are potentially much more than that. I talk about porting UML into the kernel, and the possibilities that creates. Also, UML can be linked into other applications and used as a captive virtual machine and I describe what I see as some of the possibilities there.

12 Aug 2004
The Kernel Summit and Ottawa Linux Symposium have come and gone. Virtualization was a pretty consistent theme in both, and UML came up often. At KS, Chris Wright gave a talk on virtualization on Linux in which he described the various technologies that are currently available. Then he talked about what Linux could do to better support virtualized guests. Since I had given him most of the material for this part, he more or less turned it over to me, and I went through my laundry list of things that the host could do to make life easier for UML. None of what I said seemed particularly controversial, which is nice. Wim Coekaerts of Oracle was particularly interested in UML's need for AIO, since Oracle uses it, and Wim perceives a lack of interest in it on the part of Linus.

At OLS, there were a number of virtualization talks. An IBM Power guy, Dave Boutcher, gave a talk on hot-plug CPUs and memory. The CPU stuff is going to be used by UML in exactly the same way that the other arches will. However, his memory hot-plug plans were aimed at being able to remove a particular piece of physical memory from the machine. This requires moving all kernel data structures from that memory (or not putting them there in the first place). UML's memory hotplug plans are much simpler. It is sufficient to be able to grab a number of pages and free them to the host, and it doesn't matter where they come from, or that they are contiguous.

There was a talk by the Xen people, which I missed because I went to a different talk. Chris gave his Linux virtualization talk again, which I punted on this time because I had heard it already. I don't know what he did with the section that I pretty much did during KS.

On returning from Ottawa, it was time to get back to UML, which I can now do pretty much full-time, thanks to Intel. Since then, I've got UML pretty much up to date with all the trees I keep up with. I fixed a bunch of problems in the latest 2.4.26 release, released a 2.6.7 patch, and, last night, got 2.6.8-rc4-mm1 running and sent the requisite patches to Andrew.

I've started using quilt to manage patches, and started posting my current patchsets to the UML site. This has gotten good reviews from my users, and has helped me by giving me some assurance that I'm not going majorly break anything when those patches are released for real.

I currently have my 2.6.7 tree synced up with 2.4.26, except that humfs and hostfs aren't all that stable yet after being ported over. The mm tree has a pile of bb patches in it which I need to sync up with, and merge any applicable ones into 2.4.

15 Jul 2004
Recent UML work has focussed on the host-based filesystems, mainly humfs. The filesystem metadata has been reworked to fix some bugs in the first version, namely to add modes to the metadata, handle hard links correctly, and to correctly handle the case of where to put the metadata of the parent directory containing a "metadata" subdirectory. I had a plan for files named "metadata", but forgot to check that it handled directories as well. The fix is to create two metadata directories, one for files and one for directories. In the directory metadata, there can be only one plain file in a given directory. This will normally be called "metadata", but if there is a directory by that name, it will be called something else. Whatever it is called, it can be found by looking for a non-directory.

This required bumping the metadata version and updating humfsify. Since I don't want to increase version numbers very often, I also fixed the mode problem, which is also a version-affecting bug, at the same time.

Humfs seems healthy at this point. I can boot from it and do a kernel build on it. Hostfs was also affected by the externfs restructuring, and got little attention, so it was broken. So, I've been working on it. It is better, but there are still some bugs. It doesn't survive a kernel build yet, but it's doing a lot better than it was yesterday.

I've moved UML development onto my new laptop, which means I'm running UML in tt mode until a put a skas kernel on it. This exposed a problem with the file descriptor management that I wouldn't have seen otherwise. In tt mode, a pipe is used as part of context switching. The outgoing process writes a byte into the pipe of the incoming process and then reads its own pipe. The write will bump the incoming process out of its own read, waking it up, and the outgoing process will sleep in its read until something writes to its pipe.

The problem was that the pipe file descriptors weren't under the control of the filehandle code, so it couldn't free up file descriptors if the pipe call failed due to no descriptors being available. This required a bit of surgery on some tt-specific code, which seems to be working well.

30 Jun 2004
I've had precious little free time in the last few weeks, but what there was I spent banging on humfs. It now stores file modes in the metadata rather than leaving them on the original files. humfsify now has a -r option to merge the matadata back into the data. This is how you upgrade from one metadata version to another (or change metadata types) - revert the metadata and build the new metadata from scratch.

The major development, and what sucked up a fair bit of that free time, was a nice offer from Intel, which I accepted. I get to spend most of my time working on UML, so this means that it is no longer a free-time project. We should see UML development speed up noticably as a result of this.

However, that probably won't happen until the end of July. I'm taking off next week and touristing around Iceland. Then, two weeks later, its KS and OLS. There is that week in between, but I'm expecting that to be largely spent catching up on what I missed the previous week. So, that means that it'll be the last week of July before I really get cranking on UML again.

I am trying to get a release out before I leave. The last patch is noticably less stable than the previous ones, so I'd like to get a bunch of bug fixes in.

13 May 2004
After much banging on hostfs and humfs, I got them working well enough to release them. It turns out that there were a bunch of bugs in hostfs conspiring to make it completely synchronous. This is undesirable for something like humfs, so I made it behave like the disk-based filesystems. That exposed a pile of bugs, which I fixed, now humfs is noticably faster. I also added a patch from Piotr Neuman which makes it possible to plug new metadata types into humfs. So, it will be able to support all the other metadata formats that people have been suggesting, like tdb and xattrs. So, that made up the bulk of 2.4.24-3, which I released yesterday.

In somewhat older news, there is now a x86-64 port of UML, which was sponsored by PathScale. This is on 2.6 only right now, and it's available as a separate patch to be applied on top of the 2.6.4 UML patch. I'm going to be merging it into my tree bit by bit. Currently, it's not very clean in a bunch of ways, and there will need to be some work to make it cleanly mergable. So, as this happens, the separate patch will shrink, and when the port is fully merged, it will go away totally.

Now that the humfs work is out, I need to catch up with 2.4 and 2.6. So, first up is 2.4.25, which turns out not to need any UML changes. That will be released today. Then will be 2.4.26, which I think just needs the cmpxchg patch which has been floating around for a while. Then, it's on to 2.6, where I am currently two released behind.

19 Apr 2004
It turns out that humfs was fairly buggy. I stomped out enough of the bugs that you can now do a kernel build on it. There are also a couple of design problems with the current metadata layout. One was pointed out to me by Paul Wagland; the other I actually figured out on my own.

The first is that permissions are kept on the original file. The problem is that if it is chmod-ed 000, then even the owner can't read it. However, root can, so this would lead to a situation in which root inside the UML couldn't read some files that it should be able to. The obvious solution is to have humfsify chmod all of the files 777, and move the permissions into the metadata file.

The other bug is that the current humfs metadata maintains one file for each name in the filesystem, so for files with multiple links, there is an independent metadata file for each link. This is wrong, since if you change the ownership of a file through one link, the ownerships of the other links should change as well. humfs will currently change the ownership of the one file, but leave the others as they were.

My current plan on fixing this is to designate one metadata file for each file as the primary metadata holder. If the file has more than one link, the other metadata files will be symlinks to the primary. This means that humfsify will need to keep track of what names link to what files so that it knows when it needs to start making symlinks, and where they should point to. This sounds nasty, since the easiest implementation involves keeping track of every file it deals with, but a simple optimization is to just keep track of the files with more than one link. This will be a small minority of the files. Some of them will involve links from outside the hierarchy being copied, and won't result in any metadata symlinks. The other side of this is that the metadata file will need a link count and a deletion flag in it. These are needed because someone might remove the name corresponding to the primary metadata file. That file can't actually be removed because the symlinks count on it, and we don't want to move the data because that would require searching for every affected symlink. So, when the file is deleted, the metadata file will have its link count decremented, and the deletion flag set. Then, it will appear not to exist. It will be really deleted when all of the links to it are gone. Deleting one of the other links will just involve decrementing the link count in the primary, and removing the symlink.

Also in the not-too-distant future is O_DIRECT and mmap support for humfs. Both will eliminate the double-caching problem that prompted this filesystem in the first place. O_DIRECT is useful when the data isn't shared with any other UMLs, which will be the case until humfs has COW support. It reads data from the disk directly into the buffer provided by the process without going into the kernel's page cache. The other attraction of O_DIRECT is that it is supported by the current 2.6 AIO. This means that UML can have many I/O requests in flight on a 2.6 host using O_DIRECT, which should help its I/O performance.

mmap support will do the same for shared data. In this case, the data is in the host's page cache, and it is mapped directly into the UML address spaces. This will need COW support in humfs before it becomes useful. It will also need AIO support for normal, cached, I/O before UML's use of AIO can come into play.

I'm planning on releasing a new UML patch in the near future. It will have a fixed humfs, plus maybe a restructuring on humfs to allow multiple forms of metadata to be supported. There will also be a utilities release for a rewritten version of humfsify.

7 Apr 2004
It's been a while since the last entry. Since then, 2.6.4 came out, and I released a 2.6.4 UML. I also released 2.4.24-2 today, which contains some major changes.

I became convinced (by Al Viro) that ubd-mmap wasn't viable. It's vulnerable to the filesystem making changes to data that it intends never to reach the disk. With buffers mmapped from the host, those changes automatically read the underlying device. He suggested a filesystem instead, which gives me the control that I need in order to do mmap correctly.

So, humfs is out. humfs stands (roughly) for "host uid mapping filesystem". This is the main difference between it and hostfs. The main problem with hostfs is that any files created are owned by the uid that is running UML. This is a problem when there are multiple users inside the UML creating files. The files want to preserve their ownerships, but can't.

humfs separates the ownerships from the actual file. A humfs mount has a root with two subdirectories, "data" and "metadata". "data" contains the actual files. "metadata" mirrors "data", except that the files under it contain just ownership information as their contents. humfs file access cause the file contents to be retrieved from under the "data" directory, and ownerships to be retrieved from under "metadata".

This has a number of beneficial side-effects:

  • As already mentioned, mmap can be done correctly. This will allow files to be mmapped from the host rather than being copied, allowing UMLs to share the host's page cache. In the tests I've done with ubd-mmap, this dramatically reduces the host memory usage of a UML.
  • The UML's files are visible on the host. This makes management easier, since passwords can be more easily reset and the like. There are some privacy concerns here in hosting environments. Some people don't like it to be too easy for the host admin to poke around their files. So, this will have to be balanced against the management conveniences.
  • Space is more flexibly assigned. There is a "superblock" file at the root which says how much space this mount is allowed to have. It's read at mount time currently. I think that with dnotify, it will be possible for humfs to notice changes to it and react immediately. So, on-the-fly disk allocation changes will be possible.
  • File-level COWing will be possible. This will allow UMLs to share filesystems, with modified files being copied and made private. A special case of this is booting from the host's root filesystem, with a tiny COW hierarchy containing the files which need to be different.
  • Block-level COWing within a file will also be possible. Some people need block devices underneath their UML filesystems, and here, they will want the usual block-level COWing. I think it will be possible to do this while retaining the memory usage advantages of mmap by having the disk image in a humfs filesystem and loop-mounting that as the root filesystem within the UML. I was concerned about the lack of partitioning support in the loopback device, but I recently saw a patch on LKML which fixes that.
  • It will even be possible for UMLs to share a writeable filesystem, with communication allowing one UML to change a file and cause the others to invalidate their cached copies of it. I'm planning on figuring out how to make it possible to quickly bring up UMLs in response to a load spike of some sort, like a slashdotting on a web server. Coherent filesystem sharing could play a role here. With the filesystem on a SAN of some sort, the new UMLs could serve the data from a common source, and write things, like log files to a common place, eliminating the need to collect them from the UMLs as they are shut down when the spike passes.

As a side-effect of this, I put an abstract interface between hostfs_kern and hostfs_user which allows other userspace modules to be plugged into hostfs_kern. humfs is obviously the first new user of this. Now that there's a pluggable interface, it's not too hard to make other host resources look like UML filesystems. Some examples:

  • sqlfs - mount a SQL database as a UML filesystem. There are lots of choices for mapping part of the database onto Linux directories and files. Tables could be top-level directories, rows could be subdirectories, and columns could be files within those subdirectories. Or rows could be files. Or both, I think, depending on whether you cat the row or cd into it. Plus, there'd be a /query directory in there somewhere which would let you treat a query as a directory, and the search result would appear under it. Consider
    cd /sql/query/'select * from people where first-name = "Bob"'
    Obviously, any other sort of database could get the same treatment. Other people have mentioned ldap, for example. An interesting variation on this would be to dump a Linux filesystem into a database, and boot from it. This would give you the standard Linux file semantics, but would also let you database-specific searches on the filesystem. For example,
    cat /sql/"select filename from * where uid = 0 and setuid = 1"
    A possible use of this would be for an application which needs a filesystem to store stuff, but whose needs are otherwise badly served by any existing filesystems. Maybe it needs a few huge files, or a huge number of tiny ones, or huge directories, and current filesystems don't supply the searching that's needed. You'd put those files in a database, mount the database inside UML as a filesystem, and provide a query interface to directly do searches on the database.
  • difffs - mount a directory diff as a filesystem. The two host directories to be diffed would be arguments to the mount command. The files in the resulting filesystem would be those that were different, and the contents would be the diff of the file.
  • googlefs - mount Google as a filesystem.
    cd /google/"user-mode linux"
    could produce a directory in which Google's results for the search "user-mode linux" are somehow represented.
Some of these examples are far-fetched, but I mention them to show the range of possibilities there are with this. I'm particularly interested in the filesystem-in-a-database. I imagine that there are new things that you can do when you can put an arbitrary database underneath your filesystem, and be able to query the database in whatever way it allows.

I also added aio support to the os interface. The existing aio was in the ubd driver, and consisted of a separate thread which did synchronous IO, one request at a time. This is present in the new aio. What is new is support for the 2.6 aio interfaces. This allows any number of IO requests to be started, and for one thread to handle their completions. When this is debugged, this should help UML's IO performance.

With mmap IO in sight, the next step will be to introduce the active UML memory management that I've been working towards for a while. In conjunction with the /dev/anon patch, mmap support will allow pages to be freed from a UML to the host, and conversely, be plugged into a UML to increase its available memory. With a daemon on the host monitoring the memory usage of the UMLs and the host, it will be possible to use memory more intensively by giving it to UMLs that need it and taking it away from those that don't. This should increase the density of UML servers since memory is often the bottlenck.

15 Feb 2004
As of today, I'm caught up on both the 2.4 and 2.6 fronts. I released the 2.4.24 patch today. It was trivial, as the 2.4.24 patch was trivial - just apply the patch and rebuild. I also had a bunch of accumulated bug fixes, including some time problems that I fixed in the last few days. These are the ones which people have noticed the most. They included
  • process start times as shown by ps drifting away from clock time
  • /proc showing a modification time of 1970
  • at reporteedly not working
The other user-visible fixes were a couple of mconsole bugs.

I also caught up with 2.6 - more completely than I had intended. I pulled his tree, not noticing that it contained 2.6.3-rc2 rather than just 2.6.2. So, I ended up updating UML to that rather than trying to back down.

I've also been busy on usermodelinux.org . There were a number of things from UML users that the wider community needs to know about, from new filesystems and installation HOWTOs, to an automated UML network setup tool, to a neat Knoppix image which lets you download and boot UML from a web browser. I also added some new FAQ entries and UML hosting providers. Hopefully the FAQ entries will cut down on duplicate list traffic (and frustrated users who just quit rather than ask a list).

16 Jan 2004
I forgot to mention that I also released my /dev/anon patch. This is a special device that UML can use to map in its physical memory which has the semantics needed to free memory back to the host when UML isn't using it any more. In some quick testing, it reduced UML memory usage on the host by about a quarter. It's used in conjunction with ubd-mmap, but it's also usable eithout it. ubd-mmap still has some bugs that need rooting out, including at least one file corruption bug, so it shouldn't be used for production yet. See this page for more details.
15 Jan 2004
I was somewhat behind in getting 2.6.0 working and released. But, that's done now, and I also caught up with 2.6.1 reasonably quickly. 2.6.1 was causing process segfaults with the 2.6.0 patch, and the reason was somewhat interesting. I had a couple of inline functions which looked at the frame pointer and stack pointer. In order for these to work correctly, they really needed to be inlined because they needed the frame and stack pointers of the caller. Something changed in how gcc is invoked which caused it to stop inlining these functions. So, the fix was to turn them into macros, which gives the compiler no choice about inlining them. This may also explain some of the wierd behavior other people have been seeing, and which I've been attributing to the lack of get_thread_area and set_thread_area.

Andrew included my 2.6.1 patch in 2.6.1-mm4. Hopefully it will then make it into 2.6.2. There will still need to be a separate UML patch, but it will be much smaller. There are some difficulties in initcall ordering between the UML console and serial drivers, and the generic tty driver which require some kludging. So, at least that will need to be separate for a while until something cleaner comes up.

I also made the first utilities release since the fall. A number of small changes and fixes had accumulated. There were two sizable changes - bridging support in the jail.pl script and support for dumping tty log output into a SQL database rather than to the terminal. These both stem from my UML honeypot work at Dartmouth.

15 Dec 2003
Richard announced his implementation of swsusp for UML a couple of days ago. Reaction is less than I expected, but still positive, as I expected. It'll be more widely used once it is updated to the latest UML, and doesn't require a 200M download in order.

I've been hacking away on /dev/anon recently. This is a new driver that supports mmap/munmap, and which releases memory that is no longer mapped. The purpose is to make ubd-mmap useful by freeing UML physical memory which has been over-mapped with pages from a disk. With a simple test of booting my Debian testing image to a login prompt, this consumes about 25% less memory than using the usual /tmp file. Accordingly, I can boot about 25% more UMLs before the host starts swapping. When I release it for general consumption, you can all thank memset for the reduced memory usage (and swapping) on your hosts.

I'm also releasing 2.4.22-7. This is a big patch, with a bunch of new stuff in it. The big items are partial support for skas SMP and highmem, and Sapan's real-time clock patch. There are lots of smaller changes including bug fixes and a bunch of code cleanup and restructuring.

4 Dec 2003
I spent a week in Japan, courtesy of Richard Potter at the University of Tokyo. He and others in Japan are doing some interesting things with UML and virtual machines. At dinner on the first night in Japan, I discovered that Richard had implemented swsusp for UML. This is obviously a welcome surprise, given that swsusp is probably the most asked-for feature for UML. The downside is that he is working with an old UML (2.4.18). Updating it to the latest shouldn't be too hard, though.

He also had some other interesting ideas, including the idea of a UML-wide fork, which would be a cloning of the entire state of a UML into a new UML, with shared resources being COW-ed as necessary. This is exactly analogous to a process fork, except you get a whole new UML rather than a new process. I'm not sure what I would use this for if I had it, but I think it's worth doing, just so the rest of the world can figure out how to use it.

I also visited the University of Tsukuba to give a talk to the group led by Prof. Kato, and to listen to a set of presentations on their work. My talk was a stripped down version of my UMich talk. Their work included

  • a "Software Pot", which is a controlled environment for executing arbitrary, untrusted applications with the ability to import resources from the host
  • a very minimal virtual machine environment which can boot slightly modified Linux and *BSD kernels
  • A Knoppix environment which loaded off the network, rather than a CD, and booted inside a UML.
I repeated this talk the next day at the University of Tokyo (and a number of the Tsukuba people came to hear it again), plus I gave a presentation to the Yokahama LUG.
10 Nov 2003
After much delay and items accumulating in my todo list, I released 2.4.22-6 today. This contains a large number of fixes and cleanups, mostly sent in by users over the last few months. One notable bug which I think is fixed is the "Process nnnnn exited with signal 11" that Oleg has been seeing. It was a longstanding, stupid bug, and I'm amazed it hadn't been seen sooner.
6 Oct 2003
Last week, I visited the University of Michigan at the invitation of Peter Chen, who has a research group doing interesting things with virtual machines. The one thing that I'm particularly interested in is a tool called ReVirt, which can log everything happening in a VM and replay the log so that the replay causes exactly the same instruction stream to be executed. This would make it possible for someone with a bug that I can't reproduce to send me the ReVirt log, and I would replay it (over and over if necessary) until I tracked down the problem.

There was another tool whose name I forgot (written by Sam King) whose job it is to analyze these logs. The work of this group is focussed on security, so ReVirt is used to record attacks and exploits, while the log analyzer pulls out the essential details of those attacks and generates a nice picture of them. The analyzer works by defining a small set of basic objects, such as files and processes, and a set of actions by which they can affect each other (a process creating a file, or a process being created by execing a file, for example), and taking the endpoint of an attack (such as a running backdoor process, or a modified passwd file), and backtracking through the set of events and objects involved in creating that endpoint until it identifies how the exploit entered the system. Then it generates a little diagram which serves as a picture of the exploit.

The use of ReVirt in UML development is obvious. I'm wondering whether the analyzer could be repurposed with a different set of basic objects and actions in order to analyze kernel bugs. For example, when chasing a deadlock, we could make locks be the basic object and lock and unlock be the basic actions, and get a diagram showing where the lock in question was taken and released. Hopefully, there would be a glaring mismatch which would identify the bug at a glance.

These two tools were originally done using FAUMachine (formerly UMLinux). At the time, there were some reasons that FAUMachine was preferable to UML (UML's use of helper threads complicated things) but the group is now porting them to UML. UML is now seen as preferable because of its greater stability and user base. FAUMachine's goals are apparently less of a good match to their research than UML's goals. Plus, the threads issues need to be dealt with at some point anyway.

I also spent a day at CITI, whose research doesn't directly involve virtual machines. They are into research and development which can benefit from using them as their development platform. At this point, they are heavily into NFS V4 development. They are producing the reference implementation at the behest of Sun and Network Appliances. They are also doing research of the replication and load balancing aspects of the NFS V4 protocol. In some of these projects, a virtual UML network would make an ideal platform, since it would eliminate a lot of logistics in setting up and running a physical test network.

18 Sep 2003
I released 2.4.22-5 today. I've been doing lots of code cleanup lately. That was pretty much all of 2.4.22-4 since Steve Schmidtke sent me a large cleanup patch. A lot of -5 is dealing with the aftermath of that. There were a few bugs in Steve's original patch, plus I merged a couple of chunks badly. In addition, there were some more fixes from him and BlaisorBlade merged. I also fixed a tt mode bug which caused signals to be disabled in userspace.
11 Sep 2003
HP sent me a nice IA64 machine, which showed up yesterday. Dual processor, 10G memory, 70G disk. It's a very fast machine, at least compared to my existing hardware, which is all ~3 years old, so I don't have a feel for how fast PCs are currently. I spent the afternoon turning it into my main workstation. The current holdup is that pppd doesn't work. It gets EFAULT whenever it does the PPP get-unit-number ioctl.

I guess I have no excuse for not at least thinking about porting UML to IA64. I have little free time, but I'll do the port a bit at a time in my spare moments.

8 Sep 2003
I released 2.4.22-2 today. It contains a bunch of bug fixes and a new mconsole command. New to mconsole is the 'proc' command. This was triggered by a patch I got from Steve Benson which implemented 'mem' and 'load' commands, which sent back something that resembled the contents of the UML's /proc/meminfo and /proc/loadavg. The patch implemented them by hand, which I didn't like, plus I think it is likely that it would have triggered requests for other specific stuff from /proc. So, I just added a general 'proc' command which will read from any file in the UML's /proc.

I fixed a few nasty-looking bugs such as a crash when running a UML that had been linked against a libc with the new thread support and a crash caused by a process unexpectedly segfaulting. It also turns out that ltrace never worked in skas mode. I fixed all of these, plus a couple more minor things.

3 Sep 2003
I've made a bunch of UML releases in the last few days. I added mmap support to the ubd driver, which allows it to use mmap instead of read and write. This trades data copying for TLB operations, gaining some CPU and cache improvements from not having to copy all data coming in from disk in exchange for the added expense of TLB flushing. An upcoming benefit, and the real reason for this, is that this will enable the UML and host to share page cache rather than having separate copies of the data read in by the UML, reducing the host memory usage by having the memory overlaid by mmap freed on the host. I realized fairly recently that the host mechanisms that can be used to implement UML physical memory won't free memory when it's unmapped. The reason is that the memory will be considered dirty by the host, even the the UML has no use for whatever is in it, so it will be kept, and possibly swapped. What I need, and what isn't there yet, is something that will just free the memory when UML unmaps it. So, I'm going to be hacking on a little host driver specifically to implement UML memory.

This required a rework of the UML low-level VM layer, including the ability for physical memory to fault and for those faults to be handled. This is something needed by a clustering scheme for UML that I thought up a long time ago and never did anything about. That is taking an SMP UML and spreading the virtual processors and physical memory over multiple hosts. Any physical page would be resident on only one host and accessible only by the processor(s) running on that host. If another processor tried to access it, the access would fault, and a low-level handler would figure out where that page was located and arrange for the contents to be copied and the page to be mapped on the new host and unmapped on the old one. This would fairly easily implement a SSI cluster with UML. The downside is that it would be horrendously slow because of all the faulting. The attraction of it to me is that, even in its slow form, it would be a neat capability and fun to play with. Further, since this starts as a fully-functional SSI cluster, it would be fairly easy to start working on making the inter-node communication more sane. It would be done incrementally, with a functioning (and hopefully faster) cluster at each stage. It wouldn't be necessary to implement (say) 90% of it before anything works at all.

I also implemented the COW V3 format, which fixes a number of problems that had cropped up with V2. The most painful one was the rounding bug which has been killing UMLs for some time. David has been maintaining a patch for it, but it apparently is not a 100% fix. The V3 fix should be. The various sections of the COW file are now nicely aligned. This will allow COW files to be stored on devices with restrictive alignments, such as /dev/raw devices. ubd mmap also requires this because it needs the data to be page aligned.

After this, I decided to catch up to Marcelo, and updated to 2.4.21 and 2.4.22. These were both easy. I also included a couple of small bug fixes in the 2.4.22 release.

17 Aug 2003
I spent the afternoon yesterday tracking down the task_struct leak in 2.6. Oleg had tracked the leak down to where it was happening by identifying the put_task_struct call that should have been called but wasn't. He got that right, but his analysis of the leak was wrong. So, I finally tracked it down to what appears to be a bug in the scheduler combined with my low-level switch_to() being slightly different from the i386 switch_to. context_switch calls the arch switch_to: switch_to(prev, next, prev); return prev; and this, contrary to appearances, is supposed to reassign prev, which then gets returned to schedule(). In the i386 scheduler, I don't see prev getting reassigned. It does branch to __switch_to, which returns the correct value for prev. This is left in %eax, and accidentally becomes the return value of context_switch because that's the last thing it does, so nothing munges %eax afterwards.

So, the UML bug was that my switch_to wasn't assigning prev. Once this was fixed, the task_struct leak disappeared.

I had a couple of brainstorms yesterday, too. I was looking over a patch that someone sent me which made load and memory statistics available through mconsole. I don't like it because it duplicates a bunch of code from elsewhere in the kernel into UML. It occurred to me that it would be easy to make all of /proc available through mconsole in a much cleaner way.

The other thought relates to UML's low-level memory management. It's reasonably easy to pull memory out of a UML. It can just allocate pages, and do nothing with them. Those pages can be added back later just by freeing them, making them available to the rest of the UML. So, a UML's host memory usage can be controlled by having it pull pages out of service and freeing them back to the host. This would make a good way of taking pages from an idle UML and giving them to a UML that's busy.

So, combining these two ideas, we have a daemon on the host which is monitoring the host memory usage and the memory usage of the UMLs via the mconsole /proc interface. When the host starts swapping, this daemon can figure out what UMLs are occupying memory, but not using it, and which UMLs need memory. It can then take memory from idle UMLs by telling to allocate it and free it to the host, and give it to others by allowing them to take back memory that they had previously released or by giving them extra memory.

However, freeing memory back to the host requires that the memory be released from the file that it is mmapped from. In effect, this means being able to create a hole in the file, and there are currently no mechanisms for doing this. There have been discussions in the past about sys_punch or sys_fclear, since there are multiple uses for this, but it has never been done. One major reason is that this resembles truncate, except that this doesn't change the length of the file, and truncate is famous for the number of complicated and subtle races that it's involved with. Al Viro has given sys_punch a chilly reception for this reason. So, until there's a way of releasing file-backed memory back to the system, this plan is going to stay unimplementable.

15 Aug 2003
I'm doing a somewhat better job at keeping up with Linus these days. I have test3 working, and I'm in the process of updating BK pools and generating the patch. I merged in all outstanding changes from my 2.4 tree while I was at it.

I released another 2.4 patch with my accumulated changes. These were mostly small bug fixes. I'm continuing to bang away at my 2.4 tree, fixing some compilation warnings today.

On the conference front, I submitted an abstract to Linux-Kongress. It's been a couple of years since I was there, and a lot has happened to UML in the meantime. So, the abstract describes the major things that have happened to UML in the last year or two, plus a few of the things I'm planning for its future.

29 Jul 2003
I'm caught up with Linus now. I released the 2.6.0-test1 UML from OLS last week. I jumped ahead instead of going release by release on the advice of Oleg Drokin, who had already done it and found no problems. The patch is broken, unfortunately. I tested from my BK repo, and generated the patch from it, and didn't bother testing the patch. It turns out I hadn't fully checked all the changes back in. I'm going to release 2.6.0-test2 today with a working (and tested) patch.

I was at the Kernel Summit and OLS in Ottawa last week. A good time was had by all, and all were pretty tired out by the end. The most interesting aspect of it from my humble point of view was the interest in UML. There was a panel of people from large companies who talked about what large businesses wanted from Linux. Two of them (HP and Merrill Lynch) mentioned UML.

Bdale Garbee (the HP guy) was seeing demand within HP for UML on IA64. I talked to him later, and he said that there was some possibility that they would do UML/IA64.

Robert Lefkowitz (Merrill) mentioned two things about UML. The first is that IT infrastructures in the financial industry are somewhat fragmented due to "Chinese walls" between divisions that are required by the SEC. Large financial companies have conflicts of interest due to the number of different things that they do. One division might be doing business with a company, and want to keep that company happy as a customer. Another division might be making stock recommendations to investors and might be telling them that this company sucks and to sell its stock. To limit the ability of one division from influencing the other to the detriment of its customers, the SEC requires that the divisions be structured in such a way that they don't communicate with each other too much. These structures are called Chinese walls, and they apparently extend to the IT infrastructure. You can't host functions belonging to different divisions on the same systems. Except that virtual machines on the same host are OK, which is where UML comes in. So, having multiple functions on the same host is legally OK, as long as they are separated by being in different virtual machines.

The other thing he mentioned was an attempt to package a VPN client for their employees inside UML for packaging and support convenience. The attraction of UML is that it is a single known environment, in contrast to the multitudes of OS versions running by the employees. Reducing that to one environment makes installation and support much easier. He complained that UML didn't boot off readonly media, and said that he had told the UML maintainers about it. At this, I stuck my little hand in the air and said, "Uhhh, I don't remember anything about this". Talking to him later, he said he thought the person who had done the complaining did it in such a way that it didn't appear to be a complaint, just a "does this work?" sort of a question. My opinion, and that of others, is that this person complained to Red Hat.

Moving on to OLS, there was Werner Almesburger's umlsim talk, in which he described a network simulator he built on top of UML. Russell Coker was planning on letting people log in to SELinux UMLs during his tutorial, but that didn't come off because of a non-UML-related catastrophe he suffered beforehand. The clusterfs person talked about how they're using UML for debugging. The after-dinner speaker at the IBM dinner talked about grid computing and mentioned UML as a possible technology that it could be based on. So, even though there was no official UML content at either KS or OLS, there were plenty of people talking about it.

17 Jul 2003
After much delay, and a bunch of mangled BK repositories, I decided to get my act together with 2.5 again. I somehow ended up with all my repos containing essentially the same stuff, with minor variations. So, I merged them into a single tree, produced a diff, split it out, and applied the pieces by hand to the different repos. So, my BK situation is sane again, and I've updated UML to 2.5.70. I'll be catching up to Linus again, and when I do, I'll see if I can get him to take UML updates again.
22 May 2003
I released the 2.4.20-5 patch today. There's not too much in it. I tracked down a memory leak which would eventually exhaust the /tmp filesystem if some skas UMLs were rebooted enough times. This was due to some /proc/mm descriptors not being closed across exec, causing them to continue to hold down mmapped disk space. I also added chroot and append mode options to hostfs. These are helpful for making hostfs somewhat more secure. The chroot option lets you confine hostfs mounts to a specified directory tree on the host and the append option disallows any destruction of data, whether it be truncating or deleting files.

In 2.5 news, I released the 2.5.69 patch a few days ago, and asked Linus to pull the changes. He hasn't pulled UML in ages, so I'm not expecting much from this. At least, it makes it clear that the old crufty UML in the official 2.5 tree isn't my fault since I've been keeping up with 2,5 and releasing patches on a regular basis.

12 May 2003
I just got back from Columbia (the one in South America, not the one in New York), where I was attending the II Congreso Internacional De Software Libre Colombia in Manizales. I was invited there to give a couple of talks. The first, on the afternoon of the second day, was on UML, and I talked generally about UML. It was a high-level overview of UML and didn't get into the code at all. The second talk was at 8 AM on the last day, and was about kernel hacking in general. I gave an overview of how kernel hacking works, how to participate in it, and a quick tour of the kernel source tree, with some recommended reading in the code.

The second talk went better than the first. We (me and Gustavo, who was translating for me) translated most of the first talk's slides into Spanish. This was a mistake. It took time away from actually working on the talk's substance. My Spanish is weak enough that I looked at a couple of the slides and couldn't tell what they said, so I got lost in my own talk a couple of times. The slides for the kernel hacking talk the next morning were entirely in English. This gave me more time to work on it, plus I was more comfortable during the talk because I could understand the slides.

My talks aside, the conference was a great deal of fun. The organizers (who included the Dean of the Engineering Faculty of the Universidad de Manizales, where it was held) were amazingly concerned about the comfort and happiness of the international speakers.

Among the other speakers was a Colombian Congressman who apparently was a member of the terrorist group M-19 that eventually made a deal with the government and became a legitimate political party. One of the others was a Peruvian Congressman by the name of Edgar Villanueva. You might recognize him as the author of a devastating rebuttal to some Microsoft FUD that was circulating around the Peruvian Congress last year.

Here are some pictures

  • Speakers and organizers in the President's office - actually, his official title is "Rector". The Rector and Villanueva are to my left and right, respectively.
  • The main room - the university used to be a convent, and it's obvious in this picture. This room was obviously a church at one point. I figure it held ~750 people, and it was full for the main sessions. It was slightly less full for my 8 AM, final day talk on kernel hacking... Go figure.
  • Me and Gustavo - Gustavo is an English Professor at the university who was acting as my translator. We are in the little lab set aside for decompression, talk preparation, and network access. Probably we are working on translating the slides for my first talk.
  • My first slide - Anyone who has attended any of my talks will notice some familiarity here, except for the different language
  • Me and Gustavo on my second talk - We sat next to my laptop and projector on the floor rather than on the stage. So we are here, each with a microphone, doing the talk.
The conference was a lot more political than I'm used to. I usually go to technical conferences with primarily technical content. This one had a number of talks which seemed to be concerned with the politics of Open Source, politics of getting Open Source software accepted in government, business, academia, etc. As far as I can remember, mine were the most techical talks, and may have been the only technical ones. At the end, everyone seemed to consider that it had been a great success.
24 Apr 2003
I got around to releasing 2.4.20-4 today. There were a good number of accumulated fixes, including the RH9 fix, a couple of file timestamp bugs, and cleanup of multi-line strings, which new gccs were complaining about.

I also added exec logging to the tty logging facility. It turns out that an intruder to a honeypot can arrange to run commands without anything ever allocating a terminal. This makes those commands invisible to tty logging.

The 2.5.67 UML was released a few days after Linus released his 2.5.67. 2.5.68 is now out, and I'll be dealing with that next week.

27 Mar 2003
After building gcc 3.2.2, it turns out that the 2.5.65 UML works OK. Then it turned out that gdb couldn't read the new object files, so I built a newer gdb, and everything started being fine again.

With that fixed, I pulled 2.5.66 and started looking at it. This looks reasonably straightforward. The only tricky bit is that file offsets are now stored in ptes. This requires that the offset bits be arranged around the reserved bits. UML has complete control over the pte format, so all the reserved bits are at the low end of the pte, and the offset gets stuck in the upper end.

Some more minor fixes, and 2.5.66 boots. The patch is next, once I get this pushed out to my public BK repository, followed by some more pull requests for Linus to ignore.

24 Mar 2003
It's been a quiet few weeks for UML development. I've been keeping up with Linus' 2.5 releases, but not necessarily releasing UML patches or pushing changes to Linus. The current hold-up with 2.5.65 is that Linus declared gcc 2.96 to be evil when frame pointers are enabled. Of course, 2.96 is the version of gcc on my laptop, and seems to be the only version available for RH 7.1. So, the 2.5.65 UML is on hold until I grab a tarball of something newer (and all I can find from gnu.org is 3.x) and build it.

I have a new 2.4.20 patch out. This fixes some minor bugs and applies some small patches. Nothing major, but I wanted to clear the decks before tackling some larger problems.

There was a new utilities release which fixed a uml_switch segfault. This is highly recommended for anyone having problems with uml_switch crashing. This is obviously very disruptive to any UML network which relies on the switch, since the UMLs need to be rebooted after restarting it.

27 Feb 2003
I decided to finally get a 2.4.20 UML out. So, I pushed out the remaining 2.4.19 changes as 2.4.19-51. I updated my pools to 2.4.20, and put out 2.4.20-1. The reason I waited so long on this one (as opposed to almost every other 2.4 release when I released UML within a few days of Marcelo's release) is that there was a bunch of stuff in progress that I wanted to get settled down. The main one was skas last fall, which is nicely stable now. I have had no complaints about it at all in the last few months. Then, there were occasional reports of nasty crashes (like the 'tracing myself' ones). I wanted to get those knocked off before 2.4.20. That appears to have happened.

I'm going to wait to see whether the update has created or exposed any new bugs and fix some problems in the utilities. When everything looks nicely stable, I'll make the real 2.4.20 UML release.

2.4.21 looks like it's not too far away, so the 2.4.21 UML should be released pretty soon afterwards. 2.4.20 was a special case because of all the restructuring that was happening. Hopefully, that won't happen again.

26 Feb 2003
I am releasing 2.5.63 today, and also sending the changes to Linus. He takes my stuff about every 5-6 releases, so I'm not holding my breath on him taking this right now, as I got a bunch of stuff in two tries ago.

The big change is that the UML filesystems, hostfs and hppfs are in 2.5, thanks to Petr Baudis figuring out the changes in 2.5 VFS needed to forward port them. He did hostfs, and I used those changes to do hppfs. hostfs seems to work reasonably, but hppfs pretty much only mounts at this point - the shadowing from the host doesn't work yet.

19 Feb 2003
I got 2.5.62 working and released. The major pain here was the kernel introducing its own sigprocmask (and Linus saying that the resulting clash with libc's sigprocmask was totally my problem and he wasn't about to rename it). After perusing the ld man page and info looking for some way of isolating the two symbols and finding nothing, I fell back on renaming the kernel's sigprocmask with an appropriate -D on the compilation of kernel files. This essentially the same as Oleg's fix, and sucks just as much.
6 Feb 2003
Linus finally took my pending UML updates. This seems to happen every couple of months or so, at which point the changes have started becoming fairly large. This gives me a clean slate to work from, which is handy since the existing changes repositories had become large enough that they started conflicting on a regular basis. I had to merge several of them so they would apply cleanly for Linus.

I've started knocking items off my todo list. Mostly small patches that have been sent in and have been languishing in my todo mail folder.

David has set up a UML mirror on usermodelinux.org, and I've added it to the UML download page. The LinuxVDS mirror is coming - I'm syncing it up and will make it visible when that is finished.

31 Jan 2003
I more or less didn't do anything, UML-wise, for the last week or so. With things accumulating, I decided to get back on the stick again. I looked at the console driver locking because mconsole can hang when it tries to get the configuration of a device that's on a host port. All the things I thought of were too ugly to implement right now. There's also the problem that it acquires a semaphore in an interrupt handler, which leads to a panic if it sleeps there. I don't see a good clean solution for that, either.

I did fix a few bugs though. Roger Binns found a couple of good ones, which are now gone. A couple people noticed that early printfs don't actually appear on the terminal until UML shuts down. An fflush(stdout) should fix that.

I also have a bunch of tty logging changes on the way. Upi Tamminen added timestamps and a direction flag to logging records, which let you replay the log at its original pace, and to tell what data is output and what data is input. He also wrote a little python script to replay the log. I rewrote it in perl, and added flags to allow effectively 'tail -f' of a log and to print out all data, input and output. This shows you stuff that didn't appear on the terminal originally, but which you might want to see anyway, like passwords.

On Tuesday, I was interviewed by the History Channel. They are doing a show on network security, and were at ISTS to talk about honeypots and honeynets. So, George Bakos and I told them all they wanted to know about UML honeypots and a bunch of other things. It'll be interesting to see how much of this stuff makes the final show.

On a side note - why the History Channel was doing this rather than yet another WWII show, we decided that these guys were from the future. That makes today history for them. It also means that UML figures prominently enough that far in the future to send someone back to talk about it. I guess that's some incentive to keep whacking away on it...

17 Jan 2003
I've been playing with 2.5 lately. I released 2.5.58 on Thursday, announcing it yesterday. Linus promptly released 2.5.59, so I pulled that and updated UML to it. I fixed some problems which were also in 2.4, so I ported them back.

While I was at it, I looked at SMP on 2.5. After some work, I got it to build. It doesn't run. The reason appears to be the xtime locking bug that was fixed in 2.4. So, I decided to release 2.4.19-47 so give myself two releases to diff so I can apply that diff to 2.5.

So, 2.4.19-47 is also out, with a bunch of miscellaneous fixes and cleanups in it.

I have accumulated some tools changes, so I'm releasing them today, as well. The trigger for this was the 64-bit uncleanliness bugs. I want to release the tools at the same time as the drivers so the fixes match. This won't matter for anyone running UML on 32-bit boxes. The other notable change was a bug in uml_moo which caused it not to write out to the end of the output file sometimes, leaving it shorter than it should be.

9 Jan 2003
I've been taking it easy on UML development over the last week or so. I got busy doing some other stuff. UML is in pretty decent shape now anyway, so a bit of a break likely won't bother anyone. People have again started saying that they are having a hard time breaking it. This happened last in the 2.4.19-6 to -13 range, before the skas rework. So, it seems that I've reached that level of stability again.

I had accumulated a number of changes since -45, so I released -46 yesterday. The big news is that the network hang that a number of people were complaining about is fixed. mistral had already found this and diagnosed it. It just took me a while to realize that it was the cause of the network hang.

29 Dec 2002
I finished merging the 2.4 stuff into my 2.5 tree. That all went rather well, so now I have five more trees that Linus needs to pull. I'm currently running diffs and bk stuff in order to prepare the diffstat and changelogs that Linus likes in pull requests, plus getting the 2.5.53-2 patch ready.

With 2.5 caught up to 2.4, it's time to start taking a serious look at my todo list and start knocking things off it.

28 Dec 2002
I started merging the 2.4 changes into my 2.5 pool. I put a small fix in early, then updated it to 2.5.53. That turned out to be pretty much a no-op, so I released the patch. I also asked Linus (yet again) to take my existing 2.5 changes. Doing BK stuff had started becoming inconvenient because I had enough unmerged repositories that new changes started crossing them, so that they couldn't be added cleanly to an existing repo. So, I merged them all together and started using that as my starting pool. Any repos based off that pool would have to wait until Linus pulled the existing ones. So, having prepared to have Linus ignore this set, I watched him promptly pull the whole lot.

He wasn't really enthusiastic about the host skas patch (which I purposely didn't send him). His description of /proc/mm was "crap". He's such a tactful person. This lead to a discussion of what would be better, and it turns out he would prefer a system call indirection system call which would allow an arbitrary system call to be executed in the context of a particular address space. You would pass it a file descriptor for the address space (which you'd get by calling a get_mm system call) and a block containing the number and arguments of the system call you want to have run in that address space. Very general, and a neat idea.

22 Dec 2002
I made a whole bunch of small releases as I fiddled the linker scripts and discovered a bit late on each one that I had broken a build or produced a UML that just crashes on boot. The exception to this is -45, which I did in response to a couple of nasty bugs being tracked down to the point where I could either just fix them or reproduce and fix them. Thanks to Jan Hudek and Barry Silverman for their work. -45 also fixes a linker script make-a-UML-that-just-segfaults bug. Hopefully, that's the last of them.

I'm beginning to feel better about UML stability now. The parade of crash reports that followed the skas integration seems to have died down. The two bugs fixed in -45 are a load off my mind. With 2.4 looking reasonable, I think it's time to start merging this stuff into my 2.5 tree. That's been left alone, except for keeping up with Linus, for a while. There are also some 2.5-specific bugs that need fixing, and Oleg seems to be getting impatient about them.

20 Dec 2002
Today brings the release of -41. The main feature of this is that the kernel stack size is now configurable. This is of benefit to no one except people wanting to valgrind UML, which is only me because no one else has my valgrind fixes yet.

Speaking of valgrind, it can now run UML. It produces reams and reams of errors which are almost entirely noise. We're trying to figure out how to get valgrind to be more selective about detecting only real errors. So far, it looks like UML is hitting valgrind with code which is too optimized, and contains code sequences which it doesn't recognize as initializing things. It also appears to me that it is not considering static data to be initialized.

18 Dec 2002
I released -39 and -40 yesterday. They contain mostly small bug fixes and cleanups. The one big change is that I converted all initializers over to C99 syntax. Then I decided to get caught up with 2.5. So, I updated my 2.5 tree to 2.5.52, which was pretty easy, updated my BK repositories and the patch, and sent it all out. So far, Linus hasn't taken any of it.
15 Dec 2002
Continuing to play with valgrind. I fixed its clone problems by ensuring that valgrind doesn't gain control of the child. This stops the valgrind child and parent from stomping on the same data structures and crashing each other. With that problem out of the way, I started hitting problems with valgrind's signal delivery. Its signal frames were bogus, which prevented UML's SIGSEGV handler from getting fault information. UML hit a hole in valgrind's repe handling. Then, it turns out that valgrind doesn't save and restore signal masks across signal handlers correctly. This is where we stand now. UML can run far enough that it panics when it's not given a filesystem to boot on. The signal mask problem hits when UML starts trying to do disk IO. Jeremy Fitzhardinge has been great in helping diagnose these problems and providing fixes.

I've been having limited success in diagnosing and reproducing the bugs that people have been reporting, so I decided to see if I could rustle some up myself. I started up four UMLs (2 tt and 2 skas), ran infinite kernel building loops on them, and also periodically hit them with short ping floods and ab runs. They all ran fine for a while, then they all ran out of tmpfs space on /tmp. This is fine for the tt UMLs, but I didn't implement the recovery code in skas mode, so this was fatal for them. I was hoping for bugs that I didn't already know about though.

I ran the surviving tt UMLs this way all afternoon without any problems. I cleaned up some code, fixed the skas stack consumption bug, and decided to call that -38.

10 Dec 2002
I had a fairly fruitless day chasing bugs yesterday. Some I couldn't reproduce, another I made no headway on. That one was what appears to be memory corruption in netfilter.

I decided to dust off valgrind and see how close it is to handling UML. clone() is a problem, but I realized something that I didn't last time I played with valgrind, and got it handling the !CLONE_VM case. It repaid that effort by spotting a minor buglet. It segfaults on the CLONE_VM case still. I think it's because the two threads mess with valgrind internal data after the call, and one messes up the other. So, my current theory is that I need to have to child thread immediately leave valgrind control and things will work better.

As a side-effect of that, I made UML build as a normal dynamically linked binary when CONFIG_MODE_TT is off. This will be in the next patch. Not a big deal, but it's one more step along the path of UML becoming a completely normal app.

8 Dec 2002
Thanks to David Coulson telling me how to reproduce it, I fixed the 'tracing myself' bug seen under heavy network load. It turned out to be a stack overflow caused by CONFIG_MODE_SKAS increasing the size of a data structure which tt mode put on the stack. So, the problem would go away by disabling CONFIG_MODE_SKAS, even though no skas code ever ran.

I also took the opportunity to do a whole bunch of cleanup of the uml_pt_regs struct (the bloated structure in question), and associated macros and code. This turned a few-line fix into more than 1000 lines of patch. This change is almost all of -36, which is now out.

7 Dec 2002
I finally got around to getting my BK repos hosted on bkbits.net. I had set up a project there when I first got set up with BK, but had never cloned my repos there. Larry Mcvoy had pinged me once or twice about whether I was intending to use bkbits so he could clean it up if not. So, I pulled all my changes over there.

I also sent everything to Linus again. Hopefully he'll pull it this time.

Since 2.5.50 is corrupting data (and I really hope it's not my fault :-), I'll not do anything more with 2.5 until 2.5.51. So, it's back to bug fixing on 2.4. I've got some claims of ways of reproducing some of the most-wanted bugs, plus a nice stack trace for another, so I think I knock off some good bugs.

6 Dec 2002
I released -35 with a another bunch of small changes. It turned out that signal delivery when libc didn't provide a restorer was broken. This caused Tom's boot/root to segfault because it has an older libc than the other filesystems. This is fixed, plus some large memory crashes. There were also some cleanups.

I'm going to catch up with Linus now. I've been madly massaging BK to get my repos updated. I looked at the 2.5.50 patch and spotted only a couple of things that needed changing. One of them was the deletion of the sys_security system call. I went to delete it from UML and was mystified that it wasn't there. I thought I had totally missed it until I happened to see the names of some of my files go by during a BK pull. I took a closer look at Linus' patch, and it turned out that other people already made those change for me. This is a major reason that it's nice to be in the official tree. Other people start doing your work for you.

So, I'll get 2.5.50 up and running and ask Linus to pull everything. Then I'll start merging my recent 2.4 changes into my 2.5 pool.

4 Dec 2002
I only tested Tomcat in skas mode before I released -33. This was a mistake because Tomcat could hang UML in tt mode. This uncovered a fpstate size calculation bug, which is now fixed.

I also ran into a 'sleeping process nnnnn got unexpected signal : 29' crash, which other people have been complaining about. Getting my hands on it ensured its swift demise.

With those two bugs fixed, I decided to release -34. Not a huge amount of change, but those are two major bugs, and it's good to be rid of them.

2 Dec 2002
I figured out what was happening with Java inside UML. With SA_SIGINFO set on a signal, UML didn't produce the same format stack frame as the host. The reason this broke JVMs is that Java can induce segfaults in the JVM. The JVM looks at the information on the signal frame to figure out what happened. If it figures out that the fault was caused by Java code, it converts it into a Java exception, which the Java may or may not catch. Since UML wasn't putting information in the stack frame in the same format as the host, the JVM was confused about the origin of the segfault, and crashed.

I released -33 today with the Java fix, plus a bunch of smaller fixes. I was hoping to have dealt with the 'tracing myself' crashes that David is seeing, but I haven't figured them out yet. Oh well. I also need to get 2.5.50 going, and get all my changes pushed out to Linus.

25 Nov 2002
After far too much trouble, I got a 2.5.49 patch put together, compiling, and running. This uncovered a nasty bug in skas mode which caused signals to be blocked after returning from an interrupt. I started tripping a check in the filesystem which was making sure that interrupts were enabled. After a fair amount of time chasing the problem, I tracked it down to my handler forgetting to re-enable signals before it returned to userspace.

I also have a bunch of BK repositories all set to go to Linus. Tomorrow, I'll probably ask him to pull them. Hopefully, he will this time. The changes are getting rather large and it will be convenient to have them merged into his tree.

22 Nov 2002
I finished the skas merge into 2.5 and got UML booting both in tt and skas modes. So, 2.5 is all caught up to 2.4. Now, I have to push this stuff out to Linus, and then he has to take it. And, I have to merge 2.5.49 and release that.

At this point, I'm in bug-fixing and janitoring mode for the foreseeable future. Despite my attempts to make the skas changes a no-op for the tt code, the latest 2.4 UML is breaking in strange ways in tt mode. So, it looks like I've introduced some bugs there, and they will have to be rooted out.

I also pulled the 2.5.49 changes and tried them out. It turns out the only change that affects UML is an extra argument to do_fork. Fixing that produces a UML that works. So, look for the 2.5.49 patch soon.

21 Nov 2002
I'm busy merging the 2.4 changes into 2.5. Mostly, this is the skas changes, but there's a fair amount of other stuff in there as well. My 19000 line patch is down to ~3000 lines right now. I had to do a huge merge which got rid of most of those diffs before UML would compile again. Once I got it building, I had to fix three bugs, and it booted again, which was nice.

With that out of the way, it looks like I can merge the rest in much smaller chunks. This makes life much easier, since I can build and test after each one, and if UML breaks, there is a relatively small amount of code that I have to search for the bug.

I pushed the first working bunch of code out to my BK repository and generated a 2.5.48 patch. When SF decides to start working again, I'll announce it.

16 Nov 2002
I went back to work on the 2.5 tree. At this point, I'm making the current stuff work again. This wasn't a big deal. There were some small interface changes. The biggest problem was the initramfs stuff. My objcopy didn't support --rename-section, so I grabbed the latest binutils tarball, and built and installed it. Then, UML wouldn't boot. I resorted to looking for anything helpful that Jeff Garzik may have posted to lkml. What I found was that I needed to add a couple of definitions to the arch Makefile to tell objcopy how to jam a piece of arbitrary data (the initramfs image) into the kernel binary. With that, the 2.5.46 UML boots.

While pulling 2.4.47 into my BK tree in my umlcoop UML (which failed once with an error from deep inside BK), I decided to generate the patch from my 2.4 tree that I'm going to have to merge into my 2.5 tree. It turns out, as best as I can figure, that my 2.5 tree is up to 2.4.19-14. So, I have to merge the changes from -14 to -31. Ouch. The diff is almost 19000 lines. Ouch.

15 Nov 2002
I released 2.4.19-31 today with lots of bug fixes. I'm starting to work on the backlog of bug reports, patches, and other things that I've accumulated since I started the skas work. I also redid a lot of the get_config code and extended it to the ubd driver. The network drivers still don't support it, plus the console and serial line drivers don't support plugging and unplugging devices at run time. I also need to support listing of all devices of a particular type, which is needed for people who want to find a spare slot into which to plug a new device.

-31 doesn't compile with CONFIG_MODE_SKAS disabled. It was a silly oversight - I put some skas-specific code outside CONFIG_MODE SKAS.

I'm starting work on 2.5 again. The last UML release was 2.5.44. Linus is up to 2.5.47. I'm going to take the existing code up to that (or .48 if Linus releases that before I'm done). Then comes the job of merging all the changes I've made in the 2.4 pool. The largest piece of this is the skas stuff, but there have been a lot of bug fixes, cleanups, features, and other changes.

13 Nov 2002
I finally figured out the FP problem that was preventing RH7.2 from booting on -29. I wasn't initializing the FP state correctly in new processes. Copying in a known good set of values fixed that.

I also fixed the segments problem cleanly. The host skas3 patch included a copy-segments /proc/mm operation, which is needed to copy arch-specific address space information from one address space to another. This wasn't used before, but is now in order to copy x86 segment information between address spaces.

With that, plus a bit of code cleanup, I released -30. At this point, it's time to settle things down, fix bugs, clean code, and get 2.5 up to date again. Basically, I'd like to spend a good amount of time on maintenance rather than new code. We'll see how well I resist the call of new functionality...

11 Nov 2002
I'm releasing -29 and another version of the host patch today. The big news here is that, in skas mode, UML no longer creates one host process for each UML process. That used to be done in order to create new address spaces on the host. Since only one of those processes could be running at any given time (in a UP UML), they were a waste of kernel memory.

Now, I've added /proc/mm, which provides a way to deal with address spaces independently of processes. Opening it creates a new, empty address space, and returns a file descriptor which can be used to manipulate it. Closing the descriptor frees the address space if no process is running in it. The address space can be populated by writing to the descriptor. You write a request to it, which can cause an mmap, munmap, or mprotect to happen. The contents of the request are basically the arguments to the corresponding system call.

This allows a process, i.e. the UML kernel process, to have handles to many address spaces without processes running in them. When UML creates a new process, it creates a new address space for it. When that process is scheduled, the UML userspace process on the host is switched into that address space with PTRACE_SWITCH_MM. That causes it to jump from its current address space into the new one. That, plus a full register restore, is now a UML context switch.

There are some glitches with it. Floating point is broken. This doesn't affect older distrubutions such as Debian potato, but it breaks newer ones, such as RH7.2. The problem is that, while I do a full save and restore of the floating-point registers across a context switch, that's not good enough. What I need to do is tell the FPU (via fxsave) to dump its state out to memory. That will give UML the real FP register state, which would be restored when that process is next scheduled.

With these problems fixed, I've reached the level of host support I've wanted for a while. The number of processes on the host is close to the absolute minimum. This causes UML to consume fewer host resources. It also seems to be faster. I ran some kernel builds, and it now takes 8:22. I didn't run tt or skas2 builds, but if my previous results still hold, which they should, that's 45 seconds faster than skas2, which came in at 9:07. That brings the skas3 kernel build time to 60% of the tt build time.

The one major thing that I could still ask from the host is some way to merge the current kernel and userspace host processes into one. This would involve some way for the kernel process to switch itself to the userspace address space, with a register switch at the same time. It would also require a mechanism to switch back (address space plus registers) on any sort of signal. This essentially would mean it was ptracing itself, plus intercepting its own signals. It's not clear to me how to do that, so I'll just let this sit for a while.

There are a couple minor host changes which I'll probably add in at some point. ptrace currently requires that system call intercepting processes see both the start and end of the child's system calls. This causes four host context switches per UML system call. What I would like is to just see the start of a system call, at which point UML would read it out, run it, and stick the result back in the child. It would be told just to return immediately to userspace. This would eliminate two of those context switches, and bring it down to the theoretical minimum. This should also noticably improve performance.

The second minor thing I am planning is some way of filling in the start and end addresses of the command line and environment of new address spaces. Those values would be the same as the UML values. This would have the nice effect of making ps on the host show the same process name for the UML userspace process as ps inside UML. So, that one host process would change its name according to what is currently running inside UML.

I got rid of the ugly asm that I used to set up %gs correctly. There is support in the host patch to copy segments from one address space to another, but UML doesn't currently use it. This is straightforward, and should be the correct fix for the segment problems I was having last week.

There were a bunch of other changes in this patch, including

  • /proc/mm support, since I tested it in UML before porting it to the host
  • I fixed the segfault caused by querying the configuration of a device that had never been opened
  • UML now compiles with CONFIG_MODE_SKAS off.
  • There was a fair bit of code cleanup.
  • Fixed the behavior of the network driver when it gets an error.
5 Nov 2002
After much wailing, gnashing of teeth, rending of garments, and asm nastiness, I fixed the problems with my RH 7.2 filesystem not booting in skas mode. The problem is with the segment register gs, which is used somehow by newer pthreads libraries. UML was ptracing the correct value into new processes, but it was vanishing immediately for reasons that I couldn't figure out for a while.

It eventually dawned on me that it would work if it was inherited from the parent process. In tt mode, the parent process is the host process of the parent, so that works automatically. In skas mode, the parent process is the kernel process, so there is no relationship between the host processes that hold UML parent and child processes.

With this level of understanding, I went about setting the value of gs in the kernel process temporarily to the correct value so it could be inherited by the child. This consistently segfaulted, which I didn't understand till this morning.

It turns out that loading a value into a segment register will fail if the corresponding segment hasn't been loaded with sane data. Now, modify_ldt is there, but it operates on the userspace process, not the kernel process. So, I added a kludge to make modify_ldt also operate on the kernel process. This allows the segment register to be assigned and then inherited properly by the child.

After staring at the i386 code, I have since realized why the value of gs can't be ptraced in. It is added to the thread structure properly, so if you write it in and immediately read it back out, you will get the expected result. However, in the i386 switch_to, it zeros out fs and gs if they don't have good segments already. I'm not sure exactly why they don't, but this looks like the reason ptrace isn't working the way I expect.

So, with that fix in, plus a bunch of others, -27 is out. If you want to use skas mode with it, you'll need the host skas2 patch. The first patch won't work because I moved the new ptrace operation numbers so that -27 won't see it even if it is applied to the host.

3 Nov 2002
-26 is out. This contains a few bugs fixes which are enough to get my RH 7.2 filesystem to boot. It needed a working modify_ldt and a signal delivery fix. I started trying to run UML inside itself in skas mode, and that turned up a timer bug. With those bugs fixed, RH 7.2 boots, but a few of the daemons are running continuously. This needs fixing as well.

To take advantage of this, you'll need the second host patch, which I haven't released yet. The one I'm running on my laptop has PTRACE_LDT in it. I'm going to add PTRACE_JOIN_MM to it before releasing it. I played with that a bit today, with a little test program that is made to switch address spaces. It seems fine, and I'll port it to i386 soon. I also need to port PTRACE_LDT into UML since that's only in my i386 pool at the moment.

I haven't talked about where I'm going with the host support, so this would be a good time to do it. PTRACE_JOIN_MM is a stepping stone towards greater things. Currently, you give it a pid, and it will locate that process and make the ptraced child join its address space. The next step is to eliminate that pid, since it's only used to locate an address space.

I'm going to make address spaces first class objects which are visible from user space. This will probably take the form of /proc/mm, which when opened, gives the caller a file descriptor whose underlying object is a brand-new, empty address space. Then, PTRACE_JOIN_MM can dispense with the pid, and use that file descriptor instead.

What this gives us is the ability to have an address space without a process running in it. In turn, this lets us eliminate the host processes which each hold a UML process userspace. In its place, we will have one host process per UML processor to hold the userspace contexts. The one host process per UML processor kernel context will remain. So, the number of host processes will go from num_UML_processors + num_UML_processes to 2 * num_UML_processors, which is much smaller. The userspace context processes will get bumped from address space to address space as needed.

That much is easy. It leaves us with each UML processor having two host processes, one for its kernel context and one for its userspace context. The next step will be to merge those, and have one process per processor bouncing between kernel and userspace address spaces. This is harder, because it implies these processes ptracing themselves, with an automatic address space switch every time there's a ptrace event which causes a switch to kernel space.

For system calls, this seems straightforward. The system call tracing code can do the address space switch, and replace the registers to be restored on return to userspace. Signals are harder because that code is deeper down the stack, so more code would have to be fiddled in order to make that work.

Assuming that's doaable, that's the ultimate goal - one host process per UML processor, with one host address space per UML process. This would eliminate a bunch of the ptrace additions, which would have served temporarily as scaffolding. PTRACE_{MMAP,MUNMAP,MPROTECT} and PTRACE_LDT would become operations on the /proc/mm file descriptors. PTRACE_SIGPENDING becomes unnecessary since that's needed to avoid races when switching between processes. PTRACE_FAULTINFO would still be needed, unfortunately, since PTRACE_GETREGS doesn't give it to you.

2 Nov 2002
I released -24 and -25 over the last few days. They are mostly bug fixes to skas mode. The network now runs reasonably, thanks to fixes to the checksumming code, although I do have sshd's hanging around after copies. There is now thread support, which allows me to run kernel builds - make uses vfork, which is a cheezy form of threading.

So, I did some kernel build runs to see what performance difference there is between tt and skas modes. Two runs each in skas and tt modes produce times of 9:07 and 13:50 respectively, identical to the second in both cases. This makes the skas time 65% of the tt time.

For some micro-benchmarking action, I ran a loop of 100000 calls to getpid. In tt mode, it takes 15 sec; in skas mode, it takes 7.

There are still some improvements to be had. Currently, skas mode does a system call with four host context switches, while tt mode does one with four host context switches and a host signal delivery and return. The speedup comes from losing that signal delivery, plus maybe the context switches are faster, due to the UML process address spaces being smaller because they don't contain the kernel. Two of those four context switches can be eliminated, which should cut the system call overhead down by a factor of two from where it is currently.

In the continuing quest to get my RH filesystem to boot in skas mode, I fixed modify_ldt. This is another address-space-changing function, which, before I fixed it, was modifying the kernel address space, rather than the process'. So this required another ptrace kludge on the host, which when implemented, made modify_ldt work better. It still doesn't boot. The current hang-up is something called getkey, which I've never heard of, has no documentation that I can find, and has no interesting strings in its binary. It is sitting in a poll forever, and I haven't figured out why yet.

30 Oct 2002
I released -23 last night. UML now runs in skas mode on the host. I ported the host ptrace patch from UML to i386 and have been running it on my laptop for the last few days. It is available here .

There are some glitches still - see this usermodelinux.org story for the details.

Once skas mode is working well, it's time for some stabilizing. With highmem, SMP, and skas support going in over the last couple of months, I'm going to concentrate on killing bugs for a while. Marcelo has started the 2.4.20-rc series, so I'm thinking that I will start new UML development projects after 2.4.20 is released, and concentrate on stability until then.

27 Oct 2002
skas mode works now. The host has to be a recent UML, and the guest needs to be built with CONFIG_MODE_SKAS enabled and CONFIG_NEST_LEVEL set to 1. See the story on usermodelinux.org about the 2.4.19-21 release. That contains a good description of what skas mode is and what the benefits are.

I released -22 today with more fixes and cleanups. UML now builds and runs with either mode configured out. I fixed the SMP bu ild, although I punted on SMP on skas mode for the time being. This invovled a fair bit of code cleanup and movement.

I pursuaded UML to build as a normal dynamically linked binary. This took some surgery on the linker script which I didn't include in this patch. The point of that exercise was to see if valgrind would work on UML now. It starts to, but blows up on the first call to clone, which happens fairly early. After some correspondence with Julian Seward, the valgrind author, it appears that this is easy to fix. So, we may soon have valgrind working on the kernel, which is timely considering that we are entering the stabilization phase of 2.6.

24 Oct 2002
Most of the last week was spent merging the skas work done back in September into my main pool. I created a couple of pools on the side so that I didn't have to stall the main line of development. Those two pools are now merged into my main pool and are in CVS and the patch. I released 2.4.19-18 yesterday with everything finally merged. The skas changes are present, but latent for now. You can enable CONFIG_MODE_SKAS (it's hardwired to on currently anyway) and the code will be compiled in, but you can't run it. -18 had some build problems. I forgot to update a clean rule, with the effect that there's a binary in the patch (rm arch/um/util/mk_constants if you want to build that patch). The link will fail if you enable CONFIG_HIGHMEM.

I released -19 with those bugs fixed and with some code movement, and detection of the host support needed for skas mode. This gives me a clean platform from which to start debugging the skas code, and everyone else a cleanly building patch.

14 Oct 2002
I spent the last few days debugging the merge of the 2.4 SMP support into my 2.5 pool. That was more painful than I expected. The actual debugging took longer than the 2.4 debugging, notwithstanding the fact that I had the benefit of that earlier debugging. The SMP infrastructure had changed in 2.5, and I didn't get UML into userspace until I stared at the i386 SMP boot process and made the UML boot do the same things at the same time.

I had to change the locking in the ubd driver. I started with the request queue lock being the same as the ubd device lock, which deadlocks unfixably when a disk is added to the system with mconsole. ubd_config holds the ubd_lock, calls add_disk, which does IO to read the partition table, which tried to grab the ubd_lock again. I added a ubd_io_lock, which pretty much restored the 2.4 situation, which protects IO with the global io_request_lock.

On 2.4, I had all interrupts handled on CPU 0. This turns out to be wrong - all processors have to have timers in order to do their local process accounting. However, only one processor, CPU 0 in the case of UML, actually calls the timer IRQ. This matters on 2.5; it doesn't seem to on 2.4. I also found an unfixable race with the timer interrupt caused by UML never blocking SIGALRM and SIGVTALRM and instead relying on flags in order to call into the kernel when it shouldn't. I decided to just get rid of those flags and treat the timer interrupts like any other signal.

The end result is a bunch of bug fixes and cleanups which need to be carried back to 2.4.

I'm currently merging 2.5.42, and watching an SMP 2.5.42 do a -j5 kernel build. So far, so good. However, there are some strange crashes which I haven't figured out yet. They were pretty reproducable for a while, then they seemed to disappear and I got 3 -j5 kernel builds in a row.

The evil thought of the day is to port UML back into the kernel. The internal kernel interfaces can be thought of as another OS, which would make this an OS port of UML. Why dump UML back in the kernel after I've spent all this time pulling it out into userspace? There are a few reasons that come to mind

  • It's whacked, therefore it appeals to me and must be done
  • It could perform better since it has access to the full host kernel and is not restricted to the system call interface. This is another way of looking at UML performance and could lead to better ideas on making normal userspace UML perform better.
  • It would go in the direction of allowing Linus to partition a machine and run separate OSes on the different partitions. The host could hand devices to the UML, which would access them using the normal hardware drivers, something it can't do in userspace. Over time, the "host" kernel could be trimmed down to the point where it's nothing but a little executive which starts the partition kernels and splits the hardware between them. Then the "guests" would be the real machine OSes.
8 Oct 2002
Several 2.4.19 patches later, the SMP audit is done. UML/SMP seems to be OK. My test before releasing -12 was a -j5 kernel build on a four processor UML. That went fine. mistral is reporting a hang, but I want to see it myself or else I'll consider that he's just making it up :-)

I released the 2.5.41 patch today, and pushed everything out to Linus again. I accidentally left out port numbers on the URLs I gave Linus, so I quickly put a bkd on port 80 of my UML so that he would be able to pull everything.

The next task is to merge the SMP stuff from 2.4 into 2.5. That's the reason I cleared the decks on 2.5 today.

4 Oct 2002
I've started the SMP audit. The first step was to go through the code marking all global data that took longer than .5 seconds to tell that it didn't need to be locked. The next step will be to go through those and add locking where necessary. I've done some of this, also adding comments where locking is not needed.

With this, I released 2.4.19-9. I also turned off CONFIG_UML_NET_PCAP in defconfig since the default build would break on any system that doesn't have libpcap installed.

1 Oct 2002
Some Boston University students seem to have some intellectual integrity problems. Apparently some class was assigned the task of describing the differences between the virtualization approaches of UML and VMWare. So, what do these fine students do in response? They send me mail such as
Would you like to tell me that what are the fundamental differences in the way virtualizition is performed in UML vs. VMware?

I am eager to know about it.

and
I'm hoping you can help me with this. I'm doing a small research project and am trying to figure out some of the fundamental differences in the way virtualization is performed in User-mode Linux vs. VMware.
Ummm, right.

In actual news, I released 2.4.19-8 today. I removed the limit on the number of network interfaces a UML can have. In related work, I cleaned up and simplified the network transport interface. This release contains a good number of small bug fixes.

Linus merged my changes into 2.5.40, so that's up-to-date with respect to generic kernel changes and highmem. I still have a bunch of bug fixes waiting. I'll be pushing them plus the networking changes.

30 Sep 2002
Highmem is now working in 2.5. UML is also updated to 2.5.39, so I sent both the 2.5.39 and highmem updates to Linus. I released the 2.5.39 patch as well.

I now have a lot of pent-up small stuff to deal with. I'll clean out some of that, and then get on with getting SMP working.

23 Sep 2002
I finally released the 2.4.19 UML after fixing some final build bugs. With that out of the way, I can start working on more intrusive stuff.

In the intrusive stuff category, I got highmem support working. So, you can specify any amount of memory you want, up to 4G, and the stuff that can't be mapped directly will become highmem.

I released 2.4.19-6 with highmem and fixes for a couple of crashes. One of them was a subtle timer bug that shows up with an idle UML on a loaded host, namely umlcoop.org.

I think I've got BitKeeper figured out. I'm using it for 2.5 development, and have got UML updated to 2.5.38, which patch I'm releasing now. I have to update the repositories on umlcoop.org so other people can pull from them. After that, the next task will be to get my 2.4 stuff into it.

13 Sep 2002
Lots of people are happy about UML finally getting into to 2.5. I've gotten email from all over the place. It was also fairly newsworthy, apparently. It was first picked up by kerneltrap. A pointer to that story was submitted anonymously to usermodelinux.org, which I posted, notwithstanding the fact that I had just posted a story of my own about it. An almost identical submission was made to Slashdot, who ran it. This caused more traffic to usermodelinux.org than it had ever seen before. To my knowledge, this is the first time that a UML had been slashdotted. There were some hiccups, but it seems to have fared reasonably well. It helped that it wasn't a full-bore Slashdotting.

It was also picked up tangentially by LWN. Linux Today ran a story that just pointed off to kerneltrap.

12 Sep 2002
I released 2.4.19-3 yesterday and announced it today. I also released 2.5.34, and sent the patch to Linus.

I think I'm getting close to deciding that 2.4.19 is done. The problems that have cropped up that I don't have fixes for don't seem to be as serious as they first looked. The problems that I do have fixes for will be fixed, of course.

The one remaining fix that I have is the tracing thread crash in kmalloc. I'll put that in 2.4.19-4 and see how that looks as the official 2.4.19 UML.

Late breaking news : Linus finally merged UML. It will be in 2.5.35.

9 Sep 2002
2.5.34 is out and UML isn't in it.

I've been playing with a UML on a colocated server bought by Bill Stearns and a bunch of friends, and set up by Bill over the weekend. Everyone who contributed gets a private UML with its own IP. I've turned mine into another mirror of the UML downloads and web site. I also put CVS on it with a view towards moving my CVS off SourceForge. If I can get the server side of BitKeeper (which I've downloaded, but not looked at to see whether I got only the client side), I might also run my BitKeeper pools from there.

After this, I had an evil thought. A root filesystem with

  • Apache
  • PHP
  • mod_perl
  • MySQL or Postgresql
  • CVS
  • Mailman
  • A web site, or other documentation, explaining what's there and how to use it
installed would make a fine project hosting platform. Couple it with a host and an administration that can keep it running, and you have a good mini-SourceForge. I would certainly prefer that to SF, mainly because of the control and flexibility I get.

In other news, I spent today fiddling the web site. A lot of time was spent cleaning up the XML. I fixed a bunch of mistakes and got rid of some obsolete information.

6 Sep 2002
I've spent the last few days catching up with bug fixing and releases. I rolled a bunch of fixes in and released 2.4.19-2. I also made a tools release with a few changes, chiefly fixing uml_moo so that it spits out a sparse file and cleaning it up so that it's much more understandable.

I also got 2.5.33 going. None too soon, since the natives were starting to get restless. James McMechan and Mike Anderson both popped up on uml-devel with the changes needed to get UML up to 2.5.33. I'm sending UML to Linus again. We'll see how it goes.

2 Sep 2002
Today's problem is SA_RESTORER. This was a hidden same-address-space dependency. UML signal frames are basically copies of host signal frames with all of the signal-specific information replaced. Since the libc in UML provides a restorer that's inside UML, UML processes use it since that information is not changed when UML constructs its own signal frames. This works fine as long as UML and the process are in the same address space. When they aren't, and the process' libs doesn't provide a restorer, the process will segfault trying to call UML's restorer.

The fix is to get a frame from the host which has the kernel's built-in restorer, which you get by turning off SA_RESTORER in sigaction. However, this is easier said than done, since the sigaction entry point provided by libc specifically provides a restorer and disables any attempt so say that there isn't. So, I ended up calling sigaction by going through the generic syscall() entry point rather than libc's sigaction. Doing this, plus using the old sigaction structure that the kernel uses got me a signal frame with a built-in restorer.

Theoretically, this should get the new UML up to a login prompt. The signal frame code that I changed in the host UML also needs to be applied to the guest UML. And I decided to do that as part of the merge of the two pools. And that is going to wait a bit.

I have a bunch of bug fixes to get out, plus I need to catch up to Linus, plus I need to get 2.4.19 out. So, I'm going to take care of all that, and then get on with merging the re-architecting into the main UML pool.

30 Aug 2002
Today was Signal Delivery Day. That took me a while to get right, mostly because it was a nice day and I felt like doing stuff outside.

That done, a bug in fork reared its ugly head. It turned out that fork wasn't returning 0 in the child, making it believe that it was the parent. Since init is doing the double fork trick, where it forks, the child forks and exits, and the grandchild execs whatever needs execing, this meant that all children just exited. init understandably got rather upset with this.

exec then crapped out because of a buglet in strnlen_user. I fixed that, plus a few instances of another bug that I happened to notice.

Now, I get a getty running. It hangs in sigsuspend. I think that's because it's sleeping for a few seconds and I haven't hooked in the timer yet.

29 Aug 2002
Decent progress today. Last night's infinite segfault loop turned out to be caused by clear_user not really clearing anything. So, init's bss, which was supposed to be zeroed, contained garbage, and it crapped out when it tried to dereference some of it.

Fixing that and one or two other user access bugs made things work a lot better. init now runs, and starts firing off the rc scripts. The current hold-up is signal delivery. init won't boot the system unless it sees SIGCHLDs from exited scripts. There are only minor problems here, so this shouldn't be a big deal.

28 Aug 2002
Today's installment of As the Codebase Churns stars the user access macros. I implemented the rest of them when it turned out init was getting to the first open, but bombing on a copy_from_user. That took little time, but I spent the rest of the day chasing bugs in them.

The problem is that when you mess them up, the kernel doesn't crap out right there. It bombs at some later point, in an apparently unrelated way. For example, I was chasing an infinite segfault loop. More or less by accident, I discovered that init was being started with an sp of zero. That turned out to be because a function which sets up the initial stack for an exec was silently failing. And that was failing because of a bug in strnlen_user.

I *always* get strnlen_user wrong because it's defined differently from the libc version. The libc strnlen normally returns strlen(str). The kernel version returns strlen(str) + 1. Believe it or not, this makes a big difference in how well (or if) your kernel boots.

After chasing this and a few other entertaining bugs, init is opening files, making system calls, and all those good things. Unfortunately, it also goes into an infinite segfault loop on a rediculous address. This is something to chase in the morn.

27 Aug 2002
I fixed last night's hang by having UML collect a representative set of registers, including a set of good segment register values, from a subprocess. This is used as the starting point for new processes, rather than the array of zeros I was using before.

With this fix in place, init gets to the point of starting to make system calls. The new system call handling required redefining the pt_regs structure, which required a fair amount of hacking and slashing before it would compile again.

It now seems to be doing system calls OK, although with some flakiness. I've currently got it up to system call number 6.

26 Aug 2002
The Great Redesign continues.

The major task of the day was redoing the copy_user macros. When the kernel is moved from the process address space to its own address space, the macros which copy data back and forth between userspace and kernelspace get more complicated. Before, with kernel and process sharing the address space, data can just be copied back and forth, taking care to make sure that faults are handled properly, since the userspace address may be bogus, or it may have been swapped out.

That's no longer possible when they don't share an address space, so what you have to do is do a process virtual to physical address mapping, and then copy the data in or out of the physical memory. When the page isn't present, there won't be a mapping for it, and you have to fault it in to the process address space. Then, the mapping will be created and the data is accessible.

With copy_to_user and clear_user in place, it's possible to exec init and enter userspace. I've got to the point where it has faulted in a few pages. init then spins because I gave it a set of registers which is all zeros, except for the ip and sp. For the normal registers, this is fine, but things work very badly when the segment registers are zero. In this case, it's spinning on the first data reference because it has a ds of zero.

In other news, it turns out that I pretty much have to totally work inside UML. Since I'm running UML inside another UML with the kernel pool on a hostfs mount, and because hostfs doesn't consider that files can change underneath it, I have to edit and compile through hostfs, rather than doing that on the host. I finally installed emacs in UML after getting sick of @!$!@%$ vi, and I'm fairly impressed by how quick it is. I can see a little bit of interactive slowness sometimes, but it pretty much feels the same as emacs on the host. Builds are another question. They are noticably slower than on the host. This might be due to hostfs and its synchronous IO, but I'm not sure about that.

25 Aug 2002
I'm in the throes of redoing parts of UML so that the kernel lives in its own address space. There would be one thread per processor running in this address space. Every UML user process would have a separate host process. Transfers into the kernel therefore cause a context switch from the user process to its kernel process.

It's being done this way because of Ingo's revelation on LKML that context switches are much faster than signal deliveries. Previously, I was thinking that the best performance would come from intercepting system calls without context switching. I was planning on having the kernel optionally be put in its own address space, but was thinking that there would be a performance penalty for doing that. Now, it seems that this is totally the best way to go.

To support this, I've added a bunch of new stuff to the host's ptrace:

  • PTRACE_FAULTINFO - get the fault type and address of the child's most recent segfault. This is needed for page fault handling.
  • PTRACE_SIGPENDING - get the child's pending signal mask. This will be used to work around a race between a child executing a system call and a SIGIO being queued to it. If they happen at the same time, the system call cause control to be transferred into the kernel, leaving the SIGIO pending on the child. If the kernel then switches to another process, that IO notification will be lost for an indefinite period of time, which can UML to hang.
  • PTRACE_MMAP, PTRACE_MUNMAP, and PTRACE_MPROTECT - these manipulate the child's address space.
  • PTRACE_CLONE - not added yet, but this will cause the child to call clone. The reason for this is a little subtle. CLONE_VM threads in UML need to be CLONE_VM on the host as well. So, the kernel process needs to be able to create a clone of one of its children.
When a user process is running, the kernel process will be in a loop calling wait and ptrace in order to intercept system calls. When a process system call bumps the kernel process out of wait, it will read the system call and execute it itself. When it returns, it will switch back to the user process. So, a UML system call will have an overhead of two context switches rather than the current four context switches, one signal delivery, and one signal return.

Kernel threads don't have an associated host process.

Context switches are done with a longjmp from one kernel stack to another.

Traps into the kernel are done in the same way as system calls. The interrupt to the user process will be intercepted and cancelled by the kernel process, which will execute the appropriate handler. The performance increase will be about the same as for system calls.

The benefits of this are many:

  • UML should be noticably faster.
  • jail mode is now automatic, so kernel memory protection is always on, and is very much faster than it is currently.
  • processes now have the normal full 3G of address space. This makes honeypot mode automatic (and much more believable since the UML kernel will no longer be visible to processes). Applications which require a lot of address space and which may have bombed out on UML as a result will now run.
  • The UML kernel now has a full 3G of address space. This makes it possible for UML to conveniently have lots of physical and/or virtual memory since it can all be mapped at the same time.
  • UML can now be a normal process, rather than the strange, statically linked, oddly loading process it is now. UML can be debugged with 'gdb linux', rather than the ptrace proxy arrangement it uses now. gprof and gcov should work much better as well.
  • Many, many kludges just vanish. The shared remapping of kernel text and data is gone, as is switcheroo. Context switching is dead simple, and obviously race-free. Signal handling is much simpler, since there is no more special handling of the startup of a new process. As already mentioned, the ptrace proxy and all its nastiness goes away.
Since this requires changes in the host, and the world may not patch all of its kernels immediately, I'm planning for UML to be dual-mode for the foreseeable future. It will detect the support and use it if it's there. Otherwise, it will fall back to the traditional tracing thread. In the code, I'm planning on separating the code that supports the two modes into separate subdirectories. This provides a reasonable way of splitting out code that's been somewhat messy for a while, and separates the two pieces of code from each other. It will also make it easy to configure a single-mode UML, in case you know that you will be running in one mode, and you don't want the code for the other mode compiled in.

So, what's the current status? I've added the new PTRACE_* options, except for PTRACE_CLONE to the kernel. The arch-specific support is only in UML right now. I'll add it to i386 when UML is up and running on it. As a result, I'm debugging the new UML inside an old UML with the ptrace additions. When it's working, I'll un-nest UML, fix the i386 kernel, and see how much faster it is.

The new UML has gotten to the point of starting to fault in init. Right now, exec is setting up init's stack and some of its data, forcing a few of its pages to be faulted. It has not yet entered userspace. It has created the usual collection of kernel threads, initialized them, and switched between them. So, the in-kernel portions of this code appear to be working well.

22 Aug 2002
I decided to flush out all the pending changes I had, so I also released new versions of the test suite and tools yesterday.

With 2.4.19 more or less out of the way, I decided to start into a comparatively large and risky project. So, I decided to start working on the changes needed to move the UML kernel into its own address space. This was discussed on the kernel list a few weeks ago. As the things stand now, this can't be done without some extra support from the host. I'm implementing this stuff in UML now, and will move it into the x86 kernel later.

21 Aug 2002
Not too much to report recently. I released -53 today, which contains a few small fixes. 'jail' mode works better now with a couple of crashes fixed. I also cleaned up the ubd driver's error reporting.

With no complaints about the larger changes that I made recently, I think it's time to release the 2.4.19 UML. I was holding off on this until those changes got banged on some and I was happy that they didn't break anything. This now seems to be the case.

OK, I updated UML to 2.4.19 and released the patch. The full release will come after I've had a chance to bang on it some.

12 Aug 2002
Linus released 2.4.31 yesterday, again without UML. However, in stark contrast to earlier releases, the two patches to generic code were applied. This gives me some hope that Linus will merge the main piece of UML at some point.

I tried getting UML running on my 2.2 box on Friday before leaving for a camping trip. The intent was that I'd leave some stress tests running on UML for a couple of days and see what new and interesting bugs popped up. It turned out that the bugs popped up in the process of building and running UML. I abandoned that plan until I fixed the problems I saw.

I knocked those bugs off today, and released the results in -52. I'll be trying the 2.2 thing again.

Then, it'll be off to the 2.5.32 races, with yet another patch going to Linus.

I've got 2.4.19 sitting in the background. I want the current UML to get banged on a bit before I upgrade it to 2.4.19.

9 Aug 2002
UML development has been cleanup lately. I found and fixed a "I'm tracing myself and I can't get out" panic. This induced me to understand the timer flags, which let me simplify them enough that I can understand what's going on without lots of deep thinking.

Having done that, I merged sig_handler and irq_handler_common, which were basically the same, except for one line. Then, I merged in syscall_handler, so I went from three copies of the kernel entry and exit code to one.

I also discovered a few more bugs in my stress testing. These are fixed, along with some cleanups caused by me trying to build UML on 2.2.

So, this is all released in -51.

5 Aug 2002
There is hope on the getting UML into the 2.5 tree front. Linus has merged the changes to generic code, and most of the linkage.h patch. The stringification I sent him apparently breaks on some compilers, so he backed that out. I copied the syntax I used from some gcc documentation, which I strangely can't find any more. What I do find is somewhat different.

However, he hasn't merged the main body of UML yet. So, we'll see what's there when 2.5.31 comes out.

There's been some discussion on the kernel list about making UML faster. Alan floated an idea for making jail mode faster. I pointed out some flaws in it. I floated my ideas about making address spaces accessible from userspace, and Alan dinged them in turn. Read all about it over at usermodelinux.org.

In a separate development, Ingo pointed out that process context switches are much faster than signal deliveries. This opens up the possibility of implementing jail mode by having the tracng thread also run the kernel side of things. This will give us a fast jail mode and speed up system calls. The only thing that's missing is the ability of the kernel process to change the address spaces of the other processes.

In development news, I discovered a race which produced the dreaded "I'm tracing myself and I can't get out". This prompted a cleanup of the timer code. 2.4.19 having just come out, I was nervous about releasing the 2.4.19 UML with something with potentially subtle consequences like that. So, I've been testing more heavily than usual. I added a test to the test suite which runs the rest of the suite with jail mode enabled. The reason for that is that jail is sensitive to bugs in how the timer is handled.

2 Aug 2002
Linus released 2.5.30 and UML again got dropped on the floor. So, I'm in for another round of patch generation.

2.5.30 was fairly easy. There were some block layer changes which broke the compile, but it was simple enough to figure out what had happened and fix the ubd driver.

30 Jul 2002
-48 is out. I reproduced the crash that mistral saw when he killed console xterms. I turned on slab poisoning and it became 100% reproducable. It took me a surprisingly large number of attempts to fix it. So, that fix is in, plus the fix for hostfs compilation failure that everyone seemed to think was a typo.
28 Jul 2002
I announced UML 2.5.29 to the kernel list and sent UML to Linus again for him to ignore. I had redone include/linux/linkage.h in order to clean it up and remove it from the list of generic files that UML changes. Keith Owens noticed that and suggested an improvement to it, which I will add the next time I send UML in.

I got hppfs to the point of allowing an outside script to generate dynamic /proc content and to filter the real /proc file. That's the bulk of the functionality that's needed. Then, it will be time to flesh out the remaining file and inode operations so that everything works as expected.

27 Jul 2002
Linus released 2.5.29 last night and guess what's not in it. Right.

The patch itself contains no particular arch changes, so it looks fairly simple. I'll release the UML 2.5.29 today probably. I released 2.4.18-46 today. The main feature is start of hppfs, the Honeypot procfs. It's far from done, but works enough to allow proc files inside UML to be replaced with versions on the host. This by itself is enough to make a fairly convincing honeypot. I still have to complete the file operations, so that everything works as expected. I also have to add support for allowing something on the host to generate files, either from scratch or from the contents of the UML proc file.

26 Jul 2002
I sent the latest UML to Linus. We'll see if it gets in 2.5.29. If not, I'll just keep sending it until he gets sick of the bandwidth that it's consuming...

In actual work, I started implementing the Honeypot procfs filesystem. This is designed to cut the heart out of the problem of making a UML honeypot look like a physical system. The largest part of the problem is stuff in /proc. Look in places like /proc/cmdline, /proc/interrupts, and /proc/cpuinfo to see why.

My plan is to implement another filesystem which creates a poor-man's overlay over the real /proc. It will have two sources of information - the real /proc and a shadow /proc on the host. When the user inside UML looks at a file in /proc, this new filesystem will check for the corresponding file in the shadow hierarchy on the host and use that if it's there. Otherwise, it will just call into the UML /proc. So, by sticking stuff in the hierarchy on the host, the admin will be able to override selected pieces of the UML /proc.

25 Jul 2002
2.5.28 is out. The big news is that UML again got dropped. So, I get update and send it in again. The big change this time around is all the irq changes. They are going to cause this UML update to be more troublesome than usual. Global irq disabling and enabling is gone, which overall is a good thing. I would hate to have to implement them in UML. On an SMP UML, that would require sending an IPI around to all the other processors, and not continuing until they had all confirmed that they had fiddled their interrupts appropriately. That would absolutely suck, performance-wise.

Rather than implement that, I would just never allow interrupts to be handled by any processor other than processor 0. Then local interrupt enabling and disabling is equivalent to global enabling and disabling. With this change, it will be possible to distribute interrupts around the processors of an SMP UML.

In other news, hostfs in 2.5 is broken. I fixed a few bugs, but there are more remaining.

23 Jul 2002
-43 is out. The big news here is that is has SCSI support. Currently, there is only the scsi_debug driver, which operates in memory. I'm scheming to split the file I/O code from the ubd driver, leaving behind an interface to plug it back into, and doing the same to scsi_debug. This would allow the ubd file and COW code to be plugged into the SCSI subsystem, and allow the in-memory device from scsi_debug to be used as a ubd device.

Another cute thing is /proc/mconsole. It is created if UML is booted with an mconsole notification socket. Anything written to it will be sent out to whatever is listening to that socket as a notification.

22 Jul 2002
I sent the latest UML to Linus. We'll see how this one goes.
21 Jul 2002
Linus released 2.5.27 yesterday and it did not contain UML. Oh well.

So, I'm having another go at it. I released -42 today, which is mostly cleanups and bugfixes to -41. There are also some driver build changes from Henrik.

I dropped this into 2.5.27 without too much trouble. So, I'll package it up and send it in to Linus again tomorrow.

18 Jul 2002
I released 2.4.18-41 today. This included a new way of setting up xterms which gets UML out of the business of allocating pseudo-terminals for them, which has been a source of trouble for a while. As a side-effect, the terminal emulator is now configurable. As another side-effect, it should now be possible to easily run and control the UML debugger from a script, like the UML test suite.

I also merged a bunch of the build cleanups from 2.5. Some of my Makefiles were seriously obsolete, and now they are slightly less so.

I made a tools release, which adds a jail kit. This contains the tools needed to run UML as a non-privileged user inside a right chroot jail.

17 Jul 2002
I announced 2.5.26 to the world today. Having done that, I split the patch into two pieces, one containing changes to generic files and one containing everything under arch/um and include/asm-um. This is the way Linus said he liked to get ports when I asked him about it in Ottawa.

Missing from this for now are hostfs, the tty logging patch, and the page validation patch. I'm going to send hostfs in separately. The tty logging patch will require advice and consent from whoever owns the tty driver (and is willing to admit to it). And the page validation patch will have to go in separately as well, considering the reaction to it when I brought the topic up on LKML.

It's all off to Linus (and the generic piece was cc-d to LKML). We'll see if this fares any better than my previous attempts.

16 Jul 2002
Linus released 2.5.26 today, so I updated. No problems with it. There weren't any arch changes in this version.
15 Jul 2002
2.5.17 and 2.5.18 required some include tweaking because the tlb flushing stuff moved out of pgalloc.h. Other than that, there were no problems.

2.5.19 moved some stuff from arch code into generic code, which is usually a good thing.

2.5.20 renamed the swap entry access macros, which was no big deal. 2.5.21 also posed no problems.

2.5.22 introduced some large build changes which broke some of my more antiquated Makefiles. After some modernizing, they started working again. There were also some kdev_t changes which broke hostfs.

The hotplug CPU changes in 2.5.23 required some minor changes. 2.5.24 did nothing but move sys_pause from arch code to generic code. 2.5.25 rearranged page fault handling a little, made a small build change that had spectacular results, and changed sys_sched_yield to plain old yield. After I got it to build, there turned out to be a division by zero problem caused by the HZ changes. jiffies_to_clock_t is defined as

                # define jiffies_to_clock_t(x) ((x) / (HZ /
USER_HZ))
              
which blows up when HZ < USER_HZ. UML HZ was 52, while USER_HZ is 100. I contemplated fixing this by turning the nested division into a multiplication, but that would have just made the calculation vulnerable to overflows. So, I just bumped the UML HZ up to 100.
14 Jul 2002
On to 2.5.8. Nothing major, a couple new system calls, and some header file rearrangement. It compiled and booted without too much trouble.

2.5.9 was no problem at all. I fixed a glitch or two, built it, and booted it. So were 2.5.10 through 2.5.13.

2.5.14 was more interesting. The 2.5 scheduler bug showed up here with a vengeance while it hadn't with anything previous. The problem is that the O(1) scheduler calls the arch switch_to with interrupts blocked. With UML, this means that a SIGIO can arrive before SIGIO is forwarded to the incoming process, and it will be pending in the outgoing process until it is scheduled again. This may never happen because that SIGIO could be a disk IO completion that is necessary for anything to run again. So, I applied my old fix, which makes the problem disappear. This checks for pending SIGIO after the forwarding, and does an explicit kill(next_pid, SIGIO) if there is one pending.

tmpfs fails to mount because the superblock allocation is somehow failing with -ENOMEM. And the whole boot is a mess, with -EIO appearing all over the place and /proc failing to mount for some reason.

It turns out that the filesystem was messed up. And it turns out that a little message that I was seeing was one I put in to make sure I checked some code that had never run. It started running now because of changes elsewhere in the kernel, and it turns out to be buggy in such a way as to cause data corruption. So, with that fixed, I get nice clean boots, except for a panic cause by the failed tmpfs mount.

And that turns out to have been caused by an incomplete merge of the memory stats change. Those stats are now kept in generic code rather than in the arch. tmpfs calculates the number of available inodes from the number of pages of available memory. I forgot to delete the UML declaration of totalram_pages, which meant that the generic value stayed at zero. Fixing this eliminates the mount failure and the panic.

With that all settled, I move on to 2.5.15.

2.5.15 is uneventful. The only thing of interest was a little signal delivery bug fix which I had spotted when it was first sent to LKML and which I already had in my pool. After the patch went in, UML just compiled and booted.

2.5.16 made jiffies go away as a normal variable. The other arch link scripts all create jiffies as a symbol aliased to jiffies_64. This booted after adding that to the UML link script and removing the CONFIG_SMP from around the definition of mmu_gathers.

13 Jul 2002
Continuing whacking away on 2.5.5. After fixing page table things that I had messed up, it finally booted.

2.5.6 adds a new system call and changes the interface to blk_ioctl. Fixing these up, and disabling jffs2, which was broken, resulted in a working UML.

2.5.7 added sys_futex and applied the fs.h crapectomy to a bunch of filesystems. nfsservctl is now apparently an optional system call, so the system call table needed fixing for that. With those fixes, it builds and boots.

12 Jul 2002
On to 2.5.5. This one was mostly page table changes. There were interface changes and some new interfaces added. Someone decided it was a good idea to get rid of the little caches for pgdirs and page tables. I agree, since specialized little caches like that hold on to memory that should be available to the rest of the system.

I also discovered that hostfs needed to be updated because of the fs.h crapectomy that happened in 2.5.3. I don't know why I didn't see this before. Basically, in order to eliminate the header file horror show in fs.h caused by the inode union needing to have an entry for every possible filesystem, the filesystem-specific data now includes the inode rather than the inode containing the filesystem-specific stuff.

11 Jul 2002
I decided to get on the stick and start getting UML going with 2.5. My 2.5 pool had 2.5.3-pre5 in it, so that's where I started. I grabbed all the patches from there to 2.5.25 and started whacking away.

I got 2.5.3 and 2.5.4 compiled and booting. I put off further testing in the interest of making progress through 2.5. I am generating UML patches along the way, and I'll exercise them more heavily later on.

The major work needed for 2.5.3 was the block driver. It needed updating for the bio changes. Nothing major, except it confused read and write requests, resulting in a trashed COW file when it tried reading the superblock. I didn't realize this for a while, and was trying to figure out why it wasn't reading the superblock even after I fixed the read/write confusion.

2.5.4 changed how task_structs are allocated. They used to be at the bottom of the kernel stack. Now, there is a minimal thread_info struct there, and the main task structure is allocated by kmalloc just like everything else. This took a fair amount of time to fix enough so that UML would compile. Once that happened, I had to chase down a few bugs that assumed that the current task_struct was at the bottom of the stack.

10 Jul 2002
I started releasing rapid-fire patches. I just finished with -40. There's been some more stuff moved under the OS interface. Also a bunch of fixed bugs. I discovered that some gdb features that I thought were working weren't. So, I fixed them. I also discovered that closing a terminal at one end doesn't cause a SIGIO at the other. So, I added that to the SIGIO emulation.
6 Jul 2002
Harald Welte is now winging his way back to Germany. He had been staying with me for the week after KS and OLS. I fear his most lasting impressions of lovely New England will be the insects... Oh well.

I decided to start thinking about making UML OS-portable so that Chandan Kudige can start merging bits of the Windows port. I created a new directory to hide Linux specific code in and defined an OS-independent interface for it (and other OS ports) to implement. It's fairly rudumentary right now, containing a handful of file and process operations, but the overall intent should be fairly clear.

So, with that, I released -37. It consists almost solely of the code reorganization (the exception being an updated config.release).

2 Jul 2002
I'm back from the Kernel Summit and OLS. In KS news, it turns out that Linus likes UML. Alan apparently clued him in that people are using UML for real work, and they care about performance. He has no problem with exposing address spaces to processes, nor with having UML doing an address space switch in conjunction with a signal delivery. This would essentially be sigaltstack_mm, with an mm switch as well as a stack switch. This would have a number of advantages, including making 'jail' mode trivial, cleaning up UML, speeding up UML context switches, and giving UML a larger virtual address space.

In OLS news, my talk went pretty well. I described the major things that I want in order to make UML run better, and there were no arguments with them. They were all non-controversial, so it looks like they're going to happen. The downside is that there is a 2.5 function freeze for Holloween this year, so it all needs doing by then.

I departed from tradition slightly and annotated the slides before the talk. Since this was a brand-new talk (so brand-new that I finished it 15 minutes before the talk, which gave me just enough time to get over to the conference from the hotel), I did the notes so I'd have some idea what I should cover on each slide. I couldn't look at the notes during the talk, but just doing them helped me remember what I wanted to say for a given slide. This also has the advantage that they can be put on the UML site immediately, rather than when I get around to annotating them. So, here they are.

A few other UML tidbits:

  • There was lots of demand for a 2.5 UML. This, combined with the Holloween deadline, means I need to get moving on this somewhat soon.
  • The UML swsusp patch apparently sort of works. This surprised me, since the last news I had was that it didn't resume properly. So, I probably should look at it and see about integrating it into UML.
  • Michael Richardson of the FreeS/WAN project did a talk on using UML as a testbed for regression testing. It was fairly interesting and well-received. He demoed it with four UMLs running on a FreeS/WAN server someplace outside the conference. It went well, although the connection was a bt slow. Richard Briggs was in the audience and was running a six UML testbed on his laptop, which he offered as a demo when the connection to the external demo was looking a bit iffy.
  • Bert Hubert almost used UML to demo something during his talk. He ended borrowing another laptop and using it instead, since that setup would be somewhat more authentic than using UMLs. Oh well.
21 Jun 2002
-33 turned out not to build unless CONFIG_TTY_LOG was enabled. So, I released -34 a few hours later with the fix.

Today, I'm going to release -35 with gdb stack switching implemented. This lets you detach from the thread that's currently in context, attach to an out-of-context thread and look at its stack. This makes it a lot easier to debug deadlocks since you don't have to manually reconstruct stack traces from hex dumps of the stacks. I, of course, have gotten used to the hex dump approach, so I didn't see what the problem was. Other people did have problems with it, including Peter Braam and Cluster File Systems, who offered a small bounty for this feature. And, now that it's here, I do have to admit that it's pretty handy.

This will be the last UML release for a couple of weeks since I'm getting ready to head up to Ottawa for the Kernel Summit and OLS. Those will be next week, and I've got Harald Welte visiting the week after, which means that work might be thin for that week as well.

18 Jun 2002
Last night, I added /proc/exitcode, which allows a process to set UML's exit status. This is useful for one-shot UMLs, which run one thing and exit. It lets the thing running inside UML export its exit status to the outside caller.

So this will be -33. It'll be available as soon as I get the web site build process all put together again.

17 Jun 2002
I spent a couple fruitless days looking for whatever is causing JVMs to crash under UML. It appears to be related to signals somehow, since the thread that segfaults is always returning from sending another thread a SIGRTMIN. It first appeared that the other thread was provoking the crash, since there was always a context switch from the sending thread to the receiving thread before the sending thread came back into context and returned from the kill. However, I suppressed context switching in that case, and it still died on return from the kill.

The code that's segfaulting appears to be generated by the JVM rather than being the JVM itself. It is one of very many small pieces of code pointed to by a large table. Each of these is about 6 instructions long, starting by dereferencing a pointer, doing one or two instructions worth of work, calculating an index, and jumping to whatever code is pointed to by that entry in the table. The crash is coming at the very start of one of these blocks. The register that it's dereferencing contains zero, which is bad.

I decided to put that off for the moment and do some other things. So, I redid the web site build. My major complaint was the speed at which it rebuilt. The big culprit was all of the changelogs that have accumulated. They are all generated from a large and growing XML file, with each entry getting its own web page, and requiring the processing of the entire file to generate. Needless to say, this is an O(n^2) operation, and I was well within the quadratic regime.

So, I redid it, adding some infrastructure that allows rebuilding of files only when they've really changed. So, the fact that changelog.xml changed does not cause the rebuilding of all of the changelog-*.html files, even though their source file changed.

I also tidied up the dependencies and reorganized the pool itself to make it a bit cleaner.

In actual development news, I integrated the honeypot tty logging patch into the main pool, along with a little patch I got from geoffrey hing. I also added the ability to log to a preconfigured file descriptor. This is for the benefit of chrooted UMLs, so they can log to a file outside the jail.

11 Jun 2002
Hmmm, long time no diary entries. Well, I've been busy. I've started getting regular offers of contracts to do UML-related things and I've been accepting some of them. This slows down the pace of UML development, unfortunately, but it does tend to fatten up the old bank account, which has to be considered a good thing.

Recent development has featured a couple common threads. One is moving UML into its own memory. Before, UML would allocate a physical memory region which did not include its own binary. This effectively meant that UML text, static data, and heap were not in physical memory. This caused problems for the swsusp effort on UML because it wants to copy physical memory out to disk. Fixing that took several patches to get right, but it seems to be fine now.

Another theme has been openpty. This started with some mysterious segfaults seen by a few people which ultimately were tracked down to calls to openpty that UML was making. openpty has a larger than usual stack frame, so when it runs on a kernel stack, it overflows it and corrupts whatever lies just below.

My first attempt to fix it put it in a separate thread so that the pages it used would be COWed and it wouldn't change any UML memory. This was stupid because UML memory is mapped MAP_SHARED exactly to defeat COWing. Attempt number two involved having the tracing thread run openpty. This was good because the tracing thread has a proper stack. Unfortunately, openpty can call malloc, which gets converted into kmalloc, which is a very bad thing to call from the tracing thread. So, my current code goes back to the original mechanism, except that a larger than usual stack is allocated in this case. This should be OK now.

With these things settling down, I'm about ready to release -32, and I'm thinking that a full release would be good to do soon, as well. Maybe later this week or over the weekend.

17 May 2002
A week or so ago, Steve Freitas kindly assembled and sent me a nice little SMP box so I could chase the host SMP bug that UML is exercising. I got around to fitting it into my environment - it caused my network to outgrow my crossover cable, so I invested in a switch and a bunch of cables. I've got it on the net with a serial line console running to my desktop box, which is all running fine.

UML also runs fine on it, which is disappointing. It'll be hard to find the host bug if I can't reproduce it.

In development news, I did a bunch of work on ptrace, which forms the bulk of -26, which is released today. Watchpoints now work in gdb inside UML. Kernel watchpoints don't work yet, but they will soon. I fixed a couple other ptrace bugs, including one which could be used to break out of UML.

11 May 2002
The big UML news is that it is now self-hosting. This is more a demonstration of UML maturity than something that's very useful. It's still nice to be able to do, though. There's a description of how to do it here.

UML development has been concentrating on fixing bugs, as usual. The last few releases have mostly consisted of small fixes.

The test suite received an overhaul. It is now willing to build UMLs according to the needs of the tests. So, tests can now specify how they want UML to be configured, and the suite will build a UML if necessary.

28 Apr 2002
I'm back to knocking items off my todo lists. At this point, I'm down to about 75. The recent victims are mostly small items that had accumulated with some other bugs that people confirmed had already been fixed.

I released -21 today after getting rid of most uses of tracing_cb, which UML threads use to request the tracing thread to create processes for them. Having UML threads do it themselves exposes that code to the UML gdb, as well as get one step closer to having miscommunication with a helper hang only the thread that started it rather than the whole UML. To get there also requires that input from helpers be handled asynchronously rather than synchronously as is the case now.

25 Apr 2002
This was a fairly lazy week. I went down to West Virginia at the invitation of David Krovich to give a talk at WVU. That went fairly well - the chairman of the CSEE department was apparently impressed by the number of students who attended. After dinner, I gave another talk at a MORLUG meeting (MORLUG == Morgantown LUG) which consisted of me firing up UMLs and demonstrating various neat things you can do with it. This also seemed to go well.

In extra-curricular activities, we tried to go hiking on Sunday, but got rained on heavily. It looks like nice country if only the clouds would get out of the way so you could see something.

In UML news, I spent some time fixing the iomem support. It was broken in such a way that I had a hard time believing it ever worked. What I now think is more likely is that the VM system changed in a way that broken iomem, but no one noticed. The problem was that the VM system deals largely with page structs instead of raw page addresses and the iomem regions had no sane mem_maps, so they had no page structs.

I fixed this by changing the infrastructure to allow for segmented physical memory consisting of regions which have their own separate mem_maps. This will allow for plugging and unplugging of iomem regions. I was hoping this would work for physical memory regions as well (plugging anyway; unplugging is harder), and after a bunch of failed experiments, decided that this wasn't going to be.

After getting this working, I decided to release 2.4.18-19. It also contains James McMechan's partitioned device support and a bunch of smaller bug fixes that were noticed by various people.

15 Apr 2002
More bug-bashing. My lists contain a total of 93 items, but a large number of them are about to die. The umlgdb expect script from Chandan Kudige will knock off the two related to reloading module symbols. When I verify that UML can boot as a diskless client, that will get rid of five more. James McMechan's ubd partition patch is currently accounting for six items. I should be able to knock all those off in the next couple of days.

I released an RPM last Tuesday to get the accumulated changes out to a wider audience. I think I'm going to start releasing an RPM every two or three weeks from now on. 2.4.x releases are too far apart for me now.

6 Apr 2002
I spent the last few days on a bug-smashing spree. To-do items had accumulated at an alarming pace over the last couple weeks, so I decided to whack away at them a bit. I had over 110 items on my various lists, so I knocked off as many as were easy to kill. I now have 99 items, so I got rid of more than 10 of them.

Prominent among them are

  • floating point registers not being available to gdb inside UML or stored in core files
  • hostfs not being able to create unix sockets
  • the daemon transport now gets its MAC from uml_switch
  • if the umid is set on the command line, it is put into host process names and into xterm title bars
There were some small patches from mulix, Sapan, and Daniel Phillips, which all did useful things.

With that, I released 2.4.18-14 and uml_utilities_20020406. This may be the basis of another full release (with an RPM instead of just a patch). I need to do that soon since the 2.4.18 UML is getting fairly old at this point.

3 Apr 2002
In the nearly two weeks since the last entry, I've pretty much knocked off the console flow control bug (the one remaining piece is to make sure that it works correctly when ptys deliver output SIGIO) and the init hang on older machines. These are two of the oldest UML bugs.

The init hang is caused by init executing a cmov on a processor that doesn't support it. What I ended up doing (at the suggestion of Alan Cox who saw this on one of his boxes) is detect cmov support by looking at /proc/cpuinfo, then panic if init gets a SIGILL on a cmov.

In other news, I got all of Bill Stearns' bootable filesystems over to SF, so the links to them are likely to work. They're not on ftp.nl.linux.org yet though.

22 Mar 2002
I've decided that I'm out of major design changes I need to make and can now concentrate on knocking items off the todo list. The one published on the site has 44 things on it. I also keep a todo mail folder containing pieces of mail that describe something well enough that I can keep track of it. That folder contains 54 messages right now, so I've got almost 100 things to do. Some are duplicates, and some are already fixed and I haven't figured it yet, so the actual number is somewhat smaller.

My latest run of the test suite succeeded in booting all of the filesystems available from the UML site, which is probably the first time that has ever been true.

15 Mar 2002
I discovered a remnant bug. 'strace -p' didn't work. Fixing that was easy enough, but I decided to clean up some code while I was in it. That took more work than I expected, since it involved saving state in the thread structure in signal handlers, and when you forget to restore the old state, you get very obscure misbehaviors. I ended up backing out the changes and putting them back in one line at a time before it dawned on me that perhaps restoring the old stuff would be a good plan.

So, I released 2.4.18-7 today. I also released another version of the test suite. This one allows tests to be interactive and for the perl test driver to interact with them.

In separate news, my paper proposal for OLS was accepted. So, it looks like I'm on the hook for another paper.

14 Mar 2002
After much surgery, I finished the pt_regs to sigcontext work and put UML back together. There were surprisingly few bugs to chase once it compiled again. I fixed around three bugs, and haven't found any more since.

I did some more work on the test suite. There is now a kernel build test, for which I wrote a perl mconsole client. This prompted me to add the option of the mconsole driver sending the name of its socket to a socket specified on the command line. This allows scripts to find out where to send mconsole commands without having to parse the boot log. It also tells them when UML has booted to that point.

This test also exposed a bug in the mconsole driver which caused commands not to be NULL-terminated.

7 Mar 2002
Bad news. I decided to do a UML-sucks/rocks-ometer on Google. Well, "sucks" beats "rocks" 125-83. To make things worse, the first few "sucks" entries are me.
4 Mar 2002
I made the 2.4.18 announcement on freshmeat today. That is the final piece of this release.

I have known for a long time that UML is susceptible to bus errors in random places in the code if it touches memory that it had previously mmapped, but the host can't back with physical memory. This turned out to be a lot easier to trigger than I expected. I run with tmpfs mounted on /tmp for speed reasons. It has a maximum size of 1/2 RAM by default, which is 128M for me. I got a UML to hit that limit and crash with a bus error.

So, I posted an RFC to lkml asking for comments on my proposed solution which was to add a hook to __alloc_pages to allow the architecture to touch pages before they're returned to the caller. Physical architectures have no need of this because they have a known amount of physical memory and they know it's not going anywhere. UML doesn't, so it needs to assure itself that an allocated page is real and that accesses to it won't fault.

What I got was an argument with Alan Cox who persistently doesn't understand what I'm talking about. He has appeared to believe that I'm trying to get good behavior when the system is out of memory and good behavior is impossible, that I want the host kernel to allocate memory as soon as the address space is allocated, and a bunch of other things that I don't begin to fathom. Peter Anvin hopped in briefly with what look like similar problems.

So, I implemented what I wanted, and sent the patch in. Hopefully that will clear out the confusion.

In actual development news, I'm converting UML from storing register information in pt_regs structs to storing it in sigcontext structs. What I'm doing is making UML return to userspace by way of host signal returns rather then having the tracing thread teleport it back with PTRACE_SETREGS. The main advantage is that it is guaranteed to correctly restore floating point state, which has worried me for quite a while. It has other advantages, like removing code from the tracing thread, which will make it easier to finally eliminate at some point.

1 Mar 2002
I finished the release of 2.4.18 today with the bare kernel and the RPM. No deb since Matt is taking care of that for me now.

I cleaned up the test driver some more. The local configuration information is now stored in ~/.umltest, and it is possible to pass in configuration options on the command line. This is in preparation for automatically running the tests whenever a new patch is ready.

25 Feb 2002
I released the 2.4.18 UML patch today. It was a no-effort update, except that there were a bunch of placeholder entries for the extended attribute system calls.

I'm going to sit on the full release for a few days while I work up a new test suite and harness. Hopefully, I can get something automated set up so that whenever I release a patch, it will get banged on without me having to do it by hand.

24 Feb 2002
I think I'm emerging from signal delivery hell. Everything seems to work again with the exception of a memory corruption problem which may not be new. I've discovered new and interesting ways to screw up.

It turns out that when you leave -ERESTARTSYS or -ERESTARTNOHAND in %eax when leaving a system call, the host kernel will helpfully subtract 2 from eip. This is because a system call instruction is 2 bytes long and it restarts them by executing that instruction over again. This turns out to be a problem when UML itself is restarting one of its own system calls. It delivers the signal first, so eip points to the beginning of the handler, and if you don't put a zero or something in eip, then you will fake the host into subtracting two from eip. That puts it at the very end of the previous procedure, which will try to return from a stack frame that never really existed in the first place. This leads to very interesting debugging sessions.

It also turns out that ptrace "knows" that when it has intercepted the start of a system call, eax contains -ENOSYS. It depends on this. When I changed how UML gets the process state before a system call, I changed the value in eax to something else, and ptrace stopped working. It took a couple of days to figure this out.

But things seem back to normal now. My signal delivery exercisers work OK, my stress tests work OK, UML works when compiled on 2.4 and run on 2.2, and vice versa, which -13 didn't. So, I'm releasing -14 now and maybe this signal delivery rewrite will be done.

17 Feb 2002
I submitted a paper proposal for OLS yesterday. This time, I decided to rant and rave about how the host kernel needs to be fixed to better support virtual machines. We'll see how that goes. At least it's different from the standard UML song and dance I've been flogging for the last couple of years.

I redid the UML signal delivery code in order to fix the pthreads hang with the newer pthreads library. I prototyped this in a small standalone process in Incheon Airport (Seoul's international airport) on my way back from Brisbane and just got around to integrating it into UML. It massively cleaned up a bunch of code and opens the way for getting rid of some other problems that have existed for a while.

11 Feb 2002
My flight leaves tonight, so I spent the day wandering around central Brisbane. Ben LaHaise happened to take the same CityCat ferry (which is a system of catamarans used to ferry people up and down the Brisbane River) as me down to the city. He tried to blow up the boat with his umbrella, but, fortunately, an alert crew member stopped him.

Back to the Uni at the end of the afternoon, collect my bags, call a taxi to the airport, and wait for the flight to Seoul. Then it's another 14 hours to JFK and another couple back to CT and then I will be hopelessly confused about what time it is for a couple of days.

10 Feb 2002
LCA is over as of yesterday, and I'm heading home tomorrow.

The slides from my talk (with notes) are available here. It was pretty well attended and it seemed to be received fairly well. LCA always seems to do some innovative things, one of which is to rerun talks that a large number of people regretted missing. The three talks that were chosen this year were a virtual reality talk that a huge number of people wanted to see, Andrew van der Stock's talk on code auditing, and one other that I can't remember. This was good because I was one of the huge number of people that wanted to see the virtual reality talk. However, it turned out that my talk was the number four vote-getter, and this turned out to be relevant when Andrew was nowhere to be seen when the reruns were about to happen.

So, I was happily debugging the ppc UML build with Anton Blanchard when one of the organizers ran up and asked me if I could do mine again. I did so, except I skipped over some of the heavier parts of the talk to leave a good bit of time for a demo at the end.

This went well, except for the panics I got when I tried to have two UMLs mount the same filesystem. I demoed three UMLs running (one Debian, two Slackwares) with most of them (plus the host, I think) displaying on the X server of one of the Slackware UMLs. I showed various other aspects of UML like what it looks like from the host side.

1 Feb 2002
I'm off to Australia tomorrow for LCA 2002. This caps a fairly productive week of UML bug hunting:
  • mistral and blinky started seeing a panic in fork. mistral figured out how to reproduce it and tracked it down to the point where it had something to do with kernel threads reference counting their mm's. This was enough information for me to fix the bug, by giving kernel threads NULL mm's.
  • While I was at ISTS giving a talk, I spent some free time looking at the pthreads problem that people have been seeing for a while. It turns out to be a problem with UML signal delivery. I had assumed that the registers going into a signal handler didn't matter (except for the IP and SP, of course) and that only the stack frame mattered. So, the process registers at the start of a signal delivery are initialized with a set that was captured from a UML thread at boot time. This works fine usually, except that recent pthreads libraries store some thread-private data in %gs, and the %gs value has to be preserved in signal handlers. So, the UML signal delivery mechanism needs to be reworked again.
  • The UML IO hangs that were reported this week were tracked down and found to be a bug in the host's handling of SIGIO. It turns out to be possible, on an SMP host, for SIGIO to be queued to a process after that process has returned from the fcntl that registered a different process as the SIGIO recipient. This breaks UML badly because SIGIOs end up queued, but not delivered, to a process which is out of context and sleeping.
30 Jan 2002
I gave a talk at ISTS yesterday on the UML security work that I did last week. It's available both as html-ized slides and as the original Star Office presentation. These are intended to provide a starting point for anyone wanting to probe this for exploitable holes as well as anyone who's curious about what was done.

I released 2.4.17-10 today. It contains a pile of bug fixes and a bunch of changes which allow a UML patch to come close to compiling in both 2.4 and 2.5 pools. I reverted a change which is causing problems on SMP hosts. I decided that using sockets rather than pts devices to communicate between the IO thread and UML was a good idea because sockets are lighter weight and they're pretty much guaranteed to be supported on the host, whereas there are lots of systems without pts devices. However, there is some difference in how SIGIO is delivered which causes UML to lose interrupts once in a while. The effect is that it seems to hang on boot, but can be made to continue by banging on the keyboard.

MTD is in the configuration now. So, UML supports MTD devices and creating JFFS2 filesystems and mounting them seems to work, although there are some nasty-looking error messages along the way. They don't seem obviously related to UML though.

25 Jan 2002
The security work is largely done. The exception is the lcall prevention fix that's needed on the host. So, I released 2.4.17-9 today. It also contains a number of fixes and patches from other people, the largest being the latest set of James McMechan's ubd changes.
23 Jan 2002
I released 2.4.17-8 yesterday (and 2.4.17-7 earlier this week without dignifying it with a diary entry). I spent a fair amount of time tracking down some old debugging problems. The strace recipe on the debugging page hasn't worked for a while, so I mostly fixed it. strace still doesn't see system calls from new processes until they receive a signal.

I also figured out what was happening with using the gdb under ddd as an external debugger. ddd periodically calls wait on gdb, and when UML attaches it, gdb gets reparented away from ddd. wait starts returning ECHILD, and ddd reacts by shutting down gdb's input, and gdb, in turn, exits. To fix this, I think I'm going to have to have a 'gdb-parent=' switch that will make UML attach to the parent and fake normal return values from wait.

In other news, there have been a couple of articles about UML recently. NewsForge ran one yesterday. This is a followup on the article last week about Linux virtual machines which completely failed to mention UML. Bill Stearns also noticed this article on using UML as the basis of a honeynet.

I've started finishing off the security work needed to make UML a secure root jail. The 'jail' switch now checks for config options which would make the UML inherently insecure and refuses to run if any of them are enabled. Currently, the proscribed options are CONFIG_MODULES, CONFIG_HOSTFS, and CONFIG_SMP. CONFIG_MODULES is fairly obvious. If modules are enabled, then root can insert any code at all into UML, and a nasty root would insert code that execs a shell or something on the host. CONFIG_HOSTFS is forbidden to prevent accidentally providing access to the host filesystem. This is a bit dubious because hostfs is not inherently insecure, and I may relax this one at some point. CONFIG_SMP is non-obvious. Lennert Buytenhek noticed the relationship between SMP and security. 'jail' is implemented by unprotecting kernel memory (by making it writable) on entering the UML kernel, and write-protecting it on kernel exit. If a process were to have two threads, one busy-waiting in userspace, and the other sleeping in the kernel, kernel memory would be writable because the sleeping thread would be in the kernel. So, the spinning thread would wait for that to happen and write on whatever part of kernel memory would let it escape. This will be fixed when UML gets a separate address space for the kernel.

I've stared at /proc and /dev to find devices that provide access to kernel memory. The only two that I spotted were /dev/mem and /dev/kmem. These are disabled with a trick that someone on #kernelnewbies told me about. Access to them is controlled by CAP_SYS_RAWIO, so removing that capability from the bounding capability set makes it impossible for any process to ever get it. So, no process can ever open those devices. Even better, in my limited testing, nothing seems to break badly as a result.

/proc/kcore looks suspicious, but it's a readonly file that seems to fit out a memory image wrapped in an ELF header, so it's OK, security-wise.

18 Jan 2002
2.4.17-6 is out. There are a lot of driver cleanups and bug fixes in this patch. The IRQ hang is fixed. The default console and serial line channel initialization strings are now configurable.

In other news, NewsForge ran an article about Linux virtual machines which covered everything relevant, including some things that weren't virtual machines, except for UML. They got some complaints about that, and later that day, I got a piece of email from the guy who wrote the article wanting to write a followup about UML. So, it looks like UML will be getting another nice bit of publicity.

13 Jan 2002
I made another patch against the 2.5.2-pre tree again today. This one is against 2.5.2-pre11. It has been sent off to Linus so he can drop it in his bit bucket.

In another development, it turns out that the O(1) scheduler breaks UML by holding IRQs disabled across context switches. This results in SIGIO (i.e. from disk IO completions) to be trapped in a process that has gone out of context, and can't be woken up until something else notices that the IO has completed, and of course it won't because the SIGIO has been delivered to the wrong process.

11 Jan 2002
After spending five days tracking down a swap corruption bug, I discovered that Rodrigo de Castro had explained to me exactly what it was about a month ago. Unfortunately, at the time, I didn't know enough about the swap code to decide whether he was making sense. Of course he was, and I discovered that at the end of the great bug hunt.

So, with that fix, Lennert's latest SMP changes, and a bunch of smaller stuff, I'm releasing 2.4.17-5 today.

Linus saw fit to silently drop UML into the bit bucket again, so I'll make another patch soon and send it in.

5 Jan 2002
I released the 2.4.17-3 and 2.4.17-4 patches this week. The biggest change has been the merging of Lennert's SMP fixes.

I made another attempt to get UML into the Linus tree. This patch is against 2.5.2-pre9 and is the 2.4.17-4 patch. We'll see how well this attempt fares.

30 Dec 2001
I announced the full 2.4.17 release and the 2.4.17-2 patch today. The patch is largely changes to allow UML to calculate current from the stack. This is to make life easier for Lennert, who's trying to get SMP working. It also contains a bunch of fixes for bugs that crop up when host devices get closed from under UML consoles or serial lines.

Iain Young is making a decent stab at a UML/sparc64 port. He's got the boilerplate filled in (albeit with some skeptical comments about their correctness...) and he's trying to get the whole thing to compile.

Linus released 2.5.2-pre4 today with no sign of UML in the changelog. I grabbed the patch to make sure it wasn't there. It wasn't. Grrr. I'll give him another patch and then I'll spam him with it again.

28 Dec 2001
OK, so the diary has taken a bit of a holiday break. I released the 2.4.17 UML patch yesterday. The full release will be forthcoming. I also ported that patch into 2.5.1, created the 2.5.1 patch, and sent it to Linus. Hopefully, he will put it in without my having to resend it too many times. I sent a little note off to LKML announcing this, which got some favorable reaction, both on and off the list. Alan, being his usual taciturn self sent a reply, which read, in its entirety, "Cool".

This release had a lot of accumulated stuff in it. The biggest item are the port channel, which let you attach any number of UML consoles and serial lines to a host port, at which point you can access them by telnetting to that port. I also redid the context switching mechanism after thinking of a much simpler way of doing it. This should also be much faster since it doesn't involve signals flying around, so context switches are now invisible to the tracing thread.

8 Dec 2001
I released 2.4.15-3 today. It contains some fixes to previous patches and a lot of changes to the gdb support. gdb now sees ^C immediately, rather than an arbitrary amount of time after it's typed. I also cleaned up that code quite a bit. This knocks a couple of items off the todo list. It also sets me up to fix the gdb shell hang, but I'll let these changes gel for a bit before dealing with that.
7 Dec 2001
I got the Sysadmin Disaster of the Month contest going about a week later than I should have. The thing that happened a week earlier was the publication of an article that I wrote for O'Reilly on using UML to simulate and recover from disasters.

This month's disaster is a trashed root superblock. It involves booting UML, zeroing out the superblock, and figuring out how to fix the filesystem. I had some trouble coming up with a good example to use for the contest. So, if I have similar troubles at the end of the month, I might just trash a filesystem, make it available for download, and the contest will be to figure out what's wrong with it and fix it.

4 Dec 2001
Back from Linux-Kongress. Back on US/Eastern. I think I never left it, which made life (the staying awake part of it) difficult in Holland.

As for highlights of the conference, we have:

  • I and the UML project seem to have some name recognition. Everyone I talked to seemed to have heard of both me and UML, which is very cool.
  • The talk went reasonably well. I gave the same talk as I did at ALS. I had forgotten that it was somewhat tailored for ALS (i.e. I tried to avoid talking about stuff that I had talked about at the previous ALS), and it would have been somwhat different if I had prepared a new talk for Linux-Kongress.
  • I met a pile of cool people, like
    • Lennert Buytenhek (who, in recognition of his contributions to both projects, was embarassed by both me and Rusty by being asked to stand up for some audience appreciation, which must be a new Linux-Kongress record)
    • Roman Zippel (who told me about a couple of ubd driver bugs, one of which I knew of (a subtle rounding error), and one of which I didn't (Greg Lonnon and I went to some trouble to put the COW header in network byte order, but forgot to do the same for the block bitmap (grrrr))
    • Bruce Walker (who's in charge of the Compaq SSI project, and who asked me clustering questions during my talk, at which point, I (correctly) guessed who he worked for and why he was asking)
    • Fabio Olive Leite (a Conectivite who gave me a nice Conectiva filesystem image a while back, and whose name I unaccented to get it through my XSL processor)
    • Philipp Reisner (who threatened an Alpha UML port a while back, but gave up, so he's less cool than the others :-)
  • The organizers took the speakers back to Amsterdam after the conference to spend the day bumming around the city. We broke up into small groups and went our separate ways. Our group spent much of our time in two smallish cafe-type places just talking about random stuff.
  • The train trip from Schiphol airport (Amsterdam) to Enschede (near the German border) was interesting for a number of people. It was exactly as described for me (2+ hours, direct train, no problem), but some track maintenance in Amsterdam's central station caused subsequent trains to be cancelled, so other people had nightmares involving 5 trains and a bus totalling 5 hours.
26 Nov 2001
I released 2.4.14-6 today in the interest of clearing the decks for the 2.4.15 release. Before leaving for Thanksgiving, I had redone the mconsole protocol to be packet-oriented. This allows a lot more flexibility in what can be done with the protocol. As a result, you'll need the new mconsole client for this release.

While down in CT, I redid the host channel support. It is all much cleaner now, and makes it a lot simpler to knock a bunch of related items off the todo list.

I decided to knock off the 2.4.15 patch today as well. It went cleanly, aside from a ptrace cleanup and a new way of generating /proc/cpuinfo which I had to support. I also put in the file corruption fix after forgetting to and discovering that a boot/halt caused fsck to complain a little.

18 Nov 2001
While I'm waiting for the meteors to arrive, I'm chasing and stomping UML bugs. I cleaned up and released the proxy arp fixes that I did on planes and in airports on my way to Oakland. Before, uml_net would blindly add an arp entry to eth0 and nothing else. This is wrong if there is no eth0, and it's also wrong if eth0 doesn't connect to the local net or if there are other interfaces also attached to the local net. uml_net now looks at the routing table and puts an arp entry on every interface that talks to the local net.

I also noticed that slip support wasn't up to date, so I modernized it and cleaned up the code while I was at it. You can now change the IP address of a slip-based interface and the host configuration will be updated just like the other transports.

I added some RT signal support. SA_SIGINFO is now supported, which will hopefully fix some of the strange process behaviors that have cropped up lately. If this fix doesn't do it, I chased down another bug which was causing rt_sigsuspend and sigsuspend to return incorrect values. This was causing the libc sigsuspend to hang, and its process with it. This fixes the pthread_create hang that Greg Lonnon noticed, plus the gdb hang, I think. I haven't checked that yet.

Those fixes are in 2.4.14-3um which I just released. You'll need the latest utilities in order to use the network, since I bumped the uml_net version again.

14 Nov 2001
OK, I'm back from ALS. My talk was on the first day, and it was reasonably well attended considering the somewhat dismal overall number of attendees. It was a half-hour talk, so about an hour beforehand, I took my OLS slides, threw out more than half of them, and updated the rest. That worked out reasonably well, but 30 minutes makes for a very short talk without much detail.

Daniel Phillip's talk was the last of the conference, and was somewhat interesting. He had a pile of raw data that he needed to turn into slides, and all of the KDE presentation tools blew up on him in one way or another. So, in the break before his slot, he grabbed Stephen Tweedie, and they plus me and another guy went off to a local dim-sum place. Daniel and Stephen sweated over Star Office on Stephen's laptop making slides. In the event, they turned out rather well.

I left just after Daniel's talk, so I missed out on some of the socializing that afternoon and evening. It turned out that Daniel and Larry McVoy were talking about his clustering ideas (MetaLinux, or ML), and it occurred to them that UML was not only a good simulation tool for ML, but that it actually implements a good part of what Larry has in mind. I found out about this later, and had a long talk with Larry on Monday, in which he explained his plans. I had heard various mumblings about it, and saw a slide show that Larry has, and remained unenlightened. It turns out, that as far as I can tell, the only way to find out what he's thinking is to have him explain it in person. Anyway, I became enlightened after our chat, and it looks like this could be a whole new area that UML could branch into.

In actual development news, I fully released 2.4.14 today. En route to Oakland, I fixed uml_net so it's smarter about doing proxy arp. It figures out what devices are connected to the local net and only sets proxy arp on those. As a side-effect, if the host is totally isolated, then you don't get scary-looking error messages when it tries to set proxy arp on eth0 and it turns out not to exist.

This happened to me at OLS when I tried to demo it after my talk. I got this nasty message which convinced me that the network all of a sudden didn't work, and I was all apologetic and had no idea what happened. In reality, the network was fine, and I could have demoed it if I had retained a bit more presence of mind.

This isn't in the 2.4.14 release because I'm not happy about the cleanliness of the change. I'll probably clean it up for the next 2.4.14 patch.

On a much-delayed train from San Francisco to Mountain View (a supposedly 1:13 hour trip that in reality required about 1:50 and two trains), I also figured out why you can't talk to eth1 from the host if you configure both an eth0 and eth1. It turned out to be the same bug that other people had noticed causing dropped packets. I was checking errno incorrectly. I had code that did this:

                
n = read(...);
if(errno == EAGAIN) return(0);

              
forgetting that successful system calls don't necessarily set errno to zero. So, the eth1 read was succeeding, but errno was still EAGAIN from the eth0 read.

In other news, beware of kernels built with gcc 3.0.2. I got a complaint from Jens Axboe today about UML leaving all kinds of not-quite-zombie processes lying around. I looked at it a bit and guessed that the host kernel was messed up somehow. He looked at that, decided I was right, and that the culprit was the latest gcc. The interesting thing was that, until he ran UML on that kernel, it looked just fine to him.

6 Nov 2001
In preparation for fixing the problem of the console driver losing output, I ported the SIGIO handler to use poll instead of select. This was mostly what 2.4.13-4 was. I later discovered a bug in it, which is fixed in -5.

I then decided to fix the problem of UML not being able to be interrupted and backgrounded. The problem was that all UML processes are in the same process group, with all of them stopped except for the one that's actually running. The problem is that when UML is backgrounded, the shell sends a SIGCONT to the process group, which wakes up every UML process, which is very bad.

I did some failed experiments with setpgrp/setsid and friends, and discovered that a separate process group wouldn't work because then those threads can't write to the terminal because they're in the wrong process group.

So, I decided that out-of-context processes should be asleep rather than stopped. This required redoing the task switching code. They were stopped because the tracing thread intercepted a signal from them when they went out of context and never continued them. Having them sleep would require that the tracing thread stop doing that and that the threads involved in a context switch arrange the transfer themselves.

So, what is now done is that non-running processes are asleep in sigsuspend, and they are woken up by the going-out-of-context process sending a SIGTERM. Races are avoided by having the SIGTERM sent inside a section of code that has blocked SIGTERM. SIGTERM is re-enabled atomically with the sleep with sigsuspend.

So, that plus the poll fix is the contents of -5.

3 Nov 2001
Time to patch-bomb Alan again. I sent in ten patches to get the ac tree current with CVS. Here they are:
2 Nov 2001
That last patch went into -ac6, so the ac UML builds and works again. The next job is to get the ac tree up to date.

I released a new utilities tarball today. uml_net should now do proxy arp correctly. uml_mconsole is now able to take a command on its command line and execute it, rather than being strictly a command line tool.

30 Oct 2001
I decided to make the -ac UML build again, so I made this patch and sent it off to Alan. The rest of the updates will be forthcoming.
29 Oct 2001
Today is 2.4.14-3 day. I decided to remove the code in fix_range which unmaps pages whose ptes say they're not present. That basically caused it to try to uselessly unmap all of its unused address space. So, I did that and it uncovered a bug. It turns out that swapped-out pages weren't marked as needing to be remapped. Everything worked a lot better with that fixed, and context switching should be a bit faster now.
28 Oct 2001
I released 2.4.14-2 today. This contains the fix for the process segfaults and the gdb problems people have been having. It also turns on morlock's context switch optimization which I disabled until I figured out the segfaults.
26 Oct 2001
I finished releasing 2.4.13 today.

After some prodding from Greg Lonnon, and after he did some investigation, I figured out what the problem with gdb inside UML is. The signal handlers don't save their registers in the thread struct. This means that when a SIGTRAP from a breakpoint comes in and it gets forwarded to gdb, when it gets the registers to find out what the ip is, it gets an old, bogus value. So, it doesn't recognize that as a breakpoint and complains about a spurious SIGTRAP instead.

25 Oct 2001
I spent the last few days chasing a process segfault problem. I finally tracked it down today. It turns out that my rewrite of the process signal delivery code was broken in the case of a signal being delivered from an interrupt handler rather than a system call. It grabs the process registers from the thread structure, saves them away on the stack, and then restores them to the process when the handler finishes.

However, interrupts don't save their registers in the thread structure, so those registers represent the last system call, which has already finished. And restoring those causes great confusion in the process.

20 Oct 2001
I released 2.4.12-3um and 2.4.12-4um over the last week. -3 fixed a couple of problems with -2, and -4 adds some miscellaneous fixes to that. The major ones are that physical memory protection is optional (controlled by the 'jail' switch) and that the network driver backends now collect uml_net commands and output and nicely printk them instead of having the output just dumped to the terminal. To support this, uml_net now hangs on to the commands it runs and the output they produce and send them back to UML. This required that the uml_net interface be incremented, so it's now at 3. The new drivers require the new uml_net, so if you grab the UML patch, also get the latest utilities tarball too.
13 Oct 2001
I released 2.4.12-2um today. It's almost entirely changes sent in by other people, dominated by Adam Heath's cleanups. There were also some ppc fixes from Chris Emerson, and small fixes from other people.

I also released a new utilities tarball. The one change was to uml_net, which does proxy arp in a different, and apparently more robust way than it used to.

11 Oct 2001
Linus released 2.4.11 and 2.4.12 two days apart. I had the 2.4.11 patch uploaded, and had started releasing packages when 2.4.12 came out. So, 2.4.12 is out there and I'm doing the packages again.
8 Oct 2001
I should have mentioned the latest -ac patches already since they've been in Alan's tree for a few days, but I didn't, so here they are In other news, with the help of Paul, I tracked down an ancient console driver bug that held on to a struct tty after it had been freed and subsequently caused panics.

I released 2.4.10-7um today with that fix and some other minor changes.

5 Oct 2001
Paul Larson found a test case for the signal problems that was reproducable for me. So, with that in hand, I tracked down the bug, and released 2.4.10-6um.

The bug turned out to be a result of moving where state is saved before a signal is delivered to a process. The process registers and some other things need to be saved on the process stack so they can be restored later. The way it used to work is that

  • handle_signal would figure out what the interrupted system call eventually returns
  • that value is passed up the stack and stored in the process registers stored in its task structure
  • the process would be sent a signal so it starts running on its process stack
  • the UML signal handler copies the register state from the task structure to its own stack
  • it calls the process signal handler
  • and restores the registers back to the task struct
What I implemented did this
  • handle_signal figures out what the interrupted system call eventually returns and constructs the process stack frame, copying the registers from the task struct onto the stack
  • that value is passed up the stack and stored in the process registers stored in its task structure
The bug is that the second step happened too late. The registers saved on the stack hold a bogus return value, and it's that value which the system call eventually returns.
3 Oct 2001
I decided to profile a stretch of UML thrashing. So, I took the 2.4.10-2ac UML (which I updated to the latest stuff, and which I'll be sending to Alan shortly), gave it 128M of memory and 1G of swap, and let a 'make -j' kernel build run for a couple of hours. These are the results. All of the system calls show up as <spontaneous>. Somehow it wasn't linked against the profiling libc. I'll try to figure out why not.

Some highlights:

  • Protecting kernel memory from userspace seems to be expensive - mprotect is the top item on the list.
  • wait4 is number two, which I don't entirely understand. That's the tracing thread. It sleeps in wait, and wakes up when there's something that needs doing, so I don't understand why it shows up, unless it's somehow being charged for all of the context switching that UML causes.
  • Then we have other low-level VM things, fix_range and flush_tlb_kernel_vm. These manually walk address spaces to update them. These two are unnecessarily inefficient and can probably be knocked far down the list pretty easily.
  • Finally, we get into generic kernel things, which show clear signs of heavy swapping - page_launder, swap_out_pmd, do_anonymous_page.
  • The first system call which shows up is sys_brk, way down the list, followed distantly by sys_read and sys_close.
  • There were 312175 system calls total. The most frequently called were sys_brk, sys_read, sys_open, sys_newfstat, and sys_stat64, not unexpected for a kernel build.
  • kmalloc was called most often from load_elf_binary, select_bits_alloc, and load_elf_interp. __get_free_pages was called most often from handle_mm_fault, pipe_poll, and do_fork. It called _alloc_pages, which was called most frequently from read_swap_cache_async, do_anonymous_page, and do_wp_page.
1 Oct 2001
I discovered a new way of breaking UML. A 'make -j' kernel build drives the load above 150, and on 2.4.10 causes essentially a livelock. I eventually regained control by sending SIGILL to all the processes from the host. Plus, I got all kinds of interesting illegal instruction and bus error deaths. These were absent on -ac2, probably because that wasn't a totally up-to-date UML, so it was missing the most recent bugs that I added. I'm going to track those bugs down by updating UML in the -ac tree bit by bit and seeing which bit causes these nasty little problems.

Speaking of -ac2, it needs a little fix to ptrace in order to build.

25 Sep 2001
Today I redid the signal delivery code. Now all the saving and restoring of state happens in kernelspace rather than on the process stack like before. This allows the task structure to be protected from processes. Since that was the only hole in the protection of physical memory, that is now fully protected against being changed from userspace.
24 Sep 2001
I made a .deb and an RPM in preparation for releasing 2.4.10, and Jacques Nilo reported that yesterday's fix wasn't enough. I had forgot about an instance of MAP_SHARED | MAP_ANONYMOUS. So, I fixed that, and that is 2.4.10-3um. And that is the basis of the official 2.4.10 UML release.
24 Sep 2001
I thought of an easy fix for the stack capturing problem that prevented UML from booting on 2.2 hosts. Basically, a new process is created which stops itself, and when that happens, the parent grabs a copy of the stack and uses it to create a context for future threads to run in. On 2.2, the parent used ptrace to extract the contents of the stack from the child word by word. I looked at that code and decided it would be much easier to map the stack MAP_SHARED so it would be shared between parent and child and the parent could just memcpy it to a safe place rather than ptracing it out.

What I forgot was that, while 2.4 supports MAP_SHARED | MAP_ANONYMOUS, 2.2 doesn't. So, on 2.2 hosts, UML wouldn't even begin to boot.

The easy solution was to go back to MAP_PRIVATE | MAP_ANONYMOUS, but clone the new process with CLONE_VM, making it a thread, which allows the parent to copy the stack directly, since they're both in the same address space.

This fix makes 2.4.10 usable, so I've released another patch and updated CVS.

23 Sep 2001
Linux released 2.4.10 today, so I updated UML as well. I decided not to base this on the latest UML patch, since that it not entirely healthy at the moment. My sigaltstack fixes broke UML totally on 2.2 hosts. So, 2.4.10-1um is 2.3.9-8um updated to 2.4.10.

CVS is not updated, but will be once I have the sigaltstack thing fixed and that pool updated to 2.4.10.

22 Sep 2001
The last set of -ac patches went into -ac14. That brings Alan's tree reasonably up-to-date.

I released 2.4.9-8um yesterday and 2.4.9-9um today. -8 was some bug fixes and cleanup. -9 was fixing sigaltstack and doing a lot of cleanup and rearrangement of the signal delivery code. This sets me up to redo the entire signal delivery mechanism so I can finish protecting all of the kernel's physical memory from userspace.

18 Sep 2001
That last batch of patches went into -ac12. So, the next batch is off to him, plus one from Andrea Arcangeli which fixes a declaration which is needed to compile UML successfully.

Once these are in, the -ac tree is almost up to date. It'll be one CVS release behind, which is OK because there are some tweaks I want to make to the address space reorg. So, I'll get that right and send it in rather than sending it in two pieces.

15 Sep 2001
I released 2.4.9-6um last night. It contains the already-mentioned COW header changes. It also occurred to me that I can fix the mlockall bug by sticking UML at the top of the address space where it's supposed to be anyway. So, I went ahead and did that. This allowed me to get rid of the vmas that UML needed to stick in each mm to prevent mmap from reallocating areas of virtual memory that UML is living in. This, plus the fact that these vmas had no ptes, caused mlockall to cause major damage to UML by trying to unmap it. Putting UML above TASK_SIZE causes it to be ignored by mmap, and the problem just disappears. This also let me get rid of the nasty address space reservation code that was needed in order to prevent libc from mapping stuff in where UML wanted to put stuff.

In other news, I'm back in the ac patch business. First is a patch that I've been sitting on all week which defined hz_to_std and allows UML to build again. Then, we have

These are now all off to Alan. I've got some more which will wait till those have gone in, in order to minimize conflicts:
14 Sep 2001
Greg Lonnon and I have been fiddling with the COW file header format. I had already discovered that blindly copying the backing file path provided by the user into the header is a problem when it is a relative path. That COW file won't be usable by a UML run in a different level of the directory hierarchy because, from there, the relative path stored in the header doesn't refer to the backing file. The fix is to write an absolute pathname into the header.

Greg had a couple of other good ideas which we thought should be implemented earlier rather than later

  • The header should be able to hold a MAXPATHLEN-sized backing file name rather than the current 256 bytes.
  • It should be in network byte order. This will allow COW files to be moved between big-endian and little-endian hosts. Whether the underlying filesystem can be mounted in UML after the move depends on whether the filesystem has its metadata byte-swapped correctly. But, at least the COW header won't prevent it from working.
These two are not backward compatible, so we bumped the COW header version and made these changes in the version 2 header. The driver can read both V1 and V2 headers but it will only write V2 headers.

The absolute pathname change is in 2.4.9-5um since it was small and backward compatible. The other two will be introduced in 2.4.9-6um.

The uml-user list had a couple of interesting posts from UML users today

  • Martin Volf did a Slackware 8.0 installation inside UML and wrote a page describing how he did it.
  • Tim Robinson had some problems with the TUN/TAP transport and posted a nice diagnosis of them.
10 Sep 2001
Been playing with the tools and website lately. I added a bunch of new features to the mconsole client (and promptly had to fix it), and fixed uml_net building on 2.2.

I also restructured the web site build somewhat to make it more manageable.

6 Sep 2001
I tracked down the process segfault problem. It was caused by a newly forked child inheriting some pages that were swapped out, but hadn't been unmapped. The code that it ran on its first quantum didn't update its address space correctly, so those pages remained mapped.

Having chased that problem down, I'm releasing 2.4.9-4um with that fix plus Chris Emerson's latest ppc changes.

1 Sep 2001
After much ado, I revamped the UML download page. It essentially replaces the Sourceforge project download page. I did this in order to be able to let people select the mirror they want to download from and to be able to put explanatory information on the same page as the download link. If it is missing stuff that you'd like to see, regardless of whether it's on the SF download page, I'd like to know about it.

There are a couple things that aren't working right now - the 'Changelog's don't link to anything, and most of the SourceForge root filesystem links don't work. I'm in the process of copying the filesystems over to SF to fix this.

It's now pretty trivial for me to add mirrors, so if you have a box available (particularly if it's in a part of the world not well-covered by the UML global mirror system), let me know.

30 Aug 2001
Thanks to what looks like an all-night debugging session on the part of Yon Uriarte, the TUN/TAP backend now works. You'll need the latest uml_net for this. It wasn't setting IFF_NO_PI, which was causing extra cruft to be stuck on the front of the packet, which probably required the broken nastiness I had to add to the driver. Adding that and backing out all the skbuff fiddling made everything work a lot better.

So, I released 2.4.9-3um with the fixed driver in it, plus new entries in config.release, defconfig, and Configure.help.

28 Aug 2001
I implemented a TUN/TAP backend for the network driver. It involved more work than I expected. A lot of it was due to restructuring other code in order to keep the code relatively clean.

I haven't done any stessing or timing of it, but I did happen to notice that pings over TUN/TAP are about 10x faster than pings over ethertap. The absence of the helper handling each packet on the way to the kernel is no doubt a big piece of that. At some point, I'll do some bandwidth measurements against ethertap to see how much better it is. Hopefully a lot.

26 Aug 2001
UML development took a bit of a break while I got busy with other stuff.

In UML news, I started work on my ALS paper, got a first draft ready, and sent it off for review. I also did a bunch of web site work. I've been letting things fall behind for lack of time to deal with them, so I decided to swallow my pride and start asking for help. This necessarily involves describing what needs doing, so I wrote most of it up, and the results are here, here, here, here, and here.

I also made a pass over the site, fixing a bunch of hopelessly outdated and wrong things, and probably leaving some things which are only moderately outdated and wrong.

16 Aug 2001
I've been having fun playing with crashme. It's a great little tool. It generates buffers full of random data and then executes them. It runs differently on UML than on the host, which it shouldn't. The problems I've tracked down so far are signal handling bugs. UML wasn't handling write faults correctly when the accessed memory was readonly, and it wasn't properly segfaulting processes to which signals couldn't be delivered (because their stack pointers were garbage). This last was the bug I was chasing a couple days ago. There are still problems. The first process (crashme +2000 666 100) runs just as it does on the host, but the next one (crashme +2000 667 100) doesn't. On the host, the segfault handler somehow gets bus errors in libc, which I don't understand, and that doesn't happen under UML.

On IRC yesterday, Lennert Buytenhek clued me in on how to reliably segfault processes and crash UML. He was running 8 "du /". That didn't work for me, but 16 of them does. The segfaults are on pages that are mapped in but shouldn't be (their ptes say that they should be mapped out, and somehow that didn't happen). So, those pages were presumably allocated for something else, and contain garbage from the perspective of the process that should have unmapped them, and so it segfaults.

The panic looks like memory corruption. I turned on slab debugging, and it looks like that makes the panic go away.

Well, Linus released 2.4.9 today, so it's time for me to go into my UML release routine. When I did the obligatory kernel build on 2.4.9, one of the crashme fixes turned out to be bogus. It was doing the segfault-during-signal-delivery check too early, so it caught fixable segfaults that happened because the stack needed extending or was readonly.

14 Aug 2001
2.4.8-2um is out as of yesterday. I made the freshmeat announcement of 2.4.8 this morning.

I chased the crashme bug a little. Somehow, a signal is marked as being pending, but it's never actually delivered and reset. So, no further signals can be delivered to that process from then on. This makes it unkillable and unstoppable.

I also took the first step towards making UML secure against nasty users. UML physical memory, except for the task structure and kernel stack, are protected from userspace access. I still need to protect the task structure and kernel virtual memory. The task structure is a bit tricky because of the signal delivery code. It runs on the process stack and is considered to be userspace code. However, it needs to be able to modify the task structure to restore state that it saved before the signal delivery. So, if the task structure isn't writable, this isn't possible. Further thought on the subject is necessary.

13 Aug 2001
Well, Linus released 2.4.8 just as I was heading up north for a weekend of camping and climbing mountains. He does this on purpose. He released 2.4.3 when I had just arrived in San Jose for the Kernel Summit.

Anyway, this was a relatively simple patch. It just dropped in and worked, except that hostfs was already broken. My calculation of the stat64 inode field was wrong. It looked at the kernel version to decide what was in the userspace headers. I discovered the error of my ways when I booted up a Debian UML to produce the 2.4.8 .deb. This is a 2.2 filesystem (with .st_ino in stat64) with a 2.4 kernel (which implied .__st_ino in stat64). hostfs did not build. I changed the Makefile to just grep the appropriate header instead.

So, this fix will be the substance of 2.4.8-2um.

9 Aug 2001
The remaining differences between my pool and the ac tree are a couple of patches that didn't go in for some reason, cleanups of printks and some includes.

Daemonizing UML does work. I just checked it, and the only case where it does something strange is if you background it without nohupping it and log out. The tracing thread dies from the SIGHUP, but all the other threads survive.

I released 2.4.7-5um today. It contains a few recent patches from other people. I figured out how to turn -fno-common back on. I tried all kinds of linker tricks to throw errno.o out of the binary. Then I discovered that the linking that had already taken place had destroyed any notion of what objects anything originally came from. So, instead, I added -Derrno=kernel_errno to all the kernelspace gcc lines, which translates all the kernel uses of errno to kernel_errno, and leaves the libc errno alone. That's actually a better solution than throwing out one of the errnos because that would leave open the possibility that the kernel and userspace uses of libc could step on each other. Now that they're using different symbols, that's not a problem.

7 Aug 2001
Yesterday's patches are off to Alan.

In other news, daemonizing UML seems to be broken again. Grrr. That seems to break now and then for no apparent reason.

ac9 is out with my patches in it. So, time to make the final diff between Alan's stuff and mine to get him totally caught up with me.

6 Aug 2001
Yesterday's patches are in ac8. So, two more patches will bring the ac tree completely up to date:
  • A network driver update which adds the ability for the drivers to tell the helper about any IP address changes. This allows the host configuration (routing and proxy arp) to stay in sync with the interface address changing inside UML. If you're in the habit of getting UML from the ac tree, you'll need the latest uml_net in order to use the network when this patch goes in because it makes an incompatible change in the helper interface.
  • Another batch of (surprise!) miscellaneous fixes , including some cleanup of stack permission setting, the apparently gratuitous locals that are needed to pursuade -pg to work properly, a couple of symbol exports for GFS, a fix that ensures that the pid file contains the correct pid, and yet another squashed warning.
With these in, I'll be able to diff the ac tree against mine to see what divergences there are. I know there are some, because I occasionally see patches fail to apply because of context conflicts which shouldn't be there. So, there will be one more patch to clean those up, and Alan will be completely in sync with me.
5 Aug 2001
Patch time again. This time I'm making them up ahead of ac7 coming out. So, we have
  • A hostfs update , which brings the ac tree completely up-to-date. Normally, I bundle a couple of cvs updates into a small number of patches and send them off to Alan. With hostfs, I decided to give him the latest stuff, since there have been a bunch of changes spread over a number of cvs updates. This is fairly easy since hostfs is a completely self-contained piece of code.
  • A network driver update , which fixes a crash and makes net devices pluggable via the mconsole. There's some restructuring and cleanup in this patch. Also, mconsole actions move into keventd context from softirq context. This is because alloc_netdevice does a GFP_KERNEL kmalloc, which has to be done in process context.
  • Yet another batch of miscellaneous fixes , including renaming CONFIG_IOMEM to CONFIG_MMAPPER, some cleanup in the ubd driver, and removal of a number of warnings.
  • The complete merge of the ppc port , which reorganizes the headers somewhat. For some headers, there are now header.h, which is a symlink to header-$(SUBARCH).h, which includes header-generic.h and is allowed to do whatever it wants before and after. This provides the flexibility needed to do things like undef stuff after the include and rename things beforehand.
These all will bring Alan up to my 2.4.7 release, except for hostfs, which will be completely up to date. Since I'm up to 2.4.7-4um, and 2.4.7-2um was just a hostfs fix, I might be able to bring the ac tree up to date with one more set of patches.

Alan released ac7 this afternoon, as I prophesied, so those patches are off to him. I'll be looking for them in ac8.

I failed to resist temptation. I looked at the diffs between the ac tree once those patches are in and my current stuff and I noticed a big wad of documentation. So, I rolled that up and sent it to Alan.

4 Aug 2001
OK, I'm back in the business of sending Alan patches. I sent in a small patch which fixes the things that broke when 2.4.7 came out. So, UML now builds and works in the -ac tree again. It made 2.4.7-ac6 an hour or so after I sent it over.

Also, in the interest of getting the ac tree more caught up with my CVS, I sent Alan a batch of fixes which bring him up to 2.4.6-4um:

  • umid fixes from Henrik Nordstrom which create a directory based on the umid rather than having that be the pid file. The pid file and the mconsole socket are now in that directory.
  • Another batch of small fixes - a Makefile fix, mconsole cleanups and an update to create the socket in the umid directory.
  • Some config changes , also from Henrik Nordstrom. These change the network config names to be more explicitly UML-specific. The config.in is also cleaned up so that it resembles the i386 config more closely.
  • Greg Lonnon's example iomem driver , plus a couple of generic UML fixes that were needed in order to make it work.
  • A uaccess fix which required a surprising amount of surgery to fix. The copy_{to,from}_macros previously regarded a fault location of 0 as meaning that the copy has succeeded without faulting. When the address passed into the kernel was NULL, this of course broke badly. It had a very interesting side-effect in the case I saw. After running the command that exercised the bug, every command on the system started failing to start because libc was corrupted. This was something of a head-scratcher. I eventually figured out that I was causing the command to open NULL, the fault went undetected, and the buffer that was supposed to have had the filename copied into it had the filename of libc in it from a previous use. So, libc was opened for writing with fairly severe results.
3 Aug 2001
The deb build problem turned out to be me accidentally redefining VERSION in the upper layers of the build process. That value overrode a VERSION in the kernel build, which resulted in a totally bogus KERNELRELEASE, which confused a macro which tested it badly enough that it broke the build. Simple to fix once I figured it out.

I discovered another hostfs bug on my way back from OLS. ls didn't work and I found two bugs as a result. The easy one was that hostfs_readdir was filling in the directory inode rather than the file inode for every directory entry it passed back to vfs. This was fixed by having read_dir pass the inode back up so it could be use to fill in the entry properly.

The more interesting one is that there was a source-incompatible change made in the stat64 struct between 2.2 and 2.4. The st_ino field changed its name to __st_ino and a new st_ino field was added at the end. The inode appears in the same place (the st_ino/__st_ino field) making it binary compatible. So, after changing to use the 2.4 field name (and breaking hostfs on 2.2), I changed the hostfs build to figure out what name to use and passing that in on the compile line to hostfs_user.c.

In other news, we (me, Rodrigo de Castro, and Livio Baldini Soares) have decided that -pg support in gcc is broken in multiple ways. rcastro and livio complained a couple weeks ago about UML's gprof support not working. I finally had a look at it, and found that it was broken, but not in the way they described.

UML crashed in a very inconvient place, and when I finally got in there enough to figure out what was happening, it turned out that mcount was segfaulting when it dereferenced ebp because ebp was NULL. The reason for that turned out to be that in some procedures, mcount is absolutely the first thing they do. Everything else calls mcount after the new stack frame has been set up and ebp has a valid value in it (the old esp). When the procedure is the main procedure for a thread, then ebp turns out to be NULL.

The difference between the two sets of procedures seems to be that the good ones have local variables and the bad ones don't. So, to work around this bug, I added a useless, but non-optimizable, local to the affected trampolines.

Having done that, rcastro and livio were still complaining about UML crashing. So, I looked at it with rcastro using gdbbot (and livio did so later and discovered the same thing). -pg was trashing edx for some reason. A constant (which varies from procedure to procedure) is dumped into it. This suggests that it's used for the profiling bookkeeping somehow, but looking at the assembly, we don't see how. mcount carefully pushes it and restores it, which is not typical of something that is going to be used for something. The problem is that FASTCALL procedures (which are regparam(3)) pass arguments in eax, edx, and ecx. So, dumping this constant into edx trashes the second argument to the procedure. A workaround for this bug would seem to be to disable FASTCALL (and I guess that gprof support stopped working when I enabled FASTCALL to fix a different bug).

I released 2.4.7-4um today. The main new thing is that you can change the IP address of a ethertap eth0 device and the host configuration will change to match. This required a bit of infrastructure which I wanted for other reasons. The uml_net interface is now versioned, which I've been meaning to do for a while. uml_net now goes away cleanly when UML is killed messily. Before, it would hang around, occupying the tap device, and when UML was rerun, the new uml_net would emit non-intuitive error messages.

I also made hostfs build and run again on 2.2 with a bit of Makefile hackery.

28 Jul 2001
That hostfs problem turned out to be different than I thought. Livio Soares started chasing the problem and found that the hostfs_user close_file didn't actually close anything. It took a pointer to a file descriptor and closed the pointer (or at least tried to) rather than the descriptor that it pointed to. Fixing that made hostfs behave a lot better.

Having fixed that, I finished the page cache work for UML and it can now successfully do the deb build through hostfs without getting the md5sum mismatches it was getting before. Having said that, I've started seeing a compilation problem when building UML through hostfs that I don't get on the host.

On to OLS. My talk was in the second slot of the first day, which was nice. It's good to get your talk over early so you can do the rest of the conference without worrying about it. It went pretty well. I had hoped to fit a demo in at the end, but the talk basically went the full 90 minutes. So I did a real short watch-it-boot-up demo afterwards while most of the crowd was filing out of the room.

There was a talk on porting Linux to the i-series IBM boxes (aka AS-400) which was fairly interesting. They ported Linux/ppc to a hypervisor running on OS-400, making it fairly similar to the UML port, being a port to an OS rather than to bare hardware. Dave Boucher, who gave the talk, made a number of comments comparing it to UML, which was nice. He also grabbed me during lunch today to quiz me about the COW ubd driver. It turns out that he can't do that so easily because OS-400 doesn't have sparse files, so he can't drop blocks down in the same location in the COW file as in the backing file because that would allocate space. I suggested a block directory instead of a bitmap at the beginning of the COW file and dropping changed blocks down sequentially, but he seemed unconvinced for some reason.

A number of people told me either they or people they knew were using UML for various things. The FreeS/WAN project as a whole seems extremely interested in UML for running tests on their stuff over a virtual network. A PPPoE maintainer complained about the ethertap transport not being intuitively obvious on 2.4. And there were a bunch of other people who were less specific about what their interest in UML was who were either using it or were intending to.

In other news, I discovered that the mcast network transport didn't work when the box had no ethernet card in it. Being at OLS, I showed this to Harald Welte and we stared at the code a bit, then asked Andi Kleen about it. The underlying problem turned out to be that there was no route to any multicast address because there was no interface on the system that supported multicast. The fix seems to be to add multicast support to the loopback device, preferably, and if that's not possible for some reason, to the dummy device.

22 Jul 2001
2.4.7-1um is released. A change which made kernel threads sychronize with the parent at startup caused a hang at boot. The cause was a long-standing bug which caused initdata not to be shared between processes. Andrea noticed the problem as well, and found the fix.

That bug was fixed and I released everything. I released it with a fairly big hostfs problem that I didn't notice until the middle of the release process. I changed how it opens and closes files, with the result that it closes them later than it used to. So, it isn't too hard to get hostfs very confused by running UML out of file descriptors.

21 Jul 2001
2.4.7 appeared yesterday. I'm looking it over to see what's new. One interesting thing is that Alan is sending over some bits of UML which change the generic kernel. These don't affect anything besides UML, so they're harmless. On the other hand, they eliminate some generic files from my patch, which is nice. It makes the UML patch appear purer.
16 Jul 2001
Yesterday's patches are in 2.4.6-ac5. So, time to send in another batch. This will get him up to my 2.4.6-2um. Today's batch contains another batch of random fixes and Greg Lonnon's ubd COW patch. See this page for more information on the ubd COW driver.

I also checked in all the userspace stuff, including the deb builder, recent changes to the tools, and the website, which I hadn't checked in for quite a while.

15 Jul 2001
Those two pesky patches finally made it to Alan OK and were included in 2.4.6-ac4. This gets the ac tree up to 2.4.5-8um. The next batch will bring him up to 2.4.5-10um. It includes a bunch of miscellaneous fixes, the first merge of the iomem patch, and an mconsole update which makes gdb and the ubd driver hot-pluggable and runs mconsole stuff from a tasklet rather than inside the interrupt.

With some more symlink abuse, I merged the last of Chris Emerson's ppc port patch.

14 Jul 2001
Two of the three patches I sent to Alan were broken again. However, I figured out why. My devious little mail reader was breaking lines when it sent out the mail, which was way too late for me to eyeball it to make sure it wasn't messing up. Turning off this behavior results in much better patches at the other end of the line.

I played with the UML .deb builder and got a workable .deb out of it. I think I figured out why the process gets a checksum error at the very end - it's on hostfs, and hostfs reads through the page cache, but doesn't write through it. I'll have to check with a filesystem guru on this, but it sounds right to me. Putting the process on a normal block device results in good checksums.

13 Jul 2001
I put out a couple more patches. Highlights include
9 Jul 2001
Thanks to Simon Blake, I tracked down an interesting bug last night. The UML build turns off __i386__ in order to throw out some very hardware-specific code that UML definitely doesn't want. This also turns off the i386 definition of FASTCALL, which invokes an in-register parameter passing convention that gcc supports. This wouldn't be a problem, except that UML borrows code from the i386 port which assumes that this convention is being used.

In the case that I was looking at, rw_down_write_failed() was getting its semaphore address from the wrong place and using a random userspace address as its semaphore. This could cause all kinds of interesting side-effects, like kernel corruption from two threads using two different random addresses as the same semaphore or process memory corruption from the kernel writing semaphore stuff into its memory. Hopefully, this fix will eliminate some of the strange crashes that people ocassionally see with UML.

Here is a more detailed description of the bug and its side-effects.

I sent the two broken patches (the mconsole and 64-bit patches, see the 30 Jun 2001 entry for descriptions) to Alan again. Hopefuly they aren't broken this time. Also, I fixed a few build problems that turned up lately.

7 Jul 2001
I integrated Greg Lonnon's ubd COW patch today. It allows multiple UMLs to share a filesystem read-write by storing the changes in a private file. This private file can be considered to overlay the read-only shared file. All writes go into the private file, and reads come from the private file if it has a valid block and from the shared file if not.

This allows a huge savings in disk space for people running many UMLs with large filesystems. It probably will help performance, since the caching requirements on the host are similarly reduced.

4 Jul 2001
Two of the last three patches I sent Alan somehow got corrupted. I suspect that what happened was I added spaces accidentally while reading the patch in my mail composition window by trying to page it with the space bar, then messed by the patch when deleting the spaces. So, I'll send them in again.

Rik van Riel has been visiting for the last couple of days. He was in Boston for Usenix, and was visiting EMC and MCLX (where a number of my former coworkers from DEC now work) after the show. Since I live a couple of hours north of Boston, I invited him up. In doing so, I acquired the responsibility of getting him to Logan airport at the same time that 2M people were going into Boston to see the 4th of July fireworks and concert. I ended up putting him on a bus that ran from outside the city straight to the airport. I haven't heard anything from him since, so I suppose that's good news.

2.4.6 was released last night. It turns out to be a piece of cake. The well-known softirq fix is the only thing that needed changing. I stuck that in, and it built and ran through my tests without a problem. The patch is released, and I'll probably finish the rest tomorrow.

30 Jun 2001
I sent Alan patches which will bring him up to 2.4.5-8um, which include:
  • a collection of small fixes , including ^S/^Q support for the console, some ubd driver cleanups, and the TASK_UNINTERRUPTIBLE fix
  • Lennert's reimplementation of the 64-bit file support - the first try used libc's magic support for popping the 64-bit interfaces under the 32-bit names. That broke UML modules badly. This version explicitly uses the 64-bit interfaces and seems a lot healthier.
  • Lennert's management console patch. This version has support for getting the kernel version, halting and rebooting the system, and turning the debugger on and off.

Last night, the f00f bug was bugging me, so I fixed it. It turned out that the tracing thread was routing SIGILL and SIGBUS incorrectly. Fixing that causes f00f to SIGILL properly.

29 Jun 2001
I found and fixed the TASK_UNINTERRUPTIBLE hang last night. It turned out to be caused by an interrupted write in the block driver. The driver didn't check the return value, so didn't notice that an IO request it sent to the IO thread didn't go anywhere. That shut down the disk IO system, which ultimately results in the whole system being deadlocked waiting for IO that's never going to happen.

That, plus a few other things, are checked in as 2.4.5-11um.

In other news, Bill Stearns, who's always looking for more devious things to inflict on UML, happened across the Linux Test Project and decided to run it on UML. UML did pretty well. There were three failures, two of which also fail on the host. The other is the f00f test, which causes UML to hang. I applied the obvious fix of relaying SIGILL from UML to the process. That fixed the hang, but after a long pause, the test's SIGILL handler apparently gets called twice.

26 Jun 2001
Those last two patches made it into ac19. Time to start thinking about bringing the ac tree up to -7um.

I finally got Greg Lonnon's iomem match into UML. This allows a process outside UML to communicate with one inside (or with a UML driver) through a mmapped file.

I've also been chasing the TASK_UNINTERRUPTIBLE hang that a few people have been seeing. It happens most easily under UML, apparently. I'm using a recipe discovered by mistral to reproduce it (two infinite loops each diffing two kernel pools). The longest it's taken to reproduce is about 30 minutes. It hung on boot once. The others have been in the 5-10 minute range.

I had a long chat with Al Viro last night with him telling me what he wanted to see from gdb and me providing it. He ended up being puzzled about what was happening. Following a suggestion from Daniel Phillips, I've started instrumenting buffer_heads and pages to see what happened to the ones involved in the hang.

25 Jun 2001
It's -ac patch time again. I boiled the -5um to -6um changes down to two patches:
  • a miscellaneous fixes which adds some IP address sanity checking to the ethertap backend, fixes a couple of process signal delivery races, cleans up the associated thread data a little, fixes a swap bug (which caused swapped-out pages to never be unmapped from their processes), and gets rid of the last vestige of the mm_changes code.
  • a timer patch which attempts to eliminate missing clock ticks by never disabling the timer and keeps track of ticks which happen when it's not safe to call the timer IRQ. This improves things, but it doesn't eliminate missing ticks under load.
22 Jun 2001
Some time around ac16 or ac17, someone added a call to linux_booted_ok() which the ports have to implement. So, I sent the patch to Alan today.

And a short bit later, I got a reply saying not to bother. The linux_booted_ok thing was a temporary test that's going to be removed. So, it won't appear in my pool, but if you absolutely want to run the ac16/ac17 UML, apply that patch.

I spent the better part of the afternoon in IRC trying to figure out the hang that mistral is seeing. No joy, but I did learn more about the problem. I'll attack it again later.

21 Jun 2001
gdbbot got its first test yesterday when I looked at the problem that Chris Emerson is having with UML/ppc hanging during boot. I didn't find the problem, but was able to check that signal delivery (which was what I thought was broken) was working fine. The next step will be to do a post-mortem on the hang.
20 Jun 2001
I wrote a IRC gateway for gdb. This allows a gdb (like the UML kernel debugger) to be controlled from an IRC channel. The intent is that if someone sees a bug that I can't reproduce, but want to look at, that person's UML gdb can be attached to an IRC channel where I can poke around and see what's going on.

I also integrated Lennert's management console patch. This is a very low-level interface to the kernel (like the i386 SysRq interface). The main use for it right now is to hot-plug devices. At this point, only the ubd driver and gdb support this. So, you can add and remove block devices from your UML without having to reboot it. You can switch gdb in and out the same way. I will also do the consoles, serial lines, and network interfaces at some point as well.

15 Jun 2001
The two patches I sent to Alan yesterday are in 2.4.5-ac15. Alan horribly mangled Harald Welte's name, unfortunately.

Today was a patch bashing day. I merged in a good number of the patches in my queue.

Today was also the (extended) deadline for abstracts for ALS2001. So, I sent one in. This is the most explicit that I've been so far about my future development plans for UML. So, if you want to see how wierd things are going to get, read all about it here.

14 Jun 2001
IBM put out a Linux security whitepaper in which UML gets a pretty lame mention (down towards the bottom, there's some prose which is basically lifted from my site). Thanks to Bill Stearns for spotting it.

I'm finally getting around to sending off the latest stuff to Alan. The ac tree is now two cvs updates behind. The first set will be the -5um update, which is basically

  • the mcast transport plus some other network cleanup
  • some random fixes , including an updated defconfig, making the console xterms go away when the machine shuts down, making a read-only hostfs really read-only, hooking up a couple of new system calls, allowing UML to boot on hosts with a 2G/2G address space split
12 Jun 2001
Banged out a bunch of bugs. I started booting UML with 24 megs and plenty of swap, and running a whole bunch of stuff on it to overload it and put it heavily into swap. This turned up a couple of signal delivery races and a swapping bug. The signal races would cause various strange behavior. Mostly what I saw was hangs with an infinite sequence of sigreturns. The swap bug caused pages not to be unmapped when they were swapped out. Obviously, this is very bad. With the help of rcastro, I fiddled my page table macros to fix this. I'm still seeing process segfaults. It looks like pages are being swapped out and swapped back in with the wrong data.
8 Jun 2001
I fixed a bunch of buglets, like the console xterms not going away, readonly hostfs not being readonly, merged Harald Welte's mcast network transport, and a few other things, and checked them into CVS. I also checked in the tools, so everything ought to be up-to-date and consistent at this point.

I also have the .deb build procedure working, I think. The uncertainty is due to the fact that I think there's a hostfs data corruption problem. My development box runs Red Hat, and I couldn't find RPMs for the Debian tools, so I just installed them in my Debian filesystem (apt-get rocks, BTW :-), mount the source pool inside a Debian UML via hostfs, and run the debian build procedure there. The problem is that the gzipped source tarball has its md5sum recorded at the beginning of the build and checked again at the end, and they don't match. I also ran md5sum three times in a row on that file while the builder was running, and got three different answers. So, it looks like I have some debugging to do there.

3 Jun 2001
True to yesterday's promise, I sent Alan three more patches
2 Jun 2001
From the changelog, it looks like yesterday's patches are in ac7. Time to start generating more...
1 Jun 2001
Another day, another set of patches for Alan. Today, the lucky winners are
30 May 2001
Alan put yesterday's patch into ac5, so UML should build and run again. Thanks to Arjan van de Ven for telling me about that.
29 May 2001
It turns out that I messed up the patches somewhat. So, this is the patch for ac4 . With it, UML will build and run again.
28 May 2001
Alan apparently put all 10 of yesterday's patches in 2.4.5-ac3 (but seems not to have dignified the added comment with a changelog entry).

I wrote up the new networking. Check it out here.

27 May 2001
I got 2.4.5 merged in, and mostly released. I'll probably finish it and announce it tomorrow.

I also synced up with Alan by sending him a whole pack of patches, to wit:

26 May 2001
I got the networking cleaned up enough that I'm happy for the general public to use it. There are three host transports, ethertap, the routing daemon, and slip. You can have the helper do the host setup for you or not. If you do, then getting the network running is a matter of a command line switch, ifconfiging the device, and setting routes inside UML. This is a huge usability improvement over the previous situation.

This is all checked in, and I'm currently building 2.4.5, which I'll release in the next day or two.

18 May 2001
I fixed the slip interface, cleaned out some unused code which had become a portability problem, and fixed the fix for the crash caused by someone typing at the console too soon. It is all checked in to CVS.
17 May 2001
I grabbed 2.4.4-ac11 to see if Henrik's patch was in there, and it was. So I don't have to worry about it any more. I guess it made ac9, but Henrik didn't get credit for it in the ac changelog.

In other news, the ethertap interface is working reasonably well. It couldn't do HTTP until I figured out that the mtu on host tap device needed to be 16 bytes less than the UML eth0 mtu. The helper is now more helpful. In order to talk to the rest of the world through it, you basically just have to ifconfig the device inside UML and add a route to the outside world, and you're done. Much better than what we had before.

13 May 2001
Five of yesterday's six patches made it into 2.4.4-ac9. The lonely exception was Henrik's hostfs blocksize fix.
12 May 2001
I decided to clean out my patch backlog a bit. So, I merged and sent to Alan the following patches:
11 May 2001
Chris Emerson got UML/ppc booting to a shell prompt! His uml-devel post is here . This is the first UML port, and it showed me how to make UML portable. There aren't really all that many non-portable things in UML, so a port doesn't take all that much code. Based on his work, I'm going to write up a UML porting guide, which will be found here when it's done. If that link is dead, keep trying until I have something to put there.

In other news, I fiddled the ethertap driver backend so that the read hang has gone. With some help from Bill Stearns, I also figured out how to talk to the rest of my network through the ethertap device.

9 May 2001
I got the ethertap backend to the network driver working today and I submitted it to CVS . I haven't been able to get it to talk to anything but the host over the tap device, but it communicates with the host just fine.
4 May 2001
My 2.4.4 fixes, except for Andrew Morton's exitcall fix, are in 2.4.4-ac3.

I wrote and submitted my OLS paper yesterday, two days late. It's also posted on this site, as TeX and HTML

On the network driver front, I've got the unified front-end plus the slip back-end working. I've started working on the ethertap back-end. After that will come the socket and TUN/TAP back-ends. This stuff is in CVS, but I haven't updated the patch because the ethernet driver is broken, and I don't want a bunch of complaints from people who grabbed the latest patch without knowing what was in it.

Update: Andrew's patch made it into 2.4.4-ac5. I was beginning to wonder. That cleans out my pending ac patches.

28 Apr 2001
Linus released 2.4.4 yesterday, so I released the 2.4.4 UML today. No major changes - I dropped in the semaphore changes I was keeping in my ac tree, and I added an mm argument to pgd_alloc() . When I built the RPM (which uses a different configuration), I noticed that hostfs didn't compile any more , and UML didn't compile with CONFIG_PT_PROXY turned off . These fixes aren't in CVS or the patch yet.

Those changes, plus Andrew Morton's exitcall fix , are off to Alan.

27 Apr 2001
The last batch of patches I sent to Alan made it into ac14.

I started looking at the two network drivers today. I think it won't be too hard to merge them. They're pretty similar, since they're both derived from the same code base, and the differences seem to be orthogonal. They don't seem to have done the same things in fundamentally different ways. I posted my impressions for the devel list to comment on.

Andrew Morton looked at the shutdown crash that people started seeing lately and figured out that it was caused by /proc being unregistered before something else tried to remove its proc entries when it was unregistered. He sent in a patch which reversed the __exitcall order, and Henrik Nordstrom reported that it fixed the crash for him.

22 Apr 2001
The fixes I made on Thursday were broken. The initrd fix introduced a name clash with a function in hostfs, and the sleep fix made sleep always hang . I didn't notice because I was fixated on getting UML to boot from an initrd image, and that wasn't obviously showing the problem.

Anyhow, I made the fixes, submitted them to CVS, updated the patch, and sent fixes off to Alan. I hadn't sent in Thursday's changes to Alan, so the patches are the real thing, not just patches to the patches.

The patches sent off to Alan today add initrd support , fix the sleep bug , and make UML build and work with the generic rw semaphores .

18 Apr 2001
The patches I sent in a couple days ago are all in ac10 by the looks of Alan's change log.

I figured out how initrd support is supposed to work, and implemented the necessary stuff in UML. I booted a RH initrd image far enough to convince myself that it works.

I also figured out the sleep hang. It turns out to be a race between the registration of the timer irq and the first time the timer interrupt calls do_IRQ. The timer was enabled before the registration, so if an interrupt happened in that window, do_IRQ would bail out early, leaving the irq permanently marked as in progress and pending. This locked out all future timer interrupts from going through the irq system, so counters would never be decremented, and sleeps would never wake up.

17 Apr 2001
The UML build fixes made ac6. However, the rw semaphores in ac7 broke UML again. I sent Alan the fix for that yesterday, and it made ac9 later in the afternoon. I also sent a few other patches which fix the gcov and gprof support , add support for external debuggers , and clean up the umn driver a bit .
12 Apr 2001
They didn't make -ac5. Oh well.

I had a chat on #kernelnewbies with Rodrigo de Castro, who's using UML for his compressed caching project. He understands swapping better than I, and told me why my new pte bits were breaking it. So, I fixed it, and swapping now seems to work.

11 Apr 2001
Sent Alan the patches necessary for UML to build and run in his tree. I got back a reply which said in its entirety, "ok", which I think is good. Maybe they will make -ac5.
10 Apr 2001
UML is now in 2.4.3-ac4. I was on IRC with Alan and a bunch of other hackers when he merged it. He looked like he was going to start asking a bunch of embarassing questions about my locking, but he was concerned only about one thing, and that was a special case that didn't need locking.

Too bad it doesn't build. The patch that Alan merged was against the Linus 2.4.3 tree, which differs in a few respects from the current -ac tree.

8 Apr 2001
Released 2.4.3 a week or so late. Blame Linus for releasing it the night before the kernel summit officially started. We were all in San Jose and not able to react.
4 Apr 2001
A couple more summit tidbits that I forgot to mention in my last entry:
  • Willy is thinking about using UML as a testbed for NUMA support. He wants to fire up a number of virtual machines and have them hook themselves together so they can access each other's memory through device files. This would allow people who don't have access to the fancy hardware to develop and debug Linux support for these boxes.
  • UML may appear in the -ac trees at some point. He wanted to include it, but I had sounded fairly negative towards that in the past. What I don't want just yet is for UML to hit the Linus tree. Alan said he doesn't send stuff to Linus if the author doesn't want it sent, which is fine by me.
2 Apr 2001
Back from the kernel summit. I wanted to get a feel for whether four things that I wanted from the host kernel were reasonable. I got two OKs and two dings. That's fine, since the OKs were the important ones. Here's the run-down:
  • Userspace manipulation of address spaces : I want to be able to create, populate, release, and switch between mm_structs. This will speed up UML context switches, and greatly clean up that code. I asked Linus, and he said OK to the fairly static things that I want to do. Apparently, there are serious complications when fiddling with the address space of another process, but that's not what I want to do.
  • System call interception via signals : In order to avoid the context switching between threads involved in virtualizing a system call, I want to have a process intercept its own system calls by having the host kernel deliver a signal whenever it makes a system call. The handler would be the current syscall_handler, which would read the arguments from its sigcontext_struct. This would change a system call virtualization from four context switches to a signal delivery and return. I infer Alan's OK on this from my describing it in his presence and him not objecting.
  • Notification when a UML thread sleeps in the kernel due to a page fault : For the sake of cleanliness and completeness, I want to be able to have UML know when a thread is sleeping in the kernel and be able to call schedule when that happens. This would let UML do as much work as possible given its state of memory residence. Alan rejected this on the grounds that UML would be the only sane user of this mechanism.
  • Full kernel preemption : This was implicitly rejected as a UML need by Alan's rejection of the previous item. If UML is to call schedule whenever it sleeps, the whole kernel needs to be preemptible because the swapped-out page might be a kernel page. This doesn't at all mean that preemption isn't going to happen. Rather, it means that UML doesn't have a particular need for it.
Other tidbits:
  • A number of people consider UML a very neat hack, including Ben Lahaise, Andrea Arcangeli, and Eric Raymond.
  • Alan turns out to be a UML user. For the last month or so, he's been booting his kernels as UML kernels before booting them as native kernels. This is in part because recovery from a totally messed up kernel is a lot easier with UML than with a native kernel. UML is also his ptrace test case. It apparently does things with ptrace that nothing else tries.
  • Al Viro is thinking about porting UML to Plan 9. He asked me about what it would take. He had thought through the ptrace requirement, and I told him about the mmap requirement, which is the next hurdle. Plan 9 apparently doesn't have mmap. He's going to think about how to do that.

On the trip over, I did some debugging, and I also threw in some patches. There is now a "umid=<name>" switch for providing a virtual machine with an identifier. This causes a pid file to be created using that name, which is something that makes controlling multiple UMLs through a nice UI a lot easier. This file will be replaced with a socket to the machine console that Lennert is working on.

I also implemented __exitcall, which declares a procedure which is to be called on machine shutdown. This was prompted by the need to remove the pid files when the virtual machine goes away. I also converted other existing cleanups to use this mechanism.

25 Mar 2001
I checked in a bunch of changes again. Henrik Nordstrom provoked me into making it possible to use hostfs as a root directory by sending me a patch that did it, but which was wrong (IMHO). He did it in the same way that nfsroot and initrd support is done, which is by adding a special block of code to fs/super.c inside CONFIG_HOSTFS_ROOT. That works fine, but I didn't want to annoy Al Viro. What I did instead was to add a second registration of hostfs as a device (not a virtual) filesystem and change the ubd driver to support being given a directory rather than a file or block device. What happens is that when a read request comes in to the ubd driver, it is guaranteed to be a request for the superblock. The driver constructs a fake superblock with the directory name in it. hostfs recognizes that and claims the mount as its own. After that, it goes back to being a normal virtual filesystem and doesn't bother the block driver again. The involvement of the ubd driver is a bit of a kludge, but it works well on the command line, and I can't think of anything better besides some kind of general support for virtual root filesystems that cover nfs and initrd as well as hostfs.

There were also a number of patches from other people: Lennert Buytenhek's modify_ldt patch, a bunch from Greg Lonnon, one from Gordon McNutt, and a buffer overrun patch from Henrik Nordstrom.

23 Mar 2001
I spent a few days fixing the infinite recursive context switch bug. That was a lot more complicated than I expected. The fix involved replacing the shadow page tables that represent the mappings on the host for each process with bits in the pte that say whether it is up-to-date or not. These bits are set in the little functions that change ptes and cleared in fix_range after it's updated the mappings. Since the process page tables are per-mm and not per-process, a mapping that was changed in a multi-threaded process would only be updated for one of the threads. This meant that UML processes that share a memory context also need to share a memory context on the host. This in turn complicated exec, since it now needs to create a new host process in order to get out of a shared UML address space. I implemented this a few times, and on about the third try, I got something that works.

So, aside from eliminating a nasty bug, this also makes the modify_ldt fix more useful, since it now should work properly without any extra code, and opens the way to more efficient context switching between threads, since they won't have to go through the remapping that processes need.

18 Mar 2001
Lots of bugs have been fixed. I got a little list of hostfs complaints from Al Viro, which I think I fixed. hostfs is now pretty solid. I fixed the naming problem which cropped up if you held a file open, then moved its directory and accessed that file by its new name. You'd get 'file not found'. This is because I stored the full host pathname in the inode, and when you changed its name while holding it open by the old name, the inode continued to contain the old bogus name. This was fixed by having anything that needs a filename walk the dentry tree back up to the root, constructing the current filename. The other major problem was that readdir didn't work, resulting in missing files when a directory was copied. These are fixed. What remains is to get rid of some interfaces which will complain about not being implemented.

The signal delivery race is fixed. That induced me to clean up a lot of old, crufty code in the kernel entry and exit paths. That's sensitive code, and a few bugs in it caused some very selective and very strange behavior.

I've put together an RPM for UML just in time for the April Linux Magazine to hit the streets with my article in it. This is good, because the article claims that RPMs are available, which they weren't at the time that I wrote it. This also goes some way towards simplifying the network mess. The RPM installs the umn_helper, which lets the umn device run without any help from the user. It also installs the eth tools, which are otherwise hard to find unless you pull them from cvs.

25 Feb 2001
CVS update today. I fixed a few bugs and cleaned up a bunch of things.

I've started keeping an up-to-date TODO list. This will help me not forget anything important. I post it to the -devel list occasionally to prompt people to send in whatever gripes they have.

24 Feb 2001
I'm releasing 2.4.2 today. It has a number of bug fixes and no significant functionality changes.

A number of bugs have cropped up lately. The most significant is a race when a process signal handler returns. There is a narrow window in which an interrupt can cause a crash. The fix is to implement sigreturn like the other arches and run almost all of the kernel code on the kernel stack rather than the process stack as I'm doing now.

8 Feb 2001
I managed to reproduce a number of panics and fixed all but one of them. The key was hitting UML with a high-concurrency ab run with requests that fire off perl scripts which make mySQL requests, with not too much memory, so that it is at least starting to swap.

This reproduced two bugs, one was caused by a failed memory allocation in the middle of setting up a tracing thread request. The failure caused a schedule, which caused a switch request, which blew away the first, partially-set-up request. When the process was rescheduled, its request was garbage, confusing the tracing thread into detaching it. This was fixed by moving the allocation to before the request started being set up.

The other one, which isn't fixed yet, is caused by the shadow page tables maintained by arch/um/kernel/tlb.c. It occasionally needs to allocate a page table when it sets up ptes for a new range of memory. However, if the context switch that it's dealing with was forced by low memory, then that allocation will fail, causing a recursive context switch, and recursion continues until either the stack guard page is hit, or, in the case of a kernel thread, the task structure is polluted. I'm going to fix this by following a suggestion by prumpf, which is to use some spare bits in the pte rather than a separate page table to figure out what parts of the address space need updating.

And, panic number three, which is also fixed, was caused a faulty notion of when a thread is in kernel space. The old way was to look at whether the thread is being traced. That fails when a breapoint was put in a signal handler before it requested that tracing be turned off. The fix is to look at the current stack pointer. However, that causes problems when a signal is being delivered to a process. In this case, there is kernel code running on the process stack. So, a flag was added to the thread structure when this is happening.

29 Jan 2001
Back from Sydney. The talk went pretty well, it was well attended, and there was a fair amount of interest in UML there. Rik van Riel and I wandered around Syndey until the following Friday.

The slides from my talk are available here.

While I was in .au, my OLS paper proposal was accepted, so it looks like I'll be doing my song and dance in Ottawa this summer.

3 Jan 2001
Updated the web site with a couple pages describing hostfs and the new console/serial line input specification.

The hostfs memory corruption problems are fixed. slab debug found them for me. They turned out to be two string buffer overrun bugs. I'll release a patch with the fixes pretty soon.

1 Jan 2001
I released the uml patch for 2.4.0-prerelease. You can find it here. I'm going to make the full release tomorrow, hopefully after fixing the hostfs crash and getting socket inputs to work.

Today is the deadline for a first draft of my Linux Magazine article and for OLS paper proposals. I sent them both in last night. We'll see what happens.

27 Dec 2000
hostfs now pretty much works. I built UML from inside itself on hostfs. I fixed some bugs in the write code, added enough mmap support to run binaries from hostfs, and implemented statfs. However, there is some as-yet explained memory corruption going on.

In an attempt to reproduce the MySQL problems that a couple people are seeing, I moved some of my work, which is heavy on MySQL and perl, into UML. I've seen no problems, which is disappointing because I'm not any closer to finding the bug, but also nice because it shows that it's possible to do real work inside it.

I've also been banging on the ethernet driver trying to reproduce the server buffer overflow that I saw earlier. No dice there either.

The swapoff bug is now fixed. It turned out to be a bad idea to give kernel threads both a non-NULL mm and active_mm. That code has been that way for ages. I have no idea when or why it became a bad thing.

9 Dec 2000
hostfs is now almost all working. mknod doesn't work, and you can't run binaries out of a hostfs filesystem.

I also fixed that pesky linking failure that people have seen seeing sporadically for a while. I noticed that profiling was turned on in the latest case that showed up in my inbox. I did a profiling build of my own and lo! it failed to link. Since I could reproduce it, I was out of excuses for not fixing it, and so I did. You can see a full explanation of the problem here .

Those changes plus a couple of smaller ones are now in CVS. They aren't in the latest patch because the SourceForge upload system has been seriously b0rked. I'll update the patch when I can.

7 Dec 2000
I fixed the known bugs in the block driver. The
UML# dd if=/dev/ubd/0 of=/dev/null
hang was due to the driver returning to the block layer rather than continuing to process the queue when it found an out-of-range I/O request. The dbench corruption was due to the elevator rearranging the request queue while a request was in flight. When that request finished, the interrupt handler was supposed to retire it by removing it from the head of the queue. The problem is that the elevator put some other request at the head, and that request was retired without ever being done. Meanwhile, the original request was pushed back in the queue somewhere, and it got done twice.

Dan Aloni has started the Windows port. He got most of the kernel to compile. There are a number of undefined symbols from files that don't compile yet. Overall, though, it's looking pretty good.

30 Nov 2000
I updated the site a little. The major changes involved the "ARCH=um" build change. The compilation page is now very explicit about that and there's a FAQ entry for it.

I fixed up the block driver a little. In the past, if you did

UML# dd if=/dev/ubd/0 of=/dev/null
when it ran off the end of the device, it could apparently hang. This is fixed. The problem with dbench is not fixed, but I made the driver's synchronous mode accessible from the command line with the "ubd=sync" switch. In sychronous mode, the driver has no problems with dbench.
18 Nov 2000
I've got hostfs starting to work reasonably. ls now works, you can cd around and cat things. You can't write anything, create files, or execute them yet.
17 Nov 2000
After a bit of a hiatus, I did a CVS update. A number of buglets relating to running UML as a daemon were fixed. The build was cleaned up - I had hard-coded "gcc" instead of "$(CC)" in my Makefiles, the top-level Makefile is now able to do native and user-mode builds, and I cleaned up the drivers and fs Makefiles so that they let Rules.mk do all the hard work.

I'm also back on hostfs. I fixed the mm problems that it uncovered. It can now do ls on the top-level directory.

1 Nov 2000
Linus finally released the final test10 yesterday, so I made my release last night, with a freshmeat announcement this morning. The stack overflow problems in test9 are fixed by doubling the stack size. There is also an inaccessible page between the two stack pages and the task structure, so there shouldn't be any task structure corruption.

There were a number of other fixes. At the last minute, I found and fixed a nasty race which resulted in the kernel tracing its own system calls, resulting in some nasty stack corruption which made it hard to figure out what happened. UML can now run when its main console is not a terminal (i.e. /dev/null). That didn't work because it flipped the terminal between raw and cooked mode, complaining via printk if the ioctls failed. That led to an infinite recursion of printk error messages which ultimately resulted in a segfault. I also made it possible to mount host devices again. That was broke when I made the block driver check IO requests against the device size so it could report errors for out of bounds IO. It turns out not to be possible to get the size of the media behind a block special file, as far as I can tell. So, as far as the block driver was concerned, block devices had zero size, and all IO was out of bounds.

I also started work on the hostfs filesystem. This is a virtual filesystem which provides access to the host filesystem. The theory is straightforward - vfs calls are converted into the equivalent system calls on the host - but this uncovered a subtle memory management bug. If a libc routine which mallocs memory is called, and the break is increased, that extra memory only exists in that process. If the kernel in another process tries using that memory (or tries calling malloc at all), it will fault. What needs to happen is for the context switching code to see if malloc has increased the size of the data segment and map the new memory into the newly running process. This also raises some SMP issues because when the new memory is mapped in, the other processors will need to be told about it so they can also map it. The same is true of the kernel's virtual memory.

20 Oct 2000
At long last, I added a page for related projects and other interesting links.

In other news, it turns out that Michael Vines wrote a Linux executable runner for Windows that does what a UML port to Windows would have to do and he has GPL-ed it and made it available for anyone who wants to incorporate it into a UML Windows port. See the todo page for a link to his stuff.

17 Oct 2000
Back from ALS. The talk went pretty well. I'll put the slides up on the site at some point.

I fixed the stack overflow problems that people were seeing. The stack is now two pages long, with an inaccessible third page protecting the task structure, which is on the fourth. Now, any stack overflows will segfault rather than polluting the task structure, making them a lot easier to debug. This is in CVS along with a few other changes.

2 Oct 2000
Bill Stearns decided to go overboard on root_fs production. He's been fiddling with the mkrootfs script so that it can handle distros other than Red Hat 6.x. He's done Red Hat 7.0, Mandrake, and Immunix. These are all now available from the project download page . Caldera, Conectiva, and SuSE are in the works.
26 Sep 2000

SGI released a new version of XFS for test5 and I tried to apply it to my test8 um pool, the idea being that I could play with xfs in userspace. The patch went in ok, with some rejects that were not too hard to figure out. After some work, I got it to build. It didn't boot, though. There were some changes in ll_blk_rw.c that I didn't understand, and it looks like they are what resulted in the block device getting a NULL buffer to do I/O into.

So, maybe I'll give XFS another try when SGI gets it slightly more up-to-date.

25 Sep 2000

I found out why the kernel debugging interface doesn't handle breakpoints very well. Setting breakpoints results in process segfaults, floating point exceptions, and other strange behavior. It turns out that do_syscall stored the current register state in the thread structure while determining whether the process was doing a system call. If the process hit a breakpoint in the kernel instead, then that overwrote the state that was stored when the system call was called. When the system call returned, that bogus state was restored, and the process was essentially teleported back into the kernel just after the breakpoint, leading to all kinds of strange behavior. With that problem fixed, things work much better. The kernel debugger seems to be basically healthy, and works just like on a normal process.

While I was fixing breakpoints, I decided to see why gdb inside a virtual machine crashes it whenever it sets a breakpoint. There turn out to be a number of problems. First, SIGTRAP wasn't being delivered to the debuggee when it hit a breakpoint. This made it hard for gdb to find out that the breakpoint had been hit, and to remove it temporarily so the debuggee could get by it. Then, it turned out that PTRACE_SINGLESTEP wasn't implemented. This is used by gdb to execute the instruction which had the breakpoint and stop on the next one. There were one or two other buglets, but now that they are fixed, gdb seems happy with breakpoints.

23 Sep 2000
So, I've been a little lax. Here's what's happened in the last few weeks: two bugs were fixed, the reboot bug and and shell segfault bug. That's it.
1 Sep 2000

I realized that I am starting to lose track of bugs and functionality requests, so I dusted off the project's bug tracking system and put everything that I know of in it. I'm also using the patch manager to store the fixes. The idea is that I'll put fixes there and close them when I make a release that contains the fix.

I fixed a context-switching bug noticed by Lennert Buytenhek. The problem turned out to be a race while updating the address space of the process being restarted. If the interrupt handler needed data from the kernel's vm area, and that area hadn't yet been updated, then the kernel would crash. The fix was to disable signals during that period of the context switch.

31 Aug 2000

I put in Andrea's LFS patch. While I was in there, I cleaned that code up somewhat. That is some of the oldest code still remaining, and it really needed some work. I also put in the fix for the crash caused by a module creating a kernel thread. No word yet on whether it's the right fix, though.

Also, Laurent Bonnaud volunteered to update the ancient filesystem in the Debian package to potato. This is very cool. It is something that I've been wanting to do for a long time.

25 Aug 2000

I'm releasing test7 today.

My ALS2000 paper is now available as HTML and TeX.

I redid the RH mkrootfs script. It now prompts for the info it needs. It also works for RH6.1 and probably RH6.2, although I didn't test that.

23 Aug 2000

Finished my ALS paper and sent it in. That's a load off my mind. Made some more CVS checkins. This makes the various debugging options configurable, although I haven't tested the gprof and gcov configurations. I also added some compatibility code to make the Debian install happier. It now recognizes that uml can have disks, but the disk recognizer gets stuck in state 'D' for reasons I haven't figured out yet.

Linus just released test7, so I am building it right now (right now as I'm writing this, and not right now as you're reading it, because I've got no idea when you're reading this). A quick check of the patch shows no changes that I need to worry about, so this looks like a drop-it-in-and-it-just-works patch. We'll see...

Things look good. I booted it up, and ran a few things, and they all worked. So, I'll run the stress testers on it tomorrow, and if that checks out, I'll release it.

21 Aug 2000
I've made the ptrace proxy, gprof, and gcov support configurable. Also started playing with the Debian 2.2 install. It starts up ok, which the 2.1 install doesn't. It looks like I'll have to fake some /proc/ide entries before it will deign to admit that the virtual machine has disks. Right now, it's punting me into the diskless install.
17 Aug 2000
Fixed a network driver bug which caused a crash when ab was run against it. This might also fix the ping flood problem. The fix is in cvs. It will appear for real in test7.
15 Aug 2000
I found out what was causing uml not to boot. It turned out to be a casting bug which was making the compiler do pointer arithmetic rather than integer arithmetic. This was a long-standing bug, and test6 changed things so it got hit more heavily. So, assuming that it's not too badly broken now, I'll release test6 for real.
8 Aug 2000
Revamped the website. It will be put up as soon as Linus releases test6 and I've integrated the changes in. This is because this site talks about stuff which isn't really going to be released until then.
7 Aug 2000

Checked in changes which make the new debugging interface more or less work. I also added a 'debug' command line switch which starts the kernel in the debugger, so you have control of it from the start.

There are some problems with it. Commands attached to breakpoints cause segfaults for some reason. It also can't step across a context switch.

I also put in Rusty's patches. They completely revamp the config mechanism. For some reason, there also seems to be very complete networking/netfilter converage.

5 Aug 2000
The ptrace proxy is more or less working. I've checked it in to CVS and announced it on my devel list.
4 Aug 2000

Got a couple of patches from Rusty. I'm apparently going to graduate to a complete port once I've applied one of them :-) It is nice. It gives me what looks like a complete config process rather than the one I've kludged together. He also sent in enough exports to allow his stuff to be modular inside a UML.

I'm integrating in Lars Brinkhoff's ptrace proxy. It's partially working - enough that I can attach to the running thread, poke around it, set breakpoints, etc. This without needing to detach it from the uml tracing thread. I can't ^C gdb and have it stop the kernel wherever it happens to be. It also doesn't seem to be following threads as one goes to sleep and another starts up. Once these work, this will be a huge improvement in uml kernel debugging.

2 Aug 2000

Bill Stearns pointed out a reproducable way of crashing the kernel yesterday. It turns out that irq_save/irq_restore were completely wrong. irq_save was enabling signals when it should have been disabling them. This could explain a lot of the problems I saw in test4.

I checked the fix into CVS today.

28 Jul 2000

Linus released test5 last night, so I'm putting out the user-mode version today. There's nothing new in it. The virtual ethernet is in the patch, but not enabled in the binary kernels and off by default in defconfig.

The stress testing of this kernel produced no strange happenings. Maybe the segfaults and other stuff in the last release weren't my fault (heh).

27 Jul 2000
Started integrating Jim Leu's virtual ethernet driver. It basically works, but it misbehaves a fair bit. It's unclear whether that's the kernel's fault or the driver's.
18 Jul 2000

I/O, I/O, off to OLS I go

I'm leaving OLS a bit early because I've got a hiking weekend coming up - Carter Dome on Saturday and possibly Moriah on Sunday. So, no one had better expect any work from me until at least next week...

Plus I'm getting Kije next week.

17 Jul 2000

Announced test4 on freshmeat today.

Also discovered a few more problems which I didn't see on test2 or test3:

  • the occasional process segfault which I've mentioned already
  • a devfs segfault - I did this by displaying an xterm out to the host; when I logged out, the kernel paniced with memory corruption in devfs
  • X clients sometimes can't display against a local X server - strace says that they're stuck in select
  • strace also displayed their read masks as '[?]', which doesn't seem right
  • Patches for these will be forthcoming when I find fixes.

    14 Jul 2000

    Back from SF. Not only did Linus release test3 on Tuesday (as I discovered when I was checking things out just before leaving for the airport), but he also released test4 yesterday. So, it looks like I'm going to be skipping test3 and going straight to test4.

    The test3 changes are pretty minor, but enough to prevent the um test2 patch from going in. task.priority changed to task.nice, there were some minor locking changes, devfs_mk_dir lost a parameter, and kernel/timer.c doesn't compile because of that field change. With those things fixed, the kernel boots fine.

    I also decided to get rid of the

                    
    pid 16 (mount) - segv changing ip to 0x10025ff2 for address 0x8064000
    
                  
    messages that appear at boot time. They are debugging messages to convince me that the new uaccess macros are working. But they looked abnormal and worried people so they are now gone. Don't worry, be happy.

    The test3 kernel runs my stress tests (lmbench and a kernel build) fine, so I'm checking it in to CVS and announcing it on the devel list.

    On to test4. The timer.c bug got fixed. Otherwise, the patch went in cleanly. It compiles and the resulting kernel boots cleanly. Unfortunately, lmbench segfaults. I put in some debugging code, and lmbench stops segfaulting. On to the kernel build. That works fine. I try a couple more lmbench runs. They work fine. Oh well.

    I'll consider this releasable. Maybe someone else can find a better way to reproduce the problem. Check this stuff into CVS, and out goes the announcement.

    I also updated all of the downloadable stuff and announced it.

    3 Jul 2000
    I fixed the double panic bug. That was caused by a stacksize limit that was not a multiple of 4 meg. The reason that matters is that check_range (in arch/um/kernel/tlb.c), which is used to remap address spaces during a context switch, assumes that remappable areas and non-remappable areas are under different pgdirs, which represent 4 meg apiece. Non-remappable areas are areas of address space which don't belong to the process. Kernel text, data, and physical and virtual memory, plus the original stack, fall into this category. They are represented by vmas in the process mm_struct, but don't have page table entries. If check_range runs into one of these areas in the course of looking at something else, the lack of ptes for it will cause it to be unmapped. Since the process stack is placed just outside the stacksize limit, if that limit is (say) less than 4 meg, when check_range checks it for remapping, it will also run into the main stack provided by the host kernel and unmap it. The panic happened when a process tried to change its name, which is stored in that initial stack.

    If you see this problem, you can change your stacksize limit to a multiple of 4 meg, or apply this patch to the kernel.

    Those two fixes are now checked into CVS . Here's the devel list post describing the changes.

    2 Jul 2000
    UML doesn't run on recent 2.3/2.4 kernels and I figured out why. The signal frame size increased due to some extra x86 state that needed to be saved. UML is responsible for making sure that there is enough stack available when it asks the host kernel to send a signal to one of its processes. To do this, it pokes the stack (by reading and writing a word) a little below the current stack pointer. If there is nothing mapped there, the seg fault handler will map a page in and all will be well. The offset that it used to poke was a hard-coded 512 bytes, which I got by looking at the amount of stack state the syscall handler needed (312 bytes) and adding a bit. However, it turns out that the new stack frames are much bigger than that, so the 512 bytes wasn't enough. Fixing this makes UML run on new kernels. If you are seeing this problem, apply this patch to the 2.4.0-test2 pool.

    I'm also chasing a bug which causes a panic like this:

                    
    Kernel panic: Double fault on 0xbffff874 - panicing because it wasn't
    fixed the first time
    
                  
    Hosted at SourceForge Logo