This page contains information about what's currently happening with the
project. I may update it once in a while if I feel like it.
29 Feb 2008
I've put out a few more releases of the SKAS4 patch, including a
couple to LKML. No comments as yet, except for one from Andrew about
style and a minor ocding issue.
I decided to add another idea to SKAS4 since the switch_mm bit,
although it greatly improves performance on x86_64, still isn't close
to i386. oprofile says I'm getting killed in the scheduler, so I
revived an old idea which has come up a few times, and which was
actually implemented by Ingo a few years ago.
That is to allow a process to put itself into an "unprivileged" mode,
where it can't make system calls or receive signals. That is, the
process would make a call like
vcpu(mm, registers);
which would switch to the given address space with the given registers
and run there until it receives a signal or makes a system call. At
that point, the vcpu call returns with information about what
happened. This is essentially self-ptracing, except there's only one
process involved. It also returns all of the information needed to
deal with the event, without needing to make more calls to ptrace in
order to get the current registers or signal information.
Making this work was surprisingly easy, except for saving and
restoring TLS state. Despite my best effort, I'm getting UML process
TLS segments active when vcpu returns back to the UML kernel.
When this works and is tidied up, vcpu will be in the next skas4
patch.
1 Feb 2008
The big news recently is that I've figured out a SKAS interface that
might get into mainline. What's in it:
Two new system calls
new_mm - returns file descriptor referring to a new address space
switch_mm - switches the process to a new address space
/proc/<pid>/mm - opening this returns a file descriptor referring
to the process' address space
PTRACE_SWITCH_MM - switches the child to a new address space
Manipulation of a remote address space (as when handling a page fault
or swapping out a page) is done by switching to the address space, in
the stub page mapped there by UML, and executing the
mmap/munmap/mprotect calls directly.
A user of switch_mm would normally open /proc/self/mm in order to get
a handle to its original address space. switch_mm requires a file
descriptor telling it what address space to switch to, so this
provides the means to return to the original address space. This
descriptor also holds a reference to the address space, ensuring that
it isn't freed when the process visits another address space
temporarily.
I've made several releases of SKAS4 to the UML lists, and one to
LKML. A full rolled-up patch can be found here.
It also includes a siginfo_t extension which adds the CPU error code
and trap number to the SIGSEGV case. These are needed in order to
figure out what sort of operation faulted, and thus how to fix it.
SKAS4 currently supports 32 and 64-bit UML and x86, plus 32-bit
compatibility on 64-bit x86.
With this on the host, and a UML close to what's currently in -mm, I'm
getting 82-83% of native performance on a kernel build on i386.
29 Oct 2007
In the three months since the last entry here, the whole 2.6.23 cycle
came and went. It wasn't a huge UML release - I was saving up changes
for the 2.6.24 cycle:
4K stacks and IRQ stacks - The old, larger stacks were a reliability
problem on x86_64 since there often wasn't enough contiguous memory in
order to allocate a kernel stack, even if there was plenty of memory
overall. You'd see forks start to fail when there was lots of free
memory. I implemented IRQ stacks in order to cut down on stack usage
and cut i386 kernel stacks to 4K and x86_64 kernel stacks to 8K, the
same as the host. As well as improving reliability and saving memory,
it gave a noticable speedup to a kernel build.
Core dumping now works on x86_64.
A fair amount of code cleanup.
I spent the 2.6.23 cycle accumulating things in -mm for 2.6.24.
There's a fair amount, which is now in 2.6.24-rc1:
tt mode is gone. This was prompted by Adrian Bunk looking around for
unused and defunct config options, and spotting CONFIG_MODE_TT. Since
it no longer worked, and I'm not sure if it built, I just got rid of
it. This was a fairly major undertaking, resulting in a long
patchset. I was fairly careful about it and it seems to have gone
smoothly. I think between 5000 and 10000 lines of code were deleted.
This exposed a number of opportunities to simplify the remaining code,
which I am doing on a case-by-case basis. Perhaps surprisingly, this
resulted in a measurable speed boost on a kernel build.
Tickless support - this resulted in a noticable performance boost by
itself, and it should help out hosts that are running many UMLs. They
will no longer have to deliver 100 timer interrupts per second to each
UML.
Performance enhancements - UML doesn't save and restore FP state from
processes and the tt mode removal made it possible to streamline the
page fault path some more.
VDE (Virtual Distributed Ethernet - http://vde.sourceforge.net/)
support - this is something like UML's uml_switch. The VDE driver was
contributed by the VDE developers.
13 Jul 2007
The issue of valgrinding the kernel by way of UML came up again.
Unfortunately, valgrind is just as broken as ever when it comes to
UML. It caused a cloned subprocess to segfault when UML is checking
the ptrace capabilities of the host during startup. I ifdef-ed all
that code out to see what would happen. valgrind then dies because of
an instruction it doesn't know how to emulate. It supposedly can
handle everything emitted by gcc, but this instruction came from some
hand-written assembly in include/asm-i386. After seeing this, I gave
up on valgrind again.
I've been playing with KVM some. After much futzing around, I got a
trivial guest to run. The issue was the initialization of the guest's
physical memory. You end up getting several file descriptors from KVM
as part of setting up the guest. The guest's physical memory is
created by mmapping one of these descriptors. After mapping it, you
fill it in with whatever data you want the guest to have when it
starts. I was calling mmap with MAP_PRIVATE - a decision made
somewhat randomly without any particular thought. It turns out that,
with respect to this buffer, the guest and the process initializing it
are different processes. When I initialized the buffer, the kernel
privatized the pages, leaving the guest with zero pages.
The guest was running on virtual 8086 mode, complete with 16-bit
segment limits and everything. I read the specs some to see how to
initialize the registers so as to have the guest start in protected
64-bit mode, without much success. KVM does come with a bunch of
little demo guests, one of which puts the virtual CPU into 64-bit
mode. I will use that and see if I can figure out the appropriate
register initializations.
I also spend some time playing with Ingo's and Zach Brown's syslets.
The attraction there is that they allow synchronous system calls to be
turned into asynchronous calls if they block. In this case, the
system call will return to userspace in a different thread, and the
status from the original, blocked thread can be collected later, when
it finishes. This is near to optimal as far as CPU consumption and
code cleanliness is concerned. You don't need to make everything
asynchronous in case something might occasionally block, with the
switching and event collection that implies. If the data is
available, the call returns immediately, and if it's not, you collect
the status later.
I found a few bugs, for which I sent patches. I couldn't get UML to
boot using syslets, for reasons I didn't fully debug. A problem is
the fact that the asychronous threads may return to userspace if they
get a signal, and a stack and an entry point needs to be provided to
each thread. This is a big wart on the design, and it causes problems
when a process that thinks it's single-threaded receives a signal in
16 or 32 threads. This feature seems not to be completely needed, and
I'm hopeful that it will disappear.
In other news, 2.6.22 is out, with all the changes I mentioned
previously.
25 May 2007
My pile of patches (50-60 or so) are now in 2.6.22-rc2. Among the
changes that'll be in 2.6.22:
Better hot-plug and hot-unplug of block devices and network interfaces
- I fixed some misbehavior which could cause a block device to be
neither pluggable or unpluggable and fixed crashes in the network
interface hotplug code.
Various performance improvements already mentioned.
IRQ stacks and smaller kernel stacks (4K on i386 and 8K on x86_64,
matching the host) - this knocked a bit more time off a kernel build.
Lots of other code cleanups and bug fixes.
I'm working on getting the ubd readv/writev support ready for
mainline. The per-device thread patch, which leads to it, has been
fixed so that all device I/O threads get killed on shutdown, so it's
ready. The readv/writev patch itself doesn't do the right thing with
COW files, so it still needs some work. The problems here are similar
to those encountered in the AIO work in my tree, so I might end up
pulling that forward and getting it into mainline and out of my tree
finally.
I figured out something about I/O performance. The important thing is
not to have lots of I/O going to the host at the same time, although I
imagine it doesn't hurt. During a kernel build in UML, the host CPU
is pegged, so anything that cuts down on CPU consumption directly cuts
down on the kernel build time. So, readv and writev are nice, not
because they send a lot of I/O to the host at once, but because they
allow the UML driver to be notified of a bunch of I/O with a single
interrupt. Linux signals are fairly expensive, so cutting down on
their number helps performance.
There was still another example of a boot hang after mounting the root
filesystem that I hadn't figured out. I did finally, and it turns out
to be that PTRACE_SYSEMU is broken on FC6 on i386. I let Roland
McGrath know, and added some more checking to UML, so that it now will
look at PTRACE_SYSEMU more carefully, and fall back to PTRACE_SYSCALL
if anything is wrong.
27 Apr 2007
2.6.21 came out yesterday, so the changes I have pending in -mm should
start flowing in. I spent the afternoon playing with the UML network
layer. Antoine Martin found that you couldn't assign a MAC to a pcap
device (which he wanted to do in order to not get a different device
every time his distro saw a new MAC). In the course of fixing this, I
found and fixed a pile of other bugs. I was making no attempt to have
the assignment of a MAC succeed, so see how the error cases worked. I
found that
the free wrapper wasn't detecting pointers that had been malloced, so
was passing them to kfree, which blew up
the failure path of the ethernet device configuration was causing
double-frees because sysfs was helpfully freeing things
the pcap backend wasn't printing a proper initialization string,
checking that it was passing a valid pointer to pcap_close, or
returning an error when it saw an invalid option
In other news, I love UML's new behavior of dumping core on panic. It
gives me much better debugging when I don't already have a gdb on it,
and it panics.
20 Apr 2007
So, what's been going on in the last two months? Lots of patches
either in mainline or on their way:
UML will now dump core on a panic (and it prints the current core dump
limits on boot) to give me a better chance of debugging problems that
I can't reproduce. To allow this to work, the core dump rlimit must
be unlimited (or at least very large) - see ulimit -c for your
limits. It seems that most distros set it to zero in order to inhibit
core dumping.
Device hotplug fixes - if you hotplug a disk with a bogus filename,
that device is no longer unremoveable and unrepluggable. Also, you
don't get nasty messages from sysfs (or wherever) about a lack of a
release method when you unplug a device.
Yet another x86_64 TLS fix. I think TLS is fine now - I know of no
failures at this point.
UML in -mm compiles now in the presence of utrace. It's
non-functional, but enough is there to let it build.
Lots of code cleanup and fixes of a few miscellaneous crashes.
Performance improvements - kernel builds are now ~20% faster than
before and are about 2/3 native speed. I found a horrible mess which
was causing a random userspace page to be faulted in whenever UML did
a file read or write.
The reason for this was that, a long long time ago, in the tt era, I
was having problems with userspace addresses being passed into read or
write on the host and having that return -EFAULT because the page had
not yet been faulted in. So, I added some code which faulted in any
such pages (in a totally bogus way). However, in the skas era, when a
kernel address gets passed to read or write on the host, it doesn't
need to be faulted in at all and doing a copy_user to it to make sure
it's present just faults in the process page, if any, at the
corresponding process address. Removing this gave me ~10% on a kernel
build.
Improving the page fault path gave me another ~10%.
I/O improvements - I've done a number of things to the ubd driver to
speed things up:
Stuff as many requests at the I/O thread as possible to allow them to
possibly be handled sooner, and pass pointers instead of entire
structures to increase that number and reduce the amount of data going
through the pipe.
Each device now gets its own I/O thread. This should improve
throughput when multiple devices are active. It also paves the way
for using readv/writev.
The I/O threads now use readv and writev to get some parallelism on
I/O. This doesn't help kernel builds (which is all I've looked at so
far) too much - the times do seem more consistent and at the low end
of the range that I'm used to seeing.
More locking fixes.
All critical fixes have gone into 2.6.20-stable, so the UML there is
in pretty good shape.
20 Feb 2007
I've been spending time tracking down bugs in both UML and the host,
with the following results:
Found a bug in 32-bit ptrace on x86_64 which mangled the 6th system
call argument. I fixed it to my satisfaction, but when Andi sent it
to LKML for review, it turns out that this bug had been seen before
and fixed, and the fix, which had never been merged, was better than
mine. I'm going to push that patch if the author (Chuck Ebbert)
doesn't.
BB and I added PTRACE_OLDSETOPTIONS to 32-bit ptrace on x86_64 on the
same day. His patch will be the one to go in.
I found and fixed a bug where kernelspace faults were trashing the
segfault information stored in the task structure by a previous
userspace fault. When the segfault was finally delivered to UML, it
decided that the fault was fatal because of the bogus information that
the host had saved, and killed the process. My fix went through three
iterations at the behest of Jan Beulich, who kept spotting problems
with it.
I found a signal frame alignment bug in UML which caused a few
processes to segfault. They were executing MMX instructions, which
expect data to be 16-byte aligned, inside a signal handler, and the
misalignment was causing them to segfault.
After a few days of debugging, I fixed a TLS problem on x86_64 which
caused host to segfault. I wasn't implementing CLONE_SETTID properly
there. host still isn't totally happy, so there's more work that
needs to be done.
Yesterday, I found the problem which causes UML to hang early in boot
with some more recent host kernels. UML was mishandling the host
VDSO information, with the result that init tried to branch to a empty
part of memory when it wanted to use the host's VDSO.
It turns out that an earlier cleanup suddenly started causing UML to
hang in 2.6.20 with a couple of threads sending an infinite stream of
single characters back and forth. With the help of someone on #uml
who was seeing the problem, I diagnosed and fixed it. It turned out
to be a badly designed interface being used for something it wasn't
intended to be used for. I had implemented a little growable array
abstraction which didn't bother preserving its contents when it needed
to allocate more memory, leaving that up to its callers. I reused it
with something that wasn't prepared to fill in the contents, with the
result that the array was filled with garbage when it got reallocated.
I am continuing to send SMP cleanups to Andrew. When 2.6.20 opened
up, he sent a bunch of them to Linus. I will queue up a bunch more
after 2.6.20 closes for inclusion in 2.6.21.
utrace, which is a ptrace replacement, made its debut in -mm with
2.6.20-mm1. ptrace requires a fair amount of architecture support, so
any replacement of it will require some work on the part of the
architecture maintainers. Fortunately, Roland McGrath wrote a
document on updating architectures, so it was fairly easy to get UML
compiling and booting again. However, ptrace doesn't work - that will
be the goal of the rest of the work.
Brian Ducharme of Virtual Strategy
Magazine did a podcast with me and Chris Aker of Linode.com.
Linode is a large UML ISP (and Chris is a long-time supporter of UML),
and happens to host virtual-strategy.com. The podcast is located
here.
26 Jan 2007
With the SMP mechanisms basically working, I've been making an SMP
cleanliness pass over UML. I had previously gone over the code and
made a list of things that needed to be looked at. Now, I'm going
over that list and fixing things. I've sent a pile of patches to
Andrew for inclusion in 2.6.21, and I have a lot more which I haven't
sent in yet.
I spent last week in Sydney at LCA 2007. The last LCA I was at was
Brisbane in 2002. That was an awesome show, and it has gotten more
awesome in the meantime.
I gave a talk at the virtualization miniconf on my plans on making a
UML KVM client. This is to take advantage of the virtualization
support in current Intel and AMD chips. This work was done by a
couple Intel engineers in Moscow. The UML side is fine, but the host
side is a bit scary, and prompted me to work on more pressing things.
KVM is essentially the same thing, and I can make the UML work fit
within it pretty easily.
A couple of aricles flowed from that. One was an interview with me by
Joe Brockmeier here.
The other was an article about my talk in ComputerWorld.
Since I got back (with minimal jet lag (!)), I've been on a bug hunt.
There have been a number of reports of various host-related UML
problems - UML works on one host, but not another. I've got access to
a couple of such hosts and have been tracking down the bugs. It turns
out that the 32-bit ptrace support on x86_64 is buggy. There was no
support for PTRACE_OLDSETOPTIONS, which is easily fixed. I'm
currently looking at a problem where the sixth system call argument is
trashed when read from ptrace. This is causing process bus errors
inside UML because that's the offset argument to mmap and when you try
touch touch memory which is mapped from a very negative file offset,
you get a SIGBUS.
15 Dec 2006
After much argument with ptrace and how signals get delivered to
ptraced processes, I got a 2-CPU UML to boot. After playing with it
some, I haven't found any unknown problems. The two problems that did
turn up were already known. The locking in the console driver is
screwed up, and we've known that for a long time. Aside from that,
this thing seems reasonably healthy. I'm going to beat on it, and I'm
sure more problems will turn up then.
I think I'm going to redo the detach/attach nonsense that I now do on
every context switch. The problem is that when a CPU switches from
one process to another, it needs to be able to attach to the
associated host process. I did this by having each CPU only attach to
whatever process it is running, and so all non-running processes are
detached, and can be attached by a CPU without any trouble later. The
alternative to this is to leave processes attached to their CPU when
they are switched out. If they are on a different CPU when they are
next run, the new CPU has to send an IPI to the old one to detach it.
That sounded nasty and painful to me, but what I did instead was
extremely nasty and painful. So, the alternative has been looking
somewhat more attractive of late.
4 Dec 2006
I've some some amount of SMP progress. Moving that wait to inside a
attach/detach pair still doesn't work, but UML at least runs longer
before crashing.
On the bug-hunting side of things, it's looking like the as-iosched
crash is caused by an interrupt happening when it shouldn't. My
debugging stuff is showing that a pointer is tested as not being NULL,
and immediately inside the test, it is NULL. This is the sort of
thing that says that we have an interrupt problem. Specifically, my
theory right now is that an interrupt comes in while the I/O scheduler
is playing with the request queue (and interrupts are disabled for
this reason) and does something to change the queue, such as pull a
request off it and run it. This messes up the request structure that
the scheduler is in the process of looking at, causing the crash.
That's what it looks like right now, but I don't have any idea how
this might be happening.
I can usually reproduce this in a few days of a make -j 64 kernel
build, but the last time, it took a week to happen. This slows down
the debugging effort some.
What also slows down the debugging process is discovering another
serious bug that needs to be fixed. UML processes on x86_64 sometimes
segfault. I haven't been able to reproduce this reliably enough to
track it down. Until now. I discovered that running two UMLs (one of
which was running the 64-way kernel build) was enough to make
processes segfault in one or the other of them. After trying to
figure out how they could be interfering with each other (i.e. by
trashing each other's tmpfs through a tmpfs bug), I found out that the
faults are almost legitimate. The faulting instruction is consistent
with the fault address passed in by the host kernel. What's wrong is
the CPU trap number and error code passed in along with the fault
address. In particular, the trap number is 13 (a general protection
fault) rather than the 14 (page fault) expected. A trap of 13 is
usually caused by a segment being set up wrong, and is not a fixable
page fault, so the UML process is segfaulted.
I spent a while trying to figure out how this can be happening, and
did, somewhat, although I'm still unsure of the details. In
arch/x86_64/kernel/traps.c:do_general_protection, we have this code:
Whether the fault is caused by userspace code or kernel code, the
task's error_code and trap_no (which are ultimately passed to the
process that is running, if it catches SIGSEGV) are set. If there was
already a SIGSEGV pending, with its own trap_no and error_code, those
will be lost.
I tried to catch this in the act of happening, by adding some code
there which looked for an already-queued SIGSEGV, and failed.
However, when I applied the obvious fix, which is to only set the
error_code and trap_no in the case of a userspace fault, the UML
process segfaults disappeared. So, this theory appears to be
basically correct, although I'm still missing one or two critical
details.
I've sent this patch off for comment, and if there are no objections,
I'll be sending it to mainline.
20 Nov 2006
The ptrace detaching and reattaching works with ncpus == 1. I had
some nasty flags floating around saying when wait should be called and
when it shouldn't. Once I had UML booting, I looked at this nastiness
and it turns out that it could all be simplified away.
The next thing to do is try booting with ncpus == 2, and here it
breaks. The reason is that there is some waiting on a process that
hasn't been attached. With multiple CPUs, this is a problem because
when that process was detached, it went back to being the child of its
original parent, which may be a different CPU than the one trying to
wait for it. Since you must wait for your own children, this is a
problem.
The solution is to move all waiting to between attaching and
detaching. There is one wait which needs to be moved. It can be
moved to just before an attach (from just after a detach), but taking
it one step further and moving it to after the attach causes a strange
crash later.
In other news, I'm chasing a sporadic crash that I see every few days
with a make -j 64 kernel build loop. The crash is a NULL dereference
in the AS I/O scheduler, and at this point, it looks like a generic
bug. I see no UML involvement right now. I'm putting instrumentation
in the block layer to try to track this problam back to the source.
I've gone a couple of steps back from the crash, but have no real idea
what the problem is yet.
10 Nov 2006
My metadata-mashing workload is still failing. The
removed-working-directory fix was wrong. It turns out that the
directory may not be removed on the host in d_drop before something
else recreates a directory of the same name. The host directory
lookup will produce the inode number of the not-yet-removed directory,
and that will resurrect the old inode. So, we'll have two UML
directories, the removed and the recreated, referring to the same host
directory.
It appears that the host rmdir must be done at the same time as the
UML rmdir. In order to fix the inode reuse bug, we must hold a
reference to the host directory so that the host inode stays around
even though the directory is no longer in the host namespace.
What I think I'll do is open the directory and rmdir it. The open
file descriptor will hold into the inode, which will go away when the
descriptor is closed, and that will happen when the UML dentry is
freed. This complicates things because host file operations require
file names. When a UML process whose working directory has been
removed creates a file, what name on the host do you use? The only
solution I see is openat(), where you open a file whose name is
specified relative to an open file descriptor. One trouble is that
the infrastructure under hostfs isn't equipped for this. It expects
full absolute path names. Another trouble is that openat (and the
other *at system calls) are fairly new. For example, FC6 has no man
page for openat, although there appears to be libc support for it. I
can't count on them existing on any system that might run UML. Even
if it exists in the kernel, it may not be in libc, or vice-versa.
While I think about that, I decided to go back to SMP support. The
(unforeseen) complication here is that processes must be ptrace
detached and reattached when they are scheduled out and back in. The
reason is that there will be one host process per UML processor. When
a process is switched out, it may be switched back in on a different
UML CPU. That CPU must be able to ptrace the process, so the old CPU
must have detached it. The default behavior when a process is
detached is that it is continued. This is obviously wrong for UML
which needs the process to just sit there until it is reattached. So,
SIGSTOP is specified as the signal to be delivered at the detach.
This works, but it complicates reattaching. Attach also delivers a
SIGSTOP, and this stacks on the detach SIGSTOP in the sense that one
is stored in task->exit_code and the other in the task's pending
signal mask. Both must be cleared out before the process is good for
anything, and there must be a PTRACE_CONT in there to get the last
SIGSTOP. This nastiness interacts with new thread creation, where the
new thread has been fully waited for, to make it very non-obvious
where to wait in order to get any thread scheduled properly.
3 Nov 2006
There have been sporadic reports of UML just freezing, but waking up
as soon as there is any I/O, such as whacking the keyboard. There
wasn't a lot I could do about it as long it there was nothing
consistent about it and I couldn't reproduce it myself.
However, I found a workload which would make it happen maybe once a
week, so I started chasing it. Over the weekend, it started happening
every few hours with this workload. This made things much easier, and
I tracked it down to the soft interrupt code. Nothing was wrong with
the code, exactly, but there were two things that should have happened
in a particular order, and gcc was rearranging the code so that they
were reversed. This left (in the assembly I looked at) a
one-instruction window in which an interrupt could happen, and get
lost.
If the UML was doing only disk I/O, then this would freeze the system,
since no more disk requests would get processed until the one
outstanding was finished, and its interrupt was lost. It would wake
up on any other input because the interrupt handler will look at all
active file descriptors, discover the finished disk request, and
process it.
I added a memory barrier to fix this problem, and added a couple more,
plus made a couple of variables volatile in order to guard against
future compiler misbehavior. This patch is now in mainline, and
should be in rc5.
30 Oct 2006
I spent the last two weeks chasing externfs bugs. This was prompted
by deciding to beat on it before sending it to -mm. I settled on a
make -j 16 kernel build and a set of commands which perform lots of
conflicting metadata operations - file creation, appends, removals,
and mode changes.
The kernel build exposed a file corruption bug which was easily fixed
once I figured out what was happening.
The metadata workload was significantly more troublesome. The symptom
was that some command would start emitting errors like "Not a
directory". These were caused by the in-kernel inode structures
getting out of sync with the host filesystem. When a file or
directory on the host is removed, the corresponding UML inode should
also be freed. If it's not, when the host reuses the inode number for
a different file, the old UML inode will be reused, and it will
contain the data for the removed file. The "Not a directory" errors
come from the removed file being a directory and the new file, with
the reused inode number, being a file. Things like find and chmod -R
will stat the file and see that it's a directory, since that's what
the inode says, try to readdir it, and get a failure when that
operation fails on the host.
The underlying problem is that the UML inode structure wasn't being
freed when the corresponding host file goes away. There turn out to
be a variety of ways to make that happen.
The inode may be dirty. In this case, it won't be freed until
it has been flushed out to the host. If the file or directory was
removed on the host, this opened a window in which the host could
reuse an inode number before the inode structure is freed.
This shouldn't be a problem since externfs performs operations
immediately, so the host filesystem is synced with the inode, and the
inode shouldn't be dirty. So, I had to chase down all the ways I was
dirtying inodes. A common case was i_nlink manipulation. externfs
was changing the inode link count in order to keep it in sync with the
host, sometimes correctly and sometimes incorrectly. The incorrect
changes usually shouldn't have been there in the first place, which
alleviated the next cause. The correct ones were calling an interface
(inode_{inc,dec}_link_count) which marks the inode dirty. It turns
out there's another interface ({inc,drop}_nlink) which doesn't.
Switching from one to the other fixed most of these problems.
Any operation which accessed the file would update the inode's atime,
marking it dirty. I fixed this by setting S_NOATIME on all externfs
inodes. Any such operation will eventually be reflected out to the
host, and the host atime will be updated at that point.
That left externfs_setattr. This is called on most metadata changes -
chmod, chown, truncate (which is also a data operation), etc. The
filesystem needs to update the inode as well as handle the operation
in whatever way it needs to, and there's a helper, inode_setattr,
which does this. Unfortunately, it calls make_inode_dirty, and
there's no way to pursuade it not to. Here, I resorted to explicitly
syncing the inode after calling inode_setattr. Nasty, but it works.
The inode link count may be wrong. If so, if the inode use
count is zero and the link count inside UML is greater than zero but
the file has been removed on the host, the UML VFS layer will still
hang onto the inode in case some process later reopens the file.
Then, the old inode will be reused, saving the trouble of reading it
back from disk. Of course, if there's memory pressure, that inode is
a prime candidate for being freed, but until then, it will hang
around. Again, this opens up a window for the host to reuse the old
inode number.
For the most part, this was caused by me slavishly copying the ext2
i_nlink manipulations. The problem was that some operations, like
creat and mkdir, cause a stat() on the host, which fills the inode
with a bunch of stuff, including the link count. Any further changes
are just wrong. Of course, there are operations which do affect an
inode's link count without causing a stat() on the host. rmdir and
unlink change the parent directory's link count. These must be done
by hand by the filesystem.
With these fixed, I had one more instance to track down, and this took
almost a week.
The inode may be a removed directory, but a UML process has it
as its working directory. In this case, the process will be holding a
reference to the dentry belonging to that directory, which in turn
will hold a reference to the directory's inode. Again, this opens the
same window of host inode reuse. The solution to this was to postpone
the actual host rmdir() to the dentry d_delete operation. This is
called when the last reference to a dentry is dropped. At this point,
it is safe to remove the directory. I'm only postponing rmdirs, not
unlinks, because I can't think of any way that a file dentry could be
held after it is removed without the file being opened. In this case,
UML will hold the host inode by having the file open.
22 Sep 2006
2.6.18 is out finally. I have a pile of patches stuck in -mm which
will finally get flushed out to Linus. I have another pile stuck
here which I will finally flush out to Andrew.
I implemented ethernet MAC randomization yesterday. This
automatically assigns a random MAC to any interfaces which didn't get
one on the command line. This eliminates the practice of asigning a
MAC based on the first IP address assigned to the iterface. That
stopped working when distributions started bringing interfaces up
before assigning IP addresses to them.
I'm also going to start sending out SMP patches. These won't enable
SMP, exactly. They will do things like fix and document locking, so
that will largely be ready by the time SMP support itself is there. I
also have a patch which allows an SMP kernel to boot with one CPU.
This goes through the SMP boot process, and has locking compiled in,
but, obviously, there is no concurrency to actually exercise the
locking. I'm debating whether that should be sent in.
14 Sep 2006
Another long interval between diary entries, sorry.
We are now up to 2.6.18-rc7 and 2.6.18-rc6-mm2. UML builds and runs
fine in both trees, except that the jmpbuf problem with newer libcs is
still not fixed in -linus. There's a patch in -mm for this, but
Andrew hasn't sent it along to Linus yet. There was recent -mm
breakage caused by Rusty fiddling the i386 ptrace.h. This provoked me
into splitting it into a userspace-usable part (ptrace-abi.h, which
UML uses rather than ptrace.h) and a non-userspace-usable part, in
which i386 people can do whatever they want, and I don't care. This
resulted in a nice tidying of the UML ptrace.h, eliminating all of the
cpp tricks to rename i386 ptrace symbols out of the way of UML ones.
I've started making SMP work in skas mode. What I have so far allows
UML to boot in SMP mode with a single CPU. This goes through the
mechanics of an SMP boot and locking (without any contention,
obviously). The next step is, of course, multiple CPUs. UML
currently gets into userspace with two CPUS, but dies. The skas0 mode
address space handling needs some generalization. Currently, there is
one host process per UML address space. This breaks when there are
two threads in the same address space, both wanting to run at the same
time. They obviously can't both use the same host address space, and
this manifests itself when two virtual CPUs try to ptrace the same
host process simultaneously. This happens more often than you might
think, between a vfork and an exec, for example. You don't need an
explictly multi-threaded process in order to make this happen.
I was somewhat out of commission for a couple of weeks (although I did
get some useful work done on trains and boats) on a trip to Europe
which included Linux-Kongress. There, I gave a talk on my view of the
state of Linux as a hypervisor. This was essentially the same talk as
I did at OLS.
1 Jun 2006
I spent a couple of days tracking down a timer bug which caused sleeps
to be about one second short. This was particularly noticeable with
one second sleeps, which returned immediately. I tracked it down (I
thought) to an optimization in the ktime implementation, which
operated on a structure with two 32-bit elements as though it were a
64-bit integer. I wrote a fix and posted that to the world, plus the
maintainer, who told me that I had passed a non-normalized time into
his API. It turns out that the entire problem was caused by my
initialization of wall_to_monotonic, which is a negative time, and is
used to convert wall time to the time since the system booted.
It turns out that BB had found this bug a long time ago, and had a
patch (which fixed a bunch of other things) which he never finished.
He sent it out, and I merged everything that was still applicable.
26 May 2006
My x86_64 box rebooted after a long uptime, and got a newer kernel
than it had been running before. This new kernel broke UML. This was
caused by a bug which had been noted previously on the lists. x86_64
system call tracing returned two exit notifications for each system
call. Needless to say, this will badly mess up anything which pays
attention to system call notifications.
Ironically, this was introduced as part of another bug fix which
potentially could have affected UML. All system call returns on
x86_64 are done through sysret, which is a relatively new,
low-overhead, system call mechanism. It reserves a couple of
registers, in particular, %RCX, into which you load the userspace
address to which you want to return. There is the potential for %RCX
corruption during sigreturn, which must preserve all registers. This
is because UML converts this, like all system calls, into getpid().
There is a special return path in x86_64 for sigreturn, but UML evades
this by changing it to getpid. Thus, there is potential for %RCX to
be corrupted on return from sigreturn. It looks like that is now
fixed.
I figured out what was happening yesterday, and sent a patch to Andi
Kleen, plus the x86_64 and kernel worlds. Andi sent back a different,
and presumably better patch. However, today, akpm picked up my patch
and dropped it into -mm. I immediately requested it be dropped and
replaced with Andi's patch.
In other news, Al Viro noticed the warning about strcpy being
undefined at the end of the UML build. This was caused by a
combination of a sprintf in nfs being converted to strcpy by gcc, and
UML following i386 in having an inline-only strcpy. This was fixed by
following the i386 CFLAGS more closely, and adding -ffreestanding to
them.
23 May 2006
Some more beating exposed a humfs memory leak which killed the UML
after about a day of a kernel build loop. I uncovered it after a day
of instrumenting humfs. This process convinced me that that was the
only humfs memory leaks. Every byte allocated was freed and I have
the trace to prove it.
11 May 2006
Another week, another batch of patches off to Andrew. These are small
fixes, no major features at this point.
I started banging on humfs again. Last I played with it, I could boot
a humfs filesystem, but couldn't do a kernel build on one. The build
failed with errors that suggested file corruption. Now, that behavior
just vanished. I've done a number a kernel builds without seeing that
problem. I've seen other problems, but they are now fixed.
The major one was that each user of the AIO subsystem handled -EAGAIN
from the host on its own. This happens when the buffer used to send
AIO requests to the host is full of already-submitted requests which
have not yet finished. When this happens, the caller was expected to
handle this in whatever way makes sense. The ubd driver would just
stick the queue on a restart list and return back to the block layer.
When the ubd interrupt routine next handled a finished request, it
would rerun any queue on this list.
This is fine when it's the only user of AIO, but it fails when there
are other users. If humfs had filled up the host with requests, then
this behavior of the ubd driver will fail because the ubd interrupt
routine will never be called again since there are no pending ubd
requests.
This prompted the centralization of some code into the AIO subsystem.
In particular, AIO users no longer create their own pipes to receive
finished requests. Now, there is a single pipe, created by the AIO
subsystem, interrupt from which are handled by a single handler. This
handler takes the finished requests and hands them off to the AIO
user. This is conceptually a fairly simple change, but it resulted in
a lot of changed code. The humfs and ubd interrupt handlers were
restructured, their -EAGAIN handling is different, structures changed,
etc.
At this point, humfs seems healthy. It boots and does kernel builds.
As long I don't discover any major problems, I'm going to send it in
for 2.6.18.
26 Apr 2006
The time namespace patches I sent out just before leaving for Brazil
got a bunch of comments pointing out things I did wrong. The
PTRACE_SYSCALL_MASK interface is wrong because strace -e (which is the
other possible user) has opposite requirements from UML. It wants to
selectively trace system calls rather than selectively not trace
them. There were also some bugs in the implementation.
Eric Biederman pointed out some failings in the actual time
virtualization code. What I did won't work so well when you want to
migrate a container from one host to another. This will require some
thought.
LWN picked up on it and Jon Corbett wrote a better summary
of these patches than I managed.
I spent last week in Porto Alegre, Brazil for FISL 7.0.
It's a pretty good show, thought somewhat less technical than the
likes of OLS and LCA. It's larger, with ~4000 attendees registered,
and it had a small exhibition floor. There were a good number of
international speakers, and we all seemed to be in one track, in the
largest room, with simultaneous translation into English, Spanish, and
Portuguese. My own talk was somewhat sparsely attended. The
user-level, eye-candy-type talks were much better attended.
The UML Book is now out! There was a bit of a delay at the printer,
but it is now available, and people have it in their hot little
hands. It's available from Prentice-Hall
and from Amazon.
13 Apr 2006
I've been slowly redoing the UML web site. The idea is to make it
more friendly to newbies, by having more step-by-step instructions for
the normal ways of doing things and less reference-type dumps of
information. The reference stuff will still be there, but it won't be
the first thing you see. What I have so far can be seen
here.
I implemented time namespaces in the host and support for them in
UML. The idea is to virtualize time in the host by creating a
partition which has its own independent clock. This partition takes
the form of a namespace which is created by the new unshare system
call. Within this partition, a settimeofday call will just change the
offset inside the namespace without changing the system time.
gettimeofday reads the system time and adds the offset.
The advantage to UML is that this allows gettimeofday to run on the
host as though it were running inside UML. The results are the same,
except that the system call doesn't need to be traced. In order to
make this work, I also had to add a mechanism for system call tracing
to be selective - I want to turn off tracing of gettimeofday while
retaining interception of everything else. This is done with the new
PTRACE_SYSCALL_MASK, which takes a bitmask saying which system calls
will be traced and which won't.
With this stuff working, you'd expect that gettimeofday would be
pretty close to native speed, since it runs on the host without UML
doing anything. The two measurements I've done, with a loop of a
million calls, are 98.8% and 99.2% of native.
I need to get the host side of settimeofday working, and then I will
send this off to LKML as an RFC.
31 Mar 2006
All of the patches that BB and I sent to akpm are now on their way to
mainline. So, 2.6.17 will have TLS and hotplug memory support. The
TLS support, in particular, was a long time coming, but it works very
well and has received a lot of testing, so I don't expect many
problems. It would be too optimistic to expect no problems when
exposing something new to a much larger user base, but I don't expect
any problems to be very large.
OLS papers are due tomorrow (April 1), so I've been working on
mine for the last couple of days. Eric Beiderman is giving a
talk on extending namespaces to cover the entire kernel, so there's
going to be some overlap there, since that's also relevant to my
hypervisor talk.
28 Mar 2006
A bunch of UML patches have gone to Andrew in preparation for 2.6.17.
I sent in 10 more today, which were mostly Al Viro's UML cleanup
patches. The one that was mine was a cleanup of the earlier printf
patch.
BB sent in his TLS patches. They've been tested for a long time, and
have no known bugs - the last one was spotted and fixed a few days
ago.
This eliminates a whole lot of patches from my development tree. With
-mm2, I should be into the low 40s. There will be three major
(i.e. multipatch) projects left to be merged
externfs, hostfs, and humfs - humfs can't currently do a kernel build,
and all of that code is in one directory, where it should probably be
three. Also, we have questions about the structure of the underlying
code, mainly the stuff that deals with filehandles.
UML/S390 - Most of the preparatory patches are merged. The signal
hander restructuring is still sitting there, as are a couple of
patches which I really don't like. The add-gate-vmas patch works, but
it's fairly gross. The fix-jiffies patch is small but it seems like
the wrong way to go about it. Also, there has been some bit-rotting,
as the skas0 interfaces have changed since I merged Bodo's patches
into my tree.
The ubd driver rework - I have a report, which I can't reproduce, that
this hangs under heavy I/O. Also, I have yet to sort out dealing with
the early I/O on COW files with O_DIRECT in the picture.
In other news, the sanitized kernel headers project was restarted.
This is of interest because that can, if done right, greatly simplify
UML's use of the host arch's headers. Now, most of them can just be
reused by UML. However, some of them are mostly usable, but have
stuff which is wrong for UML. I either copy these into asm-um,
leaving out the objectionable bits or include them, but use various
nasty tricks to get rid of the parts I don't want. Usually, this
means using defines before including them to rename the things I don't
want.
With a clean set of kernel headers, that won't be necessary any more,
assuming that they include all of the userspace-usable things, rather
than just the things that make up the kernel ABI. Kyle Moffett has
seemed amenable to accommodating UML, so we might get some nice UML
header cleanups from his KABI work.
25 Mar 2006
I fixed the humfs hang, so I consider it to be stable at this point.
However, I haven't yet tried a kernel build loop on it.
BB sent out his current TLS patchset for comment and review. I
dropped it into my akpm tree, which contains the patches destined for
-mm. It works fine, and fixes a problem that I had been chasing. So,
it looks good for 2.6.17, except that the patchset itself is a bit
disorganized.
This morning, I sent out 16 patches to Andrew for routing to
mainline. They include
the rest of Gennady Sharapov's isolation of libc code
fixing the get_user warnings that popped up on current gcc
memory hotplug
allowing a ubd device to be shared among clustered UMLs
a handful of smallish bug fixes
22 Mar 2006
With hostfs mostly out of the way, I started looking at humfs. This
was fairly easy. The bugs were mostly due to bit rot, and easily
fixed. The one exception was a bug in the symlink handling in
humfsify. It turns out that both -d and -l are true in Perl for a
symlink that points to a directory. This was faking humfsify into
treating such symlinks as directories.
On an intensive I/O load on humfs, I get hangs after a while. The AIO
thread is somehow faked into calling io_getevents when there are none
to get. My investigation into this uncovered some weaknesses in AIO
handling of -EAGAIN, when no requests can be queued until some have
been retrieved with io_getevents. The ubd driver assumes that, if it
gets -EAGAIN from an aio submission, that some pending requests are
ubd requests, and the queue will be restarted from the ubd interrupt
handler.
However, if the AIO queue is filled with humfs requests, there will be
no ubd request to wake up the queue, so the ubd driver will stall.
This isn't hard to fix, but there are some subtleties, mostly in
avoiding races.
19 Mar 2006
I found a couple of inode refcounting bugs in externfs. When these
were fixed, I can do kernel builds on hostfs. What was happening was
that I was giving inodes extra references. When the files were
deleted, they were deleted on the host, but the inodes within UML
remained. At some point, a new file on the host would get the same
inode number as the deleted file. When externfs looked up this inode
number, it got the inode structure belonging to the old file since
that was never thrown out due to the extra reference counts. This let
to directories looking like files (when the old inode was a file) and
to normal files having the wrong length (because the length was taken
from the old file).
There are some things which still don't work. Loop-mounting an image
from a hostfs filesystem doesn't work (and never did, even with the
old hostfs) because it doesn't support sendfile. I added that, and it
turns out that the commit_write method is broken in a few ways. It
returns a byte count, rather than 0/-errno, which is wrong. It also
supports only mmap, rather than read/write to the host.
17 Mar 2006
I spent this week banging on the new, externfs-based hostfs. The book
is going to be out in a few weeks, and there are some things it
promises which I need to get working. The new hostfs and humfs are
among them. I've killed a bunch of bugs, and am currently chasing one
more. Everything I've found has been in the externfs layer, so I'm
debugging humfs at the same time. Hopefully that will work without
too much trouble once I'm happy with externfs.
8 Mar 2006
I implemented open, read, and close for umlfs, so now you can actually
look at the files you've exported from your UML to the host. I spent
most of the day tracking down a stupid vfsmount and dentry refcounting
bug. Once that was fixed, umlfs started working nicely again.
I found and fixed a bug in my FUSE async support. It turns out I was
referencing something right after it was freed.
1 Mar 2006
FUSE will fully support async operation when Miklos sends my patches
to mainline. I did a second round of the O_ASYNC patch and another
patch to enable O_NONBLOCK. Miklos queued them both, so we'll
probably be seeing them in mainline.
I'm also finding inconveniences in the FUSE library, which I am
complaining about and sending patches for, starting with a receive
routine which doesn't let me know if the /dev/fuse read returned
-EAGAIN. Miklos is being receptive, and fixing things.
In actual UML work, I implemented readlink today, so you can cd and ls
around inside the UML filesystem on the host without many problems.
There are still some problems on filesystem boundaries which I don't
understand yet.
BB made a release of the uml_utilities which contains a good number of
cleanups, including some 64-bit fixes in the COW file utilities. The
header wasn't specified correctly for 64-bit boxes, with the result
that a 64-bit system would misread a COW header produced by a 32-bit
system. I merged his changes into my tree, which contains the umlfs
utility, plus some mconsole work. I'll be making an official release
once I get some data off a laptop disk. Among that data is the
Makefile which I use to actually upload new releases of the utilities
and other UML-related files.
27 Feb 2006
Miklos (the FUSE guy) announced mountlo today, which is very similar
to the FUSE thing that I've been doing. He wrote this for a very
different purpose - to allow user-level mounting of host filesystems,
where UML is just an enabling tool. His design is also very different
- he creates a device inside UML and has a UML process doing the file
operations. In related news, I polished the fuse-async patch a little
and sent it to Miklos.
I finished my review of the page proofs of the UML book, fixing a
number of errors, some of which were pretty embarrassing. I think
this is the last thing I have to do with it until it's published. At
this point, it's on automatic pilot inside the publisher.
24 Feb 2006
I got the umlfs FUSE filesystem working a little. lookup and readdir
now work, so you can now cd and ls in a UML filesystem on the host.
There is a dentry corruption problem, which causes crashes later (on
shutdown usually). I think I'm either double-freeing a dentry somehow
or dputting it too often. Today I looked at adding asynchronous
notification to FUSE, so that the /dev/fuse file descriptor can
generate a SIGIO when there's something the userspace server has to
do. Lack of this is annoying since I have to generate other types of
interrupts to UML in order to get it to service FUSE requests.
Normally, this means whacking the keyboard. I am building the
O_ASYNC-enabled FUSE now, and we'll see how well it does.
20 Feb 2006
I started seriously working on putting a FUSE server inside UML to
export the UML filesystem to the host. It was easier than I expected,
but needed some effort because the FUSE library wants to do things
during initialization that can't easily be done inside UML, such as
setting signals, forking, and execing. What I ended up doing is
putting that stuff in a separate helper process, which gets a file
descriptor to /dev/fuse and passes that to UML through mconsole.
Inside UML, the driver does the rest of the initialization, like
creating a fuse session and registering an operations vector.
I copied the "hello" filesystem into my driver for testing purposes,
and that mostly works. One wart is that /dev/fuse doesn't support
SIGIO, so UML can't easily get an interrupt when a FUSE operation
needs to be handled. I'm planning on fixing this and sending the
patch to Miklos.
The UML book is nearing print. I'm currently reviewing a PDF
containing page proofs. This gives the thing a reality that it didn't
have before. Amazon
knows about it already, and it is currently 2,382,397 in sales
rank. The official release date is April 7.
8 Feb 2006
There is a virtualization infrastructure thread happening on LKML
right now which looks like it fits nicely with my thinking on having
UML do less system call tracing. The OpenVZ and CKRM folks are
interested in this, as is Eric Biederman, who wants to migrate
workloads around a compute cluster. The thinking is to introduce
namespaces for the various resources that the kernel controls and make
compartments by creating new namespaces and grouping them together.
For UML, this would mean creating a compartment which contains enough
UML data that system calls execute directly on the host, and do the
right thing.
Two of the eight patches I sent to mainline got dinged yesterday.
Ulrich Drepper didn't like my resurrection of the internal jmp_buf
defines that he got rid of, and suggested that I just reimplement
setjmp and longjmp myself. Linus didn't like the uaccess warning
patch since I apparently got rid of the warnings by making changes to
the declarations which didn't make any sense.
2.6.16 seems to be winding down now, so I think I'm through with
pushing patches into it.
6 Feb 2006
Last week, I discovered that rc1-mm4 didn't boot on x86_64, while
everything up to -mm3 did. A UML process would die strangely after
having a page fault handled. This is an invitation to do a bisection
search on the patches between the two releases in order to find the
culprit patch. I just did the search on the UML patches rather than
everything, and that turned up a patch which did nothing but move code
from one file to another. Further poking at that patch revealed that
moving a jmp_buf from one file to another was enough to cause the
crash.
Needless to say, this was uninformative. There was another clue, in
the form of a complaint by the host kernel about a bogus signal frame
whenever UML crashed like this. I pursued this by comparing "good"
signal frames to "bad" ones, and decided that the floating-point
register values must be to blame. It turns out that these weren't
initialized to any sort of sane values when a new UML process was
created. When I fixed this, the crash went away. I still have no
idea what the connection is with the jmp_buf moving around.
Last week, Andrew released rc1-mm4 and Linus released -rc2. After
tracking down that bug this morning, I updated my trees, throwing out
patches that had made it into mainline. After this, I had 67 patches
in both trees. Then, I sent 8 more to Andrew. These are small fixes
and cleanups that had accumulated over the last couple of weeks.
30 Jan 2006
With FUSE now in mainline, I started thinking about umlfs, which would
export, through FUSE, a UML filesystem to the host. I think this
would provide pretty much everything that the mconsole exec fans
want. If you chroot yourself to this mount, you will be running the
UML binaries, looking at the UML /proc, etc. So, you'd be able to fix
passwords and look at running processes. The things that wouldn't
work as expected would be things that don't go through the UML
filesystem. For example, ifconfig creates a socket and calls ioctl on
it to get interface information. This is a host socket, so you'd just
be looking at the host's interfaces. Anything that looks at the UML
filesystem will report on the UML.
This started me thinking. We would have processes that are acting
almost exactly like they are running inside a UML, except they are not
being system call traced like normal UML processes. Instead, they
are coupled to the UML at the subsystem, i.e. the filesystem layer,
level. This provides a whole new way of thinking about how to do
virtualization.
This is essentially migrating a UML process to the host. When you
migrate a process from one system to another, the process can't know
that it has moved. The new system has to proxy the old system's
information to the process. This can be done by redirecting every
system call back to the home system, which is how UML works. However,
that's not necessary. When you do file accesses, there are perfectly
good network filesystems which will readahead and cache the remote
system's data so that a process doesn't have to call back to its home
system on every file operation. This is what the FUSE does (or could
do) with the umlfs data. File operations would run at full speed as
long as FUSE has the required UML data cached already.
The question this raises is whether it's possible to do this in
general - have nearly all UML process system calls run untraced on the
host, with the host proxying the UML data such that those system calls
run at full speed. This would have to include
memory - all cached data would have to be in memory owned by UML in
order to maintain mmeory jailing
processes and networking - the host would have to show UML processes
the UML network and the other UML processes. This information would
need to be passed from the UML to the host and kept up to date there
in order to minimize callbacks to the UML.
kernel version skew would be an issue - a process using a new flag
that the host didn't implement would get a -EINVAL if the system call
runs natively on the host. So, there would need to be some selective
system call tracing when this is a problem.
So, this would require some significant support in the host kernel in
order to run, but raises the possibility that UML at some point could
mostly do away with system call tracing and run at nearly native
speed. How close it gets is determined by how often the host would
have to make UML processes wait for information from UML.
Since this is really implementing process migration, this will be of
interest outside the virtualization world. The same support would be
usable for migrating processes between systems, except that
communication between those systems would be over the network rather
than over memory as with a host and a UML running on it.
And once you have process migration, you start to think about
full-blown clustering. With migration, you are mirroring one system's
data on another. Clustering takes that one step further and does away
with the idea that data belongs to a particular system. The data
belongs to the cluster and lives wherever it is most convenient.
So, implementing virtualization as process migration makes a
not-quite-complete migration system useful as a performance boost for
UML. Finishing it by implementing the network support needed for two
physical systems to migrate processes would provide full process
migration to Linux. And that will get people thinking about
implementing full SSI clustering.
27 Jan 2006
I have a bunch of patches in my queue which together redo how the ubd
driver works. I've been wanting to send it to Andrew for a while,
except that they broke COW. This week, I tracked down the problem. I
wasn't specifying the length of bitmap I/O requests properly, causing
some parts of the COW bitmap not to be written out. The next time the
filesystem was mounted, these missing parts of the bitmap would cause
data to be read from the backing file rather than the COW file,
causing file corruption.
With that fixed, I upgraded my FC4 filesystem to FC5 Test, and it
seems to be working pretty well. The most obvious difference is that
udev is a lot faster in FC5.
18 Jan 2006
-mm4 broke the UML build in a couple of ways. There is a new thread
flag, TIF_RESTORE_SIGMASK, which, among other things, allows some
signal code to be pulled from the arches and made generic. UML
didn't define this, or support it, so the build broke. Adding
support was fairly simple, so I did it.
There was a __iowrite32_copy function introduced which is built in to the
kernel unconditionally, even though UML is most unlikely to use it.
It calls __raw_writel, which UML didn't define. I stole the
definition for this (and the "b" and "w" variants while I was at it)
from x86.
My asm-offsets patch had a bug in it which broke the build. This
fixed, but not yet on its way to Andrew.
After these, UML builds and runs again.
The soft interrupts patch is now in -mm, finally. It hasn't been sent
to Linus yet, but I told Andrew that it is 2.6.16 material.
Linus released 2.6.16-rc1 and Andrew did -mm1 in quick succession. I
duly updated my trees, throwing out a number of patches (which had
been merged) in the process. My -mm tree has 58 patches in it, which
is a nice decrease from the ~90 I had before. When I push out the
ubd cleanup and the TLS stuff, that looks like it will go under 40,
which is much more manageable.
BB took back his TLS patches which I had cleaned up some, and cleaned
them up some more, so that they build on all combinations of
CONFIG_MODE_TT and CONFIG_MODE_SKAS. I dropped those back in my
tree, replacing my old ones. They seem to work OK, except that his
system seems to define a struct user_desc and mine doesn't. I have a
struct modify_ldt_t_s instead. We'll need to sort out which one we
can count on. I also deleted an annoying error message. These need
some more looking at, but they're closer to being sent to Andrew.
8 Jan 2006
The aforementioned half-dozen patches are now on their way to Andrew.
The nasty compilation fix is now a lot nicer. It turns out that I
could use the existing asm-offsets.h mechanism, which fixed the
problem and let me delete some Makefile crud.
All of my previous patches are now in the hands of Linus, so they
should be in his tree soon, if they're not there already.
6 Jan 2006
I sent Andrew a smallish set of four patches - three libc code
isolation patches and the futex.h consolidation. The futex.h patch
came about because I needed to revert the UML futex.h back to the
original version, which just returned -ENOSYS. It turns out that most
of the architectures had their own identical copies of this. Rather
than have UML create yet another copy, I stuck it in asm-generic and
made the other arches use that.
I spent today cleaning up the debris from the earlier patches that
came out in -mm1. It turned out that the boot output was printed
twice due to the mconsole stack patch. This isn't happening in my
main 2.6.15 tree for some reason, so I didn't notice until I saw it
with -mm1. There was also a nasty compilation problem which I have a
nasty fix for, plus reverting my earlier patch that checks at
compile-time that either MODE_TT or MODE_SKAS is enabled. Adrian Bunk
sent in a Kconfig way of doing it.
Beyond that, I have about a dozen more patches ready to go. These are
the ones that lead to softints. This has been in my tree forever, and
it speeds up UML quite nicely, so it's about time it went to mainline.
With these in mainline, my own patchset goes from the low 90s down to
the 50s, which is a much more manageable number. To get more patches
out of my tree, I'm thinking to clean up the ubd driver series and the
TLS/NPTL series and send them to Andrew for inclusion in 2.6.17.
That's about 15 more. If that happens, I will be down to around 40
patches, which is less than half of what I started with.
3 Jan 2006
Linus released 2.6.15 yesterday. It contained a last-minute set of
fixes from Paolo which fixed segfaults caused by running printk on the
wrong stack, cleaned up some code, and cleaned up the compilation. I
pulled it, and it seems healthy.
That release was my cue to send in a bunch of patches to Andrew. I
sent twelve, which were the bunch that I did in the last couple of
weeks, cleaning up the console code, and a bunch of other cleanups
that I did along the way.
Next on the list is the umid OS abstraction and the mconsole printk
interception that I did last week. Then, I'm looking at the ubd patch
series that I've had for a while.
31 Dec 2005
I generalized the mconsole printk thing a bit, and made sysrq use it.
Now, when you invoke sysrq, you get any output back in the mconsole
client. This is especially nice with sysrq t since that gives you the
stack of every process on the system. It's much better to get it from
the mconsole client than to have to go looking through the UML's dmesg
in order to see it.
29 Dec 2005
While playing with the UML consoles last week, I noticed that it is
possible to register consoles at runtime, and this makes it possible
to send printk output back to an mconsole client. This is done by
registering a console from the mconsole driver. This console is
called whenever there is output. Normally, it ignores the output and
just returns. However, when a "stack" command is active, it collects
the output and sends it to the client. This will capture the stack
and registers nicely. It will also capture any other printk that
happen to occur at the same time. I don't know of a way of ignoring
any such output, but that shouldn't happen very often.
Now, I need to hook sysrq up to this mechanism because it has had the
same problem. It is also possible to monitor all printk output from
the host. I would do this by adding another mconsole notification for
the kernel log. Whenever there is printk output, it would be sent to
the mconsole notification socket and whatever is listening to it.
In doing this, I learned something about Unix sockets. There is a
limited number of packets that can be pending at once, and it is easy
to fill it up. I was trying to figure out why the mconsole client
printed only a couple of lines of stack and then hung. It turns out
that the sendto from the mconsole driver was returning -EAGAIN and I
wasn't checking for it. This happened after only a few hundred
characters of output, which seemed a bit thin. After looking at the
code, it turns out that there is a limit of 10 (by default - this is
tunabled from /proc or sysctl) packets pending on a Unix socket. Send
another and you get -EAGAIN.
I was sending one packet per printk call, some of which were pretty
small. I changed things so that the output gets accumulated into a
buffer, which is sent out when it fills up. This is more complicated,
but it makes things work a lot better.
25 Dec 2005
Last week was console week. I decided to figure out what was causing
the -EBADFs in deactivate_all_fds on shutdown sometimes. This is a
consequence of closing a file descriptor and forgetting to shut down
the associated IRQ. In the console code, it is not obvious that when
the close happens, the IRQ is also freed. So, I decided restructure
the code to make it so. In doing so, I discovered a bunch of other
things that needed fixing and cleaning up.
I ended up with 10 new patches, which do things ranging from code
reformatting with no functional changes to fixing the console behavoir
when pasting a large amount of data into one.
That last problem has been a long-standing one. It turned out to be
two problems. One is that the console driver never implemented
throttling, which is how the tty driver tells the hardware driver to
stop sending it data. Adding this caused large pastes to stop losing
data.
The other problem was that the process receiving the large paste would
see an EOF in the middle and exit, and the shell would receive the
rest. This turned out to be caused by a bug in the driver exercising
a bug in the tty driver. The tty driver bug was that it was update a
counter before queuing a character, assuming that the enqueuing would
succeed. However, it can fail, and the counter being updated for a
non-existant character can cause premature EOFs to be emitted by the
tty driver.
The UML bug was that, while it detected a full tty buffer and
rescheduled the processing of the current interrupt, it delayed it
only long enough to return from the IRQ handler, because the tasklet
was scheduled immediately. Delaying that for one jiffy fixed that
bug.
14 Dec 2005
There has been a problem with the TLS patches since they've been in my
patchset - they don't
build on x86_64. This is now fixed. As part of my cleanup of the TLS
stuff, x86_64 now builds and runs with the full patchset. I did
implement a bunch of do-nothing stubs for x86_64, which is worrisome
since it should have the same requirements as i386. However, my
x86_64 filesystem boots fine, and always has. This will need looking
into later.
9 Dec 2005
The Book has pretty much been handed over to the publisher now. I
finished some last-minute items like the acknowledgements, bio, and
artwork earlier this week. I still need to reread the thing to see if
there's anything that needs fixing, but aside from that, it is done.
Another book showed up on
Linux Journal. It's not really entirely about UML - it's about
debugging and performance tuning on Linux - but there's a chapter on
UML, Linux Journal chose that chapter to excerpt (or the publisher
chose it to make available for excerpting).
On actual UML work, Blaisorblade's NPTL patches broke in an interesting way
with 2.6.15--rc5-mm1. I started getting lots of complaints
about PTRACE_SET_THREAD_AREA calls failing, and failing with a bogus
errno. After looking at this, it turns out that those calls had
always been failing, but failing silently because
ptrace was called from kernel code
in kernel files, errno is renamed to kernel_errno to avoid conflicting
with the libc errno
so uses of errno in kernel files will refer to kernel_errno, not the
real libc errno
Since the PTRACE_SET_THREAD_AREA calls and references to errno were in
kernel files, they tested kernel_errno, which is little-used, and had
some random value in it. Until -rc5-mm1, that random value happened
to be 0, and it looked like the calls were succeeding. In -rc5-mm1,
kernel_errno contained 2, so I started seeing all these nasty error
messages.
I fixed it so that the constant failures were recognized as such, then
tracked them down. It turns out that TLS entries weren't being copied
into the child process during a fork, and bogus, empty entries were
given to the child instead. Fixing this allowed me to remove a patch
which I made without understanding it, but which made NPTL and the TLS
stuff work. With that patch gone, I think TLS support in UML is
understood, and we can start cleaning it up in preparation for mainline.
15 Nov 2005
Hotplug memory is now a semi-reality. What made it so was a patch
from Badari Pulavarty which allows the punching of holes in mmapped
tmpfs files. Even as limited as this, it is exactly what I need for
hotplug memory. The way this currently works is that you use
uml_mconsole to add or remove memory in the same way that you'd plug
or unplug a device. The driver inside UML will try to allocate the
requisite memory. If it fails to allocate it all, then you won't have
pulled out the full amount. You can try again a bit later after the
kernel has had a chance to free up some memory.
You can plug in memory in the same way. The current restriction is
that you can only plug in memory that had been previously unplugged.
You can make this somewhat less onerous by giving the UML a generous
amount of memory at boot time and immediately unplugging a lot of it.
This much has always been possible. What hasn't been possible up to
now is actually freeing the memory to the host. This is what Badari's
patch does. Once it's freed on the host, it can be plugged into a
different UML or just left on the host.
The next step is a memory management daemon on the host which watches
the memory pressure on the UMLs and the host, shuffling memory around
as needed. One thing that's fairly important is to keep the host from
swapping. This makes UML performance much more predictable, as it
won't need to be swapped in to be woken up. It also avoids some
pathological swap conditions where the host and a UML swap the same
page to their respective swap devices.
The Book is nearing completion. I've sent two revisions to the
publisher in the last couple of weeks. The final manuscript is due at
the end of the month.
28 Oct 2005
So, it's been a long time since the last entry. Some of last was
laziness, some was enforced by a stolen laptop. A lot a happened in
UML-land over the last six months, and I'll mention the highlights.
I'm about done writing a book about UML. This has been taking a great
deal of my time, and it is just about done. My deadline for the final
manuscript is Nov 27. The first draft is done, and I'm currently
going through it, and the reviewers' comments, to polish it up.
It is a an how-to-use-UML book, from getting started for the first
time to setting up and running large UML servers. There's also some
history and my prognostications about the future. The publisher is
Prentiss-Hall, and it's due out in the spring.
In actual UML development, 2.6.13 and 2.6.14 (as of yesterday) are
out. A great number of UML patches are now in mainline, making it
much more robust than it had been. skas0 is now in there, along with
a bunch of associated performance improvements. More are coming in
2.6.15, notably Bodo's ldt patches.
The host AIO support went in during 2.6.14, but I yanked the ubd
driver's use of it at the last minute. There were problems in the
driver that were exposed by its use of AIO. I fixed them, but didn't
feel comfortable with them going in so late in the 2.6.14 cycle. So,
they will be introduced early in the 2.6.15 cycle in order to give
them some testing.
x86_64 has been working well for a while and now seems reasonably
mature. We are still smoking out bugs in skas0 once in a while. The
most recent one is the skas0 assembly stubs being assembled in an
unexpected way. It turns out the way I had written them (mostly with
each register assignment in a separate asm statement) doesn't
guarantee that the registers will stay that way up to the actual
system call. It also turns out that there are are asm idioms for
doing this right.
As for the lost laptop, this happened during my trip to the FSM (Free
Software Meeting) in Dijon, France. On the train from the airport to
Paris, someone sat down opposite us, and distracted us with some coins
he dumped on the floor. When we were looking under the seats, he
lifted my laptop from the overhead rack and left. I didn't discover
this until we got into Paris. The UML web site hadn't been checked
into CVS for a while (or backed up), so the only up-to-date copy of
the XML I had disappeared with the laptop. So, I grabbed a copy of
the site and gradually reproduced the XML from that HTML. This was
done a week or so ago, and I was able to start updating my patches
page.
6 May 2005
I went back to UML/x86_64 and got it working. The problem that I was
stuck on was that some processes would segfault in a way that I
couldn't diagnose. It turns out there were a few bugs in the x86_64
signal delivery code. They all were involved in extending the stack
downward when constructing a signal frame. One of them just tried to
extend the stack too far. Another, which was a generic bug, failed to
exempt signal frame generation from a stack extension check. And the
last totally mishandled a failure to do a virtual to physical
translation, resulting in signal frame data being written to physical
address zero.
With these fixed, UML/x86_64 seems healthy. I finished building a
64-bit LFS
filesystem with it. Aside from some difficulties with the packages
themselves, that went fine.
Since my 2.6 test box is my x86_64 box, and since AIO appeared in 2.6,
testing the UML AIO support has been held up on getting UML/x86_64
working. Since that is done, I went back to AIO and made it work.
Considering that the host AIO support had never been tested before, it
was surprisingly easy. A couple of bug fixes later, and it was
working. I instrumented it to see that multiple requests were being
sent to the host, and they were. I saw up to 16 outstanding requests
during a boot.
I have a ton of stuff to merge into mainline, so I started working on
that. Al Viro helped out by breaking his big cross-build patch into
manageable chunks. I sent those along, as well as about 12 others
that I had pending. I messed up a couple of them. I just forgot one
of Al's patches, on which future ones depended, so they didn't merge
well for Andrew. Another didn't go at all because I diffed it against
a built tree, and the result didn't patch into a clean tree. In the
end, those all ended up in 2.6.12-rc3-mm3, and Andrew queued them to
Linus for -rc4.
Today, I sent out 12 more patches. This time, I tested them in a
clean tree, fresh from being untarred from kernel.org tarballs and
patches. With those in, my patchset should be down to around 60.
There are more easy, independent fixes in there, so I'm going to start
doing them.
22 Apr 2005
I got my virtualized scheduler working somewhat, and announced to
LKML. Dead silence from LKML, and a bit of reaction from my group at
Intel, where I forwarded the announcement. The LKML announcement can
be seen
here.
The jist of it is this:
The virtual schedulers form sched groups, each of which is a CPU
container that competes as a single process on its host scheduler.
The processes within the guest compete against each other for whatever
CPU time the container process gets from the host scheduler.
The sched groups are visible in /proc as /proc/sched-groups/pid,
where pid is the process id of the process that made the
sched-group. These directories now contain the former
/proc/pid directories, and symlinks have been left behind for
compatibility. Initially, all processes are in sched group 0,
proc/sched-groups/0, which is the host scheduler.
The available schedulers are visible in /proc/schedulers. A process
becomes a guest scheduler by opening one of those files.
You move processes from one scheduler to another just by moving the
pid directory from one /proc/sched-groups directory to another.
The example in the announcement is three CPU hogs, one on the host
scheduler and two in a guest. They should get a 50-25-25 CPU split
because the two in the guest are competing for the 50% of the CPU that
the container process gets from the host scheduler. This is how it
actually looks:
Currently, it's for UML only - there are a couple of minor things
which I suspect will cause it not to build on x86 or anywhere else.
In other news, I brought UML up to 2.6.12-rc3. This required some
nasty hackery to get around an interaction between skas0, stack
randomization, and a consistency check in exit_mmap. Basically, the
stub data page added by skas0 was causing the check to possibly fail,
depending on where the process stack ended up in memory. Bodo and I
worked out a nasty solution this afternoon on #uml. I've been working
on getting skas0 in shape for mainline, and I pretty much had it, but
this is going to require some more work to do cleanly.
Bodo's S/390 port seems to be coming along nicely. He reports UML
working in TT, SKAS0, and SKAS3 modes. I've been merging his
non-S/390-specific patches, and waiting for him to bless the S/390
bits so I can merge those, too.
11 Mar 2005
Along with normal UML stuff, I've spent the last couple of weeks
virtualizing the the Linux scheduler. What this means is that you
have a guest scheduler which looks like the host scheduler as a single
process. This is done currently by having a process open a magic
/proc file:
cat /proc/schedulers/guest-26
This causes the cat process to turn into a new instance of the
scheduler running on top of the original. This scheduler gets
whatever cycles it can from the host scheduler and uses them to run
its own processes. This means that the group of processes within the
guest scheduler compete for cycles from the host scheduler as a single
process, making this a CPU compartment.
I gave /proc a bit of an overhaul, with all the processes on the
system starting in CPU compartment 0, which is represented by
/proc/sched-groups/0. So, all the former /proc/pid directories
start off in /proc/sched-groups/0, with symbolic links pointing there
from /proc, so that ps and related utilities continue to work.
When a process opens /proc/schedulers/guest-26 and becomes a guest
scheduler, a new entry in /proc/sched-groups is created:
usermode:~# cat /proc/schedulers/guest-26 &
Created sched_group 177 ('guest-26')
[1] 177
usermode:~# ls -l /proc/sched-groups/
total 0
dr-xr-xr-x 2 root root 0 Mar 11 17:20 0
dr-xr-xr-x 2 root root 0 Mar 11 17:20 177
This is initially empty, except for itself, because there are no
processes in this scheduling group. We fix this by moving one there -
in this case, the shell that we are typing at:
usermode:~# mv /proc/sched-groups/0/158 /proc/sched-groups/177/
usermode:~# ls -l /proc/sched-groups/177/
total 0
dr-xr-xr-x 3 root root 0 Mar 11 17:20 158
dr-xr-xr-x 3 root root 0 Mar 11 17:23 177
dr-xr-xr-x 3 root root 0 Mar 11 17:23 185
So, you can see that pid 158 is now in this group, along with 185,
which is the ls.
The next thing to get working is to have a couple of infinite loops in
a sched-group and one outside, and to see that the two inside each get
25% of the CPU and the other gets 50%. This doesn't work right now
because I don't have the timer interacting with the guest scheduler
correctly.
What does this have to do with UML? UML is going to provide the
structure for virtualizing the scheduler and other kernel subsystems.
You can make UML (or any part of it, such as the scheduler) run inside
the kernel as a guest by treating this as a new "OS". You would port
it to internal kernel interfaces rather than the libc system call
interface. This makes the "U" part of "UML" something of a misnomer,
but that's OK.
In the longer run, I'd like to be able to run a userspace guest
scheduler which would have the same properties of the current
in-kernel guest. This would use its cycles to run the (otherwise
unjailed) processes under its control. A tt-mode-like implementation
would have normal processes being run one at a time, while a
skas-mode-like implementation would have a single process constructing
the comfined processes from pieces provided by the host. So, for each
confined process, this scheduler would have an address space, a set of
register values, a set of open files, etc, and activate them all at
the same time when the corresponding process is supposed to run.
The confined processes would no longer exist in the host kernel,
except as disassociated parts. Thus, moving a process from the host
scheduler to a guest scheduler is really a process migration. I still
want the guest scheduler and its processes to be visible on the host
as they are now, so this means that the host will have to have "stub"
processes which are representatives of processes which are owned by
something else. Operations on these stubs will be passed along to the
scheduler that owns them.
This is starting to resemble a cluster, with processes migrating
from node to node, but visible across the cluster. I think that this
process of virtualizing kernel subsystems one at a time can lead to
something resembling a cluster.
16 Feb 2005
The x86_64 problem was that I was trying to use 64-bit constants and
not noticing the assembler telling me they were being truncated. I
now compute them (by or-ing two 32-bit constants together). Now, I
have to start hitting it with some I/O to see how well the AIO support
works.
15 Feb 2005
I finished the first pass of the ubd AIO support. The driver now
issues as many requests as it can, leaves them for the AIO layer to
deal with, and tells the block layer when a request is finished. It's
only tested on 2.4, which does one request at a time to the host, even
though a bunch have been issued by the driver. This is still a
reasonable test - the one thing that's different from the 2.6 AIO is
that the requests are guaranteed to finish in the order they are
issued. I think there are no issue-order dependencies, but I need to
test this on 2.6 to be sure.
However, my 2.6 box is x86_64, on which UML is currently broken. A
possible culprit is Bodo's skas0-clone patch, which makes sure that
child processes get a copy of the parent's segment registers by having
the stub in the parent actually create the child. Bodo gave me a mass
of assembly, which I gradually translated into C (with a few assembly
helpers) over the course of a couple days.
Another possible culprit is the VM op batching which I added recently,
which has the process stub perform a number of VM changes at once
instead of switching back and forth to UML for each one. I discovered
two bugs in the x86_64 support for that (one of which was also present
in the x86_64 skas0-clone support). These haven't changed the symptom
greatly, which is that the whole thing hangs on the exit of the first
non-init process. I'm seeing the two processes sharing a stack, which
is very bad, and I haven't yet tracked the cause down.
4 Feb 2005
In preparation for 2.6.11 having a good, buildable-out-of-the-box UML,
BlaisorBlade and I have been feeding select patches from my queue to
Andrew. In addition, I sent a patch to Linus (directly, not through
Andrew, which he took (!)) which fixed some mainline build problems
that didn't exist in -mm. In the meantime, BlaisorBlade pulled some
stability patches from my patchset and sent them to Andrew, who pretty
promptly sent them on to Linus.
In the meantime, the sense of urgency we feel about this has been
lessened somewhat by UML in a different way - its ability to expose
ptrace bugs. Bodo was seeing segment register corruption, and tracked
it down to a race in the host kernel where there was window during
which a parent could ptrace some values into the process, and the
process would just overwrite them due to being in the middle of a
context switch. He explained this race to me in #uml, and I got it,
somewhat dimly, after a while. He then brought it up on LKML, got
Andrew and Nick Piggin interested in it, with Andrew saying that this
was a bug to hold up 2.6.11 for.
I've started going through my patch backlog and integrating stuff from
it. The most interesting piece I did was Gerd Knorr's X11 framebuffer
driver. This gives UML a real framebuffer to use as a console, which
is pretty slick. It has to be configured somewhat carefully - until I
disabled some things and enabled others, UML first wouldn't build, and
then it built and ran, but just popped up an empty black window for a
console. This patch will fix defconfig so that the configuration is
right, and it will build out of the box.
There are some oddities with it. If you have two consoles with gettys
on them, they will both appear in the framebuffer window. This seems
to me like we need to enable virtual console switching, although it
may also be a symptom of my /dev/tty0 botch. I haven't looked at it
closely enough to tell yet.
1 Feb 2005
I spend a good part of the last week playing with AIO support in the
ubd driver. I'm doing this very incrementally in order to be able to
track down breakage easily when it happens. The first step, which
went well, was to remove the ubd I/O thread and use the existing UML
aio support, which will run an I/O thread of its own on 2.4 hosts.
Next, I tried dequeuing a request and handling it totally by myself
instead of using elv_next_request to give me a piece of it at a time.
This went less well, due to my not understanding the 2.6 block layer
changes. Once I figured out the relationship between the struct
request sector, the bio sector, and the SG offset, things went rather
better.
Now, I'm pulling a full request off the queue, turning it into a set
of scatter-gather structs, and issuing them to the host. Right now,
it's one at a time like before, but the infrastructure is closer to
being able to issue them all at once, and letting the lower layers
issue them to the host in whatever way they can.
I fell a little behind Andrew over the last couple of weeks, so I
caught up today. Merging in rc2-mm2 was fairly easy, and I'll be
pushing the current incremental patches out today.
21 Jan 2005
Bodo's segfault stub cleanups accidentally broke x86_64 by
reintroducing the original bug that had me stuck for a while. Those
changes brought back the property that UML re-enters userspace after a
page fault from inside a system call. To recap, this is bad because
system calls return using SYSRET, which returns to the address stored
in %rcx. This means that the system call wrappers can guarantee that
they don't care about %rcx, but a fault that can happen at any time
where %rcx might contain comething important can't return to userspace
using SYSRET.
I complained at him about this, and he came up with the neat
suggestion of signalling the end of the address space fixups with an
int3, which generates a SIGTRAP, but does it with a processor
interrupt rather than a system call. This returns to userspace using
IRET, not SYSRET, which solves the problem nicely.
I had my patchset down to 10 patches at one point after sending
another batch to Andrew. It's now more than that as I merge more
stuff from BlaisorBlade and keep working off my backlog. Most of this
stuff is cleanups. I split the AIO support out of externfs and
started to add AIO support to the ubd driver. Also, I'm getting UML
help from other people at Intel, and those patches have started going
into my tree. We're separating out the userspace code from the kernel
code in preparation for a ring 0 port of UML.
14 Jan 2005
I sent 28 patches to Andrew the other day. I had 50 patches in my
patchset and it was getting a bit much. This included most of the
x86_64 stuff. Amusingly, Andrew's -mm3 announcement got eaten by
vger's spam blocker, and it was my fault:
The 2.6.10-mm3 announcement was munched by the vger filters, sorry. One of
the uml patches had an inopportune substring in its name (oh pee tee hyphen
oh you tee). Nice trick if you meant it ;)
I'm not actually sure what he's talking about. I grepped my patches
for the suspect string and it wasn't there. So, maybe it was one of
his names for my patches that did the trick.
The largest things I'm still holding onto are skas0, Bodo's faultinfo
rework, and the tlb flushing improvements.
skas0 is coming along, but it still needs work. Bodo made a good
suggestion about how to pass page fault information back from the
segfault handler. This simplified the code and made it more
efficient. The original implementation was
the stub handles the segfault, and copies the fault information from
its sigcontext to registers
it stops itself
the kernel reads the values from those registers with PTRACE_GETREGS
the kernel continues the process by putting the original registers
back and PTRACE_SYSCALLing it.
This didn't work on x86_64 because of the way SYSRET works. If you
try to continue from a signal from the inside of a system call, you
corrupt %rcx. So, on x86_64, I had some special code which continued
the stub by having it return fairly normally from the signal with
sigreturn.
Bodo made the following suggestion
instead of putting the fault information in registers, just make the
stub's stack available to the kernel and have the stub write the
address of the sigcontext in a known location
have the stub hit itself with a signal that's blocked in the handler,
but unblocked outside, causing it to stop with a signal right after
the sigreturn
the kernel reads the page fault information from the sigcontext,
handles the page fault, and continues the process
This unifies the x86 and x86_64 implementations, removes some
architecture-specific definitions that used to be needed for pulling
the page fault information from the register set, and makes the x86_64
implementation cleaner and faster.
Bodo also noticed that skas0 doesn't copy ldt entries properly from
parent to child across a fork. This is not hard to fix, but does
point out that skas0 is not ready for prime time yet. I'm also
concerned about page fault speed. skas0 is noticably faster than tt
mode for everything else, but it is a couple of context switches
slower when mapping in a single page during a page fault. He made a
suggestion for this, which should speed it up a little, but still
won't touch those extra context switches. Rather than having the stub
run the system call and signal itself, it's a little quicker to just
single-step the system call. This saves the system call exit from the
single-stepped mmap and the entry for the kill(). Not a lot, but I'll
take anything I can get.
I also have patches for improving tlb flushing performance by merging
adjacent operations wherever possible. The next step, which I had
implemented, and then somehow lost, is to have skas0 batch them up so
the stub does a lot of them without switching back and forth to the
kernel on each one.
Also in my tree now are a patchbombing from Blaisorblade, so I am back
up to 40 patches. He's doing a lot of good cleaning work, and these
patches make large improvements in some areas of the code.
I need to get 2.4 back on track, so I've been figuring out what
patches need to be merged from 2.6. Sadly, I'm not organized enough
to have held onto copies of the patches I have sent to Andrew. So, I
pulled all his broken-out patchsets, and grepped out the UML-related
ones. There are 166 of them, not all of which are backportable to
2.4. However, probably most of them are, so that's what you'll be
seeing in the 2.4 part of my
patches page soon.
7 Jan 2005
I finally got the x86_64 port merged into my tree. See the
incremental patches
page for the patches. I will be sending them to Andrew when I
think they're stable, which right now, they're not. I have a busybox
filesystem which boots nicely, but when I try anything fancy, I get
crashes.
To test this, and to get a 64-bit filesystem, I've been doing the
Linux From
Scratch thing. This is a hugely useful project. I started off
by naively building gcc, then realizing I was going to need libc, and
trying to build that, and failing. Starting a toolchain and libc from
scratch is subtle, and you'll waste a lot of time if you want to
figure it out on your own. So, I now have a temporary set of tools in
my filesystem, and I'm going to continue building it from inside UML
to exercise it.
A while back, BlaisorBlade started asking questions about why
something very much like skas mode can't be implemented in stock,
unpatched hosts. It turns out that the answer is that it can. The
result is skas0 mode, which offers the security of skas mode, and a
good part of the performance, without needing to patch the host. The
two things that skas mode gives you can be implemented on stock hosts
/proc/mm - This lets you create new address spaces without needing a
new process for each one. This can't be done without patching the
host, but the benefits of this are somewhat lower memory consumption
in the host kernel. It's easy enough to get new address spaces by
creating new processes.
It also lets you modify those address spaces, i.e. mapping and
unmapping pages, and changing page protections. This can be done by
inserting code into the process, and making the process call it
whenever you need address space changes. I've done this by taking
away the top two pages of the process address space, and using one for
code that the process will run, and the other for handling SIGSEGV.
More about that later.
The code page has two pieces of UML code mapped into it. The first just
executes a system call. UML sets up the process registers for the system
call and sets the IP to the start of this bit of code, and continues
the process. The child just executes the system call and signals
itself so UML knows that it is done. This is used for mmap, munmap,
and mprotect.
PTRACE_FAULTINFO - This allows UML to get page fault information from
the child when it gets a SIGSEGV. This is done in skas0 with the
second piece of code inserted into the child. This is a segfault
handler, which reads the page fault information out of its sigcontext,
puts them in registers, and signals itself. UML then reads the
registers, and gets the page fault information from there. The data
page that is mapped into the child is just the stack for the SIGSEGV
handler to run on.
I'm trying to convince myself that this obsoletes tt mode so
completely that it can just be thrown out. From a security point of
view, it's a no-brainer. From a performance point of view, skas0 is
generally a win, but there are specific spots where it performs
worse. In particular, page fault handling is slower. tt mode can get
the page fault information directly from its stack, and fix its
address space just by calling mmap/munmap/mprotect directly. skas0
has to make two context switches in order to get the page fault
information, and at least two more to fix it. This aside, it it
noticeably faster on my favorite benchmark, the kernel build. UML can
do a build in just under three minutes on my laptop.
I've been doing some other performance improvements in order to get
skas0 performance above that of tt mode wherever possible. I've
removed some unneeded code from the system call path. I've also made
tlb flushing more efficient by minimizing the number of system calls
needed. This is done by merging adjacent operations wherever
possible. For example, a large munmap, which used to be done
page-by-page, is now coalesced and done with a single munmap.
Similarly, mmaps which are contiguous in memory and the backing file
and have the same permissions are done in one operation, rather than
one operation per page.
Having started using evolution to keep track of deadlines and to-do
lists, I've started thinking again about embedded UML as a way of
adding things that I would like, but which are specialized enough that
they probably never will be.
To recap, embedded UML is the idea of making UML into a library which
can be linked into other apps. Having done this, you would then
implement a little filesystem which would be mounted inside the UML
and give you access to internal application state. This would be
exactly like /proc, which provides access to internal kernel state.
Many entries in /proc allow reading and writing kernel variables, but
many have more complicated semantics. This "appfs" would be exactly
the same thing, except it would be specific for the application that
you're embedding UML into.
Getting back to evolution, a very simple thing I would like is some
statistics on the average age of my todo items. Are they getting
older or younger? And maybe a pretty graph showing that. I would
like to do this by making an "evofs", mounting it (on /evolution in
the embedded UML, say), looking at the items in /evolution/tasks,
pulling out their ages and doing the calculation. Lets say that each
task is a directory, which attributes for each task in its own file
within that directory. So, I could write a script that looks at
/evolution/tasks/*/start-date, calculate the age of each one, and
average them.
I realize that there's a config file with a reasonable format under
~/.evolution which can be parsed to provide the same information, but
there are other things I want which can't be had by parsing config
files.
For example, there are scheduled things which have to happen on the
same day each month, but which I'd like evolution to schedule on a
weekday if that happens to be Saturday or Sunday on a given month. I
don't necessarily fire evolution up during the weekend, so the alert
will show up on Monday in those cases, which would be non-optimal. If
it alerts my on Friday, I make a note on a Post-It and have some
chance of remembering it during the weekend.
Needless to say, there is no "put this on the 15th of each month,
except if that's a weekend, in which case put it on the previous
Friday" button. So, what I want to do is have a little daemon in my
UML inside evolution which watches for new appointments by watching
/evolution/appointments via dnotify or inotify. When one shows up on
a weekend, it would just just change the date field to the previous
Friday. There are obvious problems with appointments which are
scheduled forever into the future. One of these can't make an
infinite number of /evolution/appointment directories for this daemon
to look at. Evolution could just make directory entries for
appointments that are visible in the interface. So, if you go
scrolling through the rest of the year, everything you see will be on
a weekday.
Another example - I'm starting to write a book about UML, and just
finished the schedule for it, i.e. when each chapter will be ready. I
had a similar problem with this. Each deadline is some number of days
later than the previous one, and some of those will end up on a
weekend. I wanted each deadline to be on a weekday, except the
nearest one this time, so a Sunday deadline gets moved to Monday. A
similar daemon could be used to do this - the logic would be quite
similar.
However, we can't have two daemons battling over the same
appointments, so we need some way to specify which private rule will
be applied to an appointment. The best thing would be some button in
the appointment dialog that says "If this appointment falls on a
weekend, it should be moved to the previous Friday" or "If this
appointment falls on a weekend, it should be moved to the nearest
weekday". Evolution can export its internal state to an internal UML
via a filesystem, so it can export its UI in the same way. So,
somewhere in /evolution/ui, there'd be a directory for the appointment
dialog box. Fiddling that in the appropriate way, i.e. creating
directories or files for the new labels and checkboxes with the
information needed to position it correctly, would cause additions to
the UI when you next create an appointment.
However, if you do check those boxes, evolution won't do anything
about it because it knows nothing about those widgets except that you
told it to make them. But part of the information that you plugged
into the UI could tell evolution to include the final state of those
widgets in a particular place within the appointment's directory. So,
when it set up the checkboxes, my script could tell it to put their
state in the nearest-weekday and previous-friday files and could look
at /evolution/appointments/00123/nearest-weekday and
/evolution/appointments/00123/previous-friday to tell whether I
selected either one and move it appropriately.
Admittedly, these are minor things to use to justify a non-trivial
project such as making UML embeddable, and then modifying applications
to embed it. However, there are much more substantial uses for this
that I can imagine. For example, the embedded UML can export
application filesystems to the outside world via NFS or some other
network filesystem. In a group environment, this could mean exporting
your task list to your boss's evolution, where his UML will make a big
list by glomming together all the underlings' lists. The individual
could be mounted under /tasks/alice, /tasks/bob, etc, and then
symlinks made from /evolution/tasks to tasks within those
mounts. The "taskfs" running that filesystem would see the symlinks
being created, read the contents, and create the datastructures inside
the application to make all those tasks appear within the boss's UI.
As described above, the UI could also contain additions, such as the
owner of a task, that the original UI doesn't have. Then, there could
be an interface for changing it. Again, this would be completely
implemented by a script within the embedded UML. Changing the owner
of a task would involved moving it from one person's list to
another's. From the point of view of the script, this is moving the
task's directory from /tasks/bob to /tasks/alice. Then, Bob's and
Alice's UMLs will find out, via NFS, about the moved file, make the
changes locally, and by doing so, inform their respective evolutions
so that one will show one more task in its UI and the other will show
one fewer.
I don't know if evolution has this sort of thing already, or if it's
planned. If not, this shows how embedding a UML into an
application can make it much easier to extend, possibly in fundamental
ways, such as making an isolated application group-aware. You get a
standardized interface, the Unix file interface, and all the Linux
tools for using that interface in all applications that embed UMLs.
You don't need to worry about the application's source, or building
it, or the language that you need to use. The one place where you're
dependent on the application is what information it exports via a
filesystem and what you can do through it.
18 Oct 2004
UML updates have been going nicely into 2.6 via Andrew, so much so
that 2.6.9 won't be that different from my own tree. Thanks to
Blaisorblade for pushing patches to Andrew as well. He pushed the
initial set that got UML back into -mm, plus a batch just under the
wire for 2.6.9-final.
This leaves the question of what happens to the separate 2.6 UML
patches. Right now, my plan is to stop producing them, and let people
use the UML in -mm or -linus. People who want the latest
bleeding-edge stuff or a bug fix that hasn't made it to Andrew or
Linus yet can just grab it from my
patches page.
The other major UML happening is getting the x86_64 port merged into
my tree. The results of this can also be seen on my
patches page. I've
got most of it merged now - the exception is the 32-bit compatibility
code, a lot of which is gross, and it's optional, so I'll merge that
after the main body of the patch is merged and working.
So, I'm currently trying to get it to build. It looked ugly at first,
but fixing the most common compilation errors made the rest look a lot
more manageble.
23 Sep 2004
My patches are continuing to flow nicely through Andrew into 2.6. At
this point, 2.6.9-rc2-mm2 is almost the same as my 2.6 tree. There
are still some patches to go, but it's pretty close.
I'm working on getting the other major outstanding patches merged.
BlaisorBlade sent me a good number of his patches, and I've got them
mostly merged now. He did a bunch of good work on the build. make -j
now works nicely, plus UML does its final link in the same way as the
other arches. His patch did away with linux being the default target,
but I added that back. You now get a vmlinux and a linux, which are
the same thing. I sent about half of these to Andrew tonight, so they
should be appearing in the mm tree shortly.
Another of the outstanding patches is the x86_64 port. I spent
yesterday breaking it up into smaller patches. So, now I have 29
patches that I will be merging in. I have three of the smaller ones
in now. They are available from the new -mm section of the
patches page.
I also revived the skas4 patch. I'm having confusing problems with
UML host processes getting signals that I think they shouldn't.
Debugging on this is continuing.
After all these are done, or maybe in parallel, I'll start looking at
Gerd Knorr's patchset, plus smaller things like meta_tdb. This will
knock the patch backlog down nicely, and then it will be possible to
start looking at the small things that people have sent in, and which
have been languishing.
11 Sep 2004
This month's big news is that Andrew has started feeding the UML
patches he has been keeping in his tree on to Linus. BlaisorBlade
sent the big patch plus some patches of his own to Andrew, and I sent
in some more which made it build and work. At that point, which was
sooner than I expected, Andrew sent the whole thing to Linus, who,
miraculously, took it.
There were some glitches. Among them were my patch changing the
initial value of jiffies from -5 minutes to 0, which I put in because
it was getting a strange value in UML, another which added a UML
patch number to EXTRAVERSION, and, of course, ghash.h. The first two
were easily fixed, and I ripped ghash out of UML the day before
yesterday.
I've also started feeding him the patches that make up the difference
between what he has and what I have here. So, UML should be
reasonably up to date in -mm, and also 2.6, pretty soon.
BlaisorBlade made a suggestion which I think is reasonable, which is
that -mm and 2.6 should be the UML stable tree, and my tree should be
the development tree. Thus, patches would be forwarded from my tree
to Andrew when we think they're stable, and my tree would consist of
so-far-inadequately-tested patches. I'm also considering stopping
releasing 2.6 patches. You'll have the 2.6 tree with a good UML in
it, and my ongoing development will be available as patches from my
incremental patches
page.
I've also got externfs/humfs ported to 2.6 and working reasonably
well, except for some internal glitches, which will be fixed. With
some more work to enable mmap in humfs, this will let us start
eliminating the excess memory consumption caused by double-caching on
the host and UML.
19 Aug 2004
I've been keeping up with 2.6. I put out a second 2.6.7 patch which
syncs up my 2.4 and 2.6 trees. Then, I put out a 2.6.8.1 patch which
just updates to 2.6.8.1. I'm also tracking -mm in hopes of cleaning
up the UML in Andrew's tree enough that he can send it to Linus, and I
can finally get something current into stock 2.6.
I spent the last couple of days writing my views of where UML is
headed in the future. It is somewhat grandiosely called the Road Map. It doesn't
contain anything that I haven't talked about before, but that is all
buried in the papers and slides on the site, and I think that very few
people read them. This puts it all in a much more conspicuous spot on
the site.
Most people, understandably, consider UML to be just a virtual machine
technology. I'm trying to make clear that virtualization in general,
and UML in particular, are potentially much more than that. I talk
about porting UML into the kernel, and the possibilities that
creates. Also, UML can be linked into other applications and used as
a captive virtual machine and I describe what I see as some of the
possibilities there.
12 Aug 2004
The Kernel Summit and Ottawa Linux Symposium have come and gone.
Virtualization was a pretty consistent theme in both, and UML came up
often. At KS, Chris Wright gave a talk on virtualization on Linux in
which he described the various technologies that are currently
available. Then he talked about what Linux could do to better support
virtualized guests. Since I had given him most of the material for
this part, he more or less turned it over to me, and I went through my
laundry list of things that the host could do to make life easier for
UML. None of what I said seemed particularly controversial, which is
nice. Wim Coekaerts of Oracle was particularly interested in UML's
need for AIO, since Oracle uses it, and Wim perceives a lack of
interest in it on the part of Linus.
At OLS, there were a number of virtualization talks. An IBM Power
guy, Dave Boutcher, gave a talk on hot-plug CPUs and memory. The CPU
stuff is going to be used by UML in exactly the same way that the
other arches will. However, his memory hot-plug plans were aimed at
being able to remove a particular piece of physical memory from the
machine. This requires moving all kernel data structures from that
memory (or not putting them there in the first place). UML's memory
hotplug plans are much simpler. It is sufficient to be able to grab a
number of pages and free them to the host, and it doesn't matter where
they come from, or that they are contiguous.
There was a talk by the Xen people, which I missed because I went to a
different talk. Chris gave his Linux virtualization talk again, which
I punted on this time because I had heard it already. I don't know
what he did with the section that I pretty much did during KS.
On returning from Ottawa, it was time to get back to UML, which I can
now do pretty much full-time, thanks to Intel. Since then, I've got
UML pretty much up to date with all the trees I keep up with. I fixed
a bunch of problems in the latest 2.4.26 release, released a 2.6.7
patch, and, last night, got 2.6.8-rc4-mm1 running and sent the
requisite patches to Andrew.
I've started using quilt to manage patches, and started posting my
current patchsets to the
UML site. This has gotten good reviews from my users, and has helped
me by giving me some assurance that I'm not going majorly break
anything when those patches are released for real.
I currently have my 2.6.7 tree synced up with 2.4.26, except that
humfs and hostfs aren't all that stable yet after being ported over.
The mm tree has a pile of bb patches in it which I need to sync up
with, and merge any applicable ones into 2.4.
15 Jul 2004
Recent UML work has focussed on the host-based filesystems, mainly
humfs. The filesystem metadata has been reworked to fix some bugs in
the first version, namely to add modes to the metadata, handle hard
links correctly, and to correctly handle the case of where to put the
metadata of the parent directory containing a "metadata"
subdirectory. I had a plan for files named "metadata", but forgot to
check that it handled directories as well. The fix is to create two
metadata directories, one for files and one for directories. In the
directory metadata, there can be only one plain file in a given
directory. This will normally be called "metadata", but if there is a
directory by that name, it will be called something else. Whatever it
is called, it can be found by looking for a non-directory.
This required bumping the metadata version and updating humfsify.
Since I don't want to increase version numbers very often, I also
fixed the mode problem, which is also a version-affecting bug, at the
same time.
Humfs seems healthy at this point. I can boot from it and do a kernel
build on it. Hostfs was also affected by the externfs restructuring,
and got little attention, so it was broken. So, I've been working on
it. It is better, but there are still some bugs. It doesn't survive
a kernel build yet, but it's doing a lot better than it was yesterday.
I've moved UML development onto my new laptop, which means I'm running
UML in tt mode until a put a skas kernel on it. This exposed a
problem with the file descriptor management that I wouldn't have seen
otherwise. In tt mode, a pipe is used as part of context switching.
The outgoing process writes a byte into the pipe of the incoming
process and then reads its own pipe. The write will bump the incoming
process out of its own read, waking it up, and the outgoing process
will sleep in its read until something writes to its pipe.
The problem was that the pipe file descriptors weren't under the
control of the filehandle code, so it couldn't free up file
descriptors if the pipe call failed due to no descriptors being
available. This required a bit of surgery on some tt-specific code,
which seems to be working well.
30 Jun 2004
I've had precious little free time in the last few weeks, but what
there was I spent banging on humfs. It now stores file modes in the
metadata rather than leaving them on the original files. humfsify now
has a -r option to merge the matadata back into the data. This is how
you upgrade from one metadata version to another (or change metadata
types) - revert the metadata and build the new metadata from scratch.
The major development, and what sucked up a fair bit of that free
time, was a nice offer from Intel, which I accepted. I get to spend
most of my time working on UML, so this means that it is no longer a
free-time project. We should see UML development speed up noticably
as a result of this.
However, that probably won't happen until the end of July. I'm taking
off next week and touristing around Iceland. Then, two weeks later,
its KS and OLS. There is that week in between, but I'm expecting that
to be largely spent catching up on what I missed the previous week.
So, that means that it'll be the last week of July before I really get
cranking on UML again.
I am trying to get a release out before I leave. The last patch is
noticably less stable than the previous ones, so I'd like to get a
bunch of bug fixes in.
13 May 2004
After much banging on hostfs and humfs, I got them working well enough
to release them. It turns out that there were a bunch of bugs in
hostfs conspiring to make it completely synchronous. This is
undesirable for something like humfs, so I made it behave like the
disk-based filesystems. That exposed a pile of bugs, which I fixed,
now humfs is noticably faster. I also added a patch from Piotr Neuman
which makes it possible to plug new metadata types into humfs. So, it
will be able to support all the other metadata formats that people
have been suggesting, like tdb and xattrs. So, that made up the bulk
of 2.4.24-3, which I released yesterday.
In somewhat older news, there is now a x86-64 port of UML, which was
sponsored by PathScale. This is on 2.6
only right now, and it's available
as a separate patch to be applied on top of the 2.6.4 UML patch. I'm
going to be merging it into my tree bit by bit. Currently, it's not
very clean in a bunch of ways, and there will need to be some work to
make it cleanly mergable. So, as this happens, the separate patch
will shrink, and when the port is fully merged, it will go away totally.
Now that the humfs work is out, I need to catch up with 2.4 and 2.6.
So, first up is 2.4.25, which turns out not to need any UML changes.
That will be released today. Then will be 2.4.26, which I think just
needs the cmpxchg patch which has been floating around for a while.
Then, it's on to 2.6, where I am currently two released behind.
19 Apr 2004
It turns out that humfs was fairly buggy. I stomped out enough of
the bugs that you can now do a kernel build on it. There are also a
couple of design problems with the current metadata layout. One was
pointed out to me by Paul
Wagland; the other I actually figured out on my own.
The first is that permissions are kept on the original file. The
problem is that if it is chmod-ed 000, then even the owner can't read
it. However, root can, so this would lead to a situation in which
root inside the UML couldn't read some files that it should be able
to. The obvious solution is to have humfsify chmod all of the files
777, and move the permissions into the metadata file.
The other bug is that the current humfs metadata maintains one file
for each name in the filesystem, so for files with multiple links,
there is an independent metadata file for each link. This is wrong,
since if you change the ownership of a file through one link, the
ownerships of the other links should change as well. humfs will
currently change the ownership of the one file, but leave the others
as they were.
My current plan on fixing this is to designate one metadata file for
each file as the primary metadata holder. If the file has more than
one link, the other metadata files will be symlinks to the primary.
This means that humfsify will need to keep track of what names link to
what files so that it knows when it needs to start making symlinks,
and where they should point to. This sounds nasty, since the easiest
implementation involves keeping track of every file it deals with, but
a simple optimization is to just keep track of the files with more
than one link. This will be a small minority of the files. Some of
them will involve links from outside the hierarchy being copied, and
won't result in any metadata symlinks. The other side of this is that
the metadata file will need a link count and a deletion flag in it.
These are needed because someone might remove the name corresponding
to the primary metadata file. That file can't actually be removed
because the symlinks count on it, and we don't want to move the data
because that would require searching for every affected symlink. So,
when the file is deleted, the metadata file will have its link count
decremented, and the deletion flag set. Then, it will appear not to
exist. It will be really deleted when all of the links to it are
gone. Deleting one of the other links will just involve decrementing
the link count in the primary, and removing the symlink.
Also in the not-too-distant future is O_DIRECT and mmap support for
humfs. Both will eliminate the double-caching problem that prompted
this filesystem in the first place. O_DIRECT is useful when the data
isn't shared with any other UMLs, which will be the case until humfs
has COW support. It reads data from the disk directly into the buffer
provided by the process without going into the kernel's page cache.
The other attraction of O_DIRECT is that it is supported by the
current 2.6 AIO. This means that UML can have many I/O requests in
flight on a 2.6 host using O_DIRECT, which should help its I/O
performance.
mmap support will do the same for shared data. In this case, the data
is in the host's page cache, and it is mapped directly into the UML
address spaces. This will need COW support in humfs before it becomes
useful. It will also need AIO support for normal, cached, I/O before
UML's use of AIO can come into play.
I'm planning on releasing a new UML patch in the near future. It will
have a fixed humfs, plus maybe a restructuring on humfs to allow
multiple forms of metadata to be supported. There will also be a
utilities release for a rewritten version of humfsify.
7 Apr 2004
It's been a while since the last entry. Since then, 2.6.4 came out,
and I released a 2.6.4 UML. I also released 2.4.24-2 today, which
contains some major changes.
I became convinced (by Al Viro) that ubd-mmap wasn't viable. It's
vulnerable to the filesystem making changes to data that it intends
never to reach the disk. With buffers mmapped from the host, those
changes automatically read the underlying device. He suggested a
filesystem instead, which gives me the control that I need in order to
do mmap correctly.
So, humfs is out. humfs stands (roughly) for "host uid mapping
filesystem". This is the main difference between it and hostfs. The
main problem with hostfs is that any files created are owned by the
uid that is running UML. This is a problem when there are multiple
users inside the UML creating files. The files want to preserve their
ownerships, but can't.
humfs separates the ownerships from the actual file. A humfs mount
has a root with two subdirectories, "data" and "metadata". "data"
contains the actual files. "metadata" mirrors "data", except that the
files under it contain just ownership information as their contents.
humfs file access cause the file contents to be retrieved from under
the "data" directory, and ownerships to be retrieved from under
"metadata".
This has a number of beneficial side-effects:
As already mentioned, mmap can be done correctly. This will allow
files to be mmapped from the host rather than being copied, allowing
UMLs to share the host's page cache. In the tests I've done with
ubd-mmap, this dramatically reduces the host memory usage of a UML.
The UML's files are visible on the host. This makes management
easier, since passwords can be more easily reset and the like. There
are some privacy concerns here in hosting environments. Some people
don't like it to be too easy for the host admin to poke around their
files. So, this will have to be balanced against the management
conveniences.
Space is more flexibly assigned. There is a "superblock" file at the
root which says how much space this mount is allowed to have. It's
read at mount time currently. I think that with dnotify, it will be
possible for humfs to notice changes to it and react immediately. So,
on-the-fly disk allocation changes will be possible.
File-level COWing will be possible. This will allow UMLs to share
filesystems, with modified files being copied and made private. A
special case of this is booting from the host's root filesystem, with
a tiny COW hierarchy containing the files which need to be different.
Block-level COWing within a file will also be possible. Some people
need block devices underneath their UML filesystems, and here, they
will want the usual block-level COWing. I think it will be possible
to do this while retaining the memory usage advantages of mmap by
having the disk image in a humfs filesystem and loop-mounting that as
the root filesystem within the UML. I was concerned about the lack of
partitioning support in the loopback device, but I recently saw a
patch on LKML which fixes that.
It will even be possible for UMLs to share a writeable filesystem,
with communication allowing one UML to change a file and cause the
others to invalidate their cached copies of it. I'm planning on
figuring out how to make it possible to quickly bring up UMLs in
response to a load spike of some sort, like a slashdotting on a web
server. Coherent filesystem sharing could play a role here. With the
filesystem on a SAN of some sort, the new UMLs could serve the data
from a common source, and write things, like log files to a common
place, eliminating the need to collect them from the UMLs as they are
shut down when the spike passes.
As a side-effect of this, I put an abstract interface between
hostfs_kern and hostfs_user which allows other userspace modules to be
plugged into hostfs_kern. humfs is obviously the first new user of
this. Now that there's a pluggable interface, it's not too hard to
make other host resources look like UML filesystems. Some examples:
sqlfs - mount a SQL database as a UML filesystem. There are lots of
choices for mapping part of the database onto Linux directories and
files. Tables could be top-level directories, rows could be
subdirectories, and columns could be files within those
subdirectories. Or rows could be files. Or both, I think, depending
on whether you cat the row or cd into it. Plus, there'd be a /query
directory in there somewhere which would let you treat a query as a
directory, and the search result would appear under it. Consider
cd /sql/query/'select * from people where first-name = "Bob"'
Obviously, any other sort of database could get the same treatment.
Other people have mentioned ldap, for example. An interesting
variation on this would be to dump a Linux filesystem into a database,
and boot from it. This would give you the standard Linux file
semantics, but would also let you database-specific searches on the
filesystem. For example,
cat /sql/"select filename from * where uid = 0 and setuid = 1"
A possible use of this would be for an application which needs a
filesystem to store stuff, but whose needs are otherwise badly served
by any existing filesystems. Maybe it needs a few huge files, or a
huge number of tiny ones, or huge directories, and current filesystems
don't supply the searching that's needed. You'd put those files in a
database, mount the database inside UML as a filesystem, and provide a
query interface to directly do searches on the database.
difffs - mount a directory diff as a filesystem. The two host
directories to be diffed would be arguments to the mount command. The
files in the resulting filesystem would be those that were different,
and the contents would be the diff of the file.
googlefs - mount Google as a filesystem.
cd /google/"user-mode linux"
could produce a directory in which Google's results for the search
"user-mode linux" are somehow represented.
Some of these examples are far-fetched, but I mention them to show the
range of possibilities there are with this. I'm particularly
interested in the filesystem-in-a-database. I imagine that there are
new things that you can do when you can put an arbitrary database
underneath your filesystem, and be able to query the database in
whatever way it allows.
I also added aio support to the os interface. The existing aio was in
the ubd driver, and consisted of a separate thread which did
synchronous IO, one request at a time. This is present in the new
aio. What is new is support for the 2.6 aio interfaces. This allows
any number of IO requests to be started, and for one thread to handle
their completions. When this is debugged, this should help UML's IO
performance.
With mmap IO in sight, the next step will be to introduce the active
UML memory management that I've been working towards for a while. In
conjunction with the /dev/anon patch, mmap support will allow pages to
be freed from a UML to the host, and conversely, be plugged into a UML
to increase its available memory. With a daemon on the host
monitoring the memory usage of the UMLs and the host, it will be
possible to use memory more intensively by giving it to UMLs that need
it and taking it away from those that don't. This should increase the
density of UML servers since memory is often the bottlenck.
15 Feb 2004
As of today, I'm caught up on both the 2.4 and 2.6 fronts. I released
the 2.4.24 patch today. It was trivial, as the 2.4.24 patch was
trivial - just apply the patch and rebuild. I also had a bunch of
accumulated bug fixes, including some time problems that I fixed in
the last few days. These are the ones which people have noticed the
most. They included
process start times as shown by ps drifting away from clock time
/proc showing a modification time of 1970
at reporteedly not working
The other user-visible fixes were a couple of mconsole bugs.
I also caught up with 2.6 - more completely than I had intended. I
pulled his tree, not noticing that it contained 2.6.3-rc2 rather than
just 2.6.2. So, I ended up updating UML to that rather than trying to
back down.
I've also been busy on
usermodelinux.org
. There were a number of things from UML users that the wider
community needs to know about, from new filesystems and installation
HOWTOs, to an automated UML network setup tool, to a neat Knoppix
image which lets you download and boot UML from a web browser. I also
added some new FAQ entries and UML hosting providers. Hopefully the
FAQ entries will cut down on duplicate list traffic (and frustrated
users who just quit rather than ask a list).
16 Jan 2004
I forgot to mention that I also released my /dev/anon patch. This is
a special device that UML can use to map in its physical memory which
has the semantics needed to free memory back to the host when UML
isn't using it any more. In some quick testing, it reduced UML memory
usage on the host by about a quarter. It's used in conjunction with
ubd-mmap, but it's also usable eithout it. ubd-mmap still has some
bugs that need rooting out, including at least one file corruption
bug, so it shouldn't be used for production yet. See
this page for more details.
15 Jan 2004
I was somewhat behind in getting 2.6.0 working and released. But,
that's done now, and I also caught up with 2.6.1 reasonably quickly.
2.6.1 was causing process segfaults with the 2.6.0 patch, and the
reason was somewhat interesting. I had a couple of inline functions
which looked at the frame pointer and stack pointer. In order for
these to work correctly, they really needed to be inlined because they
needed the frame and stack pointers of the caller. Something changed
in how gcc is invoked which caused it to stop inlining these
functions. So, the fix was to turn them into macros, which gives the
compiler no choice about inlining them. This may also explain some of
the wierd behavior other people have been seeing, and which I've been
attributing to the lack of get_thread_area and set_thread_area.
Andrew included my 2.6.1 patch in 2.6.1-mm4. Hopefully it will then
make it into 2.6.2. There will still need to be a separate UML patch,
but it will be much smaller. There are some difficulties in initcall
ordering between the UML console and serial drivers, and the generic
tty driver which require some kludging. So, at least that will need
to be separate for a while until something cleaner comes up.
I also made the first utilities release since the fall. A number of
small changes and fixes had accumulated. There were two sizable
changes - bridging support in the jail.pl script and support for
dumping tty log output into a SQL database rather than to the
terminal. These both stem from my UML honeypot work at Dartmouth.
15 Dec 2003
Richard announced his implementation of swsusp for UML a couple of
days ago. Reaction is less than I expected, but still positive, as I
expected. It'll be more widely used once it is updated to the latest
UML, and doesn't require a 200M download in order.
I've been hacking away on /dev/anon recently. This is a new driver
that supports mmap/munmap, and which releases memory that is no longer
mapped. The purpose is to make ubd-mmap useful by freeing UML physical
memory which has been over-mapped with pages from a disk. With a
simple test of booting my Debian testing image to a login prompt, this
consumes about 25% less memory than using the usual /tmp file.
Accordingly, I can boot about 25% more UMLs before the host starts
swapping. When I release it for general consumption, you can all
thank memset
for the reduced memory usage (and swapping) on your hosts.
I'm also releasing 2.4.22-7. This is a big patch, with a bunch of new
stuff in it. The big items are partial support for skas SMP and
highmem, and Sapan's real-time clock patch. There are lots of smaller
changes including bug fixes and a bunch of code cleanup and
restructuring.
4 Dec 2003
I spent a week in Japan, courtesy of Richard Potter at the University
of Tokyo. He and others in Japan are doing some interesting things
with UML and virtual machines. At dinner on the first night in Japan,
I discovered that Richard had implemented swsusp for UML. This is
obviously a welcome surprise, given that swsusp is probably the most
asked-for feature for UML. The downside is that he is working with an
old UML (2.4.18). Updating it to the latest shouldn't be too hard,
though.
He also had some other interesting ideas, including the idea of a
UML-wide fork, which would be a cloning of the entire state of a UML
into a new UML, with shared resources being COW-ed as necessary. This
is exactly analogous to a process fork, except you get a whole new UML
rather than a new process. I'm not sure what I would use this for if
I had it, but I think it's worth doing, just so the rest of the world
can figure out how to use it.
I also visited the University of Tsukuba to give a talk to the group
led by Prof. Kato, and to listen to a set of presentations on their
work. My talk was a stripped down version of my UMich talk. Their
work included
a "Software Pot", which is a controlled environment for executing
arbitrary, untrusted applications with the ability to import resources
from the host
a very minimal virtual machine environment which can boot slightly
modified Linux and *BSD kernels
A Knoppix environment which loaded off the network, rather than a CD,
and booted inside a UML.
I repeated this talk the next day at the University of Tokyo (and a
number of the Tsukuba people came to hear it again), plus I gave a
presentation to the Yokahama LUG.
10 Nov 2003
After much delay and items accumulating in my todo list, I released
2.4.22-6 today. This contains a large number of fixes and cleanups,
mostly sent in by users over the last few months. One notable bug
which I think is fixed is the "Process nnnnn exited with signal 11"
that Oleg has been seeing. It was a longstanding, stupid bug, and I'm
amazed it hadn't been seen sooner.
6 Oct 2003
Last week, I visited the University of Michigan at the invitation of
Peter
Chen, who has a research group doing interesting things with
virtual machines. The one thing that I'm particularly interested in
is a tool called
ReVirt, which can log everything happening in a VM
and replay the log so that the replay causes exactly the same
instruction stream to be executed. This would make it possible for
someone with a bug that I can't reproduce to send me the ReVirt log,
and I would replay it (over and over if necessary) until I tracked
down the problem.
There was another tool whose name I forgot (written by Sam
King) whose job it is to analyze these logs. The work of this
group is focussed on security, so ReVirt is used to record attacks and
exploits, while the log analyzer pulls out the essential details of
those attacks and generates a nice picture of them. The analyzer
works by defining a small set of basic objects, such as files and
processes, and a set of actions by which they can affect each other (a
process creating a file, or a process being created by execing a file,
for example), and taking the endpoint of an attack (such as a running
backdoor process, or a modified passwd file), and backtracking through
the set of events and objects involved in creating that endpoint until
it identifies how the exploit entered the system. Then it generates a
little diagram which serves as a picture of the exploit.
The use of ReVirt in UML development is obvious. I'm wondering
whether the analyzer could be repurposed with a different set of basic
objects and actions in order to analyze kernel bugs. For example,
when chasing a deadlock, we could make locks be the basic object and
lock and unlock be the basic actions, and get a diagram showing where
the lock in question was taken and released. Hopefully, there would
be a glaring mismatch which would identify the bug at a glance.
These two tools were originally done using
FAUMachine (formerly UMLinux). At the time, there were some
reasons that FAUMachine was preferable to UML (UML's use of helper
threads complicated things) but the group is now porting them to UML.
UML is now seen as preferable because of its greater stability and user base.
FAUMachine's goals are apparently less of a good match to their
research than UML's goals. Plus, the threads issues need to be dealt
with at some point anyway.
I also spent a day at CITI, whose research
doesn't directly involve virtual machines. They are into research and
development which can benefit from using them as their development
platform. At this point, they are heavily into NFS V4 development.
They are producing the reference implementation at the behest of Sun
and
Network Appliances. They are also doing research of the
replication and load balancing aspects of the NFS V4 protocol. In
some of these projects, a virtual UML network would make an ideal
platform, since it would eliminate a lot of logistics in setting up
and running a physical test network.
18 Sep 2003
I released 2.4.22-5 today. I've been doing lots of code cleanup
lately. That was pretty much all of 2.4.22-4 since Steve Schmidtke
sent me a large cleanup patch. A lot of -5 is dealing with the
aftermath of that. There were a few bugs in Steve's original patch,
plus I merged a couple of chunks badly. In addition, there were some
more fixes from him and BlaisorBlade merged. I also fixed a tt mode
bug which caused signals to be disabled in userspace.
11 Sep 2003
HP sent me a nice IA64 machine, which showed up yesterday. Dual
processor, 10G memory, 70G disk. It's a very fast machine, at
least compared to my existing hardware, which is all ~3 years old, so
I don't have a feel for how fast PCs are currently. I spent the
afternoon turning it into my main workstation. The current holdup is
that pppd doesn't work. It gets EFAULT whenever it does the PPP
get-unit-number ioctl.
I guess I have no excuse for not at least thinking about porting UML
to IA64. I have little free time, but I'll do the port a bit at a
time in my spare moments.
8 Sep 2003
I released 2.4.22-2 today. It contains a bunch of bug fixes and a new
mconsole command. New to mconsole is the 'proc' command. This was
triggered by a patch I got from Steve Benson which implemented 'mem'
and 'load' commands, which sent back something that resembled the
contents of the UML's /proc/meminfo and /proc/loadavg. The patch
implemented them by hand, which I didn't like, plus I think it is
likely that it would have triggered requests for other specific stuff
from /proc. So, I just added a general 'proc' command which will read
from any file in the UML's /proc.
I fixed a few nasty-looking bugs such as a crash when running a UML
that had been linked against a libc with the new thread support and a
crash caused by a process unexpectedly segfaulting. It also turns out
that ltrace never worked in skas mode. I fixed all of these, plus a
couple more minor things.
3 Sep 2003
I've made a bunch of UML releases in the last few days. I added mmap
support to the ubd driver, which allows it to use mmap instead of read
and write. This trades data copying for TLB operations, gaining some
CPU and cache improvements from not having to copy all data coming in
from disk in exchange for the added expense of TLB flushing. An
upcoming benefit, and the real reason for this, is that this will
enable the UML and host to share page cache rather than having
separate copies of the data read in by the UML, reducing the host
memory usage by having the memory overlaid by mmap freed on the host.
I realized fairly recently that the host mechanisms that can be used
to implement UML physical memory won't free memory when it's
unmapped. The reason is that the memory will be considered dirty by
the host, even the the UML has no use for whatever is in it, so it
will be kept, and possibly swapped. What I need, and what isn't there
yet, is something that will just free the memory when UML unmaps it.
So, I'm going to be hacking on a little host driver specifically to
implement UML memory.
This required a rework of the UML low-level VM layer, including the
ability for physical memory to fault and for those faults to be
handled. This is something needed by a clustering scheme for UML that
I thought up a long time ago and never did anything about. That is
taking an SMP UML and spreading the virtual processors and physical
memory over multiple hosts. Any physical page would be resident on
only one host and accessible only by the processor(s) running on that
host. If another processor tried to access it, the access would
fault, and a low-level handler would figure out where that page was
located and arrange for the contents to be copied and the page to be
mapped on the new host and unmapped on the old one. This would fairly
easily implement a SSI cluster with UML. The downside is that it
would be horrendously slow because of all the faulting. The
attraction of it to me is that, even in its slow form, it would be a
neat capability and fun to play with. Further, since this starts as a
fully-functional SSI cluster, it would be fairly easy to start working
on making the inter-node communication more sane. It would be done
incrementally, with a functioning (and hopefully faster) cluster at
each stage. It wouldn't be necessary to implement (say) 90% of it
before anything works at all.
I also implemented the COW V3 format, which fixes a number of problems
that had cropped up with V2. The most painful one was the rounding
bug which has been killing UMLs for some time. David has been
maintaining a patch for it, but it apparently is not a 100% fix. The
V3 fix should be. The various sections of the COW file are now nicely
aligned. This will allow COW files to be stored on devices with
restrictive alignments, such as /dev/raw devices. ubd mmap also
requires this because it needs the data to be page aligned.
After this, I decided to catch up to Marcelo, and updated to 2.4.21
and 2.4.22. These were both easy. I also included a couple of small bug
fixes in the 2.4.22 release.
17 Aug 2003
I spent the afternoon yesterday tracking down the task_struct leak in
2.6. Oleg had tracked the leak down to where it was happening by
identifying the put_task_struct call that should have been called but
wasn't. He got that right, but his analysis of the leak was wrong.
So, I finally tracked it down to what appears to be a bug in the
scheduler combined with my low-level switch_to() being slightly
different from the i386 switch_to. context_switch calls the arch
switch_to:
switch_to(prev, next, prev);
return prev;
and this, contrary to appearances, is supposed to reassign prev, which
then gets returned to schedule(). In the i386 scheduler, I don't see
prev getting reassigned. It does branch to __switch_to, which returns
the correct value for prev. This is left in %eax, and accidentally
becomes the return value of context_switch because that's the last
thing it does, so nothing munges %eax afterwards.
So, the UML bug was that my switch_to wasn't assigning prev. Once
this was fixed, the task_struct leak disappeared.
I had a couple of brainstorms yesterday, too. I was looking over a
patch that someone sent me which made load and memory statistics
available through mconsole. I don't like it because it duplicates a
bunch of code from elsewhere in the kernel into UML. It occurred to
me that it would be easy to make all of /proc available through
mconsole in a much cleaner way.
The other thought relates to UML's low-level memory management. It's
reasonably easy to pull memory out of a UML. It can just allocate
pages, and do nothing with them. Those pages can be added back later
just by freeing them, making them available to the rest of the UML.
So, a UML's host memory usage can be controlled by having it pull
pages out of service and freeing them back to the host. This would
make a good way of taking pages from an idle UML and giving them to a
UML that's busy.
So, combining these two ideas, we have a daemon on the host which is
monitoring the host memory usage and the memory usage of the UMLs via
the mconsole /proc interface. When the host starts swapping, this
daemon can figure out what UMLs are occupying memory, but not using
it, and which UMLs need memory. It can then take memory from idle
UMLs by telling to allocate it and free it to the host, and give it to
others by allowing them to take back memory that they had previously
released or by giving them extra memory.
However, freeing memory back to the host requires that the memory be
released from the file that it is mmapped from. In effect, this means
being able to create a hole in the file, and there are currently no
mechanisms for doing this. There have been discussions in the past
about sys_punch or sys_fclear, since there are multiple uses for this,
but it has never been done. One major reason is that this resembles
truncate, except that this doesn't change the length of the file, and
truncate is famous for the number of complicated and subtle races that
it's involved with. Al Viro has given sys_punch a chilly reception
for this reason. So, until there's a way of releasing file-backed
memory back to the system, this plan is going to stay unimplementable.
15 Aug 2003
I'm doing a somewhat better job at keeping up with Linus these days.
I have test3 working, and I'm in the process of updating BK pools and
generating the patch. I merged in all outstanding changes from my 2.4
tree while I was at it.
I released another 2.4 patch with my accumulated changes. These were
mostly small bug fixes. I'm continuing to bang away at my 2.4 tree,
fixing some compilation warnings today.
On the conference front, I submitted an abstract to Linux-Kongress. It's been a
couple of years since I was there, and a lot has happened to UML in
the meantime. So, the abstract describes the major things that have
happened to UML in the last year or two, plus a few of the things I'm
planning for its future.
29 Jul 2003
I'm caught up with Linus now. I released the 2.6.0-test1 UML from OLS
last week. I jumped ahead instead of going release by release on the
advice of Oleg Drokin, who had already done it and found no problems.
The patch is broken, unfortunately. I tested from my BK repo, and
generated the patch from it, and didn't bother testing the patch. It
turns out I hadn't fully checked all the changes back in. I'm going
to release 2.6.0-test2 today with a working (and tested) patch.
I was at the Kernel Summit and OLS in Ottawa last week. A good time
was had by all, and all were pretty tired out by the end. The most
interesting aspect of it from my humble point of view was the interest
in UML. There was a panel of people from large companies who talked
about what large businesses wanted from Linux. Two of them (HP and
Merrill Lynch) mentioned UML.
Bdale Garbee (the HP guy) was seeing demand within HP for UML on
IA64. I talked to him later, and he said that there was some
possibility that they would do UML/IA64.
Robert Lefkowitz (Merrill) mentioned two things about UML. The first
is that IT infrastructures in the financial industry are somewhat
fragmented due to "Chinese walls" between divisions that are required
by the SEC. Large financial companies have conflicts of interest due
to the number of different things that they do. One division might be
doing business with a company, and want to keep that company happy as
a customer. Another division might be making stock recommendations to
investors and might be telling them that this company sucks and to
sell its stock. To limit the ability of one division from influencing
the other to the detriment of its customers, the SEC requires that the
divisions be structured in such a way that they don't communicate with
each other too much. These structures are called Chinese walls, and
they apparently extend to the IT infrastructure. You can't host
functions belonging to different divisions on the same systems.
Except that virtual machines on the same host are OK, which is where
UML comes in. So, having multiple functions on the same host is
legally OK, as long as they are separated by being in different
virtual machines.
The other thing he mentioned was an attempt to package a VPN client
for their employees inside UML for packaging and support convenience.
The attraction of UML is that it is a single known environment, in
contrast to the multitudes of OS versions running by the employees.
Reducing that to one environment makes installation and support much
easier. He complained that UML didn't boot off readonly media, and
said that he had told the UML maintainers about it. At this, I stuck
my little hand in the air and said, "Uhhh, I don't remember anything
about this". Talking to him later, he said he thought the person who
had done the complaining did it in such a way that it didn't appear to
be a complaint, just a "does this work?" sort of a question. My
opinion, and that of others, is that this person complained to Red
Hat.
Moving on to OLS, there was Werner Almesburger's umlsim talk, in which
he described a network simulator he built on top of UML. Russell
Coker was planning on letting people log in to SELinux UMLs during his
tutorial, but that didn't come off because of a non-UML-related
catastrophe he suffered beforehand. The clusterfs person talked about
how they're using UML for debugging. The after-dinner speaker at the
IBM dinner talked about grid computing and mentioned UML as a
possible technology that it could be based on. So, even though there
was no official UML content at either KS or OLS, there were plenty of
people talking about it.
17 Jul 2003
After much delay, and a bunch of mangled BK repositories, I decided to
get my act together with 2.5 again. I somehow ended up with all my
repos containing essentially the same stuff, with minor variations.
So, I merged them into a single tree, produced a diff, split it out,
and applied the pieces by hand to the different repos. So, my BK
situation is sane again, and I've updated UML to 2.5.70. I'll be
catching up to Linus again, and when I do, I'll see if I can get him
to take UML updates again.
22 May 2003
I released the 2.4.20-5 patch today. There's not too much in it. I
tracked down a memory leak which would eventually exhaust the /tmp
filesystem if some skas UMLs were rebooted enough times. This was due
to some /proc/mm descriptors not being closed across exec, causing
them to continue to hold down mmapped disk space. I also added chroot
and append mode options to hostfs. These are helpful for making
hostfs somewhat more secure. The chroot option lets you confine
hostfs mounts to a specified directory tree on the host and the append
option disallows any destruction of data, whether it be truncating
or deleting files.
In 2.5 news, I released the 2.5.69 patch a few days ago, and asked
Linus to pull the changes. He hasn't pulled UML in ages, so I'm not
expecting much from this. At least, it makes it clear that the old
crufty UML in the official 2.5 tree isn't my fault since I've been
keeping up with 2,5 and releasing patches on a regular basis.
12 May 2003
I just got back from Columbia (the one in South America, not the one
in New York), where I was attending the II Congreso Internacional De
Software Libre Colombia in Manizales. I was invited there to give a
couple of talks. The first, on the afternoon of the second day, was
on UML, and I talked generally about UML. It was a high-level
overview of UML and didn't get into the code at all. The second talk
was at 8 AM on the last day, and was about kernel hacking in general.
I gave an overview of how kernel hacking works, how to participate in
it, and a quick tour of the kernel source tree, with some recommended
reading in the code.
The second talk went better than the first. We (me and Gustavo, who
was translating for me) translated most of the first talk's slides
into Spanish. This was a mistake. It took time away from actually
working on the talk's substance. My Spanish is weak enough that I
looked at a couple of the slides and couldn't tell what they said, so
I got lost in my own talk a couple of times. The slides for the
kernel hacking talk the next morning were entirely in English. This
gave me more time to work on it, plus I was more comfortable during
the talk because I could understand the slides.
My talks aside, the conference was a great deal of fun. The
organizers (who included the Dean of the Engineering Faculty of the
Universidad de Manizales, where it was held) were amazingly concerned
about the comfort and happiness of the international speakers.
Among the other speakers was a Colombian Congressman who apparently
was a member of the terrorist group M-19 that eventually made a deal
with the government and became a legitimate political party. One of
the others was a Peruvian Congressman by the name of Edgar
Villanueva. You might recognize him as the author of a devastating
rebuttal to some Microsoft FUD that was circulating around the
Peruvian Congress last year.
The main room
- the university used to be a convent, and it's obvious in
this picture. This room was obviously a church at one point. I
figure it held ~750 people, and it was full for the main sessions. It
was slightly less full for my 8 AM, final day talk on kernel
hacking... Go figure.
Me and Gustavo
- Gustavo is an English Professor at the university who was
acting as my translator. We are in the little lab set aside for
decompression, talk preparation, and network access. Probably we are
working on translating the slides for my first talk.
My first slide
- Anyone who has attended any of my talks will notice some
familiarity here, except for the different language
Me and Gustavo on my second talk
- We sat next to my laptop and projector on the floor rather
than on the stage. So we are here, each with a microphone, doing the talk.
The conference was a lot more political than I'm used to. I usually
go to technical conferences with primarily technical content. This
one had a number of talks which seemed to be concerned with the
politics of Open Source, politics of getting Open Source software
accepted in government, business, academia, etc. As far as I can
remember, mine were the most techical talks, and may have been the
only technical ones. At the end, everyone seemed to consider that it
had been a great success.
24 Apr 2003
I got around to releasing 2.4.20-4 today. There were a good number of
accumulated fixes, including the RH9 fix, a couple of file timestamp
bugs, and cleanup of multi-line strings, which new gccs were
complaining about.
I also added exec logging to the tty logging facility. It turns out
that an intruder to a honeypot can arrange to run commands without
anything ever allocating a terminal. This makes those commands
invisible to tty logging.
The 2.5.67 UML was released a few days after Linus released his
2.5.67. 2.5.68 is now out, and I'll be dealing with that next week.
27 Mar 2003
After building gcc 3.2.2, it turns out that the 2.5.65 UML works OK.
Then it turned out that gdb couldn't read the new object files, so I
built a newer gdb, and everything started being fine again.
With that fixed, I pulled 2.5.66 and started looking at it. This
looks reasonably straightforward. The only tricky bit is that file
offsets are now stored in ptes. This requires that the offset bits be
arranged around the reserved bits. UML has complete control over the
pte format, so all the reserved bits are at the low end of the pte,
and the offset gets stuck in the upper end.
Some more minor fixes, and 2.5.66 boots. The patch is next, once I
get this pushed out to my public BK repository, followed by some more
pull requests for Linus to ignore.
24 Mar 2003
It's been a quiet few weeks for UML development. I've been keeping up
with Linus' 2.5 releases, but not necessarily releasing UML patches or
pushing changes to Linus. The current hold-up with 2.5.65 is that
Linus declared gcc 2.96 to be evil when frame pointers are enabled.
Of course, 2.96 is the version of gcc on my laptop, and seems to be
the only version available for RH 7.1. So, the 2.5.65 UML is on hold
until I grab a tarball of something newer (and all I can find from
gnu.org is 3.x) and build it.
I have a new 2.4.20 patch out. This fixes some minor bugs and applies
some small patches. Nothing major, but I wanted to clear the decks
before tackling some larger problems.
There was a new utilities release which fixed a uml_switch segfault.
This is highly recommended for anyone having problems with uml_switch
crashing. This is obviously very disruptive to any UML network which
relies on the switch, since the UMLs need to be rebooted after
restarting it.
27 Feb 2003
I decided to finally get a 2.4.20 UML out. So, I pushed out the
remaining 2.4.19 changes as 2.4.19-51. I updated my pools to 2.4.20,
and put out 2.4.20-1. The reason I waited so long on this one (as
opposed to almost every other 2.4 release when I released UML within a
few days of Marcelo's release) is that there was a bunch of stuff in
progress that I wanted to get settled down. The main one was skas
last fall, which is nicely stable now. I have had no complaints about
it at all in the last few months. Then, there were occasional reports
of nasty crashes (like the 'tracing myself' ones). I wanted to get
those knocked off before 2.4.20. That appears to have happened.
I'm going to wait to see whether the update has created or exposed any
new bugs and fix some problems in the utilities. When everything
looks nicely stable, I'll make the real 2.4.20 UML release.
2.4.21 looks like it's not too far away, so the 2.4.21 UML should
be released pretty soon afterwards. 2.4.20 was a special case because
of all the restructuring that was happening. Hopefully, that won't
happen again.
26 Feb 2003
I am releasing 2.5.63 today, and also sending the changes to Linus.
He takes my stuff about every 5-6 releases, so I'm not holding my
breath on him taking this right now, as I got a bunch of stuff in two
tries ago.
The big change is that the UML filesystems,
hostfs and
hppfs are in 2.5, thanks
to Petr Baudis figuring out the changes in 2.5 VFS needed to forward
port them. He did hostfs, and I used those changes to do hppfs.
hostfs seems to work reasonably, but hppfs pretty much only mounts at
this point - the shadowing from the host doesn't work yet.
19 Feb 2003
I got 2.5.62 working and released. The major pain here was the kernel
introducing its own sigprocmask (and Linus saying that the resulting
clash with libc's sigprocmask was totally my problem and he wasn't
about to rename it). After perusing the ld man page and info looking
for some way of isolating the two symbols and finding nothing, I fell
back on renaming the kernel's sigprocmask with an appropriate -D on
the compilation of kernel files. This essentially the same as Oleg's
fix, and sucks just as much.
6 Feb 2003
Linus finally took my pending UML updates. This seems to happen every
couple of months or so, at which point the changes have started
becoming fairly large. This gives me a clean slate to work from,
which is handy since the existing changes repositories had become
large enough that they started conflicting on a regular basis. I had
to merge several of them so they would apply cleanly for Linus.
I've started knocking items off my todo list. Mostly small patches
that have been sent in and have been languishing in my todo mail
folder.
David has set up a
UML
mirror on usermodelinux.org, and I've added it to the UML
download page. The LinuxVDS mirror is coming - I'm syncing it up and
will make it visible when that is finished.
31 Jan 2003
I more or less didn't do anything, UML-wise, for the last week or so.
With things accumulating, I decided to get back on the stick again. I
looked at the console driver locking because mconsole can hang when it
tries to get the configuration of a device that's on a host port. All
the things I thought of were too ugly to implement right now. There's
also the problem that it acquires a semaphore in an interrupt handler,
which leads to a panic if it sleeps there. I don't see a good clean
solution for that, either.
I did fix a few bugs though. Roger Binns found a couple of good ones,
which are now gone. A couple people noticed that early printfs don't
actually appear on the terminal until UML shuts down. An
fflush(stdout) should fix that.
I also have a bunch of tty logging changes on the way. Upi Tamminen
added timestamps and a direction flag to logging records, which let
you replay the log at its original pace, and to tell what data is
output and what data is input. He also wrote a little python script
to replay the log. I rewrote it in perl, and added flags to allow
effectively 'tail -f' of a log and to print out all data, input and
output. This shows you stuff that didn't appear on the terminal
originally, but which you might want to see anyway, like passwords.
On Tuesday, I was interviewed by the History Channel. They are doing
a show on network security, and were at
ISTS to
talk about honeypots and honeynets. So, George Bakos and I told them
all they wanted to know about UML honeypots and a bunch of other
things. It'll be interesting to see how much of this stuff makes the
final show.
On a side note - why the History Channel was doing this rather than yet
another WWII show, we decided that these guys were from the future.
That makes today history for them. It also means that UML figures
prominently enough that far in the future to send someone back to talk
about it. I guess that's some incentive to keep whacking away on it...
17 Jan 2003
I've been playing with 2.5 lately. I released 2.5.58 on Thursday,
announcing it yesterday. Linus promptly released 2.5.59, so I pulled
that and updated UML to it. I fixed some problems which were also in
2.4, so I ported them back.
While I was at it, I looked at SMP on 2.5. After some work, I got it
to build. It doesn't run. The reason appears to be the xtime locking
bug that was fixed in 2.4. So, I decided to release 2.4.19-47 so give
myself two releases to diff so I can apply that diff to 2.5.
So, 2.4.19-47 is also out, with a bunch of miscellaneous fixes and
cleanups in it.
I have accumulated some tools changes, so I'm releasing them today, as
well. The trigger for this was the 64-bit uncleanliness bugs. I
want to release the tools at the same time as the drivers so the fixes
match. This won't matter for anyone running UML on 32-bit boxes. The
other notable change was a bug in uml_moo which caused it not to write
out to the end of the output file sometimes, leaving it shorter than
it should be.
9 Jan 2003
I've been taking it easy on UML development over the last week or so.
I got busy doing some other stuff. UML is in pretty decent shape now
anyway, so a bit of a break likely won't bother anyone. People have
again started saying that they are having a hard time breaking it.
This happened last in the 2.4.19-6 to -13 range, before the skas
rework. So, it seems that I've reached that level of stability again.
I had accumulated a number of changes since -45, so I released -46
yesterday. The big news is that the network hang that a number of
people were complaining about is fixed. mistral had already found
this and diagnosed it. It just took me a while to realize that it was
the cause of the network hang.
29 Dec 2002
I finished merging the 2.4 stuff into my 2.5 tree. That all went
rather well, so now I have five more trees that Linus needs to pull.
I'm currently running diffs and bk stuff in order to prepare the
diffstat and changelogs that Linus likes in pull requests, plus
getting the 2.5.53-2 patch ready.
With 2.5 caught up to 2.4, it's time to start taking a serious look at
my todo list and start knocking things off it.
28 Dec 2002
I started merging the 2.4 changes into my 2.5 pool. I put a small fix
in early, then updated it to 2.5.53. That turned out to be pretty
much a no-op, so I released the patch. I also asked Linus (yet again)
to take my existing 2.5 changes. Doing BK stuff had started becoming
inconvenient because I had enough unmerged repositories that new
changes started crossing them, so that they couldn't be added cleanly
to an existing repo. So, I merged them all together and started using
that as my starting pool. Any repos based off that pool would have to
wait until Linus pulled the existing ones. So, having prepared to
have Linus ignore this set, I watched him promptly pull the whole lot.
He wasn't really enthusiastic about the host skas patch (which
I purposely didn't send him). His description of /proc/mm was
"crap". He's such a tactful person. This lead to a discussion of
what would be better, and it turns out he would prefer a system call
indirection system call which would allow an arbitrary system call to
be executed in the context of a particular address space. You would
pass it a file descriptor for the address space (which you'd get by
calling a get_mm system call) and a block containing the number and
arguments of the system call you want to have run in that address
space. Very general, and a neat idea.
22 Dec 2002
I made a whole bunch of small releases as I fiddled the linker scripts
and discovered a bit late on each one that I had broken a build or
produced a UML that just crashes on boot. The exception to this is
-45, which I did in response to a couple of nasty bugs being tracked
down to the point where I could either just fix them or reproduce and
fix them. Thanks to Jan Hudek and Barry Silverman for their work.
-45 also fixes a linker script make-a-UML-that-just-segfaults bug.
Hopefully, that's the last of them.
I'm beginning to feel better about UML stability now. The parade of
crash reports that followed the skas integration seems to have died
down. The two bugs fixed in -45 are a load off my mind. With 2.4
looking reasonable, I think it's time to start merging this stuff into
my 2.5 tree. That's been left alone, except for keeping up with
Linus, for a while. There are also some 2.5-specific bugs that need
fixing, and Oleg seems to be getting impatient about them.
20 Dec 2002
Today brings the release of -41. The main feature of this is that the
kernel stack size is now configurable. This is of benefit to no one
except people wanting to valgrind UML, which is only me because no one
else has my valgrind fixes yet.
Speaking of valgrind, it can now run UML. It produces reams and reams
of errors which are almost entirely noise. We're trying to figure out
how to get valgrind to be more selective about detecting only real
errors. So far, it looks like UML is hitting valgrind with code which
is too optimized, and contains code sequences which it doesn't
recognize as initializing things. It also appears to me that it is
not considering static data to be initialized.
18 Dec 2002
I released -39 and -40 yesterday. They contain mostly small bug fixes
and cleanups. The one big change is that I converted all initializers
over to C99 syntax. Then I decided to get caught up with 2.5. So, I
updated my 2.5 tree to 2.5.52, which was pretty easy, updated my BK
repositories and the patch, and sent it all out. So far, Linus hasn't
taken any of it.
15 Dec 2002
Continuing to play with valgrind. I fixed its clone problems by
ensuring that valgrind doesn't gain control of the child. This stops
the valgrind child and parent from stomping on the same data
structures and crashing each other. With that problem out of the way,
I started hitting problems with valgrind's signal delivery. Its
signal frames were bogus, which prevented UML's SIGSEGV handler from
getting fault information. UML hit a hole in valgrind's repe
handling. Then, it turns out that valgrind doesn't save and restore
signal masks across signal handlers correctly. This is where we stand
now. UML can run far enough that it panics when it's not given a
filesystem to boot on. The signal mask problem hits when UML starts
trying to do disk IO. Jeremy Fitzhardinge has been great in helping
diagnose these problems and providing fixes.
I've been having limited success in diagnosing and reproducing the
bugs that people have been reporting, so I decided to see if I could
rustle some up myself. I started up four UMLs (2 tt and 2 skas), ran
infinite kernel building loops on them, and also periodically hit them
with short ping floods and ab runs. They all ran fine for a while,
then they all ran out of tmpfs space on /tmp. This is fine for the tt
UMLs, but I didn't implement the recovery code in skas mode, so this
was fatal for them. I was hoping for bugs that I didn't already know
about though.
I ran the surviving tt UMLs this way all afternoon without any
problems. I cleaned up some code, fixed the skas stack consumption
bug, and decided to call that -38.
10 Dec 2002
I had a fairly fruitless day chasing bugs yesterday. Some I couldn't
reproduce, another I made no headway on. That one was what appears to
be memory corruption in netfilter.
I decided to dust off valgrind and see how close it is to handling
UML. clone() is a problem, but I realized something that I didn't
last time I played with valgrind, and got it handling the !CLONE_VM
case. It repaid that effort by spotting a minor buglet. It segfaults
on the CLONE_VM case still. I think it's because the two threads mess
with valgrind internal data after the call, and one messes up the
other. So, my current theory is that I need to have to child thread
immediately leave valgrind control and things will work better.
As a side-effect of that, I made UML build as a normal dynamically
linked binary when CONFIG_MODE_TT is off. This will be in the next
patch. Not a big deal, but it's one more step along the path of UML
becoming a completely normal app.
8 Dec 2002
Thanks to David Coulson telling me how to reproduce it, I fixed the
'tracing myself' bug seen under heavy network load. It turned out to
be a stack overflow caused by CONFIG_MODE_SKAS increasing the size of
a data structure which tt mode put on the stack. So, the problem
would go away by disabling CONFIG_MODE_SKAS, even though no skas code
ever ran.
I also took the opportunity to do a whole bunch of cleanup of the
uml_pt_regs struct (the bloated structure in question), and associated
macros and code. This turned a few-line fix into more than 1000 lines
of patch. This change is almost all of -36, which is now out.
7 Dec 2002
I finally got around to getting my BK repos hosted on bkbits.net. I
had set up a project there when I first got set up with BK, but had
never cloned my repos there. Larry Mcvoy had pinged me once or twice
about whether I was intending to use bkbits so he could clean it up if
not. So, I pulled all my changes over there.
I also sent everything to Linus again. Hopefully he'll pull it this
time.
Since 2.5.50 is corrupting data (and I really hope it's not my fault
:-), I'll not do anything more with 2.5 until 2.5.51. So, it's back
to bug fixing on 2.4. I've got some claims of ways of reproducing
some of the most-wanted bugs, plus a nice stack trace for another, so
I think I knock off some good bugs.
6 Dec 2002
I released -35 with a another bunch of small changes. It turned out
that signal delivery when libc didn't provide a restorer was broken.
This caused Tom's boot/root to segfault because it has an older libc
than the other filesystems. This is fixed, plus some large memory
crashes. There were also some cleanups.
I'm going to catch up with Linus now. I've been madly massaging BK to
get my repos updated. I looked at the 2.5.50 patch and spotted only a
couple of things that needed changing. One of them was the deletion
of the sys_security system call. I went to delete it from UML and was
mystified that it wasn't there. I thought I had totally missed it
until I happened to see the names of some of my files go by during a
BK pull. I took a closer look at Linus' patch, and it turned out that
other people already made those change for me. This is a major reason
that it's nice to be in the official tree. Other people start doing
your work for you.
So, I'll get 2.5.50 up and running and ask Linus to pull everything.
Then I'll start merging my recent 2.4 changes into my 2.5 pool.
4 Dec 2002
I only tested Tomcat in skas mode before I released -33. This was a
mistake because Tomcat could hang UML in tt mode. This uncovered a
fpstate size calculation bug, which is now fixed.
I also ran into a 'sleeping process nnnnn got unexpected signal : 29'
crash, which other people have been complaining about. Getting my
hands on it ensured its swift demise.
With those two bugs fixed, I decided to release -34. Not a huge
amount of change, but those are two major bugs, and it's good to be
rid of them.
2 Dec 2002
I figured out what was happening with Java inside UML. With
SA_SIGINFO set on a signal, UML didn't produce the same format stack
frame as the host. The reason this broke JVMs is that Java can induce
segfaults in the JVM. The JVM looks at the information on the signal
frame to figure out what happened. If it figures out that the fault
was caused by Java code, it converts it into a Java exception, which
the Java may or may not catch. Since UML wasn't putting information
in the stack frame in the same format as the host, the JVM was
confused about the origin of the segfault, and crashed.
I released -33 today with the Java fix, plus a bunch of smaller
fixes. I was hoping to have dealt with the 'tracing myself' crashes
that David is seeing, but I haven't figured them out yet. Oh well. I
also need to get 2.5.50 going, and get all my changes pushed out to
Linus.
25 Nov 2002
After far too much trouble, I got a 2.5.49 patch put together,
compiling, and running. This uncovered a nasty bug in skas mode which
caused signals to be blocked after returning from an interrupt. I
started tripping a check in the filesystem which was making sure that
interrupts were enabled. After a fair amount of time chasing the
problem, I tracked it down to my handler forgetting to re-enable
signals before it returned to userspace.
I also have a bunch of BK repositories all set to go to Linus.
Tomorrow, I'll probably ask him to pull them. Hopefully, he will this
time. The changes are getting rather large and it will be
convenient to have them merged into his tree.
22 Nov 2002
I finished the skas merge into 2.5 and got UML booting both in tt and
skas modes. So, 2.5 is all caught up to 2.4. Now, I have to push
this stuff out to Linus, and then he has to take it. And, I have to
merge 2.5.49 and release that.
At this point, I'm in bug-fixing and janitoring mode for the
foreseeable future. Despite my attempts to make the skas changes a
no-op for the tt code, the latest 2.4 UML is breaking in strange ways
in tt mode. So, it looks like I've introduced some bugs there, and
they will have to be rooted out.
I also pulled the 2.5.49 changes and tried them out. It turns out the
only change that affects UML is an extra argument to do_fork. Fixing
that produces a UML that works. So, look for the 2.5.49 patch soon.
21 Nov 2002
I'm busy merging the 2.4 changes into 2.5. Mostly, this is the skas
changes, but there's a fair amount of other stuff in there as well.
My 19000 line patch is down to ~3000 lines right now. I had to do a
huge merge which got rid of most of those diffs before UML would
compile again. Once I got it building, I had to fix three bugs, and
it booted again, which was nice.
With that out of the way, it looks like I can merge the rest in much
smaller chunks. This makes life much easier, since I can build and
test after each one, and if UML breaks, there is a relatively small
amount of code that I have to search for the bug.
I pushed the first working bunch of code out to my BK repository and
generated a 2.5.48 patch. When SF decides to start working again,
I'll announce it.
16 Nov 2002
I went back to work on the 2.5 tree. At this point, I'm making the
current stuff work again. This wasn't a big deal. There were some
small interface changes. The biggest problem was the initramfs
stuff. My objcopy didn't support --rename-section, so I grabbed the
latest binutils tarball, and built and installed it. Then, UML
wouldn't boot. I resorted to looking for anything helpful that Jeff
Garzik may have posted to lkml. What I found was that I needed to add
a couple of definitions to the arch Makefile to tell objcopy how to
jam a piece of arbitrary data (the initramfs image) into the kernel
binary. With that, the 2.5.46 UML boots.
While pulling 2.4.47 into my BK tree in my umlcoop UML (which failed
once with an error from deep inside BK), I decided to generate the
patch from my 2.4 tree that I'm going to have to merge into my 2.5
tree. It turns out, as best as I can figure, that my 2.5 tree is up
to 2.4.19-14. So, I have to merge the changes from -14 to -31. Ouch.
The diff is almost 19000 lines. Ouch.
15 Nov 2002
I released 2.4.19-31 today with lots of bug fixes. I'm starting to
work on the backlog of bug reports, patches, and other things that
I've accumulated since I started the skas work. I also redid a lot of
the get_config code and extended it to the ubd driver. The network
drivers still don't support it, plus the console and serial line
drivers don't support plugging and unplugging devices at run time.
I also need to support listing of all devices of a particular type,
which is needed for people who want to find a spare slot into which to
plug a new device.
-31 doesn't compile with CONFIG_MODE_SKAS disabled. It was a silly
oversight - I put some skas-specific code outside CONFIG_MODE SKAS.
I'm starting work on 2.5 again. The last UML release was 2.5.44.
Linus is up to 2.5.47. I'm going to take the existing code up to that
(or .48 if Linus releases that before I'm done). Then comes the job
of merging all the changes I've made in the 2.4 pool. The largest
piece of this is the skas stuff, but there have been a lot of bug
fixes, cleanups, features, and other changes.
13 Nov 2002
I finally figured out the FP problem that was preventing RH7.2 from
booting on -29. I wasn't initializing the FP state correctly in new
processes. Copying in a known good set of values fixed that.
I also fixed the segments problem cleanly. The host skas3 patch
included a copy-segments /proc/mm operation, which is needed to copy
arch-specific address space information from one address space to
another. This wasn't used before, but is now in order to copy x86
segment information between address spaces.
With that, plus a bit of code cleanup, I released -30. At this point,
it's time to settle things down, fix bugs, clean code, and get 2.5 up
to date again. Basically, I'd like to spend a good amount of time on
maintenance rather than new code. We'll see how well I resist the
call of new functionality...
11 Nov 2002
I'm releasing -29 and another version of the host patch today. The
big news here is that, in skas mode, UML no longer creates one host
process for each UML process. That used to be done in order to create
new address spaces on the host. Since only one of those processes
could be running at any given time (in a UP UML), they were a waste of
kernel memory.
Now, I've added /proc/mm, which provides a way to deal with address
spaces independently of processes. Opening it creates a new, empty
address space, and returns a file descriptor which can be used to
manipulate it. Closing the descriptor frees the address space if no
process is running in it. The address space can be populated by
writing to the descriptor. You write a request to it, which can cause
an mmap, munmap, or mprotect to happen. The contents of the request
are basically the arguments to the corresponding system call.
This allows a process, i.e. the UML kernel process, to have handles to
many address spaces without processes running in them. When UML
creates a new process, it creates a new address space for it. When
that process is scheduled, the UML userspace process on the host is
switched into that address space with PTRACE_SWITCH_MM. That causes
it to jump from its current address space into the new one. That,
plus a full register restore, is now a UML context switch.
There are some glitches with it. Floating point is broken. This
doesn't affect older distrubutions such as Debian potato, but it
breaks newer ones, such as RH7.2. The problem is that, while I do a
full save and restore of the floating-point registers across a context
switch, that's not good enough. What I need to do is tell the FPU
(via fxsave) to dump its state out to memory. That will give UML the
real FP register state, which would be restored when that process is
next scheduled.
With these problems fixed, I've reached the level of host support I've
wanted for a while. The number of processes on the host is close to
the absolute minimum. This causes UML to consume fewer host
resources. It also seems to be faster. I ran some kernel builds, and
it now takes 8:22. I didn't run tt or skas2 builds, but if my
previous results still hold, which they should, that's 45 seconds
faster than skas2, which came in at 9:07. That brings the skas3
kernel build time to 60% of the tt build time.
The one major thing that I could still ask from the host is some way to
merge the current kernel and userspace host processes into one. This
would involve some way for the kernel process to switch itself to the
userspace address space, with a register switch at the same time. It
would also require a mechanism to switch back (address space plus
registers) on any sort of signal. This essentially would mean it was
ptracing itself, plus intercepting its own signals. It's not clear to
me how to do that, so I'll just let this sit for a while.
There are a couple minor host changes which I'll probably add in at
some point. ptrace currently requires that system call intercepting
processes see both the start and end of the child's system calls.
This causes four host context switches per UML system call. What I
would like is to just see the start of a system call, at which point
UML would read it out, run it, and stick the result back in the
child. It would be told just to return immediately to userspace.
This would eliminate two of those context switches, and bring it down
to the theoretical minimum. This should also noticably improve
performance.
The second minor thing I am planning is some way of filling in the
start and end addresses of the command line and environment of new
address spaces. Those values would be the same as the UML values.
This would have the nice effect of making ps on the host show the same
process name for the UML userspace process as ps inside UML. So, that
one host process would change its name according to what is currently
running inside UML.
I got rid of the ugly asm that I used to set up %gs correctly. There
is support in the host patch to copy segments from one address space
to another, but UML doesn't currently use it. This is
straightforward, and should be the correct fix for the segment
problems I was having last week.
There were a bunch of other changes in this patch, including
/proc/mm support, since I tested it in UML before porting it to the host
I fixed the segfault caused by querying the configuration of a device
that had never been opened
UML now compiles with CONFIG_MODE_SKAS off.
There was a fair bit of code cleanup.
Fixed the behavior of the network driver when it gets an error.
5 Nov 2002
After much wailing, gnashing of teeth, rending of garments, and asm
nastiness, I fixed the problems with my RH 7.2 filesystem not booting
in skas mode. The problem is with the segment register gs, which is
used somehow by newer pthreads libraries. UML was ptracing the
correct value into new processes, but it was vanishing immediately for
reasons that I couldn't figure out for a while.
It eventually dawned on me that it would work if it was inherited from
the parent process. In tt mode, the parent process is the host
process of the parent, so that works automatically. In skas mode, the
parent process is the kernel process, so there is no relationship
between the host processes that hold UML parent and child processes.
With this level of understanding, I went about setting the value of gs
in the kernel process temporarily to the correct value so it could be
inherited by the child. This consistently segfaulted, which I didn't
understand till this morning.
It turns out that loading a value into a segment register will fail if
the corresponding segment hasn't been loaded with sane data. Now,
modify_ldt is there, but it operates on the userspace process, not the
kernel process. So, I added a kludge to make modify_ldt also operate
on the kernel process. This allows the segment register to be
assigned and then inherited properly by the child.
After staring at the i386 code, I have since realized why the value of
gs can't be ptraced in. It is added to the thread structure properly,
so if you write it in and immediately read it back out, you will get
the expected result. However, in the i386 switch_to, it zeros out fs
and gs if they don't have good segments already. I'm not sure exactly
why they don't, but this looks like the reason ptrace isn't working
the way I expect.
So, with that fix in, plus a bunch of others, -27 is out. If you want
to use skas mode with it, you'll need the
host skas2
patch. The first patch won't work because I moved the new ptrace
operation numbers so that -27 won't see it even if it is applied to
the host.
3 Nov 2002
-26 is out. This contains a few bugs fixes which are enough to get my
RH 7.2 filesystem to boot. It needed a working modify_ldt and a
signal delivery fix. I started trying to run UML inside itself in
skas mode, and that turned up a timer bug. With those bugs fixed, RH
7.2 boots, but a few of the daemons are running continuously. This
needs fixing as well.
To take advantage of this, you'll need the second host patch, which I
haven't released yet. The one I'm running on my laptop has
PTRACE_LDT in it. I'm going to add PTRACE_JOIN_MM to it before
releasing it. I played with that a bit today, with a little test
program that is made to switch address spaces. It seems fine, and
I'll port it to i386 soon. I also need to port PTRACE_LDT into
UML since that's only in my i386 pool at the moment.
I haven't talked about where I'm going with the host support, so this
would be a good time to do it. PTRACE_JOIN_MM is a stepping stone
towards greater things. Currently, you give it a pid, and it will
locate that process and make the ptraced child join its address
space. The next step is to eliminate that pid, since it's only used
to locate an address space.
I'm going to make address spaces first class objects which are visible
from user space. This will probably take the form of /proc/mm, which
when opened, gives the caller a file descriptor whose underlying
object is a brand-new, empty address space. Then, PTRACE_JOIN_MM can
dispense with the pid, and use that file descriptor instead.
What this gives us is the ability to have an address space without a
process running in it. In turn, this lets us eliminate the host
processes which each hold a UML process userspace. In its place, we
will have one host process per UML processor to hold the userspace
contexts. The one host process per UML processor kernel context will
remain. So, the number of host processes will go from
num_UML_processors + num_UML_processes to 2 * num_UML_processors,
which is much smaller. The userspace context processes will get
bumped from address space to address space as needed.
That much is easy. It leaves us with each UML processor having two
host processes, one for its kernel context and one for its userspace
context. The next step will be to merge those, and have one process
per processor bouncing between kernel and userspace address spaces.
This is harder, because it implies these processes ptracing
themselves, with an automatic address space switch every time there's
a ptrace event which causes a switch to kernel space.
For system calls, this seems straightforward. The system call tracing
code can do the address space switch, and replace the registers to be
restored on return to userspace. Signals are harder because that code
is deeper down the stack, so more code would have to be fiddled in
order to make that work.
Assuming that's doaable, that's the ultimate goal - one host process
per UML processor, with one host address space per UML process. This
would eliminate a bunch of the ptrace additions, which would have
served temporarily as scaffolding. PTRACE_{MMAP,MUNMAP,MPROTECT} and
PTRACE_LDT would become operations on the /proc/mm file
descriptors. PTRACE_SIGPENDING becomes unnecessary since that's needed
to avoid races when switching between processes. PTRACE_FAULTINFO
would still be needed, unfortunately, since PTRACE_GETREGS doesn't
give it to you.
2 Nov 2002
I released -24 and -25 over the last few days. They are mostly bug
fixes to skas mode. The network now runs reasonably, thanks to fixes
to the checksumming code, although I do have sshd's hanging around
after copies. There is now thread support, which allows me to run
kernel builds - make uses vfork, which is a cheezy form of threading.
So, I did some kernel build runs to see what performance difference
there is between tt and skas modes. Two runs each in skas and tt
modes produce times of 9:07 and 13:50 respectively, identical to the
second in both cases. This makes the skas time 65% of the tt time.
For some micro-benchmarking action, I ran a loop of 100000 calls to
getpid. In tt mode, it takes 15 sec; in skas mode, it takes 7.
There are still some improvements to be had. Currently, skas mode
does a system call with four host context switches, while tt mode does
one with four host context switches and a host signal delivery and
return. The speedup comes from losing that signal delivery, plus
maybe the context switches are faster, due to the UML process address
spaces being smaller because they don't contain the kernel. Two of
those four context switches can be eliminated, which should cut the
system call overhead down by a factor of two from where it is currently.
In the continuing quest to get my RH filesystem to boot in skas mode,
I fixed modify_ldt. This is another address-space-changing function,
which, before I fixed it, was modifying the kernel address space,
rather than the process'. So this required another ptrace kludge on
the host, which when implemented, made modify_ldt work better. It
still doesn't boot. The current hang-up is something called getkey,
which I've never heard of, has no documentation that I can find, and
has no interesting strings in its binary. It is sitting in a poll
forever, and I haven't figured out why yet.
30 Oct 2002
I released -23 last night. UML now runs in skas mode on the host. I
ported the host ptrace patch from UML to i386 and have been running it
on my laptop for the last few days. It is available
here
.
Once skas mode is working well, it's time for some stabilizing. With
highmem, SMP, and skas support going in over the last couple of
months, I'm going to concentrate on killing bugs for a while. Marcelo
has started the 2.4.20-rc series, so I'm thinking that I will start
new UML development projects after 2.4.20 is released, and concentrate
on stability until then.
27 Oct 2002
skas mode works now. The host has to be a recent UML, and the guest
needs to be built with CONFIG_MODE_SKAS enabled and CONFIG_NEST_LEVEL
set to 1. See the story on
usermodelinux.org
about the 2.4.19-21 release. That contains a good description of what
skas mode is and what the benefits are.
I released -22 today with more fixes and cleanups. UML now builds and
runs with either mode configured out. I fixed the SMP bu ild, although
I punted on SMP on skas mode for the time being. This invovled a fair
bit of code cleanup and movement.
I pursuaded UML to build as a normal dynamically linked binary. This
took some surgery on the linker script which I didn't include in this
patch. The point of that exercise was to see if
valgrind
would work on UML now. It starts to, but blows up on the first call
to clone, which happens fairly early. After some correspondence with
Julian Seward, the valgrind author, it appears that this is easy to
fix. So, we may soon have valgrind working on the kernel, which is
timely considering that we are entering the stabilization phase of 2.6.
24 Oct 2002
Most of the last week was spent merging the skas work done back in
September into my main pool. I created a couple of pools on the side
so that I didn't have to stall the main line of development. Those
two pools are now merged into my main pool and are in CVS and the
patch. I released 2.4.19-18 yesterday with everything finally
merged. The skas changes are present, but latent for now. You can
enable CONFIG_MODE_SKAS (it's hardwired to on currently anyway) and
the code will be compiled in, but you can't run it. -18 had some
build problems. I forgot to update a clean rule, with the effect that
there's a binary in the patch (rm arch/um/util/mk_constants if you
want to build that patch). The link will fail if you enable
CONFIG_HIGHMEM.
I released -19 with those bugs fixed and with some code movement, and
detection of the host support needed for skas mode. This gives me a
clean platform from which to start debugging the skas code, and
everyone else a cleanly building patch.
14 Oct 2002
I spent the last few days debugging the merge of the 2.4 SMP support
into my 2.5 pool. That was more painful than I expected. The
actual debugging took longer than the 2.4 debugging, notwithstanding
the fact that I had the benefit of that earlier debugging. The SMP
infrastructure had changed in 2.5, and I didn't get UML into userspace
until I stared at the i386 SMP boot process and made the UML boot do
the same things at the same time.
I had to change the locking in the ubd driver. I started with the
request queue lock being the same as the ubd device lock, which
deadlocks unfixably when a disk is added to the system with mconsole.
ubd_config holds the ubd_lock, calls add_disk, which does IO to read
the partition table, which tried to grab the ubd_lock again. I added
a ubd_io_lock, which pretty much restored the 2.4 situation, which
protects IO with the global io_request_lock.
On 2.4, I had all interrupts handled on CPU 0. This turns out to be
wrong - all processors have to have timers in order to do their local
process accounting. However, only one processor, CPU 0 in the case of
UML, actually calls the timer IRQ. This matters on 2.5; it doesn't
seem to on 2.4. I also found an unfixable race with the timer
interrupt caused by UML never blocking SIGALRM and SIGVTALRM and
instead relying on flags in order to call into the kernel when it
shouldn't. I decided to just get rid of those flags and treat the
timer interrupts like any other signal.
The end result is a bunch of bug fixes and cleanups which need to be
carried back to 2.4.
I'm currently merging 2.5.42, and watching an SMP 2.5.42 do a -j5
kernel build. So far, so good. However, there are some strange
crashes which I haven't figured out yet. They were pretty
reproducable for a while, then they seemed to disappear and I got 3
-j5 kernel builds in a row.
The evil thought of the day is to port UML back into the kernel. The
internal kernel interfaces can be thought of as another OS, which
would make this an OS port of UML. Why dump UML back in the kernel
after I've spent all this time pulling it out into userspace? There
are a few reasons that come to mind
It's whacked, therefore it appeals to me and must be done
It could perform better since it has access to the full host kernel
and is not restricted to the system call interface. This is another
way of looking at UML performance and could lead to better ideas on
making normal userspace UML perform better.
It would go in the direction of allowing Linus to partition a machine
and run separate OSes on the different partitions. The host could
hand devices to the UML, which would access them using the normal
hardware drivers, something it can't do in userspace. Over time, the
"host" kernel could be trimmed down to the point where it's nothing
but a little executive which starts the partition kernels and splits
the hardware between them. Then the "guests" would be the real
machine OSes.
8 Oct 2002
Several 2.4.19 patches later, the SMP audit is done. UML/SMP seems to
be OK. My test before releasing -12 was a -j5 kernel build on a four
processor UML. That went fine. mistral is reporting a hang, but I
want to see it myself or else I'll consider that he's just making it
up :-)
I released the 2.5.41 patch today, and pushed everything out to Linus
again. I accidentally left out port numbers on the URLs I gave Linus,
so I quickly put a bkd on port 80 of my UML so that he would be able
to pull everything.
The next task is to merge the SMP stuff from 2.4 into 2.5. That's the
reason I cleared the decks on 2.5 today.
4 Oct 2002
I've started the SMP audit. The first step was to go through the code
marking all global data that took longer than .5 seconds to tell that it
didn't need to be locked. The next step will be to go through those and
add locking where necessary. I've done some of this, also adding comments
where locking is not needed.
With this, I released 2.4.19-9. I also turned off CONFIG_UML_NET_PCAP
in defconfig since the default build would break on any system that
doesn't have libpcap installed.
1 Oct 2002
Some Boston University students seem to have some intellectual
integrity problems. Apparently some class was assigned the task of
describing the differences between the virtualization approaches of
UML and VMWare. So, what do these fine students do in response? They
send me mail such as
Would you like to tell me that what are the fundamental differences in the way
virtualizition is performed in UML vs. VMware?
I am eager to know about it.
and
I'm hoping you can help me with this. I'm doing a small research project and
am trying to figure out some of the fundamental differences in the way
virtualization is performed in User-mode Linux vs. VMware.
Ummm, right.
In actual news, I released 2.4.19-8 today. I removed the limit on the
number of network interfaces a UML can have. In related work, I
cleaned up and simplified the network transport interface. This
release contains a good number of small bug fixes.
Linus merged my changes into 2.5.40, so that's up-to-date with respect
to generic kernel changes and highmem. I still have a bunch of bug
fixes waiting. I'll be pushing them plus the networking changes.
30 Sep 2002
Highmem is now working in 2.5. UML is also updated to 2.5.39, so I
sent both the 2.5.39 and highmem updates to Linus. I released the
2.5.39 patch as well.
I now have a lot of pent-up small stuff to deal with. I'll clean out
some of that, and then get on with getting SMP working.
23 Sep 2002
I finally released the 2.4.19 UML after fixing some final build bugs.
With that out of the way, I can start working on more intrusive stuff.
In the intrusive stuff category, I got highmem support working. So,
you can specify any amount of memory you want, up to 4G, and the stuff
that can't be mapped directly will become highmem.
I released 2.4.19-6 with highmem and fixes for a couple of crashes.
One of them was a subtle timer bug that shows up with an idle UML on a
loaded host, namely umlcoop.org.
I think I've got BitKeeper figured out. I'm using it for 2.5
development, and have got UML updated to 2.5.38, which patch I'm
releasing now. I have to update the repositories on umlcoop.org so
other people can pull from them. After that, the next task will be to
get my 2.4 stuff into it.
13 Sep 2002
Lots of people are happy about UML finally getting into to 2.5. I've
gotten email from all over the place. It was also fairly newsworthy,
apparently. It was first picked up by
kerneltrap. A pointer to that story was submitted anonymously to
usermodelinux.org,
which I posted, notwithstanding the fact that I had just posted a
story of my own about it. An almost identical submission was made to
Slashdot,
who ran it. This caused more traffic to
usermodelinux.org
than it had ever seen before. To my knowledge, this is the first time
that a UML had been slashdotted. There were some hiccups, but it
seems to have fared reasonably well. It helped that it wasn't a
full-bore Slashdotting.
It was also picked up tangentially by LWN.
Linux Today
ran a story that just pointed off to kerneltrap.
12 Sep 2002
I released 2.4.19-3 yesterday and announced it today. I also released
2.5.34, and sent the patch to Linus.
I think I'm getting close to deciding that 2.4.19 is done. The
problems that have cropped up that I don't have fixes for don't seem
to be as serious as they first looked. The problems that I do have
fixes for will be fixed, of course.
The one remaining fix that I have is the tracing thread crash in
kmalloc. I'll put that in 2.4.19-4 and see how that looks as the
official 2.4.19 UML.
Late breaking news : Linus finally merged UML. It will be in 2.5.35.
9 Sep 2002
2.5.34 is out and UML isn't in it.
I've been playing with a UML on a colocated server bought by Bill
Stearns and a bunch of friends, and set up by Bill over the weekend.
Everyone who contributed gets a private UML with its own IP. I've
turned mine into another mirror of the
UML downloads and
web site. I also put CVS on it with a view towards moving my CVS
off SourceForge. If I can get the server side of BitKeeper (which
I've downloaded, but not looked at to see whether I got only the
client side), I might also run my BitKeeper pools from there.
After this, I had an evil thought. A root filesystem with
Apache
PHP
mod_perl
MySQL or Postgresql
CVS
Mailman
A web site, or other documentation, explaining what's there and how to use it
installed would make a fine project hosting platform. Couple it with
a host and an administration that can keep it running, and you have a
good mini-SourceForge. I would certainly prefer that to SF, mainly
because of the control and flexibility I get.
In other news, I spent today fiddling the web site. A lot of time was
spent cleaning up the XML. I fixed a bunch of mistakes and got rid of
some obsolete information.
6 Sep 2002
I've spent the last few days catching up with bug fixing and releases.
I rolled a bunch of fixes in and released 2.4.19-2. I also made a
tools release with a few changes, chiefly fixing uml_moo so that it
spits out a sparse file and cleaning it up so that it's much more
understandable.
I also got 2.5.33 going. None too soon, since the natives were
starting to get restless. James McMechan and Mike Anderson both
popped up on uml-devel with the changes needed to get UML up to
2.5.33. I'm sending UML to Linus again. We'll see how it goes.
2 Sep 2002
Today's problem is SA_RESTORER. This was a hidden same-address-space
dependency. UML signal frames are basically copies of host signal
frames with all of the signal-specific information replaced. Since
the libc in UML provides a restorer that's inside UML, UML processes
use it since that information is not changed when UML constructs its
own signal frames. This works fine as long as UML and the process are
in the same address space. When they aren't, and the process' libs
doesn't provide a restorer, the process will segfault trying to call
UML's restorer.
The fix is to get a frame from the host which has the kernel's
built-in restorer, which you get by turning off SA_RESTORER in
sigaction. However, this is easier said than done, since the
sigaction entry point provided by libc specifically provides a
restorer and disables any attempt so say that there isn't. So, I
ended up calling sigaction by going through the generic syscall()
entry point rather than libc's sigaction. Doing this, plus using the
old sigaction structure that the kernel uses got me a signal frame
with a built-in restorer.
Theoretically, this should get the new UML up to a login prompt. The
signal frame code that I changed in the host UML also needs to be
applied to the guest UML. And I decided to do that as part of the
merge of the two pools. And that is going to wait a bit.
I have a bunch of bug fixes to get out, plus I need to catch up to
Linus, plus I need to get 2.4.19 out. So, I'm going to take care of
all that, and then get on with merging the re-architecting into the
main UML pool.
30 Aug 2002
Today was Signal Delivery Day. That took me a while to get right,
mostly because it was a nice day and I felt like doing stuff outside.
That done, a bug in fork reared its ugly head. It turned out that
fork wasn't returning 0 in the child, making it believe that it was
the parent. Since init is doing the double fork trick, where it
forks, the child forks and exits, and the grandchild execs whatever
needs execing, this meant that all children just exited. init
understandably got rather upset with this.
exec then crapped out because of a buglet in strnlen_user. I fixed
that, plus a few instances of another bug that I happened to notice.
Now, I get a getty running. It hangs in sigsuspend. I think that's
because it's sleeping for a few seconds and I haven't hooked in the
timer yet.
29 Aug 2002
Decent progress today. Last night's infinite segfault loop turned out
to be caused by clear_user not really clearing anything. So, init's
bss, which was supposed to be zeroed, contained garbage, and it
crapped out when it tried to dereference some of it.
Fixing that and one or two other user access bugs made things work a
lot better. init now runs, and starts firing off the rc scripts. The
current hold-up is signal delivery. init won't boot the system unless
it sees SIGCHLDs from exited scripts. There are only minor problems
here, so this shouldn't be a big deal.
28 Aug 2002
Today's installment of As the Codebase Churns stars the user access
macros. I implemented the rest of them when it turned out init was
getting to the first open, but bombing on a copy_from_user. That took
little time, but I spent the rest of the day chasing bugs in them.
The problem is that when you mess them up, the kernel doesn't crap out
right there. It bombs at some later point, in an apparently unrelated
way. For example, I was chasing an infinite segfault loop. More or
less by accident, I discovered that init was being started with an sp
of zero. That turned out to be because a function which sets up the
initial stack for an exec was silently failing. And that was failing
because of a bug in strnlen_user.
I *always* get strnlen_user wrong because it's defined differently
from the libc version. The libc strnlen normally returns
strlen(str). The kernel version returns strlen(str) + 1. Believe it
or not, this makes a big difference in how well (or if) your kernel
boots.
After chasing this and a few other entertaining bugs, init is opening
files, making system calls, and all those good things. Unfortunately,
it also goes into an infinite segfault loop on a rediculous address.
This is something to chase in the morn.
27 Aug 2002
I fixed last night's hang by having UML collect a representative set
of registers, including a set of good segment register values, from a
subprocess. This is used as the starting point for new processes,
rather than the array of zeros I was using before.
With this fix in place, init gets to the point of starting to make
system calls. The new system call handling required redefining
the pt_regs structure, which required a fair amount of hacking and
slashing before it would compile again.
It now seems to be doing system calls OK, although with some
flakiness. I've currently got it up to system call number 6.
26 Aug 2002
The Great Redesign continues.
The major task of the day was redoing the copy_user macros. When the
kernel is moved from the process address space to its own address
space, the macros which copy data back and forth between userspace and
kernelspace get more complicated. Before, with kernel and process
sharing the address space, data can just be copied back and forth,
taking care to make sure that faults are handled properly, since the
userspace address may be bogus, or it may have been swapped out.
That's no longer possible when they don't share an address space, so
what you have to do is do a process virtual to physical address
mapping, and then copy the data in or out of the physical memory.
When the page isn't present, there won't be a mapping for it, and you
have to fault it in to the process address space. Then, the mapping
will be created and the data is accessible.
With copy_to_user and clear_user in place, it's possible to exec init
and enter userspace. I've got to the point where it has faulted in a
few pages. init then spins because I gave it a set of registers which
is all zeros, except for the ip and sp. For the normal registers,
this is fine, but things work very badly when the segment registers
are zero. In this case, it's spinning on the first data reference
because it has a ds of zero.
In other news, it turns out that I pretty much have to totally work
inside UML. Since I'm running UML inside another UML with the kernel
pool on a hostfs mount, and because hostfs doesn't consider that files
can change underneath it, I have to edit and compile through hostfs,
rather than doing that on the host. I finally installed emacs in UML
after getting sick of @!$!@%$ vi, and I'm fairly impressed by how
quick it is. I can see a little bit of interactive slowness
sometimes, but it pretty much feels the same as emacs on the host.
Builds are another question. They are noticably slower than on the
host. This might be due to hostfs and its synchronous IO, but I'm not
sure about that.
25 Aug 2002
I'm in the throes of redoing parts of UML so that the kernel lives in
its own address space. There would be one thread per processor
running in this address space. Every UML user process would have a
separate host process. Transfers into the kernel therefore cause a
context switch from the user process to its kernel process.
It's being done this way because of Ingo's revelation on LKML that
context switches are much faster than signal deliveries. Previously,
I was thinking that the best performance would come from intercepting
system calls without context switching. I was planning on having the
kernel optionally be put in its own address space, but was thinking
that there would be a performance penalty for doing that. Now, it
seems that this is totally the best way to go.
To support this, I've added a bunch of new stuff to the host's ptrace:
PTRACE_FAULTINFO - get the fault type and address of the child's most
recent segfault. This is needed for page fault handling.
PTRACE_SIGPENDING - get the child's pending signal mask. This will be
used to work around a race between a child executing a system call and
a SIGIO being queued to it. If they happen at the same time, the
system call cause control to be transferred into the kernel, leaving
the SIGIO pending on the child. If the kernel then switches to
another process, that IO notification will be lost for an indefinite
period of time, which can UML to hang.
PTRACE_MMAP, PTRACE_MUNMAP, and PTRACE_MPROTECT - these manipulate the
child's address space.
PTRACE_CLONE - not added yet, but this will cause the child to call
clone. The reason for this is a little subtle. CLONE_VM threads in
UML need to be CLONE_VM on the host as well. So, the kernel process
needs to be able to create a clone of one of its children.
When a user process is running, the kernel process will be in a loop
calling wait and ptrace in order to intercept system calls. When a
process system call bumps the kernel process out of wait, it will read
the system call and execute it itself. When it returns, it will
switch back to the user process. So, a UML system call will have an
overhead of two context switches rather than the current four context
switches, one signal delivery, and one signal return.
Kernel threads don't have an associated host process.
Context switches are done with a longjmp from one kernel stack to
another.
Traps into the kernel are done in the same way as system calls. The
interrupt to the user process will be intercepted and cancelled by the
kernel process, which will execute the appropriate handler. The
performance increase will be about the same as for system calls.
The benefits of this are many:
UML should be noticably faster.
jail mode is now automatic, so kernel memory protection is always on,
and is very much faster than it is currently.
processes now have the normal full 3G of address space. This makes
honeypot mode automatic (and much more believable since the UML kernel
will no longer be visible to processes). Applications which require a
lot of address space and which may have bombed out on UML as a result
will now run.
The UML kernel now has a full 3G of address space. This makes it
possible for UML to conveniently have lots of physical and/or virtual
memory since it can all be mapped at the same time.
UML can now be a normal process, rather than the strange, statically
linked, oddly loading process it is now. UML can be debugged with
'gdb linux', rather than the ptrace proxy arrangement it uses now.
gprof and gcov should work much better as well.
Many, many kludges just vanish. The shared remapping of kernel text
and data is gone, as is switcheroo. Context switching is dead simple,
and obviously race-free. Signal handling is much simpler, since there
is no more special handling of the startup of a new process. As
already mentioned, the ptrace proxy and all its nastiness goes away.
Since this requires changes in the host, and the world may not patch
all of its kernels immediately, I'm planning for UML to be dual-mode
for the foreseeable future. It will detect the support and use it if
it's there. Otherwise, it will fall back to the traditional tracing
thread. In the code, I'm planning on separating the code that
supports the two modes into separate subdirectories. This provides a
reasonable way of splitting out code that's been somewhat messy for a
while, and separates the two pieces of code from each other. It will
also make it easy to configure a single-mode UML, in case you know
that you will be running in one mode, and you don't want the code for
the other mode compiled in.
So, what's the current status? I've added the new PTRACE_* options,
except for PTRACE_CLONE to the kernel. The arch-specific support is
only in UML right now. I'll add it to i386 when UML is up and running
on it. As a result, I'm debugging the new UML inside an old UML with
the ptrace additions. When it's working, I'll un-nest UML, fix the
i386 kernel, and see how much faster it is.
The new UML has gotten to the point of starting to fault in init.
Right now, exec is setting up init's stack and some of its data,
forcing a few of its pages to be faulted. It has not yet entered
userspace. It has created the usual collection of kernel threads,
initialized them, and switched between them. So, the in-kernel
portions of this code appear to be working well.
22 Aug 2002
I decided to flush out all the pending changes I had, so I also
released new versions of the test suite and tools yesterday.
With 2.4.19 more or less out of the way, I decided to start into a
comparatively large and risky project. So, I decided to start working
on the changes needed to move the UML kernel into its own address
space. This was discussed on the kernel list a few weeks ago. As the
things stand now, this can't be done without some extra support from
the host. I'm implementing this stuff in UML now, and will move it
into the x86 kernel later.
21 Aug 2002
Not too much to report recently. I released -53 today, which contains
a few small fixes. 'jail' mode works better now with a couple of
crashes fixed. I also cleaned up the ubd driver's error reporting.
With no complaints about the larger changes that I made recently, I
think it's time to release the 2.4.19 UML. I was holding off on this
until those changes got banged on some and I was happy that they
didn't break anything. This now seems to be the case.
OK, I updated UML to 2.4.19 and released the patch. The full release
will come after I've had a chance to bang on it some.
12 Aug 2002
Linus released 2.4.31 yesterday, again without UML. However, in stark
contrast to earlier releases, the two patches to generic code were
applied. This gives me some hope that Linus will merge the main piece
of UML at some point.
I tried getting UML running on my 2.2 box on Friday before leaving for
a camping trip. The intent was that I'd leave some stress tests
running on UML for a couple of days and see what new and interesting
bugs popped up. It turned out that the bugs popped up in the process
of building and running UML. I abandoned that plan until I fixed the
problems I saw.
I knocked those bugs off today, and released the results in -52. I'll
be trying the 2.2 thing again.
Then, it'll be off to the 2.5.32 races, with yet another patch going
to Linus.
I've got 2.4.19 sitting in the background. I want the current UML to
get banged on a bit before I upgrade it to 2.4.19.
9 Aug 2002
UML development has been cleanup lately. I found and fixed a "I'm
tracing myself and I can't get out" panic. This induced me to
understand the timer flags, which let me simplify them enough that I
can understand what's going on without lots of deep thinking.
Having done that, I merged sig_handler and irq_handler_common, which
were basically the same, except for one line. Then, I merged in
syscall_handler, so I went from three copies of the kernel entry and
exit code to one.
I also discovered a few more bugs in my stress testing. These are
fixed, along with some cleanups caused by me trying to build UML on
2.2.
So, this is all released in -51.
5 Aug 2002
There is hope on the getting UML into the 2.5 tree front. Linus has
merged the changes to generic code, and most of the linkage.h patch.
The stringification I sent him apparently breaks on some compilers, so
he backed that out. I copied the syntax I used from some gcc
documentation, which I strangely can't find any more. What I do find
is somewhat different.
However, he hasn't merged the main body of UML yet. So, we'll see
what's there when 2.5.31 comes out.
There's been some discussion on the kernel list about making UML
faster. Alan floated an idea for making jail mode faster. I pointed
out some flaws in it. I floated my ideas about making address spaces
accessible from userspace, and Alan dinged them in turn. Read all
about it over at
usermodelinux.org.
In a separate development, Ingo pointed out that process context
switches are much faster than signal deliveries. This opens up the
possibility of implementing jail mode by having the tracng thread also
run the kernel side of things. This will give us a fast jail mode and
speed up system calls. The only thing that's missing is the ability
of the kernel process to change the address spaces of the other processes.
In development news, I discovered a race which produced the dreaded
"I'm tracing myself and I can't get out". This prompted a cleanup of
the timer code. 2.4.19 having just come out, I was nervous about
releasing the 2.4.19 UML with something with potentially subtle
consequences like that. So, I've been testing more heavily than
usual. I added a test to the test suite which runs the rest of the
suite with jail mode enabled. The reason for that is that jail is
sensitive to bugs in how the timer is handled.
2 Aug 2002
Linus released 2.5.30 and UML again got dropped on the floor. So, I'm
in for another round of patch generation.
2.5.30 was fairly easy. There were some block layer changes which
broke the compile, but it was simple enough to figure out what had
happened and fix the ubd driver.
30 Jul 2002
-48 is out. I reproduced the crash that mistral saw when he killed
console xterms. I turned on slab poisoning and it became 100%
reproducable. It took me a surprisingly large number of attempts to
fix it. So, that fix is in, plus the fix for hostfs compilation
failure that everyone seemed to think was a typo.
28 Jul 2002
I announced UML 2.5.29 to the kernel list and sent UML to Linus again
for him to ignore. I had redone include/linux/linkage.h in order to
clean it up and remove it from the list of generic files that UML
changes. Keith Owens noticed that and suggested an improvement to it,
which I will add the next time I send UML in.
I got hppfs to the point of allowing an outside script to generate
dynamic /proc content and to filter the real /proc file. That's the
bulk of the functionality that's needed. Then, it will be time to
flesh out the remaining file and inode operations so that everything
works as expected.
27 Jul 2002
Linus released 2.5.29 last night and guess what's not in it. Right.
The patch itself contains no particular arch changes, so it looks
fairly simple. I'll release the UML 2.5.29 today probably. I
released 2.4.18-46 today. The main feature is start of hppfs, the
Honeypot procfs. It's far from done, but works enough to allow proc
files inside UML to be replaced with versions on the host. This by
itself is enough to make a fairly convincing honeypot. I still have
to complete the file operations, so that everything works as
expected. I also have to add support for allowing something on the
host to generate files, either from scratch or from the contents of
the UML proc file.
26 Jul 2002
I sent the latest UML to Linus. We'll see if it gets in 2.5.29. If
not, I'll just keep sending it until he gets sick of the bandwidth
that it's consuming...
In actual work, I started implementing the Honeypot procfs
filesystem. This is designed to cut the heart out of the problem of
making a UML honeypot look like a physical system. The largest part
of the problem is stuff in /proc. Look in places like /proc/cmdline,
/proc/interrupts, and /proc/cpuinfo to see why.
My plan is to implement another filesystem which creates a poor-man's
overlay over the real /proc. It will have two sources of information
- the real /proc and a shadow /proc on the host. When the user inside
UML looks at a file in /proc, this new filesystem will check for the
corresponding file in the shadow hierarchy on the host and use that if
it's there. Otherwise, it will just call into the UML /proc. So, by
sticking stuff in the hierarchy on the host, the admin will be able to
override selected pieces of the UML /proc.
25 Jul 2002
2.5.28 is out. The big news is that UML again got dropped. So, I get
update and send it in again. The big change this time around is all
the irq changes. They are going to cause this UML update to be more
troublesome than usual. Global irq disabling and enabling is gone,
which overall is a good thing. I would hate to have to implement them
in UML. On an SMP UML, that would require sending an IPI around to
all the other processors, and not continuing until they had all
confirmed that they had fiddled their interrupts appropriately. That
would absolutely suck, performance-wise.
Rather than implement that, I would just never allow interrupts to be
handled by any processor other than processor 0. Then local interrupt
enabling and disabling is equivalent to global enabling and
disabling. With this change, it will be possible to distribute
interrupts around the processors of an SMP UML.
In other news, hostfs in 2.5 is broken. I fixed a few bugs, but there
are more remaining.
23 Jul 2002
-43 is out. The big news here is that is has SCSI support.
Currently, there is only the scsi_debug driver, which operates in
memory. I'm scheming to split the file I/O code from the ubd driver,
leaving behind an interface to plug it back into, and doing the same
to scsi_debug. This would allow the ubd file and COW code to be
plugged into the SCSI subsystem, and allow the in-memory device from
scsi_debug to be used as a ubd device.
Another cute thing is /proc/mconsole. It is created if UML is booted
with an mconsole notification socket. Anything written to it will be
sent out to whatever is listening to that socket as a notification.
22 Jul 2002
I sent the latest UML to Linus. We'll see how this one goes.
21 Jul 2002
Linus released 2.5.27 yesterday and it did not contain UML. Oh well.
So, I'm having another go at it. I released -42 today, which is
mostly cleanups and bugfixes to -41. There are also some driver build
changes from Henrik.
I dropped this into 2.5.27 without too much trouble. So, I'll package
it up and send it in to Linus again tomorrow.
18 Jul 2002
I released 2.4.18-41 today. This included a new way of setting up
xterms which gets UML out of the business of allocating
pseudo-terminals for them, which has been a source of trouble for a
while. As a side-effect, the terminal emulator is now configurable.
As another side-effect, it should now be possible to easily run and
control the UML debugger from a script, like the UML test suite.
I also merged a bunch of the build cleanups from 2.5. Some of my
Makefiles were seriously obsolete, and now they are slightly less so.
I made a tools release, which adds a jail kit. This contains the
tools needed to run UML as a non-privileged user inside a right chroot
jail.
17 Jul 2002
I announced 2.5.26 to the world today. Having done that, I split the
patch into two pieces, one containing changes to generic files and one
containing everything under arch/um and include/asm-um. This is the
way Linus said he liked to get ports when I asked him about it in
Ottawa.
Missing from this for now are hostfs, the tty logging patch, and the
page validation patch. I'm going to send hostfs in separately. The
tty logging patch will require advice and consent from whoever owns
the tty driver (and is willing to admit to it). And the page
validation patch will have to go in separately as well, considering
the reaction to it when I brought the topic up on LKML.
It's all off to Linus (and the generic piece was cc-d to LKML). We'll
see if this fares any better than my previous attempts.
16 Jul 2002
Linus released 2.5.26 today, so I updated. No problems with it.
There weren't any arch changes in this version.
15 Jul 2002
2.5.17 and 2.5.18 required some include tweaking because the tlb
flushing stuff moved out of pgalloc.h. Other than that, there were no
problems.
2.5.19 moved some stuff from arch code into generic code, which is
usually a good thing.
2.5.20 renamed the swap entry access macros, which was no big deal.
2.5.21 also posed no problems.
2.5.22 introduced some large build changes which broke some of my more
antiquated Makefiles. After some modernizing, they started working
again. There were also some kdev_t changes which broke hostfs.
The hotplug CPU changes in 2.5.23 required some minor changes. 2.5.24
did nothing but move sys_pause from arch code to generic code. 2.5.25
rearranged page fault handling a little, made a small build change
that had spectacular results, and changed sys_sched_yield to plain old
yield. After I got it to build, there turned out to be a division by zero
problem caused by the HZ changes. jiffies_to_clock_t is defined as
which blows up when HZ < USER_HZ. UML HZ was 52,
while USER_HZ is 100. I contemplated fixing this by turning the
nested division into a multiplication, but that would have just made
the calculation vulnerable to overflows. So, I just bumped the UML HZ
up to 100.
14 Jul 2002
On to 2.5.8. Nothing major, a couple new system calls, and some
header file rearrangement. It compiled and booted without too much
trouble.
2.5.9 was no problem at all. I fixed a glitch or two, built it, and
booted it. So were 2.5.10 through 2.5.13.
2.5.14 was more interesting. The 2.5 scheduler bug showed up here
with a vengeance while it hadn't with anything previous. The problem
is that the O(1) scheduler calls the arch switch_to with interrupts
blocked. With UML, this means that a SIGIO can arrive before SIGIO is
forwarded to the incoming process, and it will be pending in the
outgoing process until it is scheduled again. This may never happen
because that SIGIO could be a disk IO completion that is necessary for
anything to run again. So, I applied my old fix, which makes the
problem disappear. This checks for pending SIGIO after the
forwarding, and does an explicit kill(next_pid, SIGIO) if there is one
pending.
tmpfs fails to mount because the superblock allocation is somehow
failing with -ENOMEM. And the whole boot is a mess, with -EIO
appearing all over the place and /proc failing to mount for some
reason.
It turns out that the filesystem was messed up. And it turns out that
a little message that I was seeing was one I put in to make sure
I checked some code that had never run. It started running now
because of changes elsewhere in the kernel, and it turns out to be
buggy in such a way as to cause data corruption. So, with that fixed,
I get nice clean boots, except for a panic cause by the failed tmpfs
mount.
And that turns out to have been caused by an incomplete merge of the
memory stats change. Those stats are now kept in generic code rather
than in the arch. tmpfs calculates the number of available inodes
from the number of pages of available memory. I forgot to delete the
UML declaration of totalram_pages, which meant that the generic value
stayed at zero. Fixing this eliminates the mount failure and the panic.
With that all settled, I move on to 2.5.15.
2.5.15 is uneventful. The only thing of interest was a little signal
delivery bug fix which I had spotted when it was first sent to LKML
and which I already had in my pool. After the patch went in, UML just
compiled and booted.
2.5.16 made jiffies go away as a normal variable. The other arch link
scripts all create jiffies as a symbol aliased to jiffies_64. This
booted after adding that to the UML link script and removing the
CONFIG_SMP from around the definition of mmu_gathers.
13 Jul 2002
Continuing whacking away on 2.5.5. After fixing page table things
that I had messed up, it finally booted.
2.5.6 adds a new system call and changes the interface to blk_ioctl.
Fixing these up, and disabling jffs2, which was broken, resulted in a
working UML.
2.5.7 added sys_futex and applied the fs.h crapectomy to a bunch of
filesystems. nfsservctl is now apparently an optional system call, so
the system call table needed fixing for that. With those fixes, it
builds and boots.
12 Jul 2002
On to 2.5.5. This one was mostly page table changes. There were
interface changes and some new interfaces added. Someone decided it
was a good idea to get rid of the little caches for pgdirs and page
tables. I agree, since specialized little caches like that hold on to
memory that should be available to the rest of the system.
I also discovered that hostfs needed to be updated because of the fs.h
crapectomy that happened in 2.5.3. I don't know why I didn't see this
before. Basically, in order to eliminate the header file horror show
in fs.h caused by the inode union needing to have an entry for every
possible filesystem, the filesystem-specific data now includes the
inode rather than the inode containing the filesystem-specific stuff.
11 Jul 2002
I decided to get on the stick and start getting UML going with 2.5.
My 2.5 pool had 2.5.3-pre5 in it, so that's where I started. I
grabbed all the patches from there to 2.5.25 and started whacking
away.
I got 2.5.3 and 2.5.4 compiled and booting. I put off further testing
in the interest of making progress through 2.5. I am generating UML
patches along the way, and I'll exercise them more heavily later on.
The major work needed for 2.5.3 was the block driver. It needed
updating for the bio changes. Nothing major, except it confused read
and write requests, resulting in a trashed COW file when it tried
reading the superblock. I didn't realize this for a while, and was
trying to figure out why it wasn't reading the superblock even after I
fixed the read/write confusion.
2.5.4 changed how task_structs are allocated. They used to be at the
bottom of the kernel stack. Now, there is a minimal thread_info
struct there, and the main task structure is allocated by kmalloc just
like everything else. This took a fair amount of time to fix enough
so that UML would compile. Once that happened, I had to chase down a
few bugs that assumed that the current task_struct was at the bottom
of the stack.
10 Jul 2002
I started releasing rapid-fire patches. I just finished with -40.
There's been some more stuff moved under the OS interface. Also a
bunch of fixed bugs. I discovered that some gdb features that I
thought were working weren't. So, I fixed them. I also discovered
that closing a terminal at one end doesn't cause a SIGIO at the
other. So, I added that to the SIGIO emulation.
6 Jul 2002
Harald Welte is now winging his way back to Germany. He had been
staying with me for the week after KS and OLS. I fear his most
lasting impressions of lovely New England will be the insects... Oh
well.
I decided to start thinking about making UML OS-portable so that
Chandan Kudige can start merging bits of the Windows port. I created
a new directory to hide Linux specific code in and defined an
OS-independent interface for it (and other OS ports) to implement.
It's fairly rudumentary right now, containing a handful of file and
process operations, but the overall intent should be fairly clear.
So, with that, I released -37. It consists almost solely of the code
reorganization (the exception being an updated config.release).
2 Jul 2002
I'm back from the Kernel Summit and OLS. In KS news, it turns out
that Linus likes UML. Alan apparently clued him in that people
are using UML for real work, and they care about performance. He has
no problem with exposing address spaces to processes, nor with having
UML doing an address space switch in conjunction with a signal
delivery. This would essentially be sigaltstack_mm, with an mm switch
as well as a stack switch. This would have a number of advantages,
including making 'jail' mode trivial, cleaning up UML, speeding up UML
context switches, and giving UML a larger virtual address space.
In OLS news, my talk went pretty well. I described the major things
that I want in order to make UML run better, and there were no
arguments with them. They were all non-controversial, so it looks
like they're going to happen. The downside is that there is a 2.5
function freeze for Holloween this year, so it all needs doing by
then.
I departed from tradition slightly and annotated the slides before the
talk. Since this was a brand-new talk (so brand-new that I finished
it 15 minutes before the talk, which gave me just enough time to get
over to the conference from the hotel), I did the notes so I'd have
some idea what I should cover on each slide. I couldn't look at the
notes during the talk, but just doing them helped me remember what I
wanted to say for a given slide. This also has the advantage that
they can be put on the UML site immediately, rather than when I get
around to annotating them. So,
here
they are.
A few other UML tidbits:
There was lots of demand for a 2.5 UML. This, combined with the
Holloween deadline, means I need to get moving on this somewhat soon.
The UML swsusp patch apparently sort of works. This surprised me,
since the last news I had was that it didn't resume properly. So, I
probably should look at it and see about integrating it into UML.
Michael Richardson of the FreeS/WAN project did a talk on using UML as
a testbed for regression testing. It was fairly interesting and
well-received. He demoed it with four UMLs running on a FreeS/WAN
server someplace outside the conference. It went well, although the
connection was a bt slow. Richard Briggs was in the audience and was
running a six UML testbed on his laptop, which he offered as a demo
when the connection to the external demo was looking a bit iffy.
Bert Hubert almost used UML to demo something during his talk. He
ended borrowing another laptop and using it instead, since that setup
would be somewhat more authentic than using UMLs. Oh well.
21 Jun 2002
-33 turned out not to build unless CONFIG_TTY_LOG was enabled. So, I
released -34 a few hours later with the fix.
Today, I'm going to release -35 with gdb stack switching implemented.
This lets you detach from the thread that's currently in context,
attach to an out-of-context thread and look at its stack. This makes
it a lot easier to debug deadlocks since you don't have to manually
reconstruct stack traces from hex dumps of the stacks. I, of course,
have gotten used to the hex dump approach, so I didn't see what the
problem was. Other people did have problems with it, including Peter
Braam and Cluster File Systems, who offered a small bounty for this
feature. And, now that it's here, I do have to admit that it's pretty
handy.
This will be the last UML release for a couple of weeks since I'm
getting ready to head up to Ottawa for the Kernel Summit and OLS.
Those will be next week, and I've got Harald Welte visiting the week
after, which means that work might be thin for that week as well.
18 Jun 2002
Last night, I added /proc/exitcode, which allows a process to set
UML's exit status. This is useful for one-shot UMLs, which run one
thing and exit. It lets the thing running inside UML export its exit
status to the outside caller.
So this will be -33. It'll be available as soon as I get the web site
build process all put together again.
17 Jun 2002
I spent a couple fruitless days looking for whatever is causing JVMs
to crash under UML. It appears to be related to signals somehow,
since the thread that segfaults is always returning from sending
another thread a SIGRTMIN. It first appeared that the other thread
was provoking the crash, since there was always a context switch from
the sending thread to the receiving thread before the sending thread
came back into context and returned from the kill. However, I
suppressed context switching in that case, and it still died on return
from the kill.
The code that's segfaulting appears to be generated by the JVM rather
than being the JVM itself. It is one of very many small pieces of
code pointed to by a large table. Each of these is about 6
instructions long, starting by dereferencing a pointer, doing one or
two instructions worth of work, calculating an index, and jumping to
whatever code is pointed to by that entry in the table. The crash is
coming at the very start of one of these blocks. The register that
it's dereferencing contains zero, which is bad.
I decided to put that off for the moment and do some other things.
So, I redid the web site build. My major complaint was the speed at
which it rebuilt. The big culprit was all of the changelogs that have
accumulated. They are all generated from a large and growing XML
file, with each entry getting its own web page, and requiring the
processing of the entire file to generate. Needless to say, this is
an O(n^2) operation, and I was well within the quadratic regime.
So, I redid it, adding some infrastructure that allows rebuilding of
files only when they've really changed. So, the fact that
changelog.xml changed does not cause the rebuilding of all of the
changelog-*.html files, even though their source file changed.
I also tidied up the dependencies and reorganized the pool itself to
make it a bit cleaner.
In actual development news, I integrated the honeypot tty logging
patch into the main pool, along with a little patch I got from
geoffrey hing. I also added the ability to log to a preconfigured
file descriptor. This is for the benefit of chrooted UMLs, so they
can log to a file outside the jail.
11 Jun 2002
Hmmm, long time no diary entries. Well, I've been busy. I've started
getting regular offers of contracts to do UML-related things and I've
been accepting some of them. This slows down the pace of UML
development, unfortunately, but it does tend to fatten up the old bank
account, which has to be considered a good thing.
Recent development has featured a couple common threads. One is
moving UML into its own memory. Before, UML would allocate a physical
memory region which did not include its own binary. This effectively
meant that UML text, static data, and heap were not in physical
memory. This caused problems for the swsusp effort on UML because it
wants to copy physical memory out to disk. Fixing that took several
patches to get right, but it seems to be fine now.
Another theme has been openpty. This started with some mysterious
segfaults seen by a few people which ultimately were tracked down to
calls to openpty that UML was making. openpty has a larger than usual
stack frame, so when it runs on a kernel stack, it overflows it and
corrupts whatever lies just below.
My first attempt to fix it put it in a separate thread so that the
pages it used would be COWed and it wouldn't change any UML memory.
This was stupid because UML memory is mapped MAP_SHARED exactly to
defeat COWing. Attempt number two involved having the tracing thread
run openpty. This was good because the tracing thread has a proper
stack. Unfortunately, openpty can call malloc, which gets converted
into kmalloc, which is a very bad thing to call from the tracing
thread. So, my current code goes back to the original mechanism,
except that a larger than usual stack is allocated in this case. This
should be OK now.
With these things settling down, I'm about ready to release -32, and
I'm thinking that a full release would be good to do soon, as well.
Maybe later this week or over the weekend.
17 May 2002
A week or so ago, Steve Freitas kindly assembled and sent me a nice
little SMP box so I could chase the host SMP bug that UML is
exercising. I got around to fitting it into my environment - it
caused my network to outgrow my crossover cable, so I invested in a
switch and a bunch of cables. I've got it on the net with a serial
line console running to my desktop box, which is all running fine.
UML also runs fine on it, which is disappointing. It'll be hard to
find the host bug if I can't reproduce it.
In development news, I did a bunch of work on ptrace, which forms the
bulk of -26, which is released today. Watchpoints now work in gdb
inside UML. Kernel watchpoints don't work yet, but they will soon. I
fixed a couple other ptrace bugs, including one which could be used to
break out of UML.
11 May 2002
The big UML news is that it is now self-hosting. This is more a
demonstration of UML maturity than something that's very useful. It's
still nice to be able to do, though. There's a description of how to
do it here.
UML development has been concentrating on fixing bugs, as usual. The
last few releases have mostly consisted of small fixes.
The test suite received an overhaul. It is now willing to build UMLs
according to the needs of the tests. So, tests can now specify how
they want UML to be configured, and the suite will build a UML if
necessary.
28 Apr 2002
I'm back to knocking items off my todo lists. At this point, I'm down
to about 75. The recent victims are mostly small items that had
accumulated with some other bugs that people confirmed had already
been fixed.
I released -21 today after getting rid of most uses of tracing_cb,
which UML threads use to request the tracing thread to create
processes for them. Having UML threads do it themselves exposes that
code to the UML gdb, as well as get one step closer to having
miscommunication with a helper hang only the thread that started it
rather than the whole UML. To get there also requires that input from
helpers be handled asynchronously rather than synchronously as is the
case now.
25 Apr 2002
This was a fairly lazy week. I went down to West Virginia at the
invitation of David Krovich to give a talk at WVU. That went fairly
well - the chairman of the CSEE department was apparently impressed by
the number of students who attended. After dinner, I gave another
talk at a MORLUG meeting (MORLUG == Morgantown LUG) which consisted of
me firing up UMLs and demonstrating various neat things you can do
with it. This also seemed to go well.
In extra-curricular activities, we tried to go hiking on Sunday, but
got rained on heavily. It looks like nice country if only the clouds
would get out of the way so you could see something.
In UML news, I spent some time fixing the iomem support. It was
broken in such a way that I had a hard time believing it ever worked.
What I now think is more likely is that the VM system changed in a way
that broken iomem, but no one noticed. The problem was that the VM
system deals largely with page structs instead of raw page addresses
and the iomem regions had no sane mem_maps, so they had no page
structs.
I fixed this by changing the infrastructure to allow for segmented
physical memory consisting of regions which have their own separate
mem_maps. This will allow for plugging and unplugging of iomem
regions. I was hoping this would work for physical memory regions as
well (plugging anyway; unplugging is harder), and after a bunch of
failed experiments, decided that this wasn't going to be.
After getting this working, I decided to release 2.4.18-19. It also
contains James McMechan's partitioned device support and a bunch of
smaller bug fixes that were noticed by various people.
15 Apr 2002
More bug-bashing. My lists contain a total of 93 items, but a large
number of them are about to die. The umlgdb expect script from
Chandan Kudige will knock off the two related to reloading module
symbols. When I verify that UML can boot as a diskless client, that
will get rid of five more. James McMechan's ubd partition patch is
currently accounting for six items. I should be able to knock all
those off in the next couple of days.
I released an RPM last Tuesday to get the accumulated changes out to a
wider audience. I think I'm going to start releasing an RPM every two
or three weeks from now on. 2.4.x releases are too far apart for me now.
6 Apr 2002
I spent the last few days on a bug-smashing spree. To-do items had
accumulated at an alarming pace over the last couple weeks, so I
decided to whack away at them a bit. I had over 110 items on my
various lists, so I knocked off as many as were easy to kill. I now have
99 items, so I got rid of more than 10 of them.
Prominent among them are
floating point registers not being available to gdb inside UML or
stored in core files
hostfs not being able to create unix sockets
the daemon transport now gets its MAC from uml_switch
if the umid is set on the command line, it is put into host process
names and into xterm title bars
There were some small patches from mulix, Sapan, and Daniel Phillips,
which all did useful things.
With that, I released 2.4.18-14 and uml_utilities_20020406. This may
be the basis of another full release (with an RPM instead of just a
patch). I need to do that soon since the 2.4.18 UML is getting fairly
old at this point.
3 Apr 2002
In the nearly two weeks since the last entry, I've pretty much knocked
off the console flow control bug (the one remaining piece is to make
sure that it works correctly when ptys deliver output SIGIO) and the
init hang on older machines. These are two of the oldest UML bugs.
The init hang is caused by init executing a cmov on a processor that
doesn't support it. What I ended up doing (at the suggestion of Alan
Cox who saw this on one of his boxes) is detect cmov support by
looking at /proc/cpuinfo, then panic if init gets a SIGILL on a cmov.
In other news, I got all of Bill Stearns' bootable filesystems over to
SF, so the links to them are likely to work. They're not on
ftp.nl.linux.org yet though.
22 Mar 2002
I've decided that I'm out of major design changes I need to make and
can now concentrate on knocking items off the todo list. The one
published on the site has 44 things on it. I also keep a todo mail
folder containing pieces of mail that describe something well enough
that I can keep track of it. That folder contains 54 messages right
now, so I've got almost 100 things to do. Some are duplicates, and
some are already fixed and I haven't figured it yet, so the actual
number is somewhat smaller.
My latest run of the test suite succeeded in booting all of the
filesystems available from the UML site, which is probably the first
time that has ever been true.
15 Mar 2002
I discovered a remnant bug. 'strace -p' didn't work. Fixing that was
easy enough, but I decided to clean up some code while I was in it.
That took more work than I expected, since it involved saving state in
the thread structure in signal handlers, and when you forget to
restore the old state, you get very obscure misbehaviors. I ended up
backing out the changes and putting them back in one line at a time
before it dawned on me that perhaps restoring the old stuff would be a
good plan.
So, I released 2.4.18-7 today. I also released another version of the
test suite. This one allows tests to be interactive and for the perl
test driver to interact with them.
In separate news, my paper proposal for OLS was accepted. So, it
looks like I'm on the hook for another paper.
14 Mar 2002
After much surgery, I finished the pt_regs to sigcontext work and put
UML back together. There were surprisingly few bugs to chase once it
compiled again. I fixed around three bugs, and haven't found any more
since.
I did some more work on the test suite. There is now a kernel build
test, for which I wrote a perl mconsole client. This prompted me to
add the option of the mconsole driver sending the name of its socket
to a socket specified on the command line. This allows scripts to
find out where to send mconsole commands without having to parse the
boot log. It also tells them when UML has booted to that point.
This test also exposed a bug in the mconsole driver which caused
commands not to be NULL-terminated.
7 Mar 2002
Bad news. I decided to do a UML-sucks/rocks-ometer on Google. Well,
"sucks" beats
"rocks" 125-83. To
make things worse, the first few "sucks" entries are me.
4 Mar 2002
I made the 2.4.18 announcement on freshmeat today. That is the final
piece of this release.
I have known for a long time that UML is susceptible to bus errors in
random places in the code if it touches memory that it had previously
mmapped, but the host can't back with physical memory. This turned
out to be a lot easier to trigger than I expected. I run with tmpfs
mounted on /tmp for speed reasons. It has a maximum size of 1/2 RAM
by default, which is 128M for me. I got a UML to hit that limit and
crash with a bus error.
So, I posted an RFC to lkml asking for comments on my proposed
solution which was to add a hook to __alloc_pages to allow the
architecture to touch pages before they're returned to the caller.
Physical architectures have no need of this because they have a known
amount of physical memory and they know it's not going anywhere. UML
doesn't, so it needs to assure itself that an allocated page is real
and that accesses to it won't fault.
What I got was an argument with Alan Cox who persistently doesn't
understand what I'm talking about. He has appeared to believe that
I'm trying to get good behavior when the system is out of memory and
good behavior is impossible, that I want the host kernel to allocate
memory as soon as the address space is allocated, and a bunch of other
things that I don't begin to fathom. Peter Anvin hopped in briefly
with what look like similar problems.
So, I implemented what I wanted, and sent the patch in. Hopefully
that will clear out the confusion.
In actual development news, I'm converting UML from storing register
information in pt_regs structs to storing it in sigcontext structs.
What I'm doing is making UML return to userspace by way of host signal
returns rather then having the tracing thread teleport it back with
PTRACE_SETREGS. The main advantage is that it is guaranteed to
correctly restore floating point state, which has worried me for quite
a while. It has other advantages, like removing code from the tracing
thread, which will make it easier to finally eliminate at some point.
1 Mar 2002
I finished the release of 2.4.18 today with the bare kernel and the
RPM. No deb since Matt is taking care of that for me now.
I cleaned up the test driver some more. The local configuration
information is now stored in ~/.umltest, and it is possible to pass in
configuration options on the command line. This is in preparation for
automatically running the tests whenever a new patch is ready.
25 Feb 2002
I released the 2.4.18 UML patch today. It was a no-effort update,
except that there were a bunch of placeholder entries for
the extended attribute system calls.
I'm going to sit on the full release for a few days while I work up a
new test suite and harness. Hopefully, I can get something automated
set up so that whenever I release a patch, it will get banged on
without me having to do it by hand.
24 Feb 2002
I think I'm emerging from signal delivery hell. Everything seems to
work again with the exception of a memory corruption problem which may
not be new. I've discovered new and interesting ways to screw up.
It turns out that when you leave -ERESTARTSYS or -ERESTARTNOHAND in %eax
when leaving a system call, the host kernel will helpfully subtract 2
from eip. This is because a system call instruction is 2 bytes long
and it restarts them by executing that instruction over again. This
turns out to be a problem when UML itself is restarting one of its own
system calls. It delivers the signal first, so eip points to the
beginning of the handler, and if you don't put a zero or something in
eip, then you will fake the host into subtracting two from eip. That
puts it at the very end of the previous procedure, which will try to
return from a stack frame that never really existed in the first
place. This leads to very interesting debugging sessions.
It also turns out that ptrace "knows" that when it has intercepted the
start of a system call, eax contains -ENOSYS. It depends on this.
When I changed how UML gets the process state before a system call, I
changed the value in eax to something else, and ptrace stopped
working. It took a couple of days to figure this out.
But things seem back to normal now. My signal delivery exercisers
work OK, my stress tests work OK, UML works when compiled on 2.4 and
run on 2.2, and vice versa, which -13 didn't. So, I'm releasing -14
now and maybe this signal delivery rewrite will be done.
17 Feb 2002
I submitted a paper proposal for OLS yesterday. This time, I decided
to rant and rave about how the host kernel needs to be fixed to better
support virtual machines. We'll see how that goes. At least it's
different from the standard UML song and dance I've been flogging for
the last couple of years.
I redid the UML signal delivery code in order to fix the pthreads hang
with the newer pthreads library. I prototyped this in a small
standalone process in Incheon Airport (Seoul's international airport)
on my way back from Brisbane and just got around to integrating it
into UML. It massively cleaned up a bunch of code and opens the way
for getting rid of some other problems that have existed for a while.
11 Feb 2002
My flight leaves tonight, so I spent the day wandering around central
Brisbane. Ben LaHaise happened to take the same CityCat ferry (which
is a system of catamarans used to ferry people up and down the
Brisbane River) as me down to the city. He tried to blow up the boat
with his umbrella, but, fortunately, an alert crew member stopped
him.
Back to the Uni at the end of the afternoon, collect my bags, call a
taxi to the airport, and wait for the flight to Seoul. Then it's
another 14 hours to JFK and another couple back to CT and then I will
be hopelessly confused about what time it is for a couple of days.
10 Feb 2002
LCA is over as of yesterday, and I'm heading home tomorrow.
The slides from my talk (with notes) are available
here.
It was pretty well attended and it seemed to be received fairly well.
LCA always seems to do some innovative things, one of which is to
rerun talks that a large number of people regretted missing. The
three talks that were chosen this year were a virtual reality talk
that a huge number of people wanted to see, Andrew van der Stock's
talk on code auditing, and one other that I can't remember. This was
good because I was one of the huge number of people that wanted to
see the virtual reality talk. However, it turned out that my talk was
the number four vote-getter, and this turned out to be relevant when
Andrew was nowhere to be seen when the reruns were about to happen.
So, I was happily debugging the ppc UML build with Anton Blanchard
when one of the organizers ran up and asked me if I could do mine
again. I did so, except I skipped over some of the heavier parts of
the talk to leave a good bit of time for a demo at the end.
This went well, except for the panics I got when I tried to have two
UMLs mount the same filesystem. I demoed three UMLs running (one
Debian, two Slackwares) with most of them (plus the host, I think)
displaying on the X server of one of the Slackware UMLs. I showed
various other aspects of UML like what it looks like from the host
side.
1 Feb 2002
I'm off to Australia tomorrow for
LCA 2002.
This caps a fairly productive week of UML bug hunting:
mistral and blinky started seeing a panic in fork. mistral figured
out how to reproduce it and tracked it down to the point where it had
something to do with kernel threads reference counting their mm's.
This was enough information for me to fix the bug, by giving kernel
threads NULL mm's.
While I was at ISTS giving a talk, I spent some free time looking at
the pthreads problem that people have been seeing for a while. It
turns out to be a problem with UML signal delivery. I had assumed
that the registers going into a signal handler didn't matter (except
for the IP and SP, of course) and that only the stack frame mattered.
So, the process registers at the start of a signal delivery are
initialized with a set that was captured from a UML thread at boot
time. This works fine usually, except that recent pthreads libraries
store some thread-private data in %gs, and the %gs value has to be
preserved in signal handlers. So, the UML signal delivery mechanism
needs to be reworked again.
The UML IO hangs that were reported this week were tracked down and
found to be a bug in the host's handling of SIGIO. It turns out to be
possible, on an SMP host, for SIGIO to be queued to a process after
that process has returned from the fcntl that registered a different
process as the SIGIO recipient. This breaks UML badly because SIGIOs
end up queued, but not delivered, to a process which is out of context
and sleeping.
30 Jan 2002
I gave a talk at ISTS yesterday on the UML security work that I did
last week. It's available both as
html-ized
slides and as the
original Star
Office presentation. These are intended to provide a starting
point for anyone wanting to probe this for exploitable holes as well
as anyone who's curious about what was done.
I released 2.4.17-10 today. It contains a pile of bug fixes and a
bunch of changes which allow a UML patch to come close to compiling in
both 2.4 and 2.5 pools. I reverted a change which is causing problems
on SMP hosts. I decided that using sockets rather than pts devices to
communicate between the IO thread and UML was a good idea because
sockets are lighter weight and they're pretty much guaranteed to be
supported on the host, whereas there are lots of systems without pts
devices. However, there is some difference in how SIGIO is delivered
which causes UML to lose interrupts once in a while. The effect is
that it seems to hang on boot, but can be made to continue by banging
on the keyboard.
MTD is in the configuration now. So, UML supports MTD devices and
creating JFFS2 filesystems and mounting them seems to work, although
there are some nasty-looking error messages along the way. They don't
seem obviously related to UML though.
25 Jan 2002
The security work is largely done. The exception is the lcall
prevention fix that's needed on the host. So, I released 2.4.17-9
today. It also contains a number of fixes and patches from other
people, the largest being the latest set of James McMechan's ubd
changes.
23 Jan 2002
I released 2.4.17-8 yesterday (and 2.4.17-7 earlier this week without
dignifying it with a diary entry). I spent a fair amount of time
tracking down some old debugging problems. The strace recipe on the
debugging page
hasn't worked for a while, so I mostly fixed it. strace still doesn't
see system calls from new processes until they receive a signal.
I also figured out what was happening with using the gdb under ddd as an
external debugger. ddd periodically calls wait on gdb, and when UML
attaches it, gdb gets reparented away from ddd. wait starts returning
ECHILD, and ddd reacts by shutting down gdb's input, and gdb, in turn,
exits. To fix this, I think I'm going to have to have a 'gdb-parent='
switch that will make UML attach to the parent and fake normal return
values from wait.
In other news, there have been a couple of articles about UML
recently. NewsForge ran
one
yesterday. This is a followup on the article last week about Linux
virtual machines which completely failed to mention UML. Bill Stearns
also noticed
this article on using UML as the basis of a honeynet.
I've started finishing off the security work needed to make UML a
secure root jail. The 'jail' switch now checks for config options
which would make the UML inherently insecure and refuses to run if any
of them are enabled. Currently, the proscribed options are
CONFIG_MODULES, CONFIG_HOSTFS, and CONFIG_SMP. CONFIG_MODULES is
fairly obvious. If modules are enabled, then root can insert any code
at all into UML, and a nasty root would insert code that execs a shell
or something on the host. CONFIG_HOSTFS is forbidden to prevent
accidentally providing access to the host filesystem. This is a bit
dubious because hostfs is not inherently insecure, and I may relax
this one at some point. CONFIG_SMP is non-obvious. Lennert Buytenhek
noticed the relationship between SMP and security. 'jail' is
implemented by unprotecting kernel memory (by making it writable) on
entering the UML kernel, and write-protecting it on kernel exit. If a
process were to have two threads, one busy-waiting in userspace, and
the other sleeping in the kernel, kernel memory would be writable
because the sleeping thread would be in the kernel. So, the spinning
thread would wait for that to happen and write on whatever part of
kernel memory would let it escape. This will be fixed when UML gets a
separate address space for the kernel.
I've stared at /proc and /dev to find devices that provide access to
kernel memory. The only two that I spotted were /dev/mem and
/dev/kmem. These are disabled with a trick that someone on
#kernelnewbies told me about. Access to them is controlled by
CAP_SYS_RAWIO, so removing that capability from the bounding
capability set makes it impossible for any process to ever get it.
So, no process can ever open those devices. Even better, in my
limited testing, nothing seems to break badly as a result.
/proc/kcore looks suspicious, but it's a readonly file that seems to
fit out a memory image wrapped in an ELF header, so it's OK,
security-wise.
18 Jan 2002
2.4.17-6 is out. There are a lot of driver cleanups and bug fixes in
this patch. The IRQ hang is fixed. The default console and serial
line channel initialization strings are now configurable.
In other news, NewsForge ran an
article
about Linux virtual machines which covered everything
relevant, including some things that weren't virtual machines, except
for UML. They got some complaints about that, and later that day, I
got a piece of email from the guy who wrote the article wanting to
write a followup about UML. So, it looks like UML will be getting
another nice bit of publicity.
13 Jan 2002
I made another patch against the 2.5.2-pre tree again today. This one
is against 2.5.2-pre11. It has been sent off to Linus so he can drop
it in his bit bucket.
In another development, it turns out that the O(1) scheduler breaks
UML by holding IRQs disabled across context switches. This results in
SIGIO (i.e. from disk IO completions) to be trapped in a process that
has gone out of context, and can't be woken up until something else
notices that the IO has completed, and of course it won't because the
SIGIO has been delivered to the wrong process.
11 Jan 2002
After spending five days tracking down a swap corruption bug, I
discovered that Rodrigo de Castro had explained to me exactly what it
was about a month ago. Unfortunately, at the time, I didn't know
enough about the swap code to decide whether he was making sense. Of
course he was, and I discovered that at the end of the great bug hunt.
So, with that fix, Lennert's latest SMP changes, and a bunch of
smaller stuff, I'm releasing 2.4.17-5 today.
Linus saw fit to silently drop UML into the bit bucket again, so I'll
make another patch soon and send it in.
5 Jan 2002
I released the 2.4.17-3 and 2.4.17-4 patches this week. The biggest
change has been the merging of Lennert's SMP fixes.
I made another attempt to get UML into the Linus tree. This patch is
against 2.5.2-pre9 and is the 2.4.17-4 patch. We'll see how well this
attempt fares.
30 Dec 2001
I announced the full 2.4.17 release and the 2.4.17-2 patch today. The
patch is largely changes to allow UML to calculate current from the
stack. This is to make life easier for Lennert, who's trying to get
SMP working. It also contains a bunch of fixes for bugs that crop up
when host devices get closed from under UML consoles or serial lines.
Iain Young is making a decent stab at a UML/sparc64 port. He's got
the boilerplate filled in (albeit with some skeptical comments about
their correctness...) and he's trying to get the whole thing to compile.
Linus released 2.5.2-pre4 today with no sign of UML in the changelog.
I grabbed the patch to make sure it wasn't there. It wasn't. Grrr.
I'll give him another patch and then I'll spam him with it again.
28 Dec 2001
OK, so the diary has taken a bit of a holiday break. I released the
2.4.17 UML patch yesterday. The full release will be forthcoming. I
also ported that patch into 2.5.1, created the 2.5.1 patch, and sent
it to Linus. Hopefully, he will put it in without my having to resend
it too many times. I sent a little note off to LKML announcing this,
which got some favorable reaction, both on and off the list. Alan,
being his usual taciturn self sent a reply, which read, in its
entirety, "Cool".
This release had a lot of accumulated stuff in it. The biggest item
are the port channel, which let you attach any number of UML consoles
and serial lines to a host port, at which point you can access them by
telnetting to that port. I also redid the context switching mechanism
after thinking of a much simpler way of doing it. This should also be
much faster since it doesn't involve signals flying around, so context
switches are now invisible to the tracing thread.
8 Dec 2001
I released 2.4.15-3 today. It contains some fixes to previous patches
and a lot of changes to the gdb support. gdb now sees ^C immediately,
rather than an arbitrary amount of time after it's typed. I also
cleaned up that code quite a bit. This knocks a couple of items off
the todo list. It also
sets me up to fix the gdb shell hang, but I'll let these changes gel
for a bit before dealing with that.
7 Dec 2001
I got the Sysadmin Disaster of
the Month contest going about a week
later than I should have. The thing that happened a week earlier was
the publication of an
article that I wrote for O'Reilly on using UML to simulate and
recover from disasters.
This month's disaster is a trashed root superblock. It involves
booting UML, zeroing out the superblock, and figuring out how to fix
the filesystem. I had some trouble coming up with a good example to
use for the contest. So, if I have similar troubles at the end of the
month, I might just trash a filesystem, make it available for
download, and the contest will be to figure out what's wrong with it
and fix it.
4 Dec 2001
Back from Linux-Kongress. Back on US/Eastern. I think I never left
it, which made life (the staying awake part of it) difficult in
Holland.
As for highlights of the conference, we have:
I and the UML project seem to have some name recognition. Everyone I
talked to seemed to have heard of both me and UML, which is very cool.
The talk went reasonably well. I gave the same talk as I did at ALS.
I had forgotten that it was somewhat tailored for ALS (i.e. I tried
to avoid talking about stuff that I had talked about at the previous
ALS), and it would have been somwhat different if I had prepared a new
talk for Linux-Kongress.
I met a pile of cool people, like
Lennert Buytenhek (who, in recognition of his contributions to both
projects, was embarassed by both me and Rusty by being asked to stand
up for some audience appreciation, which must be a new Linux-Kongress
record)
Roman Zippel (who told me about a couple of ubd driver bugs, one of
which I knew of (a subtle rounding error), and one of which I didn't
(Greg Lonnon and I went to some trouble to put the COW header in
network byte order, but forgot to do the same for the block bitmap
(grrrr))
Bruce Walker (who's in charge of the Compaq SSI project, and who asked
me clustering questions during my talk, at which point, I (correctly)
guessed who he worked for and why he was asking)
Fabio Olive Leite (a Conectivite who gave me a nice Conectiva
filesystem image a while back, and whose name I unaccented to get it
through my XSL processor)
Philipp Reisner (who threatened an Alpha UML port a while back, but
gave up, so he's less cool than the others :-)
The organizers took the speakers back to Amsterdam after the
conference to spend the day bumming around the city. We broke up into
small groups and went our separate ways. Our group spent much of our
time in two smallish cafe-type places just talking about random stuff.
The train trip from Schiphol airport (Amsterdam) to Enschede (near the
German border) was interesting for a number of people. It was exactly
as described for me (2+ hours, direct train, no problem), but some
track maintenance in Amsterdam's central station caused subsequent
trains to be cancelled, so other people had nightmares involving 5
trains and a bus totalling 5 hours.
26 Nov 2001
I released 2.4.14-6 today in the interest of clearing the decks for
the 2.4.15 release. Before leaving for Thanksgiving, I had redone the
mconsole protocol to be packet-oriented. This allows a lot more
flexibility in what can be done with the protocol. As a result,
you'll need the new mconsole client for this release.
While down in CT, I redid the host channel support. It is all much
cleaner now, and makes it a lot simpler to knock a bunch of
related items off the todo list.
I decided to knock off the 2.4.15 patch today as well. It went cleanly,
aside from a ptrace cleanup and a new way of generating /proc/cpuinfo
which I had to support. I also put in the file corruption fix after
forgetting to and discovering that a boot/halt caused fsck to complain
a little.
18 Nov 2001
While I'm waiting for the meteors to arrive, I'm chasing and stomping
UML bugs. I cleaned up and released the proxy arp fixes that I did on
planes and in airports on my way to Oakland. Before, uml_net would
blindly add an arp entry to eth0 and nothing else. This is wrong if
there is no eth0, and it's also wrong if eth0 doesn't connect to the
local net or if there are other interfaces also attached to the local
net. uml_net now looks at the routing table and puts an arp entry on
every interface that talks to the local net.
I also noticed that slip support wasn't up to date, so I modernized it
and cleaned up the code while I was at it. You can now change the IP
address of a slip-based interface and the host configuration will be
updated just like the other transports.
I added some RT signal support. SA_SIGINFO is now supported, which
will hopefully fix some of the strange process behaviors that have
cropped up lately. If this fix doesn't do it, I chased down another
bug which was causing rt_sigsuspend and sigsuspend to return incorrect
values. This was causing the libc sigsuspend to hang, and its process
with it. This fixes the pthread_create hang that Greg Lonnon noticed,
plus the gdb hang, I think. I haven't checked that yet.
Those fixes are in 2.4.14-3um which I just released. You'll need the
latest utilities in order to use the network, since I bumped the
uml_net version again.
14 Nov 2001
OK, I'm back from ALS. My talk was on the first day, and it was
reasonably well attended considering the somewhat dismal overall
number of attendees. It was a half-hour talk, so about an hour
beforehand, I took my OLS slides, threw out more than half of them,
and updated the rest. That worked out reasonably well, but 30 minutes
makes for a very short talk without much detail.
Daniel Phillip's talk was the last of the conference, and was somewhat
interesting. He had a pile of raw data that he needed to turn into
slides, and all of the KDE presentation tools blew up on him in one
way or another. So, in the break before his slot, he grabbed Stephen
Tweedie, and they plus me and another guy went off to a local dim-sum
place. Daniel and Stephen sweated over Star Office on Stephen's
laptop making slides. In the event, they turned out rather well.
I left just after Daniel's talk, so I missed out on some of the
socializing that afternoon and evening. It turned out that Daniel and
Larry McVoy were talking about his clustering ideas (MetaLinux, or
ML), and it occurred to them that UML was not only a good simulation
tool for ML, but that it actually implements a good part of what Larry
has in mind. I found out about this later, and had a long talk with
Larry on Monday, in which he explained his plans. I had heard various
mumblings about it, and saw a
slide show
that Larry has, and remained unenlightened. It turns out, that as far
as I can tell, the only way to find out what he's thinking is to have
him explain it in person. Anyway, I became enlightened after our
chat, and it looks like this could be a whole new area that UML could
branch into.
In actual development news, I fully released 2.4.14 today. En route
to Oakland, I fixed uml_net so it's smarter about doing proxy arp. It
figures out what devices are connected to the local net and only sets
proxy arp on those. As a side-effect, if the host is totally
isolated, then you don't get scary-looking error messages when it
tries to set proxy arp on eth0 and it turns out not to exist.
This happened to me at OLS when I tried to demo it after my talk. I got
this nasty message which convinced me that the network all of a sudden
didn't work, and I was all apologetic and had no idea what happened.
In reality, the network was fine, and I could have demoed it if I had
retained a bit more presence of mind.
This isn't in the 2.4.14 release because I'm not happy about the
cleanliness of the change. I'll probably clean it up for the next
2.4.14 patch.
On a much-delayed train from San Francisco to Mountain View (a
supposedly 1:13 hour trip that in reality required about 1:50 and two
trains), I also figured out why you can't talk to eth1 from the host
if you configure both an eth0 and eth1. It turned out to be the same
bug that other people had noticed causing dropped packets. I was
checking errno incorrectly. I had code that did this:
n = read(...);
if(errno == EAGAIN) return(0);
forgetting that successful system calls don't necessarily set errno to
zero. So, the eth1 read was succeeding, but errno was still EAGAIN
from the eth0 read.
In other news, beware of kernels built with gcc 3.0.2. I got a
complaint from Jens Axboe today about UML leaving all kinds of
not-quite-zombie processes lying around. I looked at it a bit and
guessed that the host kernel was messed up somehow. He looked at
that, decided I was right, and that the culprit was the latest gcc.
The interesting thing was that, until he ran UML on that kernel, it
looked just fine to him.
6 Nov 2001
In preparation for fixing the problem of the console driver losing
output, I ported the SIGIO handler to use poll instead of select.
This was mostly what 2.4.13-4 was. I later discovered a bug in it,
which is fixed in -5.
I then decided to fix the problem of UML not being able to be
interrupted and backgrounded. The problem was that all UML processes
are in the same process group, with all of them stopped except for the
one that's actually running. The problem is that when UML is
backgrounded, the shell sends a SIGCONT to the process group, which
wakes up every UML process, which is very bad.
I did some failed experiments with setpgrp/setsid and friends, and
discovered that a separate process group wouldn't work because then
those threads can't write to the terminal because they're in the wrong
process group.
So, I decided that out-of-context processes should be asleep rather
than stopped. This required redoing the task switching code. They
were stopped because the tracing thread intercepted a signal from them
when they went out of context and never continued them. Having them
sleep would require that the tracing thread stop doing that and that
the threads involved in a context switch arrange the transfer
themselves.
So, what is now done is that non-running processes are asleep in
sigsuspend, and they are woken up by the going-out-of-context process
sending a SIGTERM. Races are avoided by having the SIGTERM sent
inside a section of code that has blocked SIGTERM. SIGTERM is
re-enabled atomically with the sleep with sigsuspend.
So, that plus the poll fix is the contents of -5.
3 Nov 2001
Time to patch-bomb Alan again. I sent in ten patches to get the ac
tree current with CVS. Here they are:
Miscellaneous fixes - some build cleanup,
a config update, some name changes in the mconsole driver, exporting
of gprof symbols, and various other small cleanups and fixes
A signal handling fix which eliminates
most of the process segfaults which people had been seeing
Jorgen Cederlof's context switch speedup ,
which also includes some VM fixes that I found
A sychronization patch which
includes a grab-bag of changes which I hadn't managed to get into the
ac tree yet
2 Nov 2001
That last patch went into -ac6, so the ac UML builds and works again.
The next job is to get the ac tree up to date.
I released a new utilities tarball today. uml_net should now do proxy
arp correctly. uml_mconsole is now able to take a command on its
command line and execute it, rather than being strictly a command line
tool.
30 Oct 2001
I decided to make the -ac UML build again, so I made
this patch and sent it off to
Alan. The rest of the updates will be forthcoming.
29 Oct 2001
Today is 2.4.14-3 day. I decided to remove the code in fix_range
which unmaps pages whose ptes say they're not present. That basically
caused it to try to uselessly unmap all of its unused address space.
So, I did that and it uncovered a bug. It turns out that swapped-out
pages weren't marked as needing to be remapped. Everything worked a
lot better with that fixed, and context switching should be a bit
faster now.
28 Oct 2001
I released 2.4.14-2 today. This contains the fix for the process
segfaults and the gdb problems people have been having. It also turns
on morlock's context switch optimization which I disabled until I
figured out the segfaults.
26 Oct 2001
I finished releasing 2.4.13 today.
After some prodding from Greg Lonnon, and after he did some investigation,
I figured out what the problem with gdb inside UML is. The signal
handlers don't save their registers in the thread struct. This means
that when a SIGTRAP from a breakpoint comes in and it gets forwarded
to gdb, when it gets the registers to find out what the ip is, it gets
an old, bogus value. So, it doesn't recognize that as a breakpoint
and complains about a spurious SIGTRAP instead.
25 Oct 2001
I spent the last few days chasing a process segfault problem. I
finally tracked it down today. It turns out that my rewrite of the
process signal delivery code was broken in the case of a signal being
delivered from an interrupt handler rather than a system call. It
grabs the process registers from the thread structure, saves them away
on the stack, and then restores them to the process when the handler
finishes.
However, interrupts don't save their registers in the thread
structure, so those registers represent the last system call, which
has already finished. And restoring those causes great confusion in
the process.
20 Oct 2001
I released 2.4.12-3um and 2.4.12-4um over the last week. -3 fixed a
couple of problems with -2, and -4 adds some miscellaneous fixes to
that. The major ones are that physical memory protection is optional
(controlled by the 'jail' switch) and that the network driver backends
now collect uml_net commands and output and nicely printk them instead
of having the output just dumped to the terminal. To support this,
uml_net now hangs on to the commands it runs and the output they
produce and send them back to UML. This required that the uml_net
interface be incremented, so it's now at 3. The new drivers require
the new uml_net, so if you grab the UML patch, also get the latest
utilities tarball too.
13 Oct 2001
I released 2.4.12-2um today. It's almost entirely changes sent in by
other people, dominated by Adam Heath's cleanups. There were also some
ppc fixes from Chris Emerson, and small fixes from other people.
I also released a new utilities tarball. The one change was to
uml_net, which does proxy arp in a different, and apparently more
robust way than it used to.
11 Oct 2001
Linus released 2.4.11 and 2.4.12 two days apart. I had the 2.4.11
patch uploaded, and had started releasing packages when 2.4.12 came
out. So, 2.4.12 is out there and I'm doing the packages again.
8 Oct 2001
I should have mentioned the latest -ac patches already since they've
been in Alan's tree for a few days, but I didn't, so here they are
In other news, with the help of Paul, I tracked down an ancient
console driver bug that held on to a struct tty after it had been
freed and subsequently caused panics.
I released 2.4.10-7um today with that fix and some other minor changes.
5 Oct 2001
Paul Larson found a test case for the signal problems that was
reproducable for me. So, with that in hand, I tracked down the bug,
and released 2.4.10-6um.
The bug turned out to be a result of moving where state is saved
before a signal is delivered to a process. The process registers and
some other things need to be saved on the process stack so they can be
restored later. The way it used to work is that
handle_signal would figure out what the interrupted system call
eventually returns
that value is passed up the stack and stored in the process registers
stored in its task structure
the process would be sent a signal so it starts running on its process stack
the UML signal handler copies the register state from the task
structure to its own stack
it calls the process signal handler
and restores the registers back to the task struct
What I implemented did this
handle_signal figures out what the interrupted system call eventually
returns and constructs the process stack frame, copying the registers
from the task struct onto the stack
that value is passed up the stack and stored in the process registers
stored in its task structure
The bug is that the second step happened too late. The registers
saved on the stack hold a bogus return value, and it's that value
which the system call eventually returns.
3 Oct 2001
I decided to profile a stretch of UML thrashing. So, I took the
2.4.10-2ac UML (which I updated to the latest stuff, and which I'll be
sending to Alan shortly), gave it 128M of memory and 1G of swap, and
let a 'make -j' kernel build run for a couple of hours.
These are the
results. All of the system calls show up as <spontaneous>.
Somehow it wasn't linked against the profiling libc. I'll try to
figure out why not.
Some highlights:
Protecting kernel memory from userspace seems to be expensive -
mprotect is the top item on the list.
wait4 is number two, which I don't entirely understand. That's the
tracing thread. It sleeps in wait, and wakes up when there's
something that needs doing, so I don't understand why it shows up,
unless it's somehow being charged for all of the context switching
that UML causes.
Then we have other low-level VM things, fix_range and
flush_tlb_kernel_vm. These manually walk address spaces to update
them. These two are unnecessarily inefficient and can probably be
knocked far down the list pretty easily.
Finally, we get into generic kernel things, which show clear signs of
heavy swapping - page_launder, swap_out_pmd, do_anonymous_page.
The first system call which shows up is sys_brk, way down the list,
followed distantly by sys_read and sys_close.
There were 312175 system calls total. The most frequently called were
sys_brk, sys_read, sys_open, sys_newfstat, and sys_stat64, not
unexpected for a kernel build.
kmalloc was called most often from load_elf_binary, select_bits_alloc,
and load_elf_interp. __get_free_pages was called most often from
handle_mm_fault, pipe_poll, and do_fork. It called _alloc_pages,
which was called most frequently from read_swap_cache_async,
do_anonymous_page, and do_wp_page.
1 Oct 2001
I discovered a new way of breaking UML. A 'make -j' kernel build
drives the load above 150, and on 2.4.10 causes essentially a
livelock. I eventually regained control by sending SIGILL to all the
processes from the host. Plus, I got all kinds of interesting illegal
instruction and bus error deaths. These were absent on -ac2, probably
because that wasn't a totally up-to-date UML, so it was missing the
most recent bugs that I added. I'm going to track those bugs down by
updating UML in the -ac tree bit by bit and seeing which bit causes
these nasty little problems.
Today I redid the signal delivery code. Now all the saving and
restoring of state happens in kernelspace rather than on the process
stack like before. This allows the task structure to be protected
from processes. Since that was the only hole in the protection of
physical memory, that is now fully protected against being changed
from userspace.
24 Sep 2001
I made a .deb and an RPM in preparation for releasing 2.4.10, and
Jacques Nilo reported that yesterday's fix wasn't enough. I had
forgot about an instance of MAP_SHARED | MAP_ANONYMOUS. So, I fixed
that, and that is 2.4.10-3um. And that is the basis of the official
2.4.10 UML release.
24 Sep 2001
I thought of an easy fix for the stack capturing problem that
prevented UML from booting on 2.2 hosts. Basically, a new process is
created which stops itself, and when that happens, the parent grabs a
copy of the stack and uses it to create a context for future threads
to run in. On 2.2, the parent used ptrace to extract the contents of
the stack from the child word by word. I looked at that code and
decided it would be much easier to map the stack MAP_SHARED so it
would be shared between parent and child and the parent could just
memcpy it to a safe place rather than ptracing it out.
What I forgot was that, while 2.4 supports MAP_SHARED | MAP_ANONYMOUS,
2.2 doesn't. So, on 2.2 hosts, UML wouldn't even begin to boot.
The easy solution was to go back to MAP_PRIVATE | MAP_ANONYMOUS, but
clone the new process with CLONE_VM, making it a thread, which allows
the parent to copy the stack directly, since they're both in the same
address space.
This fix makes 2.4.10 usable, so I've released another patch and
updated CVS.
23 Sep 2001
Linux released 2.4.10 today, so I updated UML as well. I decided not
to base this on the latest UML patch, since that it not entirely
healthy at the moment. My sigaltstack fixes broke UML totally on 2.2
hosts. So, 2.4.10-1um is 2.3.9-8um updated to 2.4.10.
CVS is not updated, but will be once I have the sigaltstack thing
fixed and that pool updated to 2.4.10.
22 Sep 2001
The last set of -ac patches went into -ac14. That brings Alan's tree
reasonably up-to-date.
I released 2.4.9-8um yesterday and 2.4.9-9um today. -8 was some bug
fixes and cleanup. -9 was fixing sigaltstack and doing a lot of cleanup
and rearrangement of the signal delivery code. This sets me up to redo
the entire signal delivery mechanism so I can finish protecting all of the
kernel's physical memory from userspace.
18 Sep 2001
That last batch of patches went into -ac12. So, the next batch is off
to him, plus one from Andrea
Arcangeli which fixes a declaration which is needed to
compile UML successfully.
Once these are in, the -ac tree is almost up to date. It'll be one
CVS release behind, which is OK because there are some tweaks I want
to make to the address space reorg. So, I'll get that right and send
it in rather than sending it in two pieces.
15 Sep 2001
I released 2.4.9-6um last night. It contains the already-mentioned
COW header changes. It also occurred to me that I can fix the
mlockall bug by sticking UML at the top of the address space where
it's supposed to be anyway. So, I went ahead and did that. This
allowed me to get rid of the vmas that UML needed to stick in each mm
to prevent mmap from reallocating areas of virtual memory that UML is
living in. This, plus the fact that these vmas had no ptes, caused
mlockall to cause major damage to UML by trying to unmap it. Putting
UML above TASK_SIZE causes it to be ignored by mmap, and the problem
just disappears. This also let me get rid of the nasty address space
reservation code that was needed in order to prevent libc from mapping
stuff in where UML wanted to put stuff.
In other news, I'm back in the ac patch business. First is a
patch that I've been sitting on
all week which defined hz_to_std and allows UML to build again. Then,
we have
calls to malloc, calloc, and free are now
converted into calls to kmalloc and kfree when the slab
allocator is running
14 Sep 2001
Greg Lonnon and I have been fiddling with the COW file header format.
I had already discovered that blindly copying the backing file path
provided by the user into the header is a problem when it is a
relative path. That COW file won't be usable by a UML run in a
different level of the directory hierarchy because, from there, the
relative path stored in the header doesn't refer to the backing file.
The fix is to write an absolute pathname into the header.
Greg had a couple of other good ideas which we thought should be
implemented earlier rather than later
The header should be able to hold a MAXPATHLEN-sized backing file name
rather than the current 256 bytes.
It should be in network byte order. This will allow COW files to be
moved between big-endian and little-endian hosts. Whether the
underlying filesystem can be mounted in UML after the move depends on
whether the filesystem has its metadata byte-swapped correctly. But,
at least the COW header won't prevent it from working.
These two are not backward compatible, so we bumped the COW header
version and made these changes in the version 2 header. The driver
can read both V1 and V2 headers but it will only write V2 headers.
The absolute pathname change is in 2.4.9-5um since it was small and
backward compatible. The other two will be introduced in 2.4.9-6um.
The uml-user list had a couple of
interesting posts from UML users today
Martin Volf did a Slackware 8.0
installation inside UML and wrote a
page describing how he did it.
Tim Robinson had some problems with the TUN/TAP transport and
posted a nice diagnosis of them.
10 Sep 2001
Been playing with the tools and website lately. I added a bunch of
new features to the mconsole
client (and promptly had to
fix it), and fixed uml_net
building on 2.2.
I also restructured the web site build somewhat to make it more manageable.
6 Sep 2001
I tracked down the process segfault problem. It was caused by a newly
forked child inheriting some pages that were swapped out, but hadn't
been unmapped. The code that it ran on its first quantum didn't
update its address space correctly, so those pages remained mapped.
Having chased that problem down, I'm releasing 2.4.9-4um with that fix
plus Chris Emerson's latest ppc changes.
1 Sep 2001
After much ado, I revamped the UML
download page. It
essentially replaces the Sourceforge project download page. I did
this in order to be able to let people select the mirror they want to
download from and to be able to put explanatory information on the same
page as the download link. If it is missing stuff that you'd like to
see, regardless of whether it's on the SF download page,
I'd like to know about it.
There are a couple things that aren't working right now - the
'Changelog's don't link to anything, and most of the SourceForge root
filesystem links don't work. I'm in the process of copying the
filesystems over to SF to fix this.
It's now pretty trivial for me to add mirrors, so if you have a box
available (particularly if it's in a part of the world not
well-covered by the UML global mirror system),
let me know.
30 Aug 2001
Thanks to what looks like an all-night debugging session on the part
of Yon Uriarte, the TUN/TAP backend now works. You'll need the latest
uml_net for this. It wasn't setting IFF_NO_PI, which was causing
extra cruft to be stuck on the front of the packet, which probably
required the broken nastiness I had to add to the driver. Adding that
and backing out all the skbuff fiddling made everything work a lot
better.
So, I released 2.4.9-3um with the fixed driver in it, plus new entries
in config.release, defconfig, and Configure.help.
28 Aug 2001
I implemented a TUN/TAP backend for the network driver. It involved
more work than I expected. A lot of it was due to restructuring other
code in order to keep the code relatively clean.
I haven't done any stessing or timing of it, but I did happen to
notice that pings over TUN/TAP are about 10x faster than pings over
ethertap. The absence of the helper handling each packet on the way
to the kernel is no doubt a big piece of that. At some point, I'll do
some bandwidth measurements against ethertap to see how much better it
is. Hopefully a lot.
26 Aug 2001
UML development took a bit of a break while I got busy with other
stuff.
In UML news, I started work on my ALS paper, got a first draft
ready, and sent it off for review. I also did a bunch of web site
work. I've been letting things fall behind for lack of time to deal
with them, so I decided to swallow my pride and start asking for
help. This necessarily involves describing what needs doing, so I
wrote most of it up, and the results are
here,
here,
here,
here, and
here.
I also made a pass over the site, fixing a bunch of hopelessly
outdated and wrong things, and probably leaving some things which are
only moderately outdated and wrong.
16 Aug 2001
I've been having fun playing with crashme. It's a great little tool.
It generates buffers full of random data and then executes them. It
runs differently on UML than on the host, which it shouldn't. The
problems I've tracked down so far are signal handling bugs. UML
wasn't handling write faults correctly when the accessed memory was
readonly, and it wasn't properly segfaulting processes to which
signals couldn't be delivered (because their stack pointers were
garbage). This last was the bug I was chasing a couple days ago.
There are still problems. The first process (crashme +2000 666 100)
runs just as it does on the host, but the next one (crashme +2000 667
100) doesn't. On the host, the segfault handler somehow gets bus
errors in libc, which I don't understand, and that doesn't happen
under UML.
On IRC yesterday, Lennert Buytenhek clued me in on how to reliably
segfault processes and crash UML. He was running 8 "du /". That
didn't work for me, but 16 of them does. The segfaults are on pages
that are mapped in but shouldn't be (their ptes say that they should
be mapped out, and somehow that didn't happen). So, those pages were
presumably allocated for something else, and contain garbage from the
perspective of the process that should have unmapped them, and so it
segfaults.
The panic looks like memory corruption. I turned on slab debugging,
and it looks like that makes the panic go away.
Well, Linus released 2.4.9 today, so it's time for me to go into my
UML release routine. When I did the obligatory kernel build on 2.4.9,
one of the crashme fixes turned out to be bogus. It was doing the
segfault-during-signal-delivery check too early, so it caught fixable
segfaults that happened because the stack needed extending or was
readonly.
14 Aug 2001
2.4.8-2um is out as of yesterday. I made the freshmeat announcement
of 2.4.8 this morning.
I chased the crashme bug a little. Somehow, a signal is marked as
being pending, but it's never actually delivered and reset. So, no
further signals can be delivered to that process from then on. This
makes it unkillable and unstoppable.
I also took the first step towards making UML secure against nasty
users. UML physical memory, except for the task structure and kernel
stack, are protected from userspace access. I still need to protect the
task structure and kernel virtual memory. The task structure is a bit
tricky because of the signal delivery code. It runs on the process
stack and is considered to be userspace code. However, it needs to be
able to modify the task structure to restore state that it saved
before the signal delivery. So, if the task structure isn't writable,
this isn't possible. Further thought on the subject is necessary.
13 Aug 2001
Well, Linus released 2.4.8 just as I was heading up north for a
weekend of camping and climbing mountains. He does this on purpose.
He released 2.4.3 when I had just arrived in San Jose for the Kernel
Summit.
Anyway, this was a relatively simple patch. It just dropped in and
worked, except that hostfs was already broken. My calculation of the
stat64 inode field was wrong. It looked at the kernel version to
decide what was in the userspace headers. I discovered the error of
my ways when I booted up a Debian UML to produce the 2.4.8 .deb. This
is a 2.2 filesystem (with .st_ino in stat64) with a 2.4 kernel (which
implied .__st_ino in stat64). hostfs did not build. I changed the
Makefile to just grep the appropriate header instead.
So, this fix will be the substance of 2.4.8-2um.
9 Aug 2001
The remaining differences between my pool and the ac tree are a couple
of patches that didn't go in for some reason, cleanups of printks and
some includes.
Daemonizing UML does work. I just checked it, and the only case where
it does something strange is if you background it without nohupping it
and log out. The tracing thread dies from the SIGHUP, but all the
other threads survive.
I released 2.4.7-5um today. It contains a few recent patches from
other people. I figured out how to turn -fno-common back on. I tried
all kinds of linker tricks to throw errno.o out of the binary. Then I
discovered that the linking that had already taken place had destroyed
any notion of what objects anything originally came from. So,
instead, I added -Derrno=kernel_errno to all the kernelspace gcc
lines, which translates all the kernel uses of errno to kernel_errno,
and leaves the libc errno alone. That's actually a better solution
than throwing out one of the errnos because that would leave open the
possibility that the kernel and userspace uses of libc could step on
each other. Now that they're using different symbols, that's not a problem.
7 Aug 2001
Yesterday's patches are off to Alan.
In other news, daemonizing UML seems to be broken again. Grrr. That
seems to break now and then for no apparent reason.
ac9 is out with my patches in it. So, time to make the final diff
between Alan's stuff and mine to get him totally caught up with me.
6 Aug 2001
Yesterday's patches are in ac8. So, two more patches will bring the
ac tree completely up to date:
A network driver update which
adds the ability for the drivers to tell the helper about any IP
address changes. This allows the host configuration (routing and
proxy arp) to stay in sync with the interface address changing inside
UML. If you're in the habit of getting UML from the ac tree, you'll
need the latest uml_net in order to use the network when this patch
goes in because it makes an incompatible change in the helper interface.
Another batch of (surprise!) miscellaneous
fixes , including some cleanup of stack permission setting, the
apparently gratuitous locals that are needed to pursuade -pg to work
properly, a couple of symbol exports for GFS, a fix that ensures that
the pid file contains the correct pid, and yet another squashed warning.
With these in, I'll be able to diff the ac tree against mine to see
what divergences there are. I know there are some, because I
occasionally see patches fail to apply because of context conflicts
which shouldn't be there. So, there will be one more patch to clean
those up, and Alan will be completely in sync with me.
5 Aug 2001
Patch time again. This time I'm making them up ahead of ac7 coming
out. So, we have
A hostfs update , which brings
the ac tree completely up-to-date. Normally, I bundle a couple of cvs
updates into a small number of patches and send them off to Alan.
With hostfs, I decided to give him the latest stuff, since there have
been a bunch of changes spread over a number of cvs updates. This is
fairly easy since hostfs is a completely self-contained piece of code.
A network driver update , which
fixes a crash and makes net devices pluggable via the mconsole.
There's some restructuring and cleanup in this patch. Also, mconsole
actions move into keventd context from softirq context. This is
because alloc_netdevice does a GFP_KERNEL kmalloc, which has to be
done in process context.
Yet another batch of miscellaneous
fixes , including renaming CONFIG_IOMEM to CONFIG_MMAPPER,
some cleanup in the ubd driver, and removal of a number of warnings.
The complete merge of the ppc
port , which reorganizes the headers somewhat. For some
headers, there are now header.h, which is a symlink to
header-$(SUBARCH).h, which includes header-generic.h and is allowed to
do whatever it wants before and after. This provides the flexibility
needed to do things like undef stuff after the include and rename
things beforehand.
These all will bring Alan up to my 2.4.7 release, except for hostfs,
which will be completely up to date. Since I'm up to 2.4.7-4um, and
2.4.7-2um was just a hostfs fix, I might be able to bring the ac tree
up to date with one more set of patches.
Alan released ac7 this afternoon, as I prophesied, so those patches
are off to him. I'll be looking for them in ac8.
I failed to resist temptation. I looked at the diffs between the ac
tree once those patches are in and my current stuff and I noticed a
big wad of documentation. So, I rolled that up and sent it to Alan.
4 Aug 2001
OK, I'm back in the business of sending Alan patches. I sent in a
small patch which fixes the
things that broke when 2.4.7 came out. So, UML now builds and works in
the -ac tree again. It made 2.4.7-ac6 an hour or so after I sent it over.
Also, in the interest of getting the ac tree more caught up with my CVS, I
sent Alan a batch of fixes which bring him up to 2.4.6-4um:
umid fixes from Henrik Nordstrom
which create a directory based on the umid rather than having that be
the pid file. The pid file and the mconsole socket are now in that
directory.
Another batch of small fixes -
a Makefile fix, mconsole cleanups and an update to create the socket in
the umid directory.
Some config changes , also
from Henrik Nordstrom. These change the network config names to be
more explicitly UML-specific. The config.in is also cleaned up so
that it resembles the i386 config more closely.
Greg Lonnon's example iomem
driver , plus a couple of generic UML fixes that were needed
in order to make it work.
A uaccess fix which required
a surprising amount of surgery to fix. The copy_{to,from}_macros
previously regarded a fault location of 0 as meaning that the copy has
succeeded without faulting. When the address passed into the kernel
was NULL, this of course broke badly. It had a very interesting
side-effect in the case I saw. After running the command that
exercised the bug, every command on the system started failing to
start because libc was corrupted. This was something of a
head-scratcher. I eventually figured out that I was causing the
command to open NULL, the fault went undetected, and the buffer that
was supposed to have had the filename copied into it had
the filename of libc in it from a previous use. So, libc was opened
for writing with fairly severe results.
3 Aug 2001
The deb build problem turned out to be me accidentally redefining VERSION
in the upper layers of the build process. That value overrode a VERSION
in the kernel build, which resulted in a totally bogus KERNELRELEASE, which
confused a macro which tested it badly enough that it broke the build.
Simple to fix once I figured it out.
I discovered another hostfs bug on my way back from OLS. ls didn't work
and I found two bugs as a result. The easy one was that hostfs_readdir
was filling in the directory inode rather than the file inode for every
directory entry it passed back to vfs. This was fixed by having read_dir
pass the inode back up so it could be use to fill in the entry properly.
The more interesting one is that there was a source-incompatible change
made in the stat64 struct between 2.2 and 2.4. The st_ino field changed
its name to __st_ino and a new st_ino field was added at the end. The
inode appears in the same place (the st_ino/__st_ino field) making it
binary compatible. So, after changing to use the 2.4 field name (and
breaking hostfs on 2.2), I changed the hostfs build to figure out what
name to use and passing that in on the compile line to hostfs_user.c.
In other news, we (me, Rodrigo de Castro, and Livio Baldini Soares) have
decided that -pg support in gcc is broken in multiple ways. rcastro and
livio complained a couple weeks ago about UML's gprof support not working.
I finally had a look at it, and found that it was broken, but not in the
way they described.
UML crashed in a very inconvient place, and when I finally got in there
enough to figure out what was happening, it turned out that mcount was
segfaulting when it dereferenced ebp because ebp was NULL. The reason
for that turned out to be that in some procedures, mcount is absolutely
the first thing they do. Everything else calls mcount after the new
stack frame has been set up and ebp has a valid value in it (the old esp).
When the procedure is the main procedure for a thread, then ebp turns
out to be NULL.
The difference between the two sets of procedures seems to be that the
good ones have local variables and the bad ones don't. So, to work
around this bug, I added a useless, but non-optimizable, local to the
affected trampolines.
Having done that, rcastro and livio were still complaining about UML
crashing. So, I looked at it with rcastro using gdbbot (and livio did so
later and discovered the same thing). -pg was trashing edx for some
reason. A constant (which varies from procedure to procedure) is
dumped into it. This suggests that it's used for the profiling
bookkeeping somehow, but looking at the assembly, we don't see how.
mcount carefully pushes it and restores it, which is not typical of
something that is going to be used for something. The problem is that
FASTCALL procedures (which are regparam(3)) pass arguments in eax,
edx, and ecx. So, dumping this constant into edx trashes the second
argument to the procedure. A workaround for this bug would seem to be
to disable FASTCALL (and I guess that gprof support stopped working
when I enabled FASTCALL to fix a different bug).
I released 2.4.7-4um today. The main new thing is that you can change
the IP address of a ethertap eth0 device and the host configuration
will change to match. This required a bit of infrastructure which I
wanted for other reasons. The uml_net interface is now versioned,
which I've been meaning to do for a while. uml_net now goes away
cleanly when UML is killed messily. Before, it would hang around,
occupying the tap device, and when UML was rerun, the new uml_net
would emit non-intuitive error messages.
I also made hostfs build and run again on 2.2 with a bit of Makefile
hackery.
28 Jul 2001
That hostfs problem turned out to be different than I thought. Livio
Soares started chasing the problem and found that the hostfs_user
close_file didn't actually close anything. It took a pointer to a
file descriptor and closed the pointer (or at least tried to) rather
than the descriptor that it pointed to. Fixing that made hostfs
behave a lot better.
Having fixed that, I finished the page cache work for UML and it can
now successfully do the deb build through hostfs without getting the
md5sum mismatches it was getting before. Having said that, I've
started seeing a compilation problem when building UML through hostfs
that I don't get on the host.
On to OLS. My talk was in the second slot of the first day, which was
nice. It's good to get your talk over early so you can do the rest of
the conference without worrying about it. It went pretty well. I had
hoped to fit a demo in at the end, but the talk basically went the
full 90 minutes. So I did a real short watch-it-boot-up demo
afterwards while most of the crowd was filing out of the room.
There was a talk on porting Linux to the i-series IBM boxes (aka
AS-400) which was fairly interesting. They ported Linux/ppc to a
hypervisor running on OS-400, making it fairly similar to the UML
port, being a port to an OS rather than to bare hardware. Dave
Boucher, who gave the talk, made a number of comments comparing it to
UML, which was nice. He also grabbed me during lunch today to quiz me
about the COW ubd driver. It turns out that he can't do that so
easily because OS-400 doesn't have sparse files, so he can't drop
blocks down in the same location in the COW file as in the backing
file because that would allocate space. I suggested a block directory
instead of a bitmap at the beginning of the COW file and dropping
changed blocks down sequentially, but he seemed unconvinced for some
reason.
A number of people told me either they or people they knew were using
UML for various things. The FreeS/WAN project as a whole seems
extremely interested in UML for running tests on their stuff over a
virtual network. A PPPoE maintainer complained about the ethertap
transport not being intuitively obvious on 2.4. And there were a
bunch of other people who were less specific about what their interest
in UML was who were either using it or were intending to.
In other news, I discovered that the mcast network transport didn't
work when the box had no ethernet card in it. Being at OLS, I showed
this to Harald Welte and we stared at the code a bit, then asked Andi
Kleen about it. The underlying problem turned out to be that there was
no route to any multicast address because there was no interface on
the system that supported multicast. The fix seems to be to add
multicast support to the loopback device, preferably, and if that's
not possible for some reason, to the dummy device.
22 Jul 2001
2.4.7-1um is released. A change which made kernel threads sychronize with
the parent at startup caused a hang at boot. The cause was a long-standing
bug which caused initdata not to be shared between processes. Andrea
noticed the problem as well, and found the fix.
That bug was fixed and I released everything. I released it with a fairly
big hostfs problem that I didn't notice until the middle of the release
process. I changed how it opens and closes files, with the result that it
closes them later than it used to. So, it isn't too hard to get hostfs
very confused by running UML out of file descriptors.
21 Jul 2001
2.4.7 appeared yesterday. I'm looking it over to see what's new. One
interesting thing is that Alan is sending over some bits of UML which
change the generic kernel. These don't affect anything besides UML,
so they're harmless. On the other hand, they eliminate some generic
files from my patch, which is nice. It makes the UML patch appear
purer.
16 Jul 2001
Yesterday's patches are in 2.4.6-ac5. So, time to send in another
batch. This will get him up to my 2.4.6-2um. Today's batch contains
another batch of
random fixes and Greg Lonnon's
ubd COW
patch. See this
page for more information on the ubd COW driver.
I also checked in all the userspace stuff, including the deb builder,
recent changes to the tools, and the website, which I hadn't checked
in for quite a while.
15 Jul 2001
Those two pesky patches finally made it to Alan OK and were included
in 2.4.6-ac4. This gets the ac tree up to 2.4.5-8um. The next batch
will bring him up to 2.4.5-10um. It includes a bunch of
miscellaneous
fixes, the first merge of the
iomem patch,
and an mconsole
update which makes gdb and the ubd driver hot-pluggable and
runs mconsole stuff from a tasklet rather than inside the interrupt.
With some more symlink abuse, I merged the last of Chris Emerson's ppc
port patch.
14 Jul 2001
Two of the three patches I sent to Alan were broken again. However, I
figured out why. My devious little mail reader was breaking lines
when it sent out the mail, which was way too late for me to eyeball it
to make sure it wasn't messing up. Turning off this behavior results
in much better patches at the other end of the line.
I played with the UML .deb builder and got a workable .deb out of it.
I think I figured out why the process gets a checksum error at the
very end - it's on hostfs, and hostfs reads through the page cache,
but doesn't write through it. I'll have to check with a filesystem
guru on this, but it sounds right to me. Putting the process on a
normal block device results in good checksums.
13 Jul 2001
I put out a couple more patches. Highlights include
Thanks to Simon Blake, I tracked down an interesting bug last night.
The UML build turns off __i386__ in order to throw out some very
hardware-specific code that UML definitely doesn't want.
This also turns off the i386 definition of FASTCALL, which invokes an
in-register parameter passing convention that gcc supports. This
wouldn't be a problem, except that UML borrows code from the i386 port
which assumes that this convention is being used.
In the case that I was looking at, rw_down_write_failed() was getting
its semaphore address from the wrong place and using a random
userspace address as its semaphore. This could cause all kinds of
interesting side-effects, like kernel corruption from two threads
using two different random addresses as the same semaphore or process
memory corruption from the kernel writing semaphore stuff into its
memory. Hopefully, this fix will eliminate some of the strange
crashes that people ocassionally see with UML.
Here is a more
detailed description of the bug and its side-effects.
I sent the two broken patches (the
mconsole
and 64-bit
patches, see the 30 Jun 2001 entry for
descriptions) to Alan again. Hopefuly they aren't broken this time.
Also, I
fixed
a few build problems that turned up lately.
7 Jul 2001
I integrated Greg Lonnon's ubd COW patch today. It allows multiple
UMLs to share a filesystem
read-write by storing the changes in a
private file. This private file can be considered to overlay the
read-only shared file. All writes go into the private file, and reads
come from the private file if it has a valid block and from the shared
file if not.
This allows a huge savings in disk space for people running many UMLs
with large filesystems. It probably will help performance, since the
caching requirements on the host are similarly reduced.
4 Jul 2001
Two of the last three patches I sent Alan somehow got corrupted. I
suspect that what happened was I added spaces accidentally while
reading the patch in my mail composition window by trying to page it
with the space bar, then messed by the patch when deleting the
spaces. So, I'll send them in again.
Rik van Riel has been visiting for the last couple of days. He was in
Boston for Usenix, and was visiting
EMC and
MCLX
(where a number of my former coworkers from DEC now work)
after the show. Since I live a couple of hours north of Boston, I
invited him up. In doing so, I acquired the responsibility of getting
him to Logan airport at the same time that 2M people were going into
Boston to see the 4th of July fireworks and concert. I ended up
putting him on a bus that ran from outside the city straight to the
airport. I haven't heard anything from him since, so I suppose that's
good news.
2.4.6 was released last night. It turns out to be a piece of cake.
The well-known softirq fix is the only thing that needed changing. I
stuck that in, and it built and ran through my tests without a
problem. The patch is released, and I'll probably finish the rest
tomorrow.
30 Jun 2001
I sent Alan patches which will bring him up to 2.4.5-8um, which include:
a collection of small fixes ,
including ^S/^Q support for the console, some ubd driver cleanups, and
the TASK_UNINTERRUPTIBLE fix
Lennert's reimplementation of the 64-bit
file support - the first try used libc's magic support for
popping the 64-bit interfaces under the 32-bit names. That broke UML
modules badly. This version explicitly uses the 64-bit interfaces and
seems a lot healthier.
Lennert's management
console patch. This version has support for getting the
kernel version, halting and rebooting the system, and turning the
debugger on and off.
Last night, the f00f bug was bugging me, so I fixed it. It turned out
that the tracing thread was routing SIGILL and SIGBUS incorrectly.
Fixing that causes f00f to SIGILL properly.
29 Jun 2001
I found and fixed the TASK_UNINTERRUPTIBLE hang last night. It turned
out to be caused by an interrupted write in the block driver. The
driver didn't check the return value, so didn't notice that an IO
request it sent to the IO thread didn't go anywhere. That shut down
the disk IO system, which ultimately results in the whole system being
deadlocked waiting for IO that's never going to happen.
That, plus a few other things, are checked in as 2.4.5-11um.
In other news, Bill Stearns, who's always looking for more devious
things to inflict on UML, happened across the
Linux Test Project
and decided to run it on UML. UML did pretty well. There were three
failures, two of which also fail on the host. The other is the f00f
test, which causes UML to hang. I applied the obvious fix of relaying
SIGILL from UML to the process. That fixed the hang, but after a long
pause, the test's SIGILL handler apparently gets called twice.
26 Jun 2001
Those last two patches made it into ac19. Time to start thinking
about bringing the ac tree up to -7um.
I finally got Greg Lonnon's iomem match into UML. This allows a
process outside UML to communicate with one inside (or with a UML
driver) through a mmapped file.
I've also been chasing the TASK_UNINTERRUPTIBLE hang that a few people
have been seeing. It happens most easily under UML, apparently. I'm
using a recipe discovered by mistral to reproduce it (two infinite
loops each diffing two kernel pools). The longest it's taken to
reproduce is about 30 minutes. It hung on boot once. The others have
been in the 5-10 minute range.
I had a long chat with Al Viro last night with him telling me what he
wanted to see from gdb and me providing it. He ended up being puzzled
about what was happening. Following a suggestion from Daniel
Phillips, I've started instrumenting buffer_heads and pages to see
what happened to the ones involved in the hang.
25 Jun 2001
It's -ac patch time again. I boiled the -5um to -6um changes down to
two patches:
a miscellaneous fixes which
adds some IP address sanity checking to the ethertap backend, fixes
a couple of process signal delivery races, cleans up the associated
thread data a little, fixes a swap bug (which caused swapped-out pages
to never be unmapped from their processes), and gets rid of the last
vestige of the mm_changes code.
a timer patch which attempts
to eliminate missing clock ticks by never disabling the timer and
keeps track of ticks which happen when it's not safe to call the timer
IRQ. This improves things, but it doesn't eliminate missing ticks
under load.
22 Jun 2001
Some time around ac16 or ac17, someone added a call to
linux_booted_ok() which the ports have to implement. So, I sent
the patch to Alan today.
And a short bit later, I got a reply saying not to bother. The
linux_booted_ok thing was a temporary test that's going to be
removed. So, it won't appear in my pool, but if you absolutely want
to run the ac16/ac17 UML, apply that patch.
I spent the better part of the afternoon in IRC trying to figure out
the hang that mistral is seeing. No joy, but I did learn more about
the problem. I'll attack it again later.
21 Jun 2001
gdbbot got its first test yesterday when I looked at the problem that
Chris Emerson is having with UML/ppc hanging during boot. I didn't
find the problem, but was able to check that signal delivery (which
was what I thought was broken) was working fine. The next step will
be to do a post-mortem on the hang.
20 Jun 2001
I wrote a IRC gateway for gdb. This allows a gdb (like the UML kernel
debugger) to be controlled from an IRC channel. The intent is that if
someone sees a bug that I can't reproduce, but want to look at, that
person's UML gdb can be attached to an IRC channel where I can poke
around and see what's going on.
I also integrated Lennert's management console patch. This is a very
low-level interface to the kernel (like the i386 SysRq interface).
The main use for it right now is to hot-plug devices. At this point,
only the ubd driver and gdb support this. So, you can add and remove
block devices from your UML without having to reboot it. You can
switch gdb in and out the same way. I will also do the consoles,
serial lines, and network interfaces at some point as well.
15 Jun 2001
The two patches I sent to Alan yesterday are in 2.4.5-ac15. Alan
horribly mangled Harald Welte's name, unfortunately.
Today was a patch bashing day. I merged in a good number of the
patches in my queue.
Today was also the (extended) deadline for abstracts for ALS2001. So,
I sent one in. This is the most explicit that I've been so far about
my future development plans for UML. So, if you want to see how wierd
things are going to get, read all about it
here.
14 Jun 2001
IBM put out a
Linux security whitepaper
in which UML gets a pretty lame mention (down towards the bottom, there's
some prose which is basically lifted from my site). Thanks to Bill Stearns
for spotting it.
I'm finally getting around to sending off the latest stuff to Alan. The
ac tree is now two cvs updates behind. The first set will be the -5um update,
which is basically
some random fixes , including an
updated defconfig, making the console xterms go away when the machine shuts
down, making a read-only hostfs really read-only, hooking up a couple of new
system calls, allowing UML to boot on hosts with a 2G/2G address space split
12 Jun 2001
Banged out a bunch of bugs. I started booting UML with 24 megs and
plenty of swap, and running a whole bunch of stuff on it to overload
it and put it heavily into swap. This turned up a couple of signal
delivery races and a swapping bug. The signal races would cause
various strange behavior. Mostly what I saw was hangs with an
infinite sequence of sigreturns. The swap bug caused pages not to be
unmapped when they were swapped out. Obviously, this is very bad.
With the help of rcastro, I fiddled my page table macros to fix this.
I'm still seeing process segfaults. It looks like pages are being
swapped out and swapped back in with the wrong data.
8 Jun 2001
I fixed a bunch of buglets, like the console xterms not going away,
readonly hostfs not being readonly, merged Harald Welte's mcast
network transport, and a few other things, and checked them into CVS.
I also checked in the tools, so everything ought to be up-to-date and
consistent at this point.
I also have the .deb build procedure working, I think. The
uncertainty is due to the fact that I think there's a hostfs data
corruption problem. My development box runs Red Hat, and I couldn't
find RPMs for the Debian tools, so I just installed them in my Debian
filesystem (apt-get rocks, BTW :-), mount the source pool inside a
Debian UML via hostfs, and run the debian build procedure there. The
problem is that the gzipped source tarball has its md5sum recorded at
the beginning of the build and checked again at the end, and they
don't match. I also ran md5sum three times in a row on that file
while the builder was running, and got three different answers. So,
it looks like I have some debugging to do there.
3 Jun 2001
True to yesterday's promise, I sent Alan three more patches
A fix for the ethertap driver
which fixes mishandling of large packets
I got the networking cleaned up enough that I'm happy for the general public
to use it. There are three host transports, ethertap, the routing daemon,
and slip. You can have the helper do the host setup for you or not. If you
do, then getting the network running is a matter of a command line switch,
ifconfiging the device, and setting routes inside UML. This is a huge
usability improvement over the previous situation.
This is all checked in, and I'm currently building 2.4.5, which I'll release
in the next day or two.
18 May 2001
I fixed the slip interface, cleaned out some unused code which had become
a portability problem, and fixed the fix for the crash caused by someone
typing at the console too soon. It is all checked in to CVS.
17 May 2001
I grabbed 2.4.4-ac11 to see if Henrik's patch was in there, and it was. So
I don't have to worry about it any more. I guess it made ac9, but Henrik
didn't get credit for it in the ac changelog.
In other news, the ethertap interface is working reasonably well. It couldn't
do HTTP until I figured out that the mtu on host tap device needed to be 16
bytes less than the UML eth0 mtu. The helper is now more helpful. In order
to talk to the rest of the world through it, you basically just have to
ifconfig the device inside UML and add a route to the outside world, and you're
done. Much better than what we had before.
13 May 2001
Five of yesterday's six patches made it into 2.4.4-ac9. The lonely exception
was Henrik's hostfs blocksize fix.
12 May 2001
I decided to clean out my patch backlog a bit. So, I merged and sent to Alan
the following patches:
most of Roman Zippel's Makefile
cleanup , minus some bugs and some unistd.h stuff that I want to
look at more carefully
Henrik Nordstrum's
one-line fix
that allows root hostfs to work when something has changed the ubd block size
the removal of
thread.starting_exec ,
which was ancient history, is now longer needed, and was causing hangs under
heavy load
a fix for a race while getting
the stack snapshot - this was seen on SMP boxes because the child gets to run
sooner than on a UP box
11 May 2001
Chris Emerson got UML/ppc booting to a shell prompt! His uml-devel post
is here . This is the first UML port,
and it showed me how to make UML portable. There aren't really all that many
non-portable things in UML, so a port doesn't take all that much code. Based
on his work, I'm going to write up a UML porting guide, which will be found
here when it's done. If that
link is dead, keep trying until I have something to put there.
In other news, I fiddled the ethertap driver backend so that the read hang
has gone. With some help from Bill Stearns, I also figured out how to talk
to the rest of my network through the ethertap device.
9 May 2001
I got the ethertap backend to the network driver working today and I
submitted it to CVS . I haven't been able to get it
to talk to anything but the host over the tap device, but it communicates
with the host just fine.
4 May 2001
My 2.4.4 fixes, except for Andrew Morton's exitcall fix, are in 2.4.4-ac3.
I wrote and submitted my OLS paper yesterday, two days late. It's also
posted on this site, as TeX
and HTML
On the network driver front, I've got the unified front-end plus the slip
back-end working. I've started working on the ethertap back-end. After that
will come the socket and TUN/TAP back-ends. This stuff is in CVS, but I
haven't updated the patch because the ethernet driver is broken, and I don't
want a bunch of complaints from people who grabbed the latest patch without
knowing what was in it.
Update: Andrew's patch made it into 2.4.4-ac5. I was beginning to wonder.
That cleans out my pending ac patches.
The last batch of patches I sent to Alan made it into ac14.
I started looking at the two network drivers today. I think it won't
be too hard to merge them. They're pretty similar, since they're both
derived from the same code base, and the differences seem to be
orthogonal. They don't seem to have done the same things in
fundamentally different ways. I posted
my impressions for the devel list to comment on.
Andrew Morton looked at the shutdown crash that
peoplestartedseeing
lately and figured out that it
was caused by /proc being unregistered before something else tried to
remove its proc entries when it was unregistered. He sent in a patch
which reversed the __exitcall order, and Henrik Nordstrom
reported that it fixed the
crash for him.
22 Apr 2001
The fixes I made on Thursday were broken. The initrd fix
introduced a name clash with a
function in hostfs, and the sleep fix made
sleep always hang . I
didn't notice because I was fixated on getting UML to boot from an initrd
image, and that wasn't obviously showing the problem.
Anyhow, I made the fixes, submitted them to CVS, updated the patch, and sent
fixes off to Alan. I hadn't sent in Thursday's changes to Alan, so the
patches are the real thing, not just patches to the patches.
The patches I sent in a couple days ago are all in ac10 by the looks of
Alan's change log.
I figured out how initrd support is supposed to work, and implemented the
necessary stuff in UML. I booted a RH initrd image far enough to convince
myself that it works.
I also figured out the sleep hang. It turns out to be a race between the
registration of the timer irq and the first time the timer interrupt calls
do_IRQ. The timer was enabled before the registration, so if an interrupt
happened in that window, do_IRQ would bail out early, leaving the irq
permanently marked as in progress and pending. This locked out all future
timer interrupts from going through the irq system, so counters would never
be decremented, and sleeps would never wake up.
I had a chat on #kernelnewbies
with Rodrigo de Castro, who's using UML
for his compressed
caching project. He understands swapping better
than I, and told me why my new pte bits were breaking it. So, I fixed
it, and swapping now seems to work.
11 Apr 2001
Sent Alan the patches necessary
for UML to build and run in his tree.
I got back a reply which said in its entirety, "ok", which I think is good.
Maybe they will make -ac5.
10 Apr 2001
UML is now in 2.4.3-ac4. I was on IRC with Alan and a bunch of other
hackers when he merged it. He looked like he was going to start
asking a bunch of embarassing questions about my locking, but he was
concerned only about one thing, and that was a special case that
didn't need locking.
Too bad it doesn't build. The patch that Alan merged was against the
Linus 2.4.3 tree, which differs in a few respects from the current -ac tree.
8 Apr 2001
Released 2.4.3 a week or so late. Blame Linus for releasing it the
night before the kernel summit officially started. We were all in San
Jose and not able to react.
4 Apr 2001
A couple more summit tidbits that I forgot to mention in my last entry:
Willy is thinking about using UML as a testbed for NUMA support. He
wants to fire up a number of virtual machines and have them hook
themselves together so they can access each other's memory through
device files. This would allow people who don't have access to the
fancy hardware to develop and debug Linux support for these boxes.
UML may appear in the -ac trees at some point. He wanted to include it, but I
had sounded fairly negative towards that in the past. What I don't
want just yet is for UML to hit the Linus tree. Alan said he doesn't
send stuff to Linus if the author doesn't want it sent, which is fine
by me.
2 Apr 2001
Back from the kernel summit. I wanted to get a feel for whether four
things that I wanted from the host kernel were reasonable. I got two
OKs and two dings. That's fine, since the OKs were the important
ones. Here's the run-down:
Userspace manipulation of address spaces : I want to be able to
create, populate, release, and switch between mm_structs. This will
speed up UML context switches, and greatly clean up that code. I
asked Linus, and he said OK to the fairly static things that I want to
do. Apparently, there are serious complications when fiddling with the
address space of another process, but that's not what I want to do.
System call interception via signals : In order to avoid the
context switching between threads involved in virtualizing a system
call, I want to have a process intercept its own system calls by
having the host kernel deliver a signal whenever it makes a system
call. The handler would be the current syscall_handler, which would
read the arguments from its sigcontext_struct. This would change a
system call virtualization from four context switches to a signal
delivery and return. I infer Alan's OK on this from my describing it
in his presence and him not objecting.
Notification when a UML thread sleeps in the kernel due to a
page fault : For the sake of cleanliness and completeness, I want to
be able to have UML know when a thread is sleeping in the kernel and
be able to call schedule when that happens. This would let UML do as
much work as possible given its state of memory residence. Alan
rejected this on the grounds that UML would be the only sane user of
this mechanism.
Full kernel preemption : This was implicitly rejected as a UML
need by Alan's rejection of the previous item. If UML is to call
schedule whenever it sleeps, the whole kernel needs to be preemptible
because the swapped-out page might be a kernel page. This doesn't at
all mean that preemption isn't going to happen. Rather, it means that
UML doesn't have a particular need for it.
Other tidbits:
A number of people consider UML a very neat hack, including Ben
Lahaise, Andrea Arcangeli, and Eric Raymond.
Alan turns out to be a UML user. For the last month or so, he's
been booting his kernels as UML kernels before booting them as native
kernels. This is in part because recovery from a totally messed up
kernel is a lot easier with UML than with a native kernel. UML is
also his ptrace test case. It apparently does things with ptrace that
nothing else tries.
Al Viro is thinking about porting UML to Plan 9. He asked me
about what it would take. He had thought through the ptrace
requirement, and I told him about the mmap requirement, which is the
next hurdle. Plan 9 apparently doesn't have mmap. He's going to
think about how to do that.
On the trip over, I did some debugging, and I also threw in some
patches. There is now a "umid=<name>" switch for providing a virtual
machine with an identifier. This causes a pid file to be created
using that name, which is something that makes controlling multiple
UMLs through a nice UI a lot easier. This file will be replaced with
a socket to the machine console that Lennert is working on.
I also implemented __exitcall, which declares a procedure which is
to be called on machine shutdown. This was prompted by the need to
remove the pid files when the virtual machine goes away. I also
converted other existing cleanups to use this mechanism.
25 Mar 2001
I checked in a bunch of changes again. Henrik Nordstrom provoked me into
making it possible to use hostfs as a root directory by sending me a patch
that did it, but which was wrong (IMHO). He did it in the same way that
nfsroot and initrd support is done, which is by adding a special block of
code to fs/super.c inside CONFIG_HOSTFS_ROOT. That works fine, but I didn't
want to annoy Al Viro. What I did instead was to add a second registration
of hostfs as a device (not a virtual) filesystem and change the ubd driver
to support being given a directory rather than a file or block device. What
happens is that when a read request comes in to the ubd driver, it is
guaranteed to be a request for the superblock. The driver constructs a fake
superblock with the directory name in it. hostfs recognizes that and claims
the mount as its own. After that, it goes back to being a normal virtual
filesystem and doesn't bother the block driver again. The involvement of the
ubd driver is a bit of a kludge, but it works well on the command line, and I
can't think of anything better besides some kind of general support for
virtual root filesystems that cover nfs and initrd as well as hostfs.
There were also a number of patches from other people: Lennert Buytenhek's
modify_ldt patch, a bunch from Greg Lonnon, one from Gordon McNutt, and
a buffer overrun patch from Henrik Nordstrom.
23 Mar 2001
I spent a few days fixing the infinite recursive context switch bug. That
was a lot more complicated than I expected. The fix involved replacing the
shadow page tables that represent the mappings on the host for each process
with bits in the pte that say whether it is up-to-date or not. These bits
are set in the little functions that change ptes and cleared in fix_range
after it's updated the mappings. Since the process page tables are per-mm
and not per-process, a mapping that was changed in a multi-threaded process
would only be updated for one of the threads. This meant that UML processes
that share a memory context also need to share a memory context on the host.
This in turn complicated exec, since it now needs to create a new host process
in order to get out of a shared UML address space. I implemented this a few
times, and on about the third try, I got something that works.
So, aside from eliminating a nasty bug, this also makes the modify_ldt fix
more useful, since it now should work properly without any extra code, and
opens the way to more efficient context switching between threads, since they
won't have to go through the remapping that processes need.
18 Mar 2001
Lots of bugs have been fixed. I got a little list of hostfs
complaints from Al Viro, which I think I fixed. hostfs is now pretty
solid. I fixed the naming problem which cropped up if you held a file
open, then moved its directory and accessed that file by its new
name. You'd get 'file not found'. This is because I stored the full
host pathname in the inode, and when you changed its name while
holding it open by the old name, the inode continued to contain the
old bogus name. This was fixed by having anything that needs a
filename walk the dentry tree back up to the root, constructing the
current filename. The other major problem was that readdir didn't
work, resulting in missing files when a directory was copied. These
are fixed. What remains is to get rid of some interfaces which will
complain about not being implemented.
The signal delivery race is fixed. That induced me to clean up a lot
of old, crufty code in the kernel entry and exit paths. That's
sensitive code, and a few bugs in it caused some very selective and
very strange behavior.
I've put together an RPM for UML just in time for the April Linux
Magazine to hit the streets with my article in it. This is good,
because the article claims that RPMs are available, which they weren't
at the time that I wrote it. This also goes some way towards
simplifying the network mess. The RPM installs the umn_helper, which
lets the umn device run without any help from the user. It also
installs the eth tools, which are otherwise hard to find unless you
pull them from cvs.
25 Feb 2001
CVS update today. I fixed a few bugs and cleaned up a bunch of
things.
I've started keeping an up-to-date TODO list. This will help
me not forget anything important. I post it to the -devel list
occasionally to prompt people to send in whatever gripes they have.
24 Feb 2001
I'm releasing 2.4.2 today. It has a number of bug fixes and no
significant functionality changes.
A number of bugs have cropped up lately. The most significant is a
race when a process signal handler returns. There is a narrow window
in which an interrupt can cause a crash. The fix is to implement
sigreturn like the other arches and run almost all of the kernel code
on the kernel stack rather than the process stack as I'm doing now.
8 Feb 2001
I managed to reproduce a number of panics and fixed all but one of
them. The key was hitting UML with a high-concurrency ab run with
requests that fire off perl scripts which make mySQL requests, with
not too much memory, so that it is at least starting to swap.
This reproduced two bugs, one was caused by a failed memory
allocation in the middle of setting up a tracing thread request. The
failure caused a schedule, which caused a switch request, which blew
away the first, partially-set-up request. When the process was
rescheduled, its request was garbage, confusing the tracing thread
into detaching it. This was fixed by moving the allocation to before
the request started being set up.
The other one, which isn't fixed yet, is caused by the shadow page
tables maintained by arch/um/kernel/tlb.c. It occasionally needs to
allocate a page table when it sets up ptes for a new range of memory.
However, if the context switch that it's dealing with was forced by
low memory, then that allocation will fail, causing a recursive
context switch, and recursion continues until either the stack guard
page is hit, or, in the case of a kernel thread, the task structure is
polluted. I'm going to fix this by following a suggestion by prumpf,
which is to use some spare bits in the pte rather than a separate page
table to figure out what parts of the address space need updating.
And, panic number three, which is also fixed, was caused a faulty
notion of when a thread is in kernel space. The old way was to look
at whether the thread is being traced. That fails when a breapoint
was put in a signal handler before it requested that tracing be turned
off. The fix is to look at the current stack pointer. However, that
causes problems when a signal is being delivered to a process. In
this case, there is kernel code running on the process stack. So, a
flag was added to the thread structure when this is happening.
29 Jan 2001
Back from Sydney. The talk went pretty well, it was well attended,
and there was a fair amount of interest in UML there. Rik van Riel
and I wandered around Syndey until the following Friday.
While I was in .au, my OLS paper proposal was accepted, so it
looks like I'll be doing my song and dance in Ottawa this summer.
3 Jan 2001
Updated the web site with a couple pages describing hostfs and the new
console/serial line input specification.
The hostfs memory corruption problems are fixed. slab debug found
them for me. They turned out to be two string buffer overrun bugs.
I'll release a patch with the fixes pretty soon.
1 Jan 2001
I released the uml patch for 2.4.0-prerelease. You can find it
here.
I'm going to make the full release tomorrow, hopefully after fixing
the hostfs crash and getting socket inputs to work.
Today is the deadline for a first draft of my Linux Magazine article
and for OLS paper proposals. I sent them both in last night. We'll
see what happens.
27 Dec 2000
hostfs now pretty much works. I built UML from inside itself on
hostfs. I fixed some bugs in the write code, added enough mmap
support to run binaries from hostfs, and implemented statfs. However,
there is some as-yet explained memory corruption going on.
In an attempt to reproduce the MySQL problems that a couple people are
seeing, I moved some of my work, which is heavy on MySQL and perl,
into UML. I've seen no problems, which is disappointing because I'm
not any closer to finding the bug, but also nice because it shows that
it's possible to do real work inside it.
I've also been banging on the ethernet driver trying to reproduce the
server buffer overflow that I saw earlier. No dice there either.
The swapoff bug is now fixed. It turned out to be a bad idea to give
kernel threads both a non-NULL mm and active_mm. That code has been
that way for ages. I have no idea when or why it became a bad thing.
9 Dec 2000
hostfs is now almost all working. mknod doesn't work, and you can't
run binaries out of a hostfs filesystem.
I also fixed that pesky linking failure that people have seen seeing
sporadically for a while. I noticed that profiling was turned
on in the latest case that showed up in my inbox. I did a profiling
build of my own and lo! it failed to link. Since I could reproduce
it, I was out of excuses for not fixing it, and so I did. You can see
a full explanation of the problem
here
.
Those changes plus a couple of smaller ones are now in CVS. They
aren't in the latest patch because the SourceForge upload system has
been seriously b0rked. I'll update the patch when I can.
7 Dec 2000
I fixed the known bugs in the block driver. The
UML#
dd if=/dev/ubd/0 of=/dev/null
hang was due to the driver returning to the block layer rather than
continuing to process the queue when it found an out-of-range I/O
request. The dbench corruption was due to the elevator rearranging
the request queue while a request was in flight. When that request
finished, the interrupt handler was supposed to retire it by removing
it from the head of the queue. The problem is that the elevator put
some other request at the head, and that request was retired without
ever being done. Meanwhile, the original request was pushed back in
the queue somewhere, and it got done twice.
Dan Aloni has started the Windows port. He got most of the
kernel to compile. There are a number of undefined symbols from files
that don't compile yet. Overall, though, it's looking pretty good.
30 Nov 2000
I updated the site a little. The major changes involved the "ARCH=um"
build change. The
compilation
page is now
very explicit about that and there's a
FAQ
entry for it.
I fixed up the block driver a little. In the past, if you did
UML#
dd if=/dev/ubd/0 of=/dev/null
when it ran off the end of the device, it could apparently hang. This
is fixed. The problem with dbench is not fixed, but I made the
driver's synchronous mode accessible from the command line with the
"ubd=sync" switch. In sychronous mode, the driver has no problems
with dbench.
18 Nov 2000
I've got hostfs starting to work reasonably. ls now works, you can cd
around and cat things. You can't write anything, create files, or
execute them yet.
17 Nov 2000
After a bit of a hiatus, I did a CVS update. A number of buglets
relating to running UML as a daemon were fixed. The build was cleaned
up - I had hard-coded "gcc" instead of "$(CC)" in my Makefiles, the
top-level Makefile is now able to do native and user-mode builds, and
I cleaned up the drivers and fs Makefiles so that they let Rules.mk do
all the hard work.
I'm also back on hostfs. I fixed the mm problems that it uncovered.
It can now do ls on the top-level directory.
1 Nov 2000
Linus finally released the final test10 yesterday, so I made my
release last night, with a freshmeat announcement this morning. The
stack overflow problems in test9 are fixed by doubling the stack
size. There is also an inaccessible page between the two stack pages
and the task structure, so there shouldn't be any task structure corruption.
There were a number of other fixes. At the last minute, I found and
fixed a nasty race which resulted in the kernel tracing its own system
calls, resulting in some nasty stack corruption which made it hard to
figure out what happened. UML can now run when its main console is
not a terminal (i.e. /dev/null). That didn't work because it flipped
the terminal between raw and cooked mode, complaining via printk if
the ioctls failed. That led to an infinite recursion of printk
error messages which ultimately resulted in a segfault. I also made
it possible to mount host devices again. That was broke when I made
the block driver check IO requests against the device size so it could
report errors for out of bounds IO. It turns out not to be possible
to get the size of the media behind a block special file, as far as I
can tell. So, as far as the block driver was concerned, block devices
had zero size, and all IO was out of bounds.
I also started work on the hostfs filesystem. This is a virtual
filesystem which provides access to the host filesystem. The theory
is straightforward - vfs calls are converted into the equivalent
system calls on the host - but this uncovered a subtle memory
management bug. If a libc routine which mallocs memory is called, and the
break is increased, that extra memory only exists in that process. If
the kernel in another process tries using that memory (or tries
calling malloc at all), it will fault. What needs to happen is for
the context switching code to see if malloc has increased the size of
the data segment and map the new memory into the newly running
process. This also raises some SMP issues because when the new memory
is mapped in, the other processors will need to be told about it so
they can also map it. The same is true of the kernel's virtual memory.
20 Oct 2000
At long last, I added a
page for related
projects and other interesting links.
In other news, it turns out that Michael Vines wrote a Linux
executable runner for Windows that does what a UML port to Windows
would have to do and he has GPL-ed it and made it available for anyone
who wants to incorporate it into a UML Windows port. See the
todo page for a link to his stuff.
17 Oct 2000
Back from ALS. The talk went pretty well. I'll put the slides up on
the site at some point.
I fixed the stack overflow problems that people were seeing. The
stack is now two pages long, with an inaccessible third page
protecting the task structure, which is on the fourth. Now, any stack
overflows will segfault rather than polluting the task structure,
making them a lot easier to debug. This is in CVS along with a few
other changes.
2 Oct 2000
Bill Stearns decided to go overboard on root_fs production. He's been
fiddling with the mkrootfs script so that it can handle distros other
than Red Hat 6.x. He's done Red Hat 7.0, Mandrake, and Immunix.
These are all now available from the project
download page . Caldera, Conectiva, and SuSE are in the works.
26 Sep 2000
SGI released a new version of XFS for test5 and I tried to apply it to
my test8 um pool, the idea being that I could play with xfs in
userspace. The patch went in ok, with some rejects that were not too
hard to figure out. After some work, I got it to build. It didn't
boot, though. There were some changes in ll_blk_rw.c that I didn't
understand, and it looks like they are what resulted in the block
device getting a NULL buffer to do I/O into.
So, maybe I'll give XFS another try when SGI gets it slightly more
up-to-date.
25 Sep 2000
I found out why the kernel debugging interface doesn't handle
breakpoints very well. Setting breakpoints results in process
segfaults, floating point exceptions, and other strange behavior. It
turns out that do_syscall stored the current register state in the
thread structure while determining whether the process was doing a
system call. If the process hit a breakpoint in the kernel instead,
then that overwrote the state that was stored when the system call was
called. When the system call returned, that bogus state was restored,
and the process was essentially teleported back into the kernel just
after the breakpoint, leading to all kinds of strange behavior. With
that problem fixed, things work much better. The kernel debugger
seems to be basically healthy, and works just like on a normal process.
While I was fixing breakpoints, I decided to see why gdb inside a
virtual machine crashes it whenever it sets a breakpoint. There turn
out to be a number of problems. First, SIGTRAP wasn't being delivered
to the debuggee when it hit a breakpoint. This made it hard for gdb
to find out that the breakpoint had been hit, and to remove it
temporarily so the debuggee could get by it. Then, it turned out that
PTRACE_SINGLESTEP wasn't implemented. This is used by gdb to execute
the instruction which had the breakpoint and stop on the next one.
There were one or two other buglets, but now that they are fixed, gdb
seems happy with breakpoints.
23 Sep 2000
So, I've been a little lax. Here's what's happened in the last few weeks:
two bugs were fixed, the reboot bug and and shell segfault bug. That's it.
1 Sep 2000
I realized that I am starting to lose track of bugs and functionality
requests, so I dusted off the project's
bug tracking
system and put everything that I know of in it. I'm also using the
patch manager
to store the fixes. The idea is that I'll put fixes there and close
them when I make a release that contains the fix.
I fixed a context-switching bug
noticed
by Lennert Buytenhek. The problem turned out to be a race while
updating the address space of the process being restarted. If the
interrupt handler needed data from the kernel's vm area, and that area
hadn't yet been updated, then the kernel would crash. The
fix was to disable signals during that period of the context switch.
31 Aug 2000
I put in
Andrea's LFS patch. While I was in there, I cleaned
that code up somewhat. That is some of the oldest code still
remaining, and it really needed some work. I also put in the fix for
the crash caused by a
module creating a kernel thread. No word yet on
whether it's the right fix, though.
Also, Laurent Bonnaud volunteered to update the ancient filesystem in
the Debian package to potato. This is very cool. It is something
that I've been wanting to do for a long time.
25 Aug 2000
I'm releasing test7 today.
My ALS2000 paper is now available as
HTML and TeX.
I redid the RH mkrootfs script. It now prompts for the info it
needs. It also works for RH6.1 and probably RH6.2, although I didn't
test that.
23 Aug 2000
Finished my ALS paper and sent it in. That's a load off my mind.
Made some more CVS checkins. This makes the various debugging options
configurable, although I haven't tested the gprof and gcov
configurations. I also added some compatibility code to make the
Debian install happier. It now recognizes that uml can have disks,
but the disk recognizer gets stuck in state 'D' for reasons I haven't
figured out yet.
Linus just released test7, so I am building it right now (right
now as I'm writing this, and not right now as you're reading it, because
I've got no idea when you're reading this). A quick check of the
patch shows no changes that I need to worry about, so this looks like
a drop-it-in-and-it-just-works patch. We'll see...
Things look good. I booted it up, and ran a few things, and
they all worked. So, I'll run the stress testers on it tomorrow, and
if that checks out, I'll release it.
21 Aug 2000
I've made the
ptrace proxy, gprof, and
gcov support configurable. Also started playing with the Debian 2.2
install. It starts up ok, which the 2.1 install doesn't. It looks
like I'll have to fake some /proc/ide entries before it will deign to
admit that the virtual machine has disks. Right now, it's punting me
into the diskless install.
17 Aug 2000
Fixed a network driver bug which caused a crash when ab was run
against it. This might also fix the ping flood problem. The fix is
in cvs. It will appear for real in test7.
15 Aug 2000
I found out what was causing uml not to boot. It turned out to be a
casting bug which was making the compiler do pointer arithmetic rather
than integer arithmetic. This was a long-standing bug, and test6
changed things so it got hit more heavily. So, assuming that it's not
too badly broken now, I'll release test6 for real.
8 Aug 2000
Revamped the website. It will be put up as soon as Linus releases
test6 and I've integrated the changes in. This is because this site
talks about stuff which isn't really going to be released until then.
7 Aug 2000
Checked in changes which make the new debugging interface
more or less work. I also added a 'debug' command line switch which
starts the kernel in the debugger, so you have control of it from the start.
There are some problems with it. Commands attached to breakpoints
cause segfaults for some reason. It also can't step across a context
switch.
I also put in Rusty's patches. They completely revamp the config
mechanism. For some reason, there also seems to be very complete
networking/netfilter converage.
5 Aug 2000
The ptrace proxy is more or less working. I've checked it in to CVS and
announced it on my devel list.
4 Aug 2000
Got a couple of patches from Rusty. I'm apparently going
to graduate to a complete port once I've applied one of them :-) It
is nice. It gives me what looks like a complete config process rather
than the one I've kludged together. He also sent in enough exports to
allow his stuff to be modular inside a UML.
I'm integrating in Lars Brinkhoff's ptrace proxy. It's partially
working - enough that I can attach to the running thread, poke around
it, set breakpoints, etc. This without needing to detach it from the
uml tracing thread. I can't ^C gdb and have it stop the kernel
wherever it happens to be. It also doesn't seem to be following
threads as one goes to sleep and another starts up. Once these work, this
will be a huge improvement in uml kernel debugging.
2 Aug 2000
Bill Stearns
pointed out a reproducable way of crashing the kernel
yesterday. It turns out that irq_save/irq_restore were completely wrong.
irq_save was enabling signals when it should have been disabling them.
This could explain a lot of the problems I saw in test4.
Linus released test5 last night, so I'm putting out the user-mode
version today. There's nothing new in it. The virtual ethernet is in
the patch, but not enabled in the binary kernels and off by default in
defconfig.
The stress testing of this kernel produced no strange happenings.
Maybe the segfaults and other stuff in the last release weren't my
fault (heh).
27 Jul 2000
Started integrating Jim Leu's virtual ethernet driver. It
basically works, but it misbehaves a fair bit. It's unclear whether
that's the kernel's fault or the driver's.
18 Jul 2000
I/O, I/O, off to OLS I go
I'm leaving OLS a bit early because I've got a hiking weekend
coming up - Carter Dome on Saturday and possibly Moriah on Sunday.
So, no one had better expect any work from me until at least next
week...
Also discovered a few more problems which I didn't see on test2 or test3:
the occasional process segfault which I've mentioned already
a devfs segfault - I did this by displaying an xterm out to the
host; when I logged out, the kernel paniced with memory corruption in
devfs
X clients sometimes can't display against a local X server - strace says
that they're stuck in select
strace also displayed their read masks as '[?]', which doesn't
seem right
Patches for these will be forthcoming when I find fixes.
14 Jul 2000
Back from SF. Not only did Linus release test3 on Tuesday (as I
discovered when I was checking things out just before leaving for the
airport), but he also released test4 yesterday. So, it looks like I'm
going to be skipping test3 and going straight to test4.
The test3 changes are pretty minor, but enough to prevent the um
test2 patch from going in. task.priority changed to task.nice, there
were some minor locking changes, devfs_mk_dir lost a parameter, and
kernel/timer.c doesn't compile because of that field change.
With those things fixed, the kernel boots fine.
I also decided to get rid of the
pid 16 (mount) - segv changing ip to 0x10025ff2 for address 0x8064000
messages that appear at boot time. They are debugging messages to
convince me that the new uaccess macros are working. But they looked
abnormal and worried people so they are now gone. Don't worry, be
happy.
The test3 kernel runs my stress tests (lmbench and a kernel build)
fine, so I'm checking it in to CVS and
announcing it on the devel list.
On to test4. The timer.c bug got fixed. Otherwise, the patch went
in cleanly. It compiles and the resulting kernel boots cleanly.
Unfortunately, lmbench segfaults. I put in some debugging code, and
lmbench stops segfaulting. On to the kernel build. That works fine.
I try a couple more lmbench runs. They work fine. Oh well.
I'll consider this releasable. Maybe someone else can find a
better way to reproduce the problem. Check this stuff into CVS, and
out goes the
announcement.
I also updated all of the downloadable stuff and
announced it.
3 Jul 2000
I fixed the double panic bug. That was caused by a stacksize limit
that was not a multiple of 4 meg. The reason that matters is that
check_range (in arch/um/kernel/tlb.c), which is used to remap address
spaces during a context switch, assumes that remappable areas and
non-remappable areas are under different pgdirs, which represent 4 meg
apiece. Non-remappable areas are areas of address space which don't
belong to the process. Kernel text, data, and physical and virtual
memory, plus the original stack, fall into this category. They are
represented by vmas in the process mm_struct, but don't have page
table entries. If check_range runs into one of these areas in the
course of looking at something else, the lack of ptes for it will
cause it to be unmapped. Since the process stack is placed just
outside the stacksize limit, if that limit is (say) less than 4 meg,
when check_range checks it for remapping, it will also run into the
main stack provided by the host kernel and unmap it. The panic
happened when a process tried to change its name, which is stored in
that initial stack.
If you see this problem, you can change your stacksize limit to a
multiple of 4 meg, or apply
this patch to
the kernel.
Those two fixes are now checked into
CVS .
Here's the devel list post describing the changes.
2 Jul 2000
UML doesn't run on recent 2.3/2.4 kernels and I figured out
why. The signal frame size increased due to some extra x86 state that needed
to be saved. UML is responsible for making sure that there is enough stack
available when it asks the host kernel to send a signal to one of its
processes. To do this, it pokes the stack (by reading and writing a word)
a little below the current stack pointer. If there is nothing mapped there,
the seg fault handler will map a page in and all will be well. The offset
that it used to poke was a hard-coded 512 bytes, which I got by looking at
the amount of stack state the syscall handler needed (312 bytes) and adding
a bit. However, it turns out that the new stack frames are much bigger than
that, so the 512 bytes wasn't enough. Fixing this makes UML run on new
kernels. If you are seeing this problem, apply
this patch to the 2.4.0-test2 pool.
I'm also chasing a bug which causes a panic like this:
Kernel panic: Double fault on 0xbffff874 - panicing because it wasn't
fixed the first time