Current patches

Site Home Page
The UML Wiki
UML Community Site
The UML roadmap
What it's good for
Case Studies
Kernel Capabilities
Downloading it
Running it
Compiling
Installation
Skas Mode
Incremental Patches
Test Suite
Host memory use
Building filesystems
Troubles
User Contributions
Related Links
Projects
Diary
Thanks
Contacts

Tutorials
The HOWTO (html)
The HOWTO (text)
Host file access
Device inputs
Sharing filesystems
Creating filesystems
Resizing filesystems
Virtual Networking
Management Console
Kernel Debugging
UML Honeypots
gprof and gcov
Running X
Diagnosing problems
Configuration
Installing Slackware
Porting UML
IO memory emulation
UML on 2G/2G hosts
Adding a UML system call
Running nested UMLs

How you can help
Overview
Documentation
Utilities
Kernel projects

Screenshots
A virtual network
An X session

Transcripts
A login session
A debugging session
Slackware installation

Reference
Kernel switches
Slackware README

Papers
ALS 2000 paper (html)
ALS 2000 paper (TeX)
ALS 2000 slides
LCA 2001 slides
OLS 2001 paper (html)
OLS 2001 paper (TeX)
ALS 2001 paper (html)
ALS 2001 paper (TeX)
UML security (html)
LCA 2002 (html)
WVU 2002 (html)
Security Roundtable (html)
OLS 2002 slides
LWE 2005 slides

Fun and Games
Kernel Hangman
Disaster of the Month

Current patches

The purpose of this page is to keep people better informed about ongoing work between UML releases by making the patches currently in my working pool visible to the public. This should alleviate several issues with UML development:

Not infrequently, someone finds a bug in UML, chases it down, and submits a patch, not knowing that the bug has already been fixed in my tree. Since my working tree isn't public until a release, there was no way for anyone to know that the bug was already fixed.
Also not infrequently, the fixes in my tree are incomplete or wrong in some other way. Having those patches available before the release gives UML developers a way to test and sanity-check the patches before they are released to the public.
Having the patches in a release split out makes it easier to fix new bugs by allowing users to back out patches until the bug disappears. Then we know which patch was responsible and can probably figure out the problem quickly. This also allows non-expert users to help track things down since the only expertise needed is the ability to run patch and build UML from source.

To this end, I've started using quilt to manage patches, and will publish the unreleased patches in my current tree here. It will be updated frequently so that there will only be a short window between me putting a patch in my tree and it appearing here.

So, here are the patches pending in my 2.6 tree. I gave up the pretense that I'm still supporting 2.4, so those patches are gone.

These patches apply most easily using quilt on the full tarball. If you're not a quilt user, the "series" file in the tarball gives you the order in which to appy the patches.

2.6.24-rc3

Patches tarball : last modified - Fri Mar 7 11:53:50 EST 2008

fix-config

Last Changed - Mon Nov 19 14:50:48 EST 2007

futex-uaccess

Last Changed - Mon Oct 29 13:24:24 EDT 2007

tlb-build

Last Changed - Mon Nov 19 10:41:54 EST 2007

include-pagemap

Last Changed - Thu Oct 25 23:10:26 EDT 2007

page-flags

Last Changed - Mon Jan 7 12:52:48 EST 2008

pgtable-swap

Last Changed - Thu Oct 25 23:14:42 EDT 2007

init-updating

Last Changed - Sun Sep 9 10:52:25 EDT 2007

update-used

Last Changed - Sun Sep 9 11:09:03 EDT 2007

x86-64-signal-notify

Last Changed - Mon Nov 19 10:42:01 EST 2007

Signals aren't being properly notified to ptrace on x86_64.

o-direct-field

Last Changed - Mon Nov 19 10:42:04 EST 2007

This patch pulls the addition of the openflags.d field from externfs.
This will be merged with the o_direct patch when it is sent to mainline.

externfs-aio

Last Changed - Mon Jan 7 12:52:38 EST 2008

These are AIO changes needed by the ubd driver and humfs.
One major problem fixed here is -EAGAIN handling. It is not enough
to simply pass the error back to the driver so it can retry later.
The ubd driver retries in its interrupt routine, the theory being
that it knows that some requests have been finished on the host, so
there is room to queue some more. The problem is that all the
host AIO requests may be humfs requests, in which case the ubd
interrupt handler will never be called, since it had nothing
pending.

The solution is to centralize the kernel side of AIO request
completion. Rather than have the aio thread send finished requests
directly to the driver which submitted them, they now go to a new
AIO IRQ handler, which sends finished requests to the appropriate
driver. When the host says -EAGAIN, the driver registers a restart
handler with the AIO subsystem. When a host AIO request finishes,
the AIO IRQ handler calls all registered restart handlers. So, a
driver will be notified when new requests can be queued, even if
it's not the one which clogged up the host queue in the first place.

This required a bunch of changes in the ubd driver and humfs. The
interrupt routines are drastically different, as they are no longer
directly called from the IRQ system. They take a list of completed
requests and finish them off. They register a structure at boot
time with the AIO subsystem. This is used as the start of the list
of completed requests. When there are requests mixed together from
different drivers, they need to be separated into different lists,
and the aio_driver is used for this.

aio_thread_reply is gone, as the err field was redundant. Once it's
gone, the aio_context is all that's left, so we might as well just
write aio_contexts between the kernel and aio thread.

UBB_IRQ and HUMFS_IRQ are no more, being replaced by AIO_IRQ.
These are AIO changes needed by the ubd driver and humfs.
One major problem fixed here is -EAGAIN handling. It is not enough
to simply pass the error back to the driver so it can retry later.
The ubd driver retries in its interrupt routine, the theory being
that it knows that some requests have been finished on the host, so
there is room to queue some more. The problem is that all the
host AIO requests may be humfs requests, in which case the ubd
interrupt handler will never be called, since it had nothing
pending.

The solution is to centralize the kernel side of AIO request
completion. Rather than have the aio thread send finished requests
directly to the driver which submitted them, they now go to a new
AIO IRQ handler, which sends finished requests to the appropriate
driver. When the host says -EAGAIN, the driver registers a restart
handler with the AIO subsystem. When a host AIO request finishes,
the AIO IRQ handler calls all registered restart handlers. So, a
driver will be notified when new requests can be queued, even if
it's not the one which clogged up the host queue in the first place.

This required a bunch of changes in the ubd driver and humfs. The
interrupt routines are drastically different, as they are no longer
directly called from the IRQ system. They take a list of completed
requests and finish them off. They register a structure at boot
time with the AIO subsystem. This is used as the start of the list
of completed requests. When there are requests mixed together from
different drivers, they need to be separated into different lists,
and the aio_driver is used for this.

aio_thread_reply is gone, as the err field was redundant. Once it's
gone, the aio_context is all that's left, so we might as well just
write aio_contexts between the kernel and aio thread.

UBB_IRQ and HUMFS_IRQ are no more, being replaced by AIO_IRQ.

externfs

Last Changed - Mon Nov 19 20:02:35 EST 2007

This is the externfs/new hostfs/humfs patch. hostfs now seems to be stable.
The old hostfs will continue to exist until this one is as functional and
stable as it.

externfs-formatting

Last Changed - Mon Nov 19 20:04:27 EST 2007

delete-hostfs

Last Changed - Mon Nov 19 10:56:14 EST 2007

This deletes the old hostfs. This will be sent to mainline when the
externfs-based hostfs seems stable.

switch-pipe

Last Changed - Mon Nov 19 10:56:24 EST 2007

This fixes the interface of make_pipe, which doesn't need to initialize
filehandles. Instead, it is just a wrapper around pipe which just reclaims
descriptors if the initial call to pipe failes with -EMFILE.

x11-fb

Last Changed - Mon Nov 19 10:59:25 EST 2007

X11 framebuffer driver from Gerd Knorr.
You have to enable CONFIG_FB (UML-specific options/Graphics support/
Support for frabe buffer devices), disable CONFIG_VGA_CONSOLE
(UML-specific options/Graphics support/Console display driver support/
VGA text console), and enable Framebuffer Console support (in the same
place), plus some fonts. You also seem to have to put 'x11=<width>x<height>
on the command line.

fix-get_user_pages

Last Changed - Mon Nov 19 10:59:29 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

Fix of a wrong condition.

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

add-stub-vmas.patch

Last Changed - Mon Nov 19 11:11:41 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

This patch adds vm_area structs for stub-code and stub-data.
So, stub-area is displayed in /proc/XXX/maps. Also, stub-pages
are accessible for debuggers via ptrace now.

Linux has a gate-vma concept, that unfortunately supports one
gate-vma only. Thus, there need to be done some changes in
mm/memory.c and fs/proc/task_mmu.c.
This patch avoids the mainline changes by using some dirty tricks.
So, the patch is for testing only, mainline should be changed
to support more than one gate-vma.

fix-jiffies.patch

Last Changed - Mon Nov 19 11:12:47 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

To support different subarches, UML must not use the same
address for jiffies and jiffies_64 in a hardcoded way.
I added JIFFIES_OFFSET to handle different arches. For
current arches, it is set to 0, for s390 it will be set to 4.

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

insert-TIF_RESTART_SVC

Last Changed - Mon Nov 19 11:14:23 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

s390 syscalls might be done by a "svc X" instruction (2 bytes
in size) or a "exec X,Y" instruction (4 bytes in size).
There is no way to read the size of the instruction via ptrace,
so UML/s390 can't do syscall-restarting by resetting instruction
pointer to the value before the syscall.
Also, in most cases syscall number is hardcoded in the "SVC X"
instruction, so there is no way to handle ERESTART_RESTARTBLOCK
correctly by *really* restarting the syscall.
s390 host has implemented TIF_RESTART_SVC-flag to handle the
latter case.
In UML we have to use TIF_RESTART_SVC for both cases.
This patch implements TIF_RESTART_SVC in UML.

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

390-syscall-restart

Last Changed - Mon Nov 19 11:18:56 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

s390 normally doesn't support a method, that allows us to force
the host to skip its syscall restart handling.
I implemented a new method in the host, which also is posted to
LKML to hopefully be inserted in s390 mainline.
To check availability of this change, I added a new check, which
is done in a slightly different way for the other arches, too.
Success in check_ptrace() and success in the new check are
absolutely necessary for UML to run in any mode.
So I changed the sequence of checks to:
1) check_ptrace() being called at startup very early
2) check_ptrace() calls the new check, too
3) can_do_skas() is called after check_ptrace()
check_ptrace() will never return, if it fails, but it now uses
printf() and exit() instead of panic().

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

add-PTRACE_AREA

Last Changed - Mon Nov 19 11:20:50 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

s390 doesn't have PTRACE_GETREGS and friends, but has
PTRACE_[PEEK|POKE]USR_AREA to let user of ptrace() read or write
struct user as he wants.
So we need to support this operation conditionally.

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

stub-arch-optimization

Last Changed - Mon Nov 19 11:22:09 EST 2007

From: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

In s390, fpregs are not reset in signal handlers.
Thus we may stop stub_segv_handler on s390 with a breakpoint instruction
instead of calling getpid, kill and sigreturn.
To make this run, we must not mask any signals in stub_segv_handler.

So I added conditional execution of set_handler in userspace_tramp
depending on ARCH_STUB_NO_SIGRETURN. If this macro isn't defined,
the code remains unchanged, else no signals for sa_mask are defined
and SA_NODEFER is added to flags.

Using the change, we also no longer need to care about correct stack
pointer for sigreturn, which would cause some nasty code on s390.

Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>

test_stub_kill

Last Changed - Mon Nov 19 11:22:13 EST 2007

A small stub optimization. I forget what the reasoning was, needs
more thought.

s390

Last Changed - Tue Oct 24 12:30:58 EDT 2006

This patch adds s390 (31-bit) to UML.

SKAS0 and SKAS3 are tested a bit at least system boots and
shuts down correctly, network (tun/tap) works and we even could
start YaST on it.

Note:s
We use a host running SuSE SLES8 with a "private" kernel named
2.4.21-fsc.11, that contains special adaptions and drivers for
Fujitsu-Siemens mainframes. This means, our current SKAS3-patch
doesn't fit to vanilla kernels. We will create a reworked patch for
vanilla later (2.4 and 2.6).
Our 2.4 kernel also contains two fixes and one enhancement, that all
are essential to make UML run, even in SKAS0. The enhancement is to
support PT_TRACESYSGOOD, that generally is available in 2.6 kernels
but not in 2.4 for s390. So I would suggest to use 2.6 host for the
moment.
The fixes meanwhile are included into mainline. I don't know precisely
the first version containing them, in 2.6.13-rc5 they are present.
As those two patches are very small, they are inserted here as as
comment (AFAICS, its easy to do the changes by hand on older kernel
versions):

First patch to fix signal stack handling:
--- a/arch/s390/kernel/signal.c 2005-03-22 11:07:39.000000000 +0100
+++ b/arch/s390/kernel/signal.c 2005-03-22 11:08:44.000000000 +0100
@@ -285,7 +285,7 @@

/* This is the X/Open sanctioned signal stack switching. */
if (ka->sa.sa_flags & SA_ONSTACK) {
- if (! on_sig_stack(sp))
+ if (! sas_ss_flags(sp))
sp = current->sas_ss_sp + current->sas_ss_size;
}

Second patch to allow skipping of syscall restart:
--- a/arch/s390/kernel/ptrace.c 2005-05-07 07:20:31.000000000 +0200
+++ b/arch/s390/kernel/ptrace.c 2005-08-02 06:45:48.000000000 +0200
@@ -723,6 +761,13 @@
? 0x80 : 0));

/*
+ * If the debugger has set an invalid system call number,
+ * we prepare to skip the system call restart handling.
+ */
+ if (!entryexit && regs->gprs[2] >= NR_syscalls)
+ regs->trap = -1;
+
+ /*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
* stopping signal is not SIGTRAP. -brl

ubd-aio

Last Changed - Mon Nov 19 11:23:59 EST 2007

This adds AIO support to the ubd driver.

ubd-atomic

Last Changed - Mon Nov 19 11:24:02 EST 2007

To ensure that I/O can always make progress, even when there is no
memory, we provide static buffers which are to be used when dynamic
ones can't be allocated. These buffers are protected by flags which
are set when they are currently in use. The use of these flags is
protected by the queue lock, which is held for the duration of the
do_ubd_request call.

There is an allocation failure emulation
mechanism here - setting fail_start and fail_end will cause
allocations in that range (fail_start <= allocations < fail_end) to
fail, invoking the emergency mechanism.
When this is happening, I/O requests proceed one at a time,
essentially synchronously, until allocations start succeeding again.

This currently doesn't handle the bitmap array, since that can be of
any length, so we can't have a static version of it at this point.

bitmap-atomic

Last Changed - Mon Nov 19 11:25:38 EST 2007

This patch completes the robustness and deadlock avoidance work by
handling the writing of the bitmap. The existing method of dealing
with low-memory situations by having an emergency structure for use
when memory can't be allocated won't work here because of the
variable size of the bitmap buffer, and the unknown (to me) limit of
a contiguous I/O request.
The allocation is avoided by writing directly from the bitmap rather
than allocating a buffer and copying the relevant chunk of the
bitmap into it.
This has a number of consequences. First, since the bitmap is
written directly from the device's bitmap, bits should not be set in
it until the I/O is just about to start. This is because reads
would see that and possibly race with the outgoing writes, returning
data from a section of the COW file which has never been written.
To prevent this, reads that overlap a pending bitmap write are
stalled until the write is finished. Modifying the bitmap bits as
late as possible shrinks the window in which this could happen.
Second, a section of bitmap that's being written out should not be
modified again until the write has finished. Otherwise, a bit might
be set and picked up by a pending I/O, resulting in it being on disk
too soon.
So, there are a couple new lists. Bitmap writes which have been
issued, but not finished are on the pending_bitmaps list. Any
subsequent bitmap writes which overlap a pending write have to
wait. These are put on the waiting_bitmaps list. Whenever a
pending bitmap write finishes, any overlapping waiting writes are
tried. They may continue waiting because they overlap an earlier
waiter, but at least one will proceed. Third, reads which overlap a
pending or waiting bitmap write will wait until those writes have
finished. This is done by do_io returning -EAGAIN, causing the
queue to wait until some requests have finished.

ubd-no-count

Last Changed - Mon Nov 19 11:25:44 EST 2007

This patch eliminates the atomic count associated with a bitmap_io
struct. The original thinking was that there would be a number of
aio structures associated with the bitmap_io, since different chunks
of the sg element could go to different layers. The count was
needed to know when the full sg segment reached disk and it was safe
to write the bitmap.
However, the flaw in that thinking is that a bitmap_io struct is
only needed for writes, and writes always go to the COW layer.
Hence, there will only be one bitmap_io per aio, and the counting is
unnecessary.
This patch makes it possible to merge the aio and bitmap_io structs,
which would be a good cleanup.

aio-batching

Last Changed - Mon Nov 19 21:27:46 EST 2007

I noticed that the common case in io_submit is an immediate context
switch to the AIO thread when it returns from io_getevents, followed
by a switch back. This patch changes that by having the AIO thread
wait on a pipe before calling io_getevents. When the kernel
finishes a batch of I/O, it writes the number of requests down the
pipe, and the AIO thread waits for that number, and goes back to
sleeping on the pipe.
This probably shouldn't reach mainline, as O_DIRECT I/O should have
the property of causing switching on every I/O request. Also, the
wakeup mechanism should be only used when the other side might be
sleeping.

o_direct

Last Changed - Mon Nov 19 11:31:50 EST 2007

This enables O_DIRECT on ubd devices. This needs work, as it will die
when creating a COW file. It also needs to do buffered I/O on backing
files.

init-io-req

Last Changed - Mon Jan 7 12:51:44 EST 2008

This uses the C99 syntax to initialize an io_thread_req. Given how this
is compiled, it may not be a good idea, as it will consume more stack
than it should.

no-o-direct

Last Changed - Mon Nov 19 11:31:55 EST 2007

This is the reversion of the o_direct patch so I can make COW files.

no-fakehd

Last Changed - Mon Nov 19 11:31:56 EST 2007

The fakehd switch lost its implementation at some point. Since no one is
screaming for it, we might as well remove it.

cow-odirect

Last Changed - Mon Nov 19 11:32:33 EST 2007

Start fixing the problems with aligned access to COW files when O_DIRECT
is enabled.

fuse

Last Changed - Mon Nov 19 11:39:35 EST 2007

This is the start of the FUSE server support, which will export the UML
filesystem to the host as a FUSE filesystem.

no-cow-odirect

Last Changed - Mon Nov 19 11:39:59 EST 2007

Back out the odirect stuff temporarily.

tty-logging

Last Changed - Mon Nov 19 11:40:03 EST 2007

Re-enabled tty logging.

separate-signals

Last Changed - Mon Jan 7 12:51:13 EST 2008

Pull the signals_enabled bit apart into signal-specific bits. This is
used by unmask_* in the genirq patch.

This is used by unmask_* in genirq.

genirq

Last Changed - Mon Jan 7 12:51:07 EST 2008

Use the genirq infrastructure correctly.

Get rid of init_irq_signals
init_new_thread_signals
start_uml_skas
userspace_tramp
exec_tramp - tt
do_new_thread_handler - tt
init_new_thread_signals should probably go away, except that this requires
tt mode to go away first

See why a separate boot_timer_handler is needed
do_boot_timer_handler
boot_timer_handler
* merge_time_init

See why SIGIO and SIGVTALRM mask off different sets of signals
* they don't any more

See if they need to mask off other signals at all

user_time_init should probably go away
* merge-time-init

Look at signal enabling/disabling in unblock_signals and the handlers

Get rid of timer_irq_inited
* no-timer-irq-inited

Merge timer_init and time_init
* merge-time-init

startup_sigvtalrm needs to call set_interval
* genirq

disable_timer should return int and not play with signals
disable_timer
error path of setup_sigvtalrm
just before exiting
* genirq

enable_timer
tt exec, separate from unblock_signals
tt new thread, separate from local_irq_enable
tt finish fork, separate from local_irq_enable
skas userspace_tramp

uml_idle_timer move to tt
* idle-timer

winch

Last Changed - Mon Nov 19 11:56:10 EST 2007

SIGWINCH handling cleanups -
Reformatting of comments and code
Closed a memory and fd leak by freeing a winch structure and
everything associated with it if communication with the winch thread
dies for some reason.

free_winch in winch_interrupt may be wrong, as it calls free_irq
winch_handler_lock needs to become _irqsave
register_winch_irq may need to return int

no-os

Last Changed - Mon Nov 19 11:56:20 EST 2007

There are a number of calls to os_* functions from code that's already
inside os-Linux, and thus can call libc directly. In most cases, the
wrappers added no value, and existed only to separate libc and kernel
headers. When the calls are under os-Linux, this is pointless.

os_{read,write}_file do add some value, unlike the other wrappers,
namely that it retries if the corresponding read() or write() returns
-EINTR and that it handles userspace addresses properly by making sure
they are mapped, avoiding the read() or write() returning -EFAULT.
So, we need to be sure that when these calls are replaced, there is a
CATCH_EINTR added and that the buffer is kernel memory.

In addition, the wrappers returned -errno, rather than {-1,0}, so the
replacement code needs to look at errno for the error.

This patch also makes a number of whitespace and formatting cleanups,
plus some error path fixes around the affected code.

os_print_error just goes away since it's replaced by perror and printk.

os_getpid is preserved for the moment because it takes care to make
sure that the system call is executed, avoiding an old libc bug which
cached the pid of a different thread.

etap_open needs to undo tap_open_common if a socketpair fails
Also maybe undo etap_tramp if an error came back
Split out os_set_exec_close interface change
tuntap_open needs to undo tap_open_common
clean up os_set_fd_async
Remove helper_hup and helper_pause from helper_child
Add NOCLDWAIT support
os_get_exec_close can just return the flag
ptrace_child should kill itself rather than calling _exit

delay-interface

Last Changed - Mon Nov 19 11:57:34 EST 2007

Fix a confusing parameter name in the __const_udelay prototypes.
The UML changes here are suspect, and probably wrong, but also fairly
harmless.
Fix a confusing parameter name in the __const_udelay prototypes.
The UML changes here are suspect, and probably wrong, but also fairly
harmless.

logging

Last Changed - Mon Nov 19 11:58:01 EST 2007

This is a little logger which dumps stuff out to a host file. Used for
tracking down otherwise intractable bugs.

Hosted at