1
linux/fs
Davide Libenzi 6192bd536f epoll: optimizations and cleanups
Epoll is doing multiple passes over the ready set at the moment, because of
the constraints over the f_op->poll() call.  Looking at the code again, I
noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).

This is a stress application that can be used to test the new code.  It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads.  Stress tested on my dual Opteron 254 w/out any problems.

http://www.xmailserver.org/totalmess.c

This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:

http://www.xmailserver.org/epwbench.c

[1] Considering that epoll must be thread-safe, there are five ways we can
    be hit during an epoll_wait() transfer loop (ep_send_events()):

    1) The epoll fd going away and calling ep_free
       This just can't happen, since we did an fget() in sys_epoll_wait

    2) An epoll_ctl(EPOLL_CTL_DEL)
       This can't happen because epoll_ctl() gets ep->sem in write, and
       we're holding it in read during ep_send_events()

    3) An fd stored inside the epoll fd going away
       This can't happen because in eventpoll_release_file() we get
       ep->sem in write, and we're holding it in read during
       ep_send_events()

    4) Another epoll_wait() happening on another thread
       They both can be inside ep_send_events() at the same time, we get
       (splice) the ready-list under the spinlock, so each one will get
       its own ready list. Note that an fd cannot be at the same time
       inside more than one ready list, because ep_poll_callback() will
       not re-queue it if it sees it already linked:

       if (ep_is_linked(&epi->rdllink))
                goto is_linked;

       Another case that can happen, is two concurrent epoll_wait(),
       coming in with a userspace event buffer of size, say, ten.
       Suppose there are 50 event ready in the list. The first
       epoll_wait() will "steal" the whole list, while the second, seeing
       no events, will go to sleep. But at the end of ep_send_events() in
       the first epoll_wait(), we will re-inject surplus ready fds, and we
       will trigger the proper wake_up to the second epoll_wait().

    5) ep_poll_callback() hitting us asyncronously
       This is the tricky part. As I said above, the ep_is_linked() test
       done inside ep_poll_callback(), will guarantee us that until the
       item will result linked to a list, ep_poll_callback() will not try
       to re-queue it again (read, write data on any of its members). When
       we do a list_del() in ep_send_events(), the item will still satisfy
       the ep_is_linked() test (whatever data is written in prev/next,
       it'll never be its own pointer), so ep_poll_callback() will still
       leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
       that it'll become visible to ep_poll_callback(), but at the point
       we're already past it.

[akpm@osdl.org: 80 cols]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:01 -07:00
..
9p
adfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
affs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
afs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
autofs
autofs4
befs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
bfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
cifs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
coda slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
configfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
cramfs mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
debugfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
devpts devpts: add fsnotify create event 2007-05-08 11:14:59 -07:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw 2007-05-07 12:26:27 -07:00
ecryptfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
efs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
exportfs
ext2 ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systems 2007-05-08 11:14:58 -07:00
ext3 ext3: dirindex error pointer issues 2007-05-08 11:15:01 -07:00
ext4 ext3: dirindex error pointer issues 2007-05-08 11:15:01 -07:00
fat is_power_of_2 in fat 2007-05-08 11:14:59 -07:00
freevxfs freevxfs: possible null pointer dereference fix 2007-05-08 11:14:59 -07:00
fuse Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
gfs2 Factor outstanding I/O error handling 2007-05-08 11:14:57 -07:00
hfs is_power_of_2 in fs/hfs 2007-05-08 11:14:59 -07:00
hfsplus is_power_of_2 in fs/hfs 2007-05-08 11:14:59 -07:00
hostfs uml: hostfs style fixes 2007-05-08 11:14:57 -07:00
hpfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
hppfs
hugetlbfs hugetlbfs: add NULL check in hugetlb_zero_setup() 2007-05-07 12:12:57 -07:00
isofs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
jbd
jbd2
jffs2 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
jfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
lockd Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
minix slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
msdos
ncpfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
nfs Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
nfs_common
nfsd Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
nls
ntfs mm: move common segment checks to separate helper function 2007-05-08 11:14:57 -07:00
ocfs2 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
openpromfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
partitions mm: optimize acorn partition truncate 2007-05-07 12:12:55 -07:00
proc reduce size of task_struct on 64-bit machines 2007-05-08 11:14:58 -07:00
qnx4 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
ramfs
reiserfs reiserfs: correct misspelled "REISERFS_PROC_INFO" to "CONFIG_REISERFS_PROC_INFO" 2007-05-08 11:15:00 -07:00
romfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
smbfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
sysfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
sysv slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
udf slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
ufs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
vfat
xfs mm: move common segment checks to separate helper function 2007-05-08 11:14:57 -07:00
aio.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
block_dev.c is_power_of_2 in fs/block_dev.c 2007-05-08 11:14:59 -07:00
buffer.c block_write_full_page(): report ENOSPC 2007-05-08 11:14:57 -07:00
char_dev.c
compat_ioctl.c [PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems 2007-05-02 19:27:21 +02:00
compat.c [PATCH] x86-64: Print type and size correctly for unknown compat ioctls 2007-05-02 19:27:21 +02:00
dcache.c mm: shrink parent dentries when shrinking slab 2007-05-08 11:14:58 -07:00
dcookies.c
direct-io.c
dnotify.c
dquot.c mm: remove destroy_dirty_buffers from invalidate_bdev() 2007-05-07 12:12:55 -07:00
drop_caches.c
eventpoll.c epoll: optimizations and cleanups 2007-05-08 11:15:01 -07:00
exec.c exec: fix remove_arg_zero 2007-05-08 11:15:00 -07:00
fcntl.c
fifo.c
file_table.c
file.c
filesystems.c
fs-writeback.c
generic_acl.c
inode.c slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
inotify_user.c
inotify.c
internal.h
ioctl.c
ioprio.c
Kconfig Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 2007-05-04 19:55:11 -07:00
Kconfig.binfmt blackfin architecture 2007-05-07 12:12:58 -07:00
libfs.c
locks.c Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
Makefile
mbcache.c
mpage.c Factor outstanding I/O error handling 2007-05-08 11:14:57 -07:00
namei.c mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
namespace.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
nfsctl.c
no-block.c
open.c
pipe.c
pnode.c
pnode.h
posix_acl.c
quota_v1.c
quota_v2.c
quota.c
read_write.c use use SEEK_MAX to validate user lseek arguments 2007-05-08 11:14:59 -07:00
read_write.h
readdir.c
select.c
seq_file.c
splice.c
stack.c
stat.c
super.c the overdue removal of the mount/umount uevents 2007-04-27 10:57:31 -07:00
sync.c [PATCH] Turn do_sync_file_range() into do_sync_mapping_range() 2007-04-26 15:02:26 -07:00
utimes.c
xattr_acl.c
xattr.c