Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.
The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.
This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.
Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.
With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.
The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-api docbook contents problems.
docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement support for file systems larger than 8 TiB.
The reiserfs superblock contains a 16 bit value for counting the number of
bitmap blocks. The rest of the disk format supports file systems up to 2^32
blocks, but the bitmap block limitation artificially limits this to 8 TiB with
a 4KiB block size.
Rather than trust the superblock's 16-bit bitmap block count, we calculate it
dynamically based on the number of blocks in the file system. When an
incorrect value is observed in the superblock, it is zeroed out, ensuring that
older kernels will not be able to mount the file system.
Userspace support has already been implemented and shipped in reiserfsprogs
3.6.20.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first_zero_hint metadata caching was never actually used, and it's of
dubious optimization quality. This patch removes it.
It doesn't actually shrink the size of the reiserfs_bitmap_info struct, since
that doesn't work with block sizes larger than 8K. There was a big fixme in
there, and with all the work lately in allowing block size > page size, I
might as well kill the fixme as well.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do a quick signedness check for block numbers. There are a number of places
where signed integers are used for block numbers, which limits the usable file
system size to 8 TiB. The disk format, excepting a problem which will be
fixed in the following patch, supports file systems up to 16 TiB in size.
This patch cleans up those sites so that we can enable the full usable size.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct the memset in reiserfs_resize to clear the memory allocated for the
new bitmap info structs. Previously, it would clear the memory used by the
old size. Depending on the contents of memory, this could cause incorrect
caching behavior for bitmap blocks in the newly allocated area.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Build in is_reusable() unconditionally and use it to catch corruption before
it reaches the block freeing paths.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change reiserfs_panic() to use panic() initially instead of BUG(). Using
BUG() ignores the configurable panic behavior, so systems that should be
failing and rebooting are left hanging. This causes problems in
active/standby HA scenarios.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add I_MUTEX_XATTR annotations to the inode locking in the reiserfs xattr code.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note from Mingming's JBD2 fix:
Noticed all warnings are occurs when the debug level is 0. Then found the
"jbd2: Move jbd2-debug file to debugfs" patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b
changed the jbd2_journal_enable_debug from int type to u8, makes the
jbd_debug comparision is always true when the debugging level is 0. Thus
the compile warning occurs.
Thought about changing the jbd2_journal_enable_debug data type back to int,
but can't, because the jbd2-debug is moved to debug fs, where calling
debugfs_create_u8() to create the debugfs entry needs the value to be u8
type.
Even if we changed the data type back to int, the code is still buggy,
kernel should not print jbd2 debug message if the jbd2_journal_enable_debug
is set to 0. But this is not the case.
The fix is change the level of debugging to 1. The same should fixed in
ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we
probably should fix it all together.
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should really call journal_abort() and not __journal_abort_hard() in
case of errors. The latter call does not record the error in the journal
superblock and thus filesystem won't be marked as with errors later (and
user could happily mount it without any warning).
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The jbd-debug file used to be located in /proc/sys/fs/jbd-debug, but
create_proc_entry() does not do lookups on file names that are more that
one directory deep. This causes the entry creation to fail and hence, no
proc file is created.
Instead of fixing this on procfs might as well move the jbd2-debug file to
debugfs which would be the preferred location for this kind of tunable.
The new location is now /sys/kernel/debug/jbd/jbd-debug.
[akpm@linux-foundation.org: zillions of cleanups]
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove printk from J_ASSERT to preserve registers during BUG.
Signed-off-by: Chris Snook <csnook@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert kmalloc to kzalloc() and get rid of the memset().
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
REQUEST_IRQ is never used, so delete it. In the process get rid of the
macro FREE_IRQ which makes the code unnecessarily difficult to read.
Signed-off-by: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Acked-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the hard freeze debugged for AVM C4 cards for the AVM T1 cards.
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the hard freeze debugded for AVM C4 cards using the b1dma
interface.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One call was missing in the previous patch.
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial change in a comment.
Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Andre Haupt <andre@finow14.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some external modules like Speakup need to monitor console output.
This adds a VT notifier that such modules can use to get console output events:
allocation, deallocation, writes, other updates (cursor position, switch, etc.)
[akpm@linux-foundation.org: fix headers_check]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is separate notifier header, but no separate notifier .c file.
Extract notifier code out of kernel/sys.c which will remain for
misc syscalls I hope. Merge kernel/die_notifier.c into kernel/notifier.c.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch uses vm_get_page_prot() to setup vma->vm_page_prot.
Though inside vm_get_page_prot() the protection flags is AND with
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.
Signed-off-by: Coly Li <coyli@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining
traces of it from all archs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_oldmem_page should not return leaving a page frame from the
previous kernel mapped.
Signed-off-by: Fernando Luis Vázquez Cao <fernando@oss.ntt.co.jp>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the `clear_50' and `clear_vesa' fields of struct
ps3av_monitor_quirk, as they're currently unused. We can always re-add
them when we really need them.
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some blind people use a kernel engine called Speakup which uses hardware
synthesis to speak what gets displayed on the screen. They use the
PC keyboard to control this engine (start/stop, accelerate, ...) and
also need to get keyboard feedback (to make sure to know what they are
typing, the caps lock status, etc.)
Up to now, the way it was done was very ugly. Below is a patch to add a
notifier list for permitting a far better implementation, see ChangeLog
above for details.
You may wonder why this can't be done at the input layer. The problem
is that what people want to monitor is the console keyboard, i.e. all
input keyboards that got attached to the console, and with the currently
active keymap (i.e. keysyms, not only keycodes).
This adds a keyboard notifier that such modules can use to get the keyboard
events and possibly eat them, at several stages:
- keycodes: even before translation into keysym.
- unbound keycodes: when no keysym is bound.
- unicode: when the keycode would get translated into a unicode character.
- keysym: when the keycode would get translated into a keysym.
- post_keysym: after the keysym got interpreted, so as to see the result
(caps lock, etc.)
This also provides access to k_handler so as to permit simulation of
keypresses.
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Declarations go into headers.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Ram Pai <linuxram@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Majority of host drivers using IDE PCI layer set drive->autotune, the only
exceptions are:
generic.c
ns87415.c
rz1000.c
trm290.c
* no ->set_pio_mode method
it821x.c:
* if memory allocation fails drive->autotune won't be set
(but there also won't be ->set_pio_mode method in such case)
piix.c:
* MPIIX controller (no ->init_hwif method so also no ->set_pio_mode method)
However if there is no ->set_pio_mode method there are no changes in behavior
w.r.t. PIO tuning so always set drive->autotune in ide_pci_setup_ports().
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Since cs5520 uses VDMA best PIO mode was tuned anyway by ide_dma_check()
but only if DMA was successfully initialized.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_LEGACY_IRQS host flag to tell ide_pci_setup_ports() to set
hwif->irq to legacy IRQ 14/15 (iff hwif->irq is not already set) and convert
atiixp, piix, serverworks, sis5513 and slc90e66 host drivers to use it.
While at it:
* In piix.c add IDE_HFLAGS_PIIX define and don't use ->init_hwif for MPIIX.
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_SERIALIZE host flag to tell ide_pci_setup_ports() to set
hwif/mate->serialized and convert aec62xx, cs5530 and sc1200 host drivers
to use it.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_ERROR_STOPS_FIFO host flag and use it instead
of hwif->err_stops_fifo. As a side-effect this change fixes
hwif->err_stops_fifo not being restored by ide_hwif_restore().
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add DECLARE_ICH_DEV() macro.
While at it:
* Add init_hwif_ich() (->init_hwif method) for ICH controllers.
* Rename init_chipset_piix() to init_chipset_ich() and use it only for
ICH controllers.
* Remove no longer needed piix_is_ichx() helper.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* No need to disable UDMA in ->init_hwif method for ATP850UF (and since we
now always tune PIO it will be disabled by ->set_pio_mode calls anyway).
* Bump driver version.
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Merge init_setup_{svwks,csb6}() into svwks_init_one().
While at it:
* Remove redundant dev->device checks.
* Operate on a local copy of serverworks_chipsets[] entry.
* Use pci_resource_start().
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Split off pdc20270_get_dev2() helper from init_setup_pdc20270().
* Merge init_setup_{pdcnew,pdc20270,pdc20276}() into pdc202new_init_one().
While at it:
* Change KERN_ level of interrupt fixup message from KERN_WARNING to KERN_INFO.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Split off hpt{374,371,366}_init() helper from init_setup_hpt{374,371,366}().
* Merge init_setup_{374,372n,371,372a,302,366}() into hpt366_init_one().
While at it:
* Use "HPT36x" name for HPT366/HPT368 chipsets.
* Add .chip_name to struct hpt_info and use it to set set d->name.
* Convert .max_ultra in struct hpt_info to .udma_mask and use it to set
d->udma_mask.
* Fix hpt302 to use HPT302_ALLOW_ATA133_6 define.
* Change HPT366/HPT374 interrupt fixup message from KERN_WARNING to KERN_INFO.
* Use the second hpt366_chipsets[] entry for HPT37x chipsets using HPT36x PCI
device ID and fix .enablebits/.host_flags for HPT36x hpt366_chipsets[] entry.
* Bump driver version.
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>