1
Commit Graph

10442 Commits

Author SHA1 Message Date
Al Viro
08f8585121 [PATCH] move block_device_operations to blkdev.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:20 -04:00
Al Viro
86d434dede [PATCH] eliminate use of ->f_flags in block methods
store needed information in f_mode

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:08 -04:00
Al Viro
aeb5d72706 [PATCH] introduce fmode_t, do annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:06 -04:00
Paul Mundt
2515ddc6db binfmt_elf_fdpic: Update for cputime changes.
Commit f06febc96b ("timers: fix itimer/
many thread hang") introduced a new task_cputime interface and
subsequently only converted binfmt_elf over to it.  This results in the
build for binfmt_elf_fdpic blowing up given that p->signal->{u,s}time
have disappeared from underneath us.

Apply the same trivial fix from binfmt_elf to binfmt_elf_fdpic.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 20:17:18 -07:00
Linus Torvalds
9301975ec2 Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.

The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]).  The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.

* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
  genirq: improve include files
  intr_remapping: fix typo
  io_apic: make irq_mis_count available on 64-bit too
  genirq: fix name space collisions of nr_irqs in arch/*
  genirq: fix name space collision of nr_irqs in autoprobe.c
  genirq: use iterators for irq_desc loops
  proc: fixup irq iterator
  genirq: add reverse iterator for irq_desc
  x86: move ack_bad_irq() to irq.c
  x86: unify show_interrupts() and proc helpers
  x86: cleanup show_interrupts
  genirq: cleanup the sparseirq modifications
  genirq: remove artifacts from sparseirq removal
  genirq: revert dynarray
  genirq: remove irq_to_desc_alloc
  genirq: remove sparse irq code
  genirq: use inline function for irq_to_desc
  genirq: consolidate nr_irqs and for_each_irq_desc()
  x86: remove sparse irq from Kconfig
  genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
  ...
2008-10-20 13:23:01 -07:00
Linus Torvalds
99ebcf8285 Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
  fix documentation of sysrq-q really
  Fix documentation of sysrq-q
  timer_list: add base address to clock base
  timer_list: print cpu number of clockevents device
  timer_list: print real timer address
  NOHZ: restart tick device from irq_enter()
  NOHZ: split tick_nohz_restart_sched_tick()
  NOHZ: unify the nohz function calls in irq_enter()
  timers: fix itimer/many thread hang, fix
  timers: fix itimer/many thread hang, v3
  ntp: improve adjtimex frequency rounding
  timekeeping: fix rounding problem during clock update
  ntp: let update_persistent_clock() sleep
  hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
  posix-timers: lock_timer: make it readable
  posix-timers: lock_timer: kill the bogus ->it_id check
  posix-timers: kill ->it_sigev_signo and ->it_sigev_value
  posix-timers: sys_timer_create: cleanup the error handling
  posix-timers: move the initialization of timer->sigq from send to create path
  posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
  ...

Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c
2008-10-20 13:19:56 -07:00
Linus Torvalds
1d9a8a47d6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: implement nonseekable open
  fuse: add include protectors
  fuse: config description improvement
  fuse: add missing fuse_request_free
  fuse: fix SEEK_END incorrectness
2008-10-20 12:53:27 -07:00
Alexey Dobriyan
6da0b38f44 fs/Kconfig: move ext2, ext3, ext4, JBD, JBD2 out
Use fs/*/Kconfig more, which is good because everything related to one
filesystem is in one place and fs/Kconfig is quite fat.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 11:43:59 -07:00
Linus Torvalds
45e4a24f7b Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: (26 commits)
  9p: add more conservative locking
  9p: fix oops in protocol stat parsing error path.
  9p: fix device file handling
  9p: Improve debug support
  9p: eliminate depricated conv functions
  9p: rework client code to use new protocol support functions
  9p: remove unnecessary tag field from p9_req_t structure
  9p: remove 9p fcall debug prints
  9p: add new protocol support code
  9p: encapsulate version function
  9p: move dirread to fs layer
  9p: adjust 9p vfs write operation
  9p: move readn meta-function from client to fs layer
  9p: consolidate read/write functions
  9p: drop broken unused error path from p9_conn_create()
  9p: make rpc code common and rework flush code
  9p: use the rcall structure passed in the request in trans_fd read_work
  9p: apply common request code to trans_fd
  9p: apply common tagpool handling to trans_fd
  9p: move request management to client code
  ...
2008-10-20 09:39:47 -07:00
Linus Torvalds
52c6738b7f Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: use correct fs type for v4 submounts and referrals
  Make nfs_file_cred more robust.
  NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets
2008-10-20 09:39:20 -07:00
Linus Torvalds
396b122f6a Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (25 commits)
  UBIFS: fix ubifs_compress commentary
  UBIFS: amend printk
  UBIFS: do not read unnecessary bytes when unpacking bits
  UBIFS: check buffer length when scanning for LPT nodes
  UBIFS: correct condition to eliminate unecessary assignment
  UBIFS: add more debugging messages for LPT
  UBIFS: fix bulk-read handling uptodate pages
  UBIFS: improve garbage collection
  UBIFS: allow for sync_fs when read-only
  UBIFS: commit on sync_fs
  UBIFS: correct comment for commit_on_unmount
  UBIFS: update dbg_dump_inode
  UBIFS: fix commentary
  UBIFS: fix races in bit-fields
  UBIFS: ensure data read beyond i_size is zeroed out correctly
  UBIFS: correct key comparison
  UBIFS: use bit-fields when possible
  UBIFS: check data CRC when in error state
  UBIFS: improve znode splitting rules
  UBIFS: add no_chk_data_crc mount option
  ...
2008-10-20 09:19:03 -07:00
Linus Torvalds
2be508d847 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (69 commits)
  Revert "[MTD] m25p80.c code cleanup"
  [MTD] [NAND] GPIO driver depends on ARM... for now.
  [MTD] [NAND] sh_flctl: fix compile error
  [MTD] [NOR] AT49BV6416 has swapped erase regions
  [MTD] [NAND] GPIO NAND flash driver
  [MTD] cmdlineparts documentation change - explain where mtd-id comes from
  [MTD] cfi_cmdset_0002.c: Add Macronix CFI V1.0 TopBottom detection
  [MTD] [NAND] Fix compilation warnings in drivers/mtd/nand/cs553x_nand.c
  [JFFS2] Write buffer offset adjustment for NOR-ECC (Sibley) flash
  [MTD] mtdoops: Fix a bug where block may not be erased
  [MTD] mtdoops: Add a magic number to logged kernel oops
  [MTD] mtdoops: Fix an off by one error
  [JFFS2] Correct parameter names of jffs2_compress() in comments
  [MTD] [NAND] sh_flctl: add support for Renesas SuperH FLCTL
  [MTD] [NAND] Bug on atmel_nand HW ECC : OOB info not correctly written
  [MTD] [MAPS] Remove unused variable after ROM API cleanup.
  [MTD] m25p80.c extended jedec support (v2)
  [MTD] remove unused mtd parameter in of_mtd_parse_partitions()
  [MTD] [NAND] remove dead Kconfig associated with !CONFIG_PPC_MERGE
  [MTD] [NAND] driver extension to support NAND on TQM85xx modules
  ...
2008-10-20 09:03:12 -07:00
Alexey Dobriyan
bb26b963d8 fs/Kconfig: move CIFS out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:42 -07:00
Simon Horman
85a0ee342e kdump: add is_vmcore_usable() and vmcore_unusable()
The usage of elfcorehdr_addr has changed recently such that being set to
ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
executing in a kernel executed as a crash kernel.

However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
calls to is_kdump_kernel() will return 0, even though they should return
1.

Ok, at this point in time there are no subsequent calls, but I think its
fair to say that there is ample scope for error or at the very least
confusion.

This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
elfcorehdr_addr was passed on the command line, and thus execution is
taking place in a crashdump kernel, but vmcore can't be used for some
reason.  This is tested for using is_vmcore_usable() and set using
vmcore_unusable().  A subsequent patch makes use of this new code.

To summarise, the states that elfcorehdr_addr can now be in are as follows:

ELFCORE_ADDR_MAX: not a crashdump kernel
ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
any other value:  crash dump kernel and vmcore is usable

Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:40 -07:00
Vivek Goyal
57cac4d188 kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE
o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
  but also by the code which is not inside CONFIG_PROC_VMCORE.  For
  example, is_kdump_kernel() is used by powerpc code to determine if
  kernel is booting after a panic then use previous kernel's TCE table.
  So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
  able to correctly determine that we are booting after a panic and setup
  calgary iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under
  CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
  (Unfortunately crash dump does not have an arch independent file
  otherwise that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
  second kernel without KEXEC being enabled.

o I don't see sh setup code parsing the command line for
  elfcorehdr_addr.  I am wondering how does vmcore interface work on sh.
  Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
  broken on sh.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Simon Horman <horms@verge.net.au>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Roland McGrath
656eb2cd5d add CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
This adds a kconfig option to change the /proc/PID/coredump_filter default.
Fedora has been carrying a trivial patch to change the hard-wired value for
this default, since Fedora 8.  The default default can't change safely
because there are old GDB versions out there (all before 6.7) that are
confused by the core dump files created by the MMF_DUMP_ELF_HEADERS setting.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Oleg Nesterov
6409324b38 coredump: format_corename: don't append .%pid if multi-threaded
If the coredumping is multi-threaded, format_corename() appends .%pid to
the corename.  This was needed before the proper multi-thread core dump
support, now all the threads in the mm go into a single unified core file.

Remove this special case, it is not even documented and we have "%p"
and core_uses_pid.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: La Monte Yarroll <piggy@laurelnetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Lai Jiangshan
3eda201180 seq_file: add seq_cpumask_list(), seq_nodemask_list()
seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(),
seq_nodemask(), but they print human readable string.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Lai Jiangshan
85dd030edb seq_file: don't call bitmap_scnprintf_len()
"m->count + len < m->size" is true commonly, so bitmap_scnprintf()
is commonly called. this fix saves a call to bitmap_scnprintf_len().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Eric Sesterhenn
248736c2a5 hfsplus: fix possible deadlock when handling corrupted extents
A corrupted extent for the extent file itself may try to get an impossible
extent, causing a deadlock if I see it correctly.

Check the inode number after the first_blocks checks and fail if it's the
extent file, as according to the spec the extent file should have no
extent for itself.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Alan Cox
6e71529444 hfsplus: missing O_LARGEFILE check
hfsplus: O_LARGEFILE checking is missing

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=8490

From: Alan Cox <alan@redhat.com>
Reported-by: didier <did447@gmail.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Eric Sandeen
cdbf6dba28 ext3: avoid printk floods in the face of directory corruption
A very large directory with many read failures (either due to storage
problems, or due to invalid size & blocks from corruption) will generate a
printk storm as the filesystem continues to try to read all the blocks.
This flood of messages can tie up the box until it is complete - which may
be a very long time, especially for very large corrupted values.

This is fixed by only reporting the corruption once each time we try to
read the directory.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Aneesh Kumar K.V
5ec8b75e3a ext3: truncate block allocated on a failed ext3_write_begin
For blocksize < pagesize we need to remove blocks that got allocated in
block_write_begin() if we fail with ENOSPC for later blocks.
block_write_begin() internally does this if it allocated page locally.
This makes sure we don't have blocks outside inode.i_size during ENOSPC.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Eugene Dashevsky
6a897cf447 ext3: fix ext3_dx_readdir hash collision handling
This fixes a bug where readdir() would return a directory entry twice
if there was a hash collision in an hash tree indexed directory.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eugene Dashevsky <eugene@ibrix.com>
Signed-off-by: Mike Snitzer <msnitzer@ibrix.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Hidehiro Kawai
960a22ae60 jbd: ordered data integrity fix
In ordered mode, if a file data buffer being dirtied exists in the
committing transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.  But we
don't have to remove the buffer from the committing transaction when the
buffer couldn't be written out, otherwise it would miss the error and the
committing transaction would not abort.

This patch adds an error check before removing the buffer from the
committing transaction.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Hidehiro Kawai
0e4fb5e283 ext3: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently.  Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error.  It's scary for mission critical systems.  On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable.  So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error.  If you mount it with data_err=ignore, it doesn't abort, just
call printk().  data_err=ignore is the default.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Mingming Cao
46d01a225e ext3: fix ext3 block reservation early ENOSPC issue
We could run into ENOSPC error on ext3, even when there is free blocks on
the filesystem.

The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2".  Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.

This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.

Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality.  But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.

2) we don't need to check the window size if the block reservation is off.

The problem was originally found and fixed in ext4.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Josef Bacik
972fbf7798 ext3: don't try to resize if there are no reserved gdt blocks left
When trying to resize a ext3 fs and you run out of reserved gdt blocks,
you get an error that doesn't actually tell you what went wrong, it just
says that the gdb it picked is not correct, which is the case since you
don't have any reserved gdt blocks left.  This patch adds a check to make
sure you have reserved gdt blocks to use, and if not prints out a more
relevant error.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Dilger <adilger@sun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Hidehiro Kawai
885e353c74 jbd: don't dirty original metadata buffer on abort
Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not.  Eventually these buffers will be
written-back to the filesystem by pdflush.  This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts.  So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state.  Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly.  Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata.  This would also break the consistency of the filesystem.

This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers.  Thus, no metadata
buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Hidehiro Kawai
d1645e526a jbd: abort when failed to log metadata buffers
If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:36 -07:00
KOSAKI Motohiro
e575f111dc coredump_filter: add hugepage dumping
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g.  vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily.  Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area.  then these area reading may change device state.  it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better.  It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page.  Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

 - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
 - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#include "hugetlbfs.h"

int main(int argc, char** argv){
	char* p;
	int ch;
	int mmap_flags = MAP_SHARED;
	int fd;
	int nr_pages;

	while((ch = getopt(argc, argv, "p")) != -1) {
		switch (ch) {
		case 'p':
			mmap_flags &= ~MAP_SHARED;
			mmap_flags |= MAP_PRIVATE;
			break;
		default:
			/* nothing*/
			break;
		}
	}
	argc -= optind;
	argv += optind;

	if (argc == 0){
		printf("need # of pages\n");
		exit(1);
	}

	nr_pages = atoi(argv[0]);
	if (nr_pages < 2) {
		printf("nr_pages must >2\n");
		exit(1);
	}

	fd = hugetlbfs_unlinked_fd();
	p = mmap(NULL, nr_pages * gethugepagesize(),
		 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

	sleep(2);

	*(p + gethugepagesize()) = 1; /* COW */
	sleep(2);

	/* crash! */
	*(int*)0 = 1;

	return 0;
}

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Nick Piggin
51b07fc3c5 fs: buffer lock use lock bitops
trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Nick Piggin
5344b7e648 vmstat: mlocked pages statistics
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Lee Schermerhorn
ba9ddf4939 Ramfs and Ram Disk pages are unevictable
Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists.  When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list.  Round and round she goes...

With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:

Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages.  This will provide for efficient testing of ramfs pages
in page_evictable().

Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.

Set the unevictable state on address_space structures for new ramfs
inodes.  Test the unevictable state in page_evictable() to cull
unevictable pages.

These changes depend on [CONFIG_]UNEVICTABLE_LRU.

[riel@redhat.com: undo the brd.c part]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Lee Schermerhorn
7b854121eb Unevictable LRU Page Statistics
Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Rik van Riel
4f98a2fee8 vmscan: split LRU lists into anon & file sets
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon").  The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists.  The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Thomas Gleixner
c465a76af6 Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus 2008-10-20 13:14:06 +02:00
Geert Uytterhoeven
54779aabb0 UBIFS: fix ubifs_compress commentary
Update the comment for ubifs_compress(), which incorrectly states that it
returnsa success/failure indicator.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-10-19 13:01:37 +03:00
Artem Bityutskiy
fae7fb299f UBIFS: amend printk
It is better to print "Reserved for root" than
"Reserved pool size", because it is more obvious for users
what this means.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-10-19 13:01:30 +03:00
Adrian Hunter
727d2dc045 UBIFS: do not read unnecessary bytes when unpacking bits
Fixes the following Oops:

BUG: unable to handle kernel paging request at f8d24000
IP: [<f8ff0657>] :ubifs:ubifs_unpack_bits+0xcd/0x231
*pde = 34333067 *pte = 00000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: deflate zlib_deflate lzo lzo_decompress lzo_compress
ubifs ubi nandsim nand nand_ids nand_ecc mtd nfsd lockd sunrpc exportfs
[last unloaded: nand_ecc]

Pid: 7450, comm: sync Not tainted (2.6.27-rc8-ubifs-2.6 #27)
EIP: 0060:[<f8ff0657>] EFLAGS: 00010206 CPU: 0
EIP is at ubifs_unpack_bits+0xcd/0x231 [ubifs]
EAX: 00000000 EBX: 00000000 ECX: d7e43dc0 EDX: 0000ff00
ESI: 00000004 EDI: f8d23ffe EBP: d7e43db4 ESP: d7e43d8c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process sync (pid: 7450, ti=d7e42000 task=eb6f9530 task.ti=d7e42000)
Stack: 00000400 c0103db4 dc5e8090 d7e43dc0 d7e43dc0 d7e43dc4 0000001c 00000004
      f496d1e0 f8d23ffc d7e43dd4 f8ffac7e f8d23ffe 00000000 f8d23ffe f2b7af68
      f496d1e0 f8d23ffc d7e43e2c f8ffadc5 00000000 0001f000 00000000 c03b10a7
Call Trace:
[<c0103db4>] ? restore_nocheck_notrace+0x0/0xe
[<f8ffac7e>] ? is_a_node+0x43/0x92 [ubifs]
[<f8ffadc5>] ? dbg_check_ltab+0xf8/0x5c9 [ubifs]
[<c03b10a7>] ? mutex_lock_nested+0x1b2/0x2a0
[<f8ffc86e>] ? ubifs_lpt_start_commit+0x49/0xecb [ubifs]
[<c03b0ef3>] ? mutex_unlock+0xd/0xf
[<f8fef017>] ? ubifs_tnc_start_commit+0x1cf/0xef8 [ubifs]
[<f8fe65d8>] ? do_commit+0x18f/0x52d [ubifs]
[<f8fe69f6>] ? ubifs_run_commit+0x80/0xca [ubifs]
[<f8fd8d35>] ? ubifs_sync_fs+0xdb/0xf6 [ubifs]
[<c0181a07>] ? sync_filesystems+0xc6/0x10c
[<c019f279>] ? do_sync+0x3b/0x6a
[<c019f2ba>] ? sys_sync+0x12/0x18
[<c0103ced>] ? sysenter_do_call+0x12/0x35
=======================
Code: 4d ec 89 01 8b 45 e8 89 10 89 d8 89 f1 d3 e8 85 c0 74 07 29 d6 83 fe
20 75 2a 89 d8 83 c4 1c 5b 5e 5f 5d c3 0f b6 57 01 c1 e2 08 <0f> b6 47 02
c1 e0 10 09 c2 0f b6 07 09 c2 0f b
EIP: [<f8ff0657>] ubifs_unpack_bits+0xcd/0x231 [ubifs] SS:ESP 0068:d7e43d8c
---[ end trace 1bbb4c407a6dd816 ]---

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-10-19 13:01:21 +03:00
Alexander Belyakov
5bf1723723 [JFFS2] Write buffer offset adjustment for NOR-ECC (Sibley) flash
After choosing new c->nextblock, don't leave the wbuf offset field
occasionally pointing at the start of the next physical eraseblock.
This was causing a BUG() on NOR-ECC (Sibley) flash, where we start
writing after the cleanmarker.

Among other this fix should cover write buffer offset adjustment
after flushing the last page of an eraseblock.

Signed-off-by: Alexander Belyakov <abelyako@googlemail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-10-18 11:54:09 +01:00
Linus Torvalds
58617d5e59 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Remove automatic enabling of the HUGE_FILE feature flag
  ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback
  ext4: Update Documentation/filesystems/ext4.txt
  ext4: Remove unused mount options: nomballoc, mballoc, nocheck
  ext4: Remove compile warnings when building w/o CONFIG_PROC_FS
  ext4: Add missing newlines to printk messages
  ext4: Fix file fragmentation during large file write.
  vfs: Add no_nrwrite_index_update writeback control flag
  vfs: Remove the range_cont writeback mode.
  ext4: Use tag dirty lookup during mpage_da_submit_io
  ext4: let the block device know when unused blocks can be discarded
  ext4: Don't reuse released data blocks until transaction commits
  ext4: Use an rbtree for tracking blocks freed during transaction.
  ext4: Do mballoc init before doing filesystem recovery
  ext4: Free ext4_prealloc_space using kmem_cache_free
  ext4: Fix Kconfig typo for ext4dev
  ext4: Remove an old reference to ext4dev in Makefile comment
2008-10-17 15:08:11 -07:00
Magnus Deininger
57c7b4e68e 9p: fix device file handling
In v9fs_get_inode(), for block, as well as char devices (in theory), 
the function init_special_inode() is called to set up callback functions 
for file ops. this function uses the file mode's value to determine whether 
to use block or char dev functions. In v9fs_inode_from_fid(), the function 
p9mode2unixmode() is used, but for all devices it initially returns S_IFBLK, 
then uses v9fs_get_inode() to initialise a new inode, then finally uses 
v9fs_stat2inode(), which would determine whether the inode is a block or 
character device. However, at that point init_special_inode() had already 
decided to use the block device functions, so even if the inode's mode is 
turned to a character device, the block functions are still used to operate 
on them. The attached patch simply calls init_special_inode() again for devices 
after parsing device node data in v9fs_stat2inode() so that the proper functions 
are used.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 12:44:46 -05:00
Andy Adamson
ec9a05c94c NFS: use correct fs type for v4 submounts and referrals
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:48 -04:00
Neil Brown
504e518953 Make nfs_file_cred more robust.
As not all files have an associated open_context (e.g. device special
files), it is safest to test for the existence of the open context
before de-referencing it.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:45 -04:00
Chuck Lever
18de973530 NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets
Allow the NFS callback server to listen for requests via an AF_INET6 or
AF_INET socket when IPv6 support is present in the kernel.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:41 -04:00
Eric Van Hensbergen
02da398b95 9p: eliminate depricated conv functions
Remove depricated conv functions which have been replaced with new 
protocol routines.

This patch also reworks the one instance of the file-system code which
directly calls conversion routines (to accomplish unpacking dirreads).

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:06:57 -05:00
Eric Van Hensbergen
51a87c552d 9p: rework client code to use new protocol support functions
Now that the new protocol functions are in place, this patch switches
the client code to using the new support code.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:45 -05:00
Eric Van Hensbergen
06b55b464e 9p: move dirread to fs layer
Currently reading a directory is implemented in the client code.
This function is not actually a wire operation, but a meta operation 
which calls read operations and processes the results.

This patch moves this functionality to the fs layer and calls component
wire operations instead of constructing their packets.  This provides a 
cleaner separation and will help when we reorganize the client functions
and protocol processing methods.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:43 -05:00
Eric Van Hensbergen
dfb0ec2e13 9p: adjust 9p vfs write operation
Currently, the 9p net wire operation ensures that all data is sent by sending
multiple packets if the data requested is larger than the msize.  This is
better handled in the vfs code so that we can simplify wire operations to 
being concerned with only putting data onto and taking data off of the wire.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:43 -05:00