linux

Author	SHA1	Message	Date
Lai Jiangshan	57bdfbf9ee	block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu() The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-07-20 14:10:13 -07:00
Shaohua Li	7700fc4f67	CFQ: add think time check for group Currently when the last queue of a group has no request, we don't expire the queue to hope request from the group comes soon, so the group doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=30 direct=1 [test1] cgroup=test1 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file1 thinktime=9000 [test2] cgroup=test2 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file2 patched base test1 64k 39k test2 548k 540k total 604k 578k group1 gets much better throughput because it waits less time. To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test result is stable. The thoughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 14:24:56 +02:00
Shaohua Li	f5f2b6ceb2	CFQ: add think time check for service tree Currently when the last queue of a service tree has no request, we don't expire the queue to hope request from the service tree comes soon, so the service tree doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=10 direct=1 [test1] rw=randread ioengine=libaio size=500m directory=/mnt filename=file1 thinktime=9000 [test2] rw=read ioengine=libaio size=1G directory=/mnt filename=file2 patched base test1 41k/s 33k/s test2 15868k/s 15789k/s total 15902k/s 15817k/s A slightly better To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test has variation even without the patch, but the average throughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 14:24:55 +02:00
Shaohua Li	383cd7213f	CFQ: move think time check variables to a separate struct Move the variables to do think time check to a sepatate struct. This is to prepare adding think time check for service tree and group. No functional change. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 14:24:35 +02:00
Justin TerAvest	4aede84b33	fixlet: Remove fs_excl from struct task. fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 08:35:10 +02:00
Justin TerAvest	a07405b780	cfq: Remove special treatment for metadata rqs. There is no consistency among filesystems from what bios (or requests) are marked as being metadata. It's interesting to expose this in traces, but we shouldn't schedule the requests differently based on whether or not they're marked as being metadata. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-10 22:09:19 +02:00
Shaohua Li	55c022bbdd	block: avoid building too big plug list When I test fio script with big I/O depth, I found the total throughput drops compared to some relative small I/O depth. The reason is the thread accumulates big requests in its plug list and causes some delays (surely this depends on CPU speed). I thought we'd better have a threshold for requests. When a threshold reaches, this means there is no request merge and queue lock contention isn't severe when pushing per-task requests to queue, so the main advantages of blk plug don't exist. We can force a plug list flush in this case. With this, my test throughput actually increases and almost equals to small I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list() for big I/O depth. The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to reduce lock contention to me. But I'm open here, 32 is ok in my test too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-08 08:19:20 +02:00
Mike Snitzer	0f79960391	block: eliminate potential for infinite loop in blkdev_issue_discard Due to the recently identified overflow in read_capacity_16() it was possible for max_discard_sectors to be zero but still have discards enabled on the associated device's queue. Eliminate the possibility for blkdev_issue_discard to infinitely loop. Interestingly this issue wasn't identified until a device, whose discard_granularity was 0 due to read_capacity_16 overflow, was consumed by blk_stack_limits() to construct limits for a higher-level DM multipath device. The multipath device's resulting limits never had the discard limits stacked because blk_stack_limits() will only do so if the bottom device's discard_granularity != 0. This resulted in the multipath device's limits.max_discard_sectors being 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-06 21:32:02 +02:00
Johannes Stezenbach	390192b300	compat_ioctl: fix warning caused by qemu On Linux x86_64 host with 32bit userspace, running qemu or even just "qemu-img create -f qcow2 some.img 1G" causes a kernel warning: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img ioctl 00005326 is CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. The warning appears because the Linux compat-ioctl handler for these ioctls only applies to block devices, while qemu also uses the ioctls on plain files. Signed-off-by: Johannes Stezenbach <js@sig21.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 22:32:26 +02:00
Tejun Heo	85ef06d1d2	block: flush MEDIA_CHANGE from drivers on close(2) Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 16:17:47 +02:00
Jens Axboe	04bf7869ca	Merge branch 'for-linus' into for-3.1/core Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 16:17:13 +02:00
Shaohua Li	726e99ab88	cfq-iosched: make code consistent ioc->ioc_data is rcu protectd, so uses correct API to access it. This doesn't change any behavior, but just make code consistent. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after `ab4bd22d` Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-27 09:36:06 +02:00
Shaohua Li	3181faa85b	cfq-iosched: fix a rcu warning I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but doesn't hold rcu_read_lock. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after `ab4bd22d` Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-27 09:36:06 +02:00
Namhyung Kim	2b727c6300	bsg: fix address space warning from sparse copy_from/to_user() and blk_rq_map_user() want __user pointer. This patch fixes following warnings from sparse: CHECK block/bsg.c block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces) block/bsg.c:185:38: expected void const [noderef] <asn:1>from block/bsg.c:185:38: got void <noident> block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:295:58: expected void [noderef] <asn:1><noident> block/bsg.c:295:58: got void [assigned] dxferp block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:311:52: expected void [noderef] <asn:1><noident> block/bsg.c:311:52: got void [assigned] dxferp block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces) block/bsg.c:448:37: expected void [noderef] <asn:1>dst block/bsg.c:448:37: got void <noident> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-20 13:27:45 +02:00
Namhyung Kim	44194e3e88	bsg: remove unnecessary conditional expressions Second condition in OR always implies first condition is false thus bytes_read in the second is not needed. The same goes to bytes_written. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-20 13:27:44 +02:00
Namhyung Kim	80ceb05713	bsg: fix bsg_poll() to return POLLOUT properly POLLOUT should be returned only if bd->queued_cmds < bd->max_queue so that bsg_alloc_command() can proceed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-20 13:27:44 +02:00
Joe Perches	d2f31a5fd6	blk-throttle: Make total_nr_queued unsigned The total of two unsigned values should also be unsigned. Update throtl_log output to unsigned. Update total_nr_queued test to non-zero to be the same as the other total_nr_queued tests. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 20:19:27 +02:00
Joe Perches	fd16d26319	block: Add __attribute__((format(printf...) and fix fallout Use the compiler to verify format strings and arguments. Fix fallout. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 20:18:49 +02:00
Wanlong Gao	9f5e486550	block:remove some spare spaces in genhd.c Remove the end-of-line spaces in genhd.c. Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 10:45:43 +02:00
Joe Perches	08e8138ade	block: Add __attribute__((format(printf...) and fix fallout Use the compiler to verify format strings and arguments. Fix fallout. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 10:42:49 +02:00
Tejun Heo	fdd514e16b	block: make disk_block_events() properly wait for work cancellation disk_block_events() should guarantee that the event work is not in flight on return and once blocked it shouldn't issue further cancellations. Because there was no synchronization between the first blocker doing cancel_delayed_work_sync() and the following blockers, the following blockers could finish before cancellation was complete, which broke both guarantees - event work could be in flight and cancellation could happen after return. This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 Fix it by adding an outer mutex which protects both block count manipulation and work cancellation. -v2: Use outer mutex instead of bit waitqueue per Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-09 20:43:59 +02:00
Tejun Heo	c3af54afba	block: remove non-syncing __disk_block_events() and fold it into disk_block_events() After the previous update to disk_check_events(), nobody is using non-syncing __disk_block_events(). Remove @sync and, as this makes __disk_block_events() virtually identical to disk_block_events(), remove the underscore prefixed version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-09 20:43:55 +02:00
Tejun Heo	a9dce2a3b4	block: don't use non-syncing event blocking in disk_check_events() This patch is part of fix for triggering of WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 disk_clear_events() blocks events, schedules and flushes the event work. It expects the work to have started execution on schedule and finished on return from flush. WARN_ON_ONCE() triggers if the event work hasn't executed as expected. This problem happens because __disk_block_events() fails to guarantee that the event work item is not in flight on return from the function in race-free manner. The problem is two-fold and this patch addresses one of them. When __disk_block_events() is called with @sync == %false, it bumps event block count, calls cancel_delayed_work() and return. This makes it impossible to guarantee that event polling is not in flight on return from syncing __disk_block_events() - if the first blocker was non-syncing, polling could still be in progress and later syncing ones would assume that the first blocker already canceled it. Making __disk_block_events() cancel_sync regardless of block count isn't feasible either as it may race with forced event checking in disk_clear_events(). As disk_check_events() is the only user of non-syncing __disk_block_events(), updating it to directly cancel and schedule event work is the easiest way to solve the issue. Note that there's another bug in __disk_block_events() and this patch doesn't fix the issue completely. Later patch will fix the other bug. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-09 20:43:54 +02:00
Paul Bolle	df4156569d	block: rename the return of two functions If we rename the return of alloc_io_context() and get_io_context() from "ret" to "ioc" the code get's (a bit) more readable and (a lot) more grepable. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-06 05:57:25 +02:00
Paul Bolle	8aea45451b	CFQ: make two functions static Correctly suggested by sparse. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-06 05:57:25 +02:00
Jens Axboe	9b50902db5	cfq-iosched: fix locking around ioc->ioc_data assignment Since we are modifying this RCU pointer, we need to hold the lock protecting it around it. This fixes a potential reuse and double free of a cfq io_context structure. The bug has been in CFQ for a long time, it hit very few people but those it did hit seemed to see it a lot. Tracked in RH bugzilla here: https://bugzilla.redhat.com/show_bug.cgi?id=577968 Credit goes to Paul Bolle for figuring out that the issue was around the one-hit ioc->ioc_data cache. Thanks to his hard work the issue is now fixed. Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-06 05:57:21 +02:00
Jens Axboe	ab4bd22d3c	cfq-iosched: fix locking around ioc->ioc_data assignment Since we are modifying this RCU pointer, we need to hold the lock protecting it around it. This fixes a potential reuse and double free of a cfq io_context structure. The bug has been in CFQ for a long time, it hit very few people but those it did hit seemed to see it a lot. Tracked in RH bugzilla here: https://bugzilla.redhat.com/show_bug.cgi?id=577968 Credit goes to Paul Bolle for figuring out that the issue was around the one-hit ioc->ioc_data cache. Thanks to his hard work the issue is now fixed. Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-06 05:56:49 +02:00
Jeff Moyer	796d5116c4	iosched: prevent aliased requests from starving other I/O Hi, Jens, If you recall, I posted an RFC patch for this back in July of last year: http://lkml.org/lkml/2010/7/13/279 The basic problem is that a process can issue a never-ending stream of async direct I/Os to the same sector on a device, thus starving out other I/O in the system (due to the way the alias handling works in both cfq and deadline). The solution I proposed back then was to start dispatching from the fifo after a certain number of aliases had been dispatched. Vivek asked why we had to treat aliases differently at all, and I never had a good answer. So, I put together a simple patch which allows aliases to be added to the rb tree (it adds them to the right, though that doesn't matter as the order isn't guaranteed anyway). I think this is the preferred solution, as it doesn't break up time slices in CFQ or batches in deadline. I've tested it, and it does solve the starvation issue. Let me know what you think. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-02 21:19:05 +02:00
Paul Bolle	e2bd9678fc	block: Use hlist_entry() for io_context.cic_list.first list_entry() and hlist_entry() are both simply aliases for container_of(), but since io_context.cic_list.first is an hlist_node one should at least use the correct alias. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-02 13:05:02 +02:00
Paul Bolle	28304f485c	cfq-iosched: Remove bogus check in queue_fail path queue_fail can only be reached if cic is NULL, so its check for cic must be bogus. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-02 13:05:02 +02:00
Kyungmin Park	4495a7d41d	CFQ: Fix typo and remove unnecessary semicolon Fix comment typo and remove unnecessary semicolon at macro Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-31 19:49:44 +02:00
Linus Torvalds	bdf7cf1c83	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: loop: export module parameters block: export blk_{get,put}_queue() block: remove unused variable in bio_attempt_front_merge() block: always allocate genhd->ev if check_events is implemented brd: export module parameters brd: fix comment on initial device creation brd: handle on-demand devices correctly brd: limit 'max_part' module param to DISK_MAX_PARTS brd: get rid of unused members from struct brd_device block: fix oops on !disk->queue and sysfs discard alignment display	2011-05-27 10:24:40 -07:00
Jens Axboe	d86e0e83b3	block: export blk_{get,put}_queue() We need them in SCSI to fix a bug, but currently they are not exported to modules. Export them. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-27 07:45:45 +02:00
Ben Blum	f780bdb7c1	cgroups: add per-thread subsystem callbacks Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Luca Tettamanti	700c4f3325	block: remove unused variable in bio_attempt_front_merge() sector is never read inside the function. Signed-off-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-26 21:07:26 +02:00
Tejun Heo	75e3f3ee3c	block: always allocate genhd->ev if check_events is implemented `9fd097b149` (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block drivers which have inadequate ->check_events(). Combined with earlier change `7c88a168da` (block: don't propagate unlisted DISK_EVENTs to userland), this enables using ->check_events() for internal processing while avoiding enabling in-kernel block event polling which can lead to infinite event loop. Unfortunately, this made many drivers including floppy without any bit set in disk->events and ->async_events in which case disk_add_events() simply skipped allocation of disk->ev, which disables whole event handling. As ->check_events() is still used during open processing for revalidation, this can lead to open failure. This patch always allocates disk->ev if ->check_events is implemented. In the long term, it would make sense to simply include the event structure inline into genhd as it's now used by virtually all block devices. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ondrej Zary <linux@rainbow-software.org> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-26 21:06:50 +02:00
Namhyung Kim	1547010e6e	cfq-iosched: free cic_index if cfqd allocation fails When struct cfq_data allocation fails, cic_index need to be freed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-24 10:23:22 +02:00
Namhyung Kim	20359f27e8	cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add() The 'group_changed' variable is initialized to 0 and never changed, so checking the variable is meaningless. It is a leftover from `0bbfeb8320` ("cfq-iosched: Always provide group iosolation."). Let's get rid of it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-24 10:23:22 +02:00
Namhyung Kim	229836bd63	cfq-iosched: reduce bit operations in cfq_choose_req() Reduce the number of bit operations in cfq_choose_req() on average (and worst) cases. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-24 10:23:21 +02:00
Namhyung Kim	b9f8ce0599	cfq-iosched: algebraic simplification in cfq_prio_to_maxrq() Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in this context IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-24 10:23:21 +02:00
Vivek Goyal	4cbadbd16e	blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time If we don't explicitly initialize it to zero, CFQ might think that cgroup of ioc has changed and it generates lots of unnecessary calls to call_for_each_cic(changed_cgroup). Fix it. cfq_get_io_context() cfq_ioc_set_cgroup() call_for_each_cic(ioc, changed_cgroup) Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 19:35:04 +02:00
Vivek Goyal	95cf3dd9db	block: call elv_bio_merged() when merged Commit `73c1010119` ("block: initial patch for on-stack per-task plugging") removed calls to elv_bio_merged() when @bio merged with @req. Re-add them. This in turn will update merged stats in associated group. That should be safe as long as request has got reference to the blkio_group. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Divyesh Shah <dpshah@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 10:02:19 +02:00
Vivek Goyal	317389a773	cfq-iosched: Make IO merge related stats per cpu Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking blkg->stats_lock. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 10:02:19 +02:00
Vivek Goyal	2abae55f5a	cfq-iosched: Fix a memory leak of per cpu stats for root group We allocated per cpu stats struct for root group but did not free it. Fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 10:02:19 +02:00
Jens Axboe	771949d03b	block: get rid of on-stack plugging debug checks We don't need them anymore, so kill: - REQ_ON_PLUG checks in various places - !rq_mergeable() check in plug merging Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:52:16 +02:00
Jens Axboe	0eb8e88572	Merge branch 'for-linus' into for-2.6.40/core This patch merges in a fix that missed 2.6.39 final. Conflicts: block/blk.h	2011-05-20 20:36:16 +02:00
Vivek Goyal	af75cd3c67	blk-throttle: Make no throttling rule group processing lockless Currently we take a queue lock on each bio to check if there are any throttling rules associated with the group and also update the stats. Now access the group under rcu and update the stats without taking the queue lock. Queue lock is taken only if there are throttling rules associated with the group. So the common case of root group when there are no rules, save unnecessary pounding of request queue lock. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:53 +02:00
Vivek Goyal	f0bdc8cdd9	blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats Now dispatch stats update is lock free. But reset of these stats still takes blkg->stats_lock and is dependent on that. As stats are per cpu, we should be able to just reset the stats on each cpu without any locks. (Atleast for 64bit arch). On 32bit arch there is a small race where 64bit updates are not atomic. The result of this race can be that in the presence of other writers, one might not get 0 value after reset of a stat and might see something intermediate One can write more complicated code to cover this race like sending IPI to other cpus to reset stats and for offline cpus, reset these directly. Right not I am not taking that path because reset_update is more of a debug feature and it can happen only on 32bit arch and possibility of it happening is small. Will fix it if it becomes a real problem. For the time being going for code simplicity. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:53 +02:00
Vivek Goyal	575969a0dd	blk-cgroup: Make 64bit per cpu stats safe on 32bit arch Some of the stats are 64bit and updation will be non atomic on 32bit architecture. Use sequence counters on 32bit arch to make reading of stats safe. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:53 +02:00
Vivek Goyal	5624a4e445	blk-throttle: Make dispatch stats per cpu Currently we take blkg_stat lock for even updating the stats. So even if a group has no throttling rules (common case for root group), we end up taking blkg_lock, for updating the stats. Make dispatch stats per cpu so that these can be updated without taking blkg lock. If cpu goes offline, these stats simply disappear. No protection has been provided for that yet. Do we really need anything for that? Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	4843c69d49	blk-throttle: Free up a group only after one rcu grace period Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence start freeing up throtl_grp after one rcu grace period. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	5617cbef77	blk-throttle: Use helper function to add root throtl group to lists Use same helper function for root group as we use with dynamically allocated groups to add it to various lists. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	269f541555	blk-throttle: Introduce a helper function to fill in device details A helper function for the code which is used at 2-3 places. Makes reading code little easier. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	29b125892f	blk-throttle: Dynamically allocate root group Currently, we allocate root throtl_grp statically. But as we will be introducing per cpu stat pointers and that will be allocated dynamically even for root group, we might as well make whole root throtl_grp allocation dynamic and treat it in same manner as other groups. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	f469a7b4d5	blk-cgroup: Allow sleeping while dynamically allocating a group Currently, all the cfq_group or throtl_group allocations happen while we are holding ->queue_lock and sleeping is not allowed. Soon, we will move to per cpu stats and also need to allocate the per group stats. As one can not call alloc_percpu() from atomic context as it can sleep, we need to drop ->queue_lock, allocate the group, retake the lock and continue processing. In throttling code, I check the queue DEAD flag again to make sure that driver did not call blk_cleanup_queue() in the mean time. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	56edf7d75d	cfq-iosched: Fix a possible race with cfq cgroup removal code blkg->key = cfqd is an rcu protected pointer and hence we used to do call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period. The problem here is that even though cfqd is around, there are no gurantees that associated request queue (td->queue) or q->queue_lock is still around. A driver might have called blk_cleanup_queue() and release the lock. It might happen that after freeing up the lock we call blkg->key->queue->queue_ock and crash. This is possible in following path. blkiocg_destroy() blkio_unlink_group_fn() cfq_unlink_blkio_group() Hence, wait for an rcu peirod if there are groups which have not been unlinked from blkcg->blkg_list. That way, if there are any groups which are taking cfq_unlink_blkio_group() path, can safely take queue lock. This is how we have taken care of race in throttling logic also. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	3e59cf9d66	cfq-iosched: Get rid of redundant function parameter "create" Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create". Get rid of that. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	a23e686955	blk-cgroup: move some fields of unaccounted_time file under right config option cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y. there are some fields which are out side this config option. Fix that. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Vivek Goyal	a29a171e7c	blk-throttle: Do the new group initialization with the help of a function Group initialization code seems to be at two places. root group initialization in blk_throtl_init() and dynamically allocated group in throtl_find_alloc_tg(). Create a common function and use at both the places. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:51 +02:00
Jens Axboe	698567f3fa	Merge commit 'v2.6.39' into for-2.6.40/core Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've had churn in the core area that makes it difficult to handle patches for eg cfq or blk-throttle. Instead of requiring that they be based in older versions with bugs that have been fixed later in the rc cycle, merge in 2.6.39 final. Also fixes up conflicts in the below files. Conflicts: drivers/block/paride/pcd.c drivers/cdrom/viocd.c drivers/ide/ide-cd.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:33:15 +02:00
James Bottomley	0a58e077eb	block: add proper state guards to __elv_next_request blk_cleanup_queue() calls elevator_exit() and after this, we can't touch the elevator without oopsing. __elv_next_request() must check for this state because in the refcounted queue model, we can still call it after blk_cleanup_queue() has been called. This was reported as causing an oops attributable to scsi. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-18 19:30:32 +02:00
Shaohua Li	3ec717b7ca	block: don't delay blk_run_queue_async Let's check a scenario: 1. blk_delay_queue(q, SCSI_QUEUE_DELAY); 2. blk_run_queue_async(); the second one will became a noop, because q->delay_work already has WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed work runs immediately. Fix this by doing a cancel on potentially pending delayed work before queuing an immediate run of the workqueue. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-18 12:24:03 +02:00
Martin K. Petersen	a934a00a69	block: Fix discard topology stacking and reporting In some cases we would end up stacking discard_zeroes_data incorrectly. Fix this by enabling the feature by default for stacking drivers and clearing it for low-level drivers. Incorporating a device that does not support dzd will then cause the feature to be disabled in the stacking driver. Also ensure that the maximum discard value does not overflow when exported in sysfs and return 0 in the alignment and dzd fields for devices that don't support discard. Reported-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-18 10:37:35 +02:00
Vivek Goyal	70087dc38c	blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup Currentlly we first map the task to cgroup and then cgroup to blkio_cgroup. There is a more direct way to get to blkio_cgroup from task using task_subsys_state(). Use that. The real reason for the fix is that it also avoids a race in generic cgroup code. During remount/umount rebind_subsystems() is called and it can do following with and rcu protection. cgrp->subsys[i] = NULL; That means if somebody got hold of cgroup under rcu and then it tried to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which is wrong. I was running into this race condition with ltp running on a upstream derived kernel and that lead to crash. So ideally we should also fix cgroup generic code to wait for rcu grace period before setting pointer to NULL. Li Zefan is not very keen on introducing synchronize_wait() as he thinks it will slow down moun/remount/umount operations. So for the time being atleast fix the kernel crash by taking a more direct route to blkio_cgroup. One tester had reported a crash while running LTP on a derived kernel and with this fix crash is no more seen while the test has been running for over 6 days. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-16 15:24:08 +02:00
Lukas Czerner	8af1954d17	blkdev: Do not return -EOPNOTSUPP if discard is supported Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the bio fails due to underlying device not supporting discard request. However, if the device is for example dm device composed of devices which some of them support discard and some of them does not, it is ok for some bios to fail with EOPNOTSUPP, but it does not mean that discard is not supported at all. This commit removes the check for bios failed with EOPNOTSUPP and change blkdev_issue_discard() to return operation not supported if and only if the device does not actually supports it, not just part of the device as some bios might indicate. This change also fixes problem with BLKDISCARD ioctl() which now works correctly on such dm devices. Signed-off-by: Lukas Czerner <lczerner@redhat.com> CC: Jens Axboe <jaxboe@fusionio.com> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-06 19:30:01 -06:00
Lukas Czerner	5baebe5c86	blkdev: Simple cleanup in blkdev_issue_zeroout() In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do not need to check for -EOPNOTSUPP specifically in case of error. Also there is no need to have label submit: because there is no way to jump out from the while cycle without an error and we really want to exit, rather than try again. And also remove the check for (sz == 0) since at that point sz can never be zero. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> CC: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-06 19:26:28 -06:00
Lukas Czerner	5dba3089ed	blkdev: Submit discard bio in batches in blkdev_issue_discard() Currently we are waiting for every submitted REQ_DISCARD bio separately, but it can have unwanted consequences of repeatedly flushing the queue, so we rather submit bios in batches and wait for the entire batch, hence narrowing the window of other ios going in. Use bio_batch_end_io() and struct bio_batch for that purpose, the same is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we always set !BIO_UPTODATE in the case of error and remove the check for bb, since we are the only user of this function and we always set this. Remove bio_get()/bio_put() from the blkdev_issue_discard() since bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is not needed anymore. I have done simple dd testing with surprising results. The script I have used is: for i in $(seq 10); do echo $i dd if=/dev/sdb1 of=/dev/sdc1 bs=4k & sleep 5 done /usr/bin/time -f %e ./blkdiscard /dev/sdc1 Running time of BLKDISCARD on the whole device: with patch without patch 0.95 15.58 So we can see that in this artificial test the kernel with the patch applied is approx 16x faster in discarding the device. Signed-off-by: Lukas Czerner <lczerner@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> CC: Jens Axboe <jaxboe@fusionio.com> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-06 19:26:27 -06:00
shaohua.li@intel.com	3ac0cc4508	block: hold queue if flush is running for non-queueable flush drive In some drives, flush requests are non-queueable. When flush request is running, normal read/write requests can't run. If block layer dispatches such request, driver can't handle it and requeue it. Tejun suggested we can hold the queue when flush is running. This can avoid unnecessary requeue. Also this can improve performance. For example, we have request flush1, write1, flush 2. flush1 is dispatched, then queue is hold, write1 isn't inserted to queue. After flush1 is finished, flush2 will be dispatched. Since disk cache is already clean, flush2 will be finished very soon, so looks like flush2 is folded to flush1. In my test, the queue holding completely solves a regression introduced by commit `53d63e6b0d`: block: make the flush insertion use the tail of the dispatch list It's not a preempt type request, in fact we have to insert it behind requests that do specify INSERT_FRONT. which causes about 20% regression running a sysbench fileio workload. Stable: 2.6.39 only Cc: stable@kernel.org Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-06 11:36:25 -06:00
shaohua.li@intel.com	f387693095	block: add a non-queueable flush flag flush request isn't queueable in some drives. Add a flag to let driver notify block layer about this. We can optimize flush performance with the knowledge. Stable: 2.6.39 only Cc: stable@kernel.org Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-06 11:36:25 -06:00
Kees Cook	490b94be02	iosched: remove redundant sprintf After the anticipatory scheduler was dropped, there was no need to special-case the request_module string. As such, drop the redundant sprintf and stack variable. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-05 18:02:12 -06:00
Tao Ma	addd0a09fc	block: Remove 'plug/unplug' comment in blk_execute_rq_nowait unplug is replaced with blk_run_queue now in blk_execute_rq_nowait, so change the comment accordingly. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-05 15:10:05 -06:00
Tejun Heo	7c88a168da	block: don't propagate unlisted DISK_EVENTs to userland DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and internal event for revalidation of removeable devices. Some legacy drivers don't implement proper event detection and continuously generate events under certain circumstances. For example, ide-cd generates media changed continuously if there's no media in the drive, which can lead to infinite loop of events jumping back and forth between the driver and userland event handler. This patch updates disk event infrastructure such that it never propagates events not listed in disk->events to userland. Those events are processed the same for internal purposes but uevent generation is suppressed. This also ensures that userland only gets events which are advertised in the @events sysfs node lowering risk of confusion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-21 19:43:58 +02:00
Jens Axboe	3aa72873ff	elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too The sort insert is the one that goes to the IO scheduler. With the SORT_MERGE addition, we could bypass IO scheduler setup but still ask the IO scheduler to insert the request. This would cause an oops on switching IO schedulers through the sysfs interface, unless the disk just happened to be idle while it occured. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-21 19:28:35 +02:00
Tao Ma	60735b6362	block: Remove the extra check in queue_requests_store In queue_requests_store, the code looks like if (rl->count[BLK_RW_SYNC] >= q->nr_requests) { blk_set_queue_full(q, BLK_RW_SYNC); } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) { blk_clear_queue_full(q, BLK_RW_SYNC); wake_up(&rl->wait[BLK_RW_SYNC]); } If we don't satify the situation of "if", we can get that rl->count[BLK_RW_SYNC} < q->nr_quests. It is the same as rl->count[BLK_RW_SYNC]+1 <= q->nr_requests. All the "else" should satisfy the "else if" check so it isn't needed actually. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:51:53 +02:00
Liu Yuan	ed5302d3c2	block, blk-sysfs: Fix an err return path in blk_register_queue() We do not call blk_trace_remove_sysfs() in err return path if kobject_add() fails. This path fixes it. Cc: stable@kernel.org Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:51:53 +02:00
Jens Axboe	d350e6b6e8	block: remove stale kerneldoc member from __blk_run_queue() We don't pass in a 'force_kblockd' anymore, get rid of the stsale comment. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:34:14 +02:00
Jens Axboe	c21e6beba8	block: get rid of QUEUE_FLAG_REENTER We are currently using this flag to check whether it's safe to call into ->request_fn(). If it is set, we punt to kblockd. But we get a lot of false positives and excessive punts to kblockd, which hurts performance. The only real abuser of this infrastructure is SCSI. So export the async queue run and convert SCSI over to use that. There's room for improvement in that SCSI need not always use the async call, but this fixes our performance issue and they can fix that up in due time. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:32:46 +02:00
Jens Axboe	5f45c69589	cfq-iosched: read_lock() does not always imply rcu_read_lock() For some configurations of CONFIG_PREEMPT that is not true. So get rid of __call_for_each_cic() and always uses the explicitly rcu_read_lock() protected call_for_each_cic() instead. This fixes a potential bug related to IO scheduler removal or online switching. Thanks to Paul McKenney for clarifying this. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 09:10:35 +02:00
Jens Axboe	bd900d4580	block: kill blk_flush_plug_list() export With all drivers and file systems converted, we only have in-core use of this function. So remove the export. Reporteed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 22:06:57 +02:00
Christoph Hellwig	24ecfbe27f	block: add blk_run_queue_async Instead of overloading __blk_run_queue to force an offload to kblockd add a new blk_run_queue_async helper to do it explicitly. I've kept the blk_queue_stopped check for now, but I suspect it's not needed as the check we do when the workqueue items runs should be enough. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 11:41:33 +02:00
Jens Axboe	4521cc4ed5	block: blk_delay_queue() should use kblockd workqueue Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 11:36:39 +02:00
Jens Axboe	99e22598e9	block: drop queue lock before calling __blk_run_queue() for kblockd punt If we know we are going to punt to kblockd, we can drop the queue lock before calling into __blk_run_queue() since it only does a safe bit test and a workqueue call. Since kblockd needs to grab this very lock as one of the first things it does, it's a good optimization to drop the lock before waking kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:59:55 +02:00
Jens Axboe	b4cb290e0a	Revert "block: add callback function for unplug notification" MD can't use this since it really requires us to be able to keep more than a single piece of state for the unplug. Commit `048c9374` added the required support for MD, so get rid of this now unused code. This reverts commit `f75664570d`. Conflicts: block/blk-core.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:54:05 +02:00
NeilBrown	048c9374a7	block: Enhance new plugging support to support general callbacks md/raid requires an unplug callback, but as it does not uses requests the current code cannot provide one. So allow arbitrary callbacks to be attached to the blk_plug. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:52:22 +02:00
Jens Axboe	49cac01e1f	block: make unplug timer trace event correspond to the schedule() unplug It's a pretty close match to what we had before - the timer triggering would mean that nobody unplugged the plug in due time, in the new scheme this matches very closely what the schedule() unplug now is. It's essentially the difference between an explicit unplug (IO unplug) or an implicit unplug (timer unplug, we scheduled with pending IO queued). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-16 13:51:05 +02:00
Jens Axboe	f6603783f9	block: only force kblockd unplugging from the schedule() path For the explicit unplugging, we'd prefer to kick things off immediately and not pay the penalty of the latency to switch to kblockd. So let blk_finish_plug() do the run inline, while the implicit-on-schedule-out unplug will punt to kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-15 15:49:07 +02:00
Christoph Hellwig	88b996cd06	block: cleanup the block plug helper functions It's a bit of a mess currently. task->plug is being cleared and reset in __blk_finish_plug(), and blk_finish_plug() is testing for a NULL plug which cannot happen even from schedule() anymore since it uses blk_needs_flush_plug() to determine whether to call into this function at all. So get rid of some of the cruft. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-15 15:20:10 +02:00
Liu Yuan	80656b67b3	block, blk-sysfs: Use the variable directly instead of a function call In the function blk_register_queue(), var _dev_ is already assigned by disk_to_dev().So use it directly instead of calling disk_to_dev() again. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Modified by me to delete an empty line in the same function while in there anyway. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-13 22:14:54 +02:00
Jens Axboe	f4af3c3d07	block: move queue run on unplug to kblockd There are worries that we are now consuming a lot more stack in some cases, since we potentially call into IO dispatch from schedule() or io_schedule(). We can reduce this problem by moving the running of the queue to kblockd, like the old plugging scheme did as well. This may or may not be a good idea from a performance perspective, depending on how many tasks have queue plugs running at the same time. For even the slightly contended case, doing just a single queue run from kblockd instead of multiple runs directly from the unpluggers will be faster. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 14:58:51 +02:00
Jens Axboe	cf82c79839	block: kill queue_sync_plugs() The original use for this dates back to when we had to track write requests for serializing around barriers. That's not needed anymore, so kill it. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:30:53 +02:00
Jens Axboe	dc6d36c971	block: readd plug trace event This was removed with the queue plug state. But we can easily readd by checking if this is the first request going to this queue. It's good information to have when tracing to see how effective the plugging is. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:28:28 +02:00
Jens Axboe	f75664570d	block: add callback function for unplug notification MD would like to know when a queue is unplugged, so it can flush it's bitmap writes. Add such a callback. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:17:31 +02:00
Jens Axboe	188112722c	block: add comment on why we save and disable interrupts in flush_plug_list() It's done at the top to avoid doing it for every queue we unplug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:12:29 +02:00
Jens Axboe	94b5eb28b4	block: fixup block IO unplug trace call It was removed with the on-stack plugging, readd it and track the depth of requests added when flushing the plug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:12:19 +02:00
NeilBrown	109b81296c	block: splice plug list to local context If the request_fn ends up blocking, we could be re-entering the plug flush. Since the list is protected by explicitly not allowing schedule events, this isn't a terribly good idea. Additionally, it can cause us to recurse. As request_fn called by __blk_run_queue is allowed to 'schedule()' (after dropping the queue lock of course), it is possible to get a recursive call: schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list -> __blk_run_queue -> request_fn -> schedule We must make sure that the second schedule does not call into blk_flush_plug again. So instead of leaving the list of requests on blk_plug->list, move them to a separate list leaving blk_plug->list empty. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-11 14:13:10 +02:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Konstantin Khlebnikov	f83e826181	block: fix request sorting at unplug Comparison function for list_sort() must be anticommutative, otherwise it is not sorting in ordinary meaning. But fortunately list_sort() always check ((*cmp)(priv, a, b) <= 0) it not distinguish negative and zero, so comparison function can implement only less-or-equal instead of full three-way comparison. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:52:49 +02:00
Mike Snitzer	a63a5cf84d	dm: improve block integrity support The current block integrity (DIF/DIX) support in DM is verifying that all devices' integrity profiles match during DM device resume (which is past the point of no return). To some degree that is unavoidable (stacked DM devices force this late checking). But for most DM devices (which aren't stacking on other DM devices) the ideal time to verify all integrity profiles match is during table load. Introduce the notion of an "initialized" integrity profile: a profile that was blk_integrity_register()'d with a non-NULL 'blk_integrity' template. Add blk_integrity_is_initialized() to allow checking if a profile was initialized. Update DM integrity support to: - check all devices with _initialized_ integrity profiles match during table load; uninitialized profiles (e.g. for underlying DM device(s) of a stacked DM device) are ignored. - disallow a table load that would result in an integrity profile that conflicts with a DM device's existing (in-use) integrity profile - avoid clearing an existing integrity profile - validate all integrity profiles match during resume; but if they don't all we can do is report the mismatch (during resume we're past the point of no return) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:52:43 +02:00
Andreas Schwab	6f03793770	blk-throttle: don't call xchg on bool xchg does not work portably with smaller than 32bit types. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:37 +02:00
Jens Axboe	53d63e6b0d	block: make the flush insertion use the tail of the dispatch list It's not a preempt type request, in fact we have to insert it behind requests that do specify INSERT_FRONT. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:37 +02:00
Jens Axboe	b710a48055	block: get rid of elv_insert() interface Merge it with __elv_add_request(), it's pretty pointless to have a function with only two callers. The main interface is elv_add_request()/__elv_add_request(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:37 +02:00
Jens Axboe	8182924bc5	block: dump request state on seeing a corrupted request completion Currently we just dump a non-informative 'request botched' message. Lets actually try and print something sane to help debug issues around this. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:37 +02:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Jens Axboe	ad3d9d7ede	block: fix issue with calling blk_stop_queue() from the request_fn handler When the queue work handler was converted to delayed work, the stopping was inadvertently made sync as well. Change this back to being async stop, using __cancel_delayed_work() instead of cancel_delayed_work(). Reported-by: Jeremy Fitzhardinge <jeremy@goop.org> Reported-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-25 17:04:08 +01:00
Jens Axboe	401a18e92c	block: fix bug with inserting flush requests as sort/merge With the introduction of the on-stack plugging, we would assume that any request being inserted was a normal file system request. As flush/fua requires a special insert mode, this caused problems. Fix this up by checking for this in flush_plug_list() and use the appropriate insert mechanism. Big thanks goes to Markus Tripplesdorf for tirelessly testing patches, and to Sergey Senozhatsky for helping find the real issue. Reported-by: Markus Tripplesdorf <markus@trippelsdorf.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-25 17:04:08 +01:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Li, Shaohua	c4ade94fc0	cfq-iosched: removing unnecessary think time checking Removing think time checking. A high thinktime queue might means the queue dispatches several requests and then do away. Limitting such queue seems meaningless. And also this can simplify code. This is suggested by Vivek. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-23 08:30:34 +01:00
Justin TerAvest	62a37f6bad	cfq-iosched: Don't clear queue stats when preempt. For v2, I added back lines to cfq_preempt_queue() that were removed during updates for accounting unaccounted_time. Thanks for pointing out that I'd missed these, Vivek. Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly cleared stats for preempting queues when it shouldn't have, because when we choose a queue to preempt, it still isn't necessarily scheduled next. Thanks to Vivek Goyal for figuring this out and understanding how the preemption code works. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-23 08:25:44 +01:00
Vivek Goyal	04521db04e	blk-throttle: Reset group slice when limits are changed Lina reported that if throttle limits are initially very high and then dropped, then no new bio might be dispatched for a long time. And the reason being that after dropping the limits we don't reset the existing slice and do the rate calculation with new low rate and account the bios dispatched at high rate. To fix it, reset the slice upon rate change. https://lkml.org/lkml/2011/3/10/298 Another problem with very high limit is that we never queued the bio on throtl service tree. That means we kept on extending the group slice but never trimmed it. Fix that also by regulary trimming the slice even if bio is not being queued up. Reported-by: Lina Lu <lulina_nuaa@foxmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 21:55:00 +01:00
Justin TerAvest	9026e521c0	blk-cgroup: Only give unaccounted_time under debug This change moves unaccounted_time to only be reported when CONFIG_DEBUG_BLK_CGROUP is true. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 21:26:54 +01:00
Justin TerAvest	eda5e0c91f	cfq-iosched: Don't set active queue in preempt Commit "Add unaccounted time to timeslice_used" changed the behavior of cfq_preempt_queue to set cfqq active. Vivek pointed out that other preemption rules might get involved, so we shouldn't manually set which queue is active. This cleans up the code to just clear the queue stats at preemption time. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 21:26:49 +01:00
Jens Axboe	5e84ea3a9c	block: attempt to merge with existing requests on plug flush One of the disadvantages of on-stack plugging is that we potentially lose out on merging since all pending IO isn't always visible to everybody. When we flush the on-stack plugs, right now we don't do any checks to see if potential merge candidates could be utilized. Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE. It works just ELEVATOR_INSERT_SORT, but first checks whether we can merge with an existing request before doing the insertion (if we fail merging). This fixes a regression with multiple processes issuing IO that can be merged. Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing an accounting bug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-21 10:14:27 +01:00
Linus Torvalds	c55d267de2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits) [SCSI] scsi_dh_rdac: Add MD36xxf into device list [SCSI] scsi_debug: add consecutive medium errors [SCSI] libsas: fix ata list corruption issue [SCSI] hpsa: export resettable host attribute [SCSI] hpsa: move device attributes to avoid forward declarations [SCSI] scsi_debug: Logical Block Provisioning (SBC3r26) [SCSI] sd: Logical Block Provisioning update [SCSI] Include protection operation in SCSI command trace [SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try) [SCSI] target: Fix volume size misreporting for volumes > 2TB [SCSI] bnx2fc: Broadcom FCoE offload driver [SCSI] fcoe: fix broken fcoe interface reset [SCSI] fcoe: precedence bug in fcoe_filter_frames() [SCSI] libfcoe: Remove stale fcoe-netdev entries [SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h [SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument [SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs [SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out" [SCSI] libfc: Fixing a memory leak when destroying an interface [SCSI] megaraid_sas: Version and Changelog update ... Fix up trivial conflicts due to whitespace differences in drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}	2011-03-17 17:54:40 -07:00
Justin TerAvest	8184f93ece	cfq-iosched: Don't update group weights when on service tree Version 3 is updated to apply to for-2.6.39/core. For version 2, I took Vivek's advice and made sure we update the group weight from cfq_group_service_tree_add(). If a weight was updated while a group is on the service tree, the calculation for the total weight of the service tree can be adjusted improperly, which either leads to bad service tree weights, or potentially crashes (if total_weight becomes 0). This patch defers updates to the weight until a group is off the service tree. Signed-off-by: Justin TerAvest <teravest@google.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 16:12:36 +01:00
Justin TerAvest	167400d340	blk-cgroup: Add unaccounted time to timeslice_used. There are two kind of times that tasks are not charged for: the first seek and the extra time slice used over the allocated timeslice. Both of these exported as a new unaccounted_time stat. I think it would be good to have this reported in 'time' as well, but that is probably a separate discussion. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-12 16:54:00 +01:00
Tao Ma	eba2ed9c96	block: remove obsolete comments for blkdev_issue_zeroout. barrier is already removed, so remove the obsolete comments in blkdev_issue_zeroout. Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-11 20:13:54 +01:00
Lukas Czerner	0aeea18964	block: fix mis-synchronisation in blkdev_issue_zeroout() BZ29402 https://bugzilla.kernel.org/show_bug.cgi?id=29402 We can hit serious mis-synchronization in bio completion path of blkdev_issue_zeroout() leading to a panic. The problem is that when we are going to wait_for_completion() in blkdev_issue_zeroout() we check if the bb.done equals issued (number of submitted bios). If it does, we can skip the wait_for_completition() and just out of the function since there is nothing to wait for. However, there is a ordering problem because bio_batch_end_io() is calling atomic_inc(&bb->done) before complete(), hence it might seem to blkdev_issue_zeroout() that all bios has been completed and exit. At this point when bio_batch_end_io() is going to call complete(bb->wait), bb and wait does not longer exist since it was allocated on stack in blkdev_issue_zeroout() ==> panic! (thread 1) (thread 2) bio_batch_end_io() blkdev_issue_zeroout() if(bb) { ... if (bb->end_io) ... bb->end_io(bio, err); ... atomic_inc(&bb->done); ... ... while (issued != atomic_read(&bb.done)) ... (let issued == bb.done) ... (do the rest of the function) ... return ret; complete(bb->wait); ^^^^^^^^ panic We can fix this easily by simplifying bio_batch and completion counting. Also remove bio_end_io_t *end_io since it is not used. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Eric Whitney <eric.whitney@hp.com> Tested-by: Eric Whitney <eric.whitney@hp.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-11 15:36:08 +01:00
Jens Axboe	4c63f5646e	Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:58:35 +01:00
Vivek Goyal	69d60eb96a	blk-throttle: Use blk_plug in throttle dispatch Use plug in throttle dispatch also as we are dispatching a bunch of bios in throttle context and some of them might merge. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:27 +01:00
Jens Axboe	721a9602e6	block: kill off REQ_UNPLUG With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:27 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
Jens Axboe	73c1010119	block: initial patch for on-stack per-task plugging This patch adds support for creating a queuing context outside of the queue itself. This enables us to batch up pieces of IO before grabbing the block device queue lock and submitting them to the IO scheduler. The context is created on the stack of the process and assigned in the task structure, so that we can auto-unplug it if we hit a schedule event. The current queue plugging happens implicitly if IO is submitted to an empty device, yet callers have to remember to unplug that IO when they are going to wait for it. This is an ugly API and has caused bugs in the past. Additionally, it requires hacks in the vm (->sync_page() callback) to handle that logic. By switching to an explicit plugging scheme we make the API a lot nicer and can get rid of the ->sync_page() hack in the vm. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:45:54 +01:00
Jens Axboe	3cca6dc1c8	block: add API for delaying work/request_fn a little bit Currently we use plugging for that, but as plugging is going away, we need an alternative mechanism. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:45:54 +01:00
Tejun Heo	facc31ddc3	block: Don't implicitly trigger event check on disk_unblock_events() Currently, disk_unblock_events() implicitly kick event check if the block count reaches zero. This behavior is not described in the comment and hinders with future changes. Make the unblocker explicitly check events by calling disk_check_events() as necessary. This patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org>	2011-03-09 19:54:27 +01:00
Justin TerAvest	df457f845e	blk-cgroup: Lower minimum weight from 100 to 10. We've found that we still get good, useful isolation at weights this low. I'd like to adjust the minimum so that any other changes can take these values into account. Signed-off-by: Justin TerAvest <teravest@google.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-08 19:45:00 +01:00
Vivek Goyal	de701c74a3	blk-throttle: Some cleanups and race fixes in limit update code When throttle group limits are updated through cgroups, a thread is woken up to process these updates. While reviewing that code, oleg noted couple of race conditions existed in the code and he also suggested that code can be simplified. This patch fixes the races simplifies the code based on Oleg's suggestions: - Use xchg(). - Introduced a common function throtl_update_blkio_group_common() which is shared now by all iops/bps update functions. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data as the argument now, not the queue. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-07 21:09:32 +01:00
Vivek Goyal	231d704b4a	blk-throttle: process limit change only through one function With the help of cgroup interface one can go and upate the bps/iops limits of existing group. Once the limits are udpated, a thread is woken up to see if some blocked group needs recalculation based on new limits and needs to be requeued. There was also a piece of code where I was checking for group limit update when a fresh bio comes in. This patch gets rid of that piece of code and keeps processing the limit change at one place throtl_process_limit_change(). It just keeps the code simple and easy to understand. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-07 21:05:14 +01:00
Jens Axboe	b873c5d692	Merge branch 'block-for-2.6.39-core' of ssh://master.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.39/core	2011-03-07 09:40:21 +01:00
Gui Jianfeng	a60327107b	cfq-iosched: Fix update_vdisktime logic The update_vdisktime logic is broken since commit `b54ce60eb7`, st->min_vdisktime never makes a progress. Fix it. Thanks Vivek for pointing it out. Signed-off-by: Gui Jianfeng <guijianfen@cn.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-07 09:28:09 +01:00
Shaohua Li	ef8a41df8c	cfq-iosched: give busy sync queue no dispatch limit If there are a sync and an async queue and the sync queue's think time is small, we can ignore the sync queue's dispatch quantum. Because the sync queue will always preempt the async queue, we don't need to care about async's latency. This can fix a performance regression of aiostress test, which is introduced by commit `f8ae6e3eb8`. The issue should exist even without the commit, but the commit amplifies the impact. The initial post does the same optimization for RT queue too, but since I have no real workload for it, Vivek suggests to drop it. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-07 09:26:29 +01:00
Jens Axboe	93803e0140	cfq-iosched: fix race in cfq_set_request() We need to hold the queue lock over the reference increment, it's not atomic anymore. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-07 08:59:06 +01:00
Tejun Heo	e83a46bbb1	Merge branch 'for-linus' of ../linux-2.6-block into block-for-2.6.39/core This merge creates two set of conflicts. One is simple context conflicts caused by removal of throtl_scheduled_delayed_work() in for-linus and removal of throtl_shutdown_timer_wq() in for-2.6.39/core. The other is caused by commit `255bb490c8` (block: blk-flush shouldn't call directly into q->request_fn() __blk_run_queue()) in for-linus crashing with FLUSH reimplementation in for-2.6.39/core. The conflict isn't trivial but the resolution is straight-forward. * __blk_run_queue() calls in flush_end_io() and flush_data_end_io() should be called with @force_kblockd set to %true. * elv_insert() in blk_kick_flush() should use %ELEVATOR_INSERT_REQUEUE. Both changes are to avoid invoking ->request_fn() directly from request completion path and closely match the changes in the commit `255bb490c8`. Signed-off-by: Tejun Heo <tj@kernel.org>	2011-03-04 19:09:02 +01:00
Vivek Goyal	da52777000	block: Move blk_throtl_exit() call to blk_cleanup_queue() Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is written in such a way that it needs queue lock. In blk_release_queue() there is no gurantee that ->queue_lock is still around. Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported one problem. https://lkml.org/lkml/2010/10/23/86 And a quick fix moved blk_throtl_exit() to blk_release_queue(). commit `7ad58c0286` Author: Jens Axboe <jaxboe@fusionio.com> Date: Sat Oct 23 20:40:26 2010 +0200 block: fix use-after-free bug in blk throttle code This patch reverts above change and does not try to shutdown the throtl work in blk_sync_queue(). By avoiding call to throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid the problem reported by Ingo. blk_sync_queue() seems to be used only by md driver and it seems to be using it to make sure q->unplug_fn is not called as md registers its own unplug functions and it is about to free up the data structures used by unplug_fn(). Block throttle does not call back into unplug_fn() or into md. So there is no need to cancel blk throttle work. In fact I think cancelling block throttle work is bad because it might happen that some bios are throttled and scheduled to be dispatched later with the help of pending work and if work is cancelled, these bios might never be dispatched. Block layer also uses blk_sync_queue() during blk_cleanup_queue() and blk_release_queue() time. That should be safe as we are also calling blk_throtl_exit() which should make sure all the throttling related data structures are cleaned up. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-02 19:06:49 -05:00
Vivek Goyal	c94a96ac93	block: Initialize ->queue_lock to internal lock at queue allocation time There does not seem to be a clear convention whether q->queue_lock is initialized or not when blk_cleanup_queue() is called. In the past it was not necessary but now blk_throtl_exit() takes up queue lock by default and needs queue lock to be available. In fact elevator_exit() code also has similar requirement just that it is less stringent in the sense that elevator_exit() is called only if elevator is initialized. Two problems have been noticed because of ambiguity about spin lock status. - If a driver calls blk_alloc_queue() and then soon calls blk_cleanup_queue() almost immediately, (because some other driver structure allocation failed or some other error happened) then blk_throtl_exit() will run into issues as queue lock is not initialized. Loop driver ran into this issue recently and I noticed error paths in md driver too. Similar error paths should exist in other drivers too. - If some driver provided external spin lock and zapped the lock before blk_cleanup_queue(), then it can lead to issues. So this patch initializes the default queue lock at queue allocation time. block throttling code is one of the users of queue lock and it is initialized at the queue allocation time, so it makes sense to initialize ->queue_lock also to internal lock. A driver can overide that lock later. This will take care of the issue where a driver does not have to worry about initializing the queue lock to default before calling blk_cleanup_queue() Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-02 19:06:49 -05:00
Liu Yuan	53f22956ef	block/genhd: Change some numerals into macros Rename the numerals in the diskstats_show() into the macros. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-02 11:00:15 -05:00
Tejun Heo	255bb490c8	block: blk-flush shouldn't call directly into q->request_fn() __blk_run_queue() blk-flush decomposes a flush into sequence of multiple requests. On completion of a request, the next one is queued; however, block layer must not implicitly call into q->request_fn() directly from completion path. This makes the queue behave unexpectedly when seen from the drivers and violates the assumption that q->request_fn() is called with process context + queue_lock. This patch makes blk-flush the following two changes to make sure q->request_fn() is not called directly from request completion path. - blk_flush_complete_seq_end_io() now asks __blk_run_queue() to always use kblockd instead of calling directly into q->request_fn(). - queue_next_fseq() uses ELEVATOR_INSERT_REQUEUE instead of ELEVATOR_INSERT_FRONT so that elv_insert() doesn't try to unplug the request queue directly. Reported by Jan in the following threads. http://thread.gmane.org/gmane.linux.ide/48778 http://thread.gmane.org/gmane.linux.ide/48786 stable: applicable to v2.6.37. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-02 08:48:06 -05:00
Tejun Heo	1654e7411a	block: add @force_kblockd to __blk_run_queue() __blk_run_queue() automatically either calls q->request_fn() directly or schedules kblockd depending on whether the function is recursed. blk-flush implementation needs to be able to explicitly choose kblockd. Add @force_kblockd. All the current users are converted to specify %false for the parameter and this patch doesn't introduce any behavior change. stable: This is prerequisite for fixing ide oops caused by the new blk-flush implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Beulich <JBeulich@novell.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-02 08:48:05 -05:00
Justin TerAvest	0bbfeb8320	cfq-iosched: Always provide group isolation. Effectively, make group_isolation=1 the default and remove the tunable. The setting group_isolation=0 was because by default we idle on sync-noidle tree and on fast devices, this can be very harmful for throughput. However, this problem can also be addressed by tuning slice_idle and possibly group_idle on faster storage devices. This change simplifies the CFQ code by removing the feature entirely. Signed-off-by: Justin TerAvest <teravest@google.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-01 15:05:08 -05:00
Jens Axboe	6fae9c2513	Merge commit 'v2.6.38-rc6' into for-2.6.39/core Conflicts: block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-01 15:04:39 -05:00
Ben Hutchings	291d24f6d9	block: fix kernel-doc format for blkdev_issue_zeroout Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-01 13:45:24 -05:00
Vivek Goyal	450adcbe51	blk-throttle: Do not use kblockd workqueue for throtl work o Dominik Klein reported a system hang issue while doing some blkio throttling testing. https://lkml.org/lkml/2011/2/24/173 o Some tracing revealed that CFQ was not dispatching any more jobs as queue unplug was not happening. And queue unplug was not happening because unplug work was not being called as there was one throttling work on same cpu which as not finished yet. And throttling work had not finished as it was tyring to dispatch a bio to CFQ but all the request descriptors were consume to it was put to sleep. o So basically it is a cyclic dependecny between CFQ unplug work and throtl dispatch work. Tejun suggested that use separate workqueue for such cases. o This patch uses a separate workqueue for throttle related work and does not rely on kblockd workqueue anymore. Cc: stable@kernel.org Reported-by: Dominik Klein <dk@in-telegence.net> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-01 13:41:53 -05:00
Linus Torvalds	638691a7a4	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: Fix - again - partition detection when array becomes active Fix over-zealous flush_disk when changing device size. md: avoid spinlock problem in blk_throtl_exit md: correctly handle probe of an 'mdp' device. md: don't set_capacity before array is active. md: Fix raid1->raid0 takeover	2011-02-25 11:13:26 -08:00
Miklos Szeredi	3c522cedb5	block: fix refcounting in BLKBSZSET Adam Kovari and others reported that disconnecting an USB drive with an ntfs-3g filesystem would cause "kernel BUG at fs/inode.c:1421!" to be triggered. The BUG could be traced back to ioctl(BLKBSZSET), which would erroneously decrement the refcount on the bdev. This is because blkdev_get() expects the refcount to be already incremented and either returns success or decrements the refcount and returns an error. The bug was introduced by `e525fd89` (block: make blkdev_get/put() handle exclusive access), which didn't take into account this behavior of blkdev_get(). This fixes https://bugzilla.kernel.org/show_bug.cgi?id=29202 (and likely 29792 too) Reported-by: Adam Kovari <kovariadam@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-24 08:54:21 -08:00
NeilBrown	93b270f76e	Fix over-zealous flush_disk when changing device size. There are two cases when we call flush_disk. In one, the device has disappeared (check_disk_change) so any data will hold becomes irrelevant. In the oter, the device has changed size (check_disk_size_change) so data we hold may be irrelevant. In both cases it makes sense to discard any 'clean' buffers, so they will be read back from the device if needed. In the former case it makes sense to discard 'dirty' buffers as there will never be anywhere safe to write the data. In the second case it doesnot* make sense to discard dirty buffers as that will lead to file system corruption when you simply enlarge the containing devices. flush_disk calls __invalidate_devices. __invalidate_device calls both invalidate_inodes and invalidate_bdev. invalidate_inodes does discard I_DIRTY inodes and this does lead to fs corruption. invalidate_bev doesnot* discard dirty pages, but I don't really care about that at present. So this patch adds a flag to __invalidate_device (calling it __invalidate_device2) to indicate whether dirty buffers should be killed, and this is passed to invalidate_inodes which can choose to skip dirty inodes. flusk_disk then passes true from check_disk_change and false from check_disk_size_change. dm avoids tripping over this problem by calling i_size_write directly rathher than using check_disk_size_change. md does use check_disk_size_change and so is affected. This regression was introduced by commit `608aeef17a` which causes check_disk_size_change to call flush_disk, so it is suitable for any kernel since 2.6.27. Cc: stable@kernel.org Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Andrew Patterson <andrew.patterson@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-24 17:25:47 +11:00
Hannes Reinecke	79775567e0	[SCSI] block: improve detail in I/O error messages Classify severity of I/O errors for target, nexus, and transport errors. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2011-02-12 10:33:37 -06:00
Mike Snitzer	c186794dbb	block: share request flush fields with elevator_private Flush requests are never put on the IO scheduler. Convert request structure's elevator_private* into an array and have the flush fields share a union with it. Reclaim the space lost in 'struct request' by moving 'completion_data' back in the union with 'rb_node'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-02-11 11:08:00 +01:00
Mike Snitzer	9d5a4e946c	block: skip elevator data initialization for flush requests Skip elevator initialization for flush requests by passing priv=0 to blk_alloc_request() in get_request(). As such elv_set_request() is never called for flush requests. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-02-11 11:05:46 +01:00
Linus Torvalds	aceb91cd35	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cdrom: support devices that have check_events but not media_changed cfq-iosched: Don't wait if queue already has requests. blkio-throttle: Avoid calling blkiocg_lookup_group() for root group cfq: rename a function to give it more appropriate name cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily. drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y loop: queue_lock NULL pointer derefence in blk_throtl_exit drivers/block/Makefile: replace the use of <module>-objs with <module>-y blktrace: Don't output messages if NOTIFY isn't set.	2011-02-09 11:45:21 -08:00
Justin TerAvest	02a8f01b5a	cfq-iosched: Don't wait if queue already has requests. Commit `7667aa0630` added logic to wait for the last queue of the group to become busy (have at least one request), so that the group does not lose out for not being continuously backlogged. The commit did not check for the condition that the last queue already has some requests. As a result, if the queue already has requests, wait_busy is set. Later on, cfq_select_queue() checks the flag, and decides that since the queue has a request now and wait_busy is set, the queue is expired. This results in early expiration of the queue. This patch fixes the problem by adding a check to see if queue already has requests. If it does, wait_busy is not set. As a result, time slices do not expire early. The queues with more than one request are usually buffered writers. Testing shows improvement in isolation between buffered writers. Cc: stable@kernel.org Signed-off-by: Justin TerAvest <teravest@google.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-02-09 14:22:36 +01:00
Tejun Heo	ae1b153962	block: reimplement FLUSH/FUA to support merge The current FLUSH/FUA support has evolved from the implementation which had to perform queue draining. As such, sequencing is done queue-wide one flush request after another. However, with the draining requirement gone, there's no reason to keep the queue-wide sequential approach. This patch reimplements FLUSH/FUA support such that each FLUSH/FUA request is sequenced individually. The actual FLUSH execution is double buffered and whenever a request wants to execute one for either PRE or POSTFLUSH, it queues on the pending queue. Once certain conditions are met, a flush request is issued and on its completion all pending requests proceed to the next sequence. This allows arbitrary merging of different type of flushes. How they are merged can be primarily controlled and tuned by adjusting the above said 'conditions' used to determine when to issue the next flush. This is inspired by Darrick's patches to merge multiple zero-data flushes which helps workloads with highly concurrent fsync requests. * As flush requests are never put on the IO scheduler, request fields used for flush share space with rq->rb_node. rq->completion_data is moved out of the union. This increases the request size by one pointer. As rq->elevator_private* are used only by the iosched too, it is possible to reduce the request size further. However, to do that, we need to modify request allocation path such that iosched data is not allocated for flush requests. * FLUSH/FUA processing happens on insertion now instead of dispatch. - Comments updated as per Vivek and Mike. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Darrick J. Wong" <djwong@us.ibm.com> Cc: Shaohua Li <shli@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-25 12:43:54 +01:00
Tejun Heo	143a87f4c9	block: improve flush bio completion bio's for flush are completed twice - once during the data phase and one more time after the whole sequence is complete. The first completion shouldn't notify completion to the issuer. This was achieved by skipping all bio completion steps in req_bio_endio() for the first completion; however, this has two drawbacks. * Error is not recorded in bio and must be tracked somewhere else. * Partial completion is not supported. Both don't cause problems for the current users; however, they make further improvements difficult. Change req_bio_endio() such that it only skips the actual notification part for the first completion. bio completion is implemented with partial completions on mind anyway so this is as simple as moving the REQ_FLUSH_SEQ conditional such that only calling of bio_endio() is skipped. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-25 12:43:52 +01:00
Tejun Heo	414b4ff5ee	block: add REQ_FLUSH_SEQ rq == &q->flush_rq was used to determine whether a rq is part of a flush sequence, which worked because all requests in a flush sequence were sequenced using the single dedicated request. This is about to change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence requests. This patch doesn't cause any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-25 12:43:49 +01:00
David Rientjes	6a108a14fa	kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option is used to configure any non-standard kernel with a much larger scope than only small devices. This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes references to the option throughout the kernel. A new CONFIG_EMBEDDED option is added that automatically selects CONFIG_EXPERT when enabled and can be used in the future to isolate options that should only be considered for embedded systems (RISC architectures, SLOB, etc). Calling the option "EXPERT" more accurately represents its intention: only expert users who understand the impact of the configuration changes they are making should enable it. Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: David Woodhouse <david.woodhouse@intel.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Greg KH <gregkh@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Robin Holt <holt@sgi.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Vivek Goyal	be2c6b1990	blkio-throttle: Avoid calling blkiocg_lookup_group() for root group o Jeff Moyer was doing some testing on a RAM backed disk and blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly somebody else reported that blkiocg_lookup_group() is eating 6% extra cpu. Though looking at the code I can't think why the overhead of this function is so high. One thing is that it is called with very high frequency (once for every IO). o For lot of folks blkio controller will be compiled in but they might not have actually created cgroups. Hence optimize the case of root cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening in root group (common case). Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-19 08:25:02 -07:00
Vivek Goyal	ba5bd520f6	cfq: rename a function to give it more appropriate name o Rename a function to give it more approprate name. We are calculating cfq queue slice and function name gives the impression as if cfq group slice length is being calculated. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-19 08:25:02 -07:00
Shaohua Li	c553f8e335	block cfq: compensate preempted queue even if it has no slice assigned If a queue is preempted before it gets slice assigned, the queue doesn't get compensation, which looks unfair. For such queue, we compensate it for a whole slice. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-14 08:41:03 +01:00
Shaohua Li	f8ae6e3eb8	block cfq: make queue preempt work for queues from different workload I got this: fio-874 [007] 2157.724514: 8,32 m N cfq874 preempt fio-874 [007] 2157.724519: 8,32 m N cfq830 slice expired t=1 fio-874 [007] 2157.724520: 8,32 m N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0 fio-874 [007] 2157.724521: 8,32 m N cfq830 set_active wl_prio:0 wl_type:0 fio-874 [007] 2157.724522: 8,32 m N cfq830 Not idling. st->count:1 cfq830 is an async queue, and preempted by a sync queue cfq874. But since we have cfqg->saved_workload_slice mechanism, the preempt is a nop. Looks currently our preempt is totally broken if the two queues are not from the same workload type. Below patch fixes it. This will might make async queue starvation, but it's what our old code does before cgroup is added. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-14 08:41:02 +01:00
Linus Torvalds	275220f0fc	Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...	2011-01-13 10:45:01 -08:00
Jens Axboe	81c5e2ae33	Merge branch 'for-2.6.38/event-handling' into for-2.6.38/core	2011-01-13 14:47:54 +01:00
Shaohua Li	329a67815b	block cfq: don't use atomic_t for cfq_group cfq_group->ref is used with queue_lock hold, the only exception is cfq_set_request, which looks like a bug to me, so ref doesn't need to be an atomic and atomic operation is slower. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-07 08:48:28 +01:00
Shaohua Li	30d7b9448f	block cfq: don't use atomic_t for cfq_queue cfq_queue->ref is used with queue_lock hold, so ref doesn't need to be an atomic and atomic operation is slower. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-07 08:46:59 +01:00
Jens Axboe	6c23a9681c	block: add internal hd part table references We can't use krefs since it's apparently restricted to very basic reference counting. This reverts commit `e4a683c8`. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-07 08:43:37 +01:00
Jerome Marchand	09e099d4ba	block: fix accounting bug on cross partition merges /proc/diskstats would display a strange output as follows. $ cat /proc/diskstats \|grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. \| hd_struct->in_flight --------------------------- sda1 \| 0 sda2 \| 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. \| hd_struct->in_flight --------------------------- sda1 \| 0 sda2 \| 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. \| hd_struct->in_flight --------------------------- sda1 \| -1 sda2 \| 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. Also add a refcount to struct hd_struct to keep the partition in memory as long as users exist. We use kref_test_and_get() to ensure we don't add a reference to a partition which is going away. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-05 16:57:38 +01:00
Tejun Heo	89b90be2d8	block: make kblockd_workqueue smarter kblockd is used for unplugging and may affect IO latency and throughput and the max number of concurrent work items are bound by the number of block devices. Make it HIGHPRI workqueue w/ default max concurrency. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-03 15:01:47 +01:00
Jiri Kosina	4b7bd36470	Merge branch 'master' into for-next Conflicts: MAINTAINERS arch/arm/mach-omap2/pm24xx.c drivers/scsi/bfa/bfa_fcpim.c Needed to update to apply fixes for which the old branch was too outdated.	2010-12-22 18:57:02 +01:00
Bart Van Assche	27667c996f	block: Clean up exit_io_context() source code. This patch fixes a spelling error in a source code comment and removes superfluous braces in the function exit_io_context(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-21 15:07:45 +01:00
Linus Torvalds	7f8635cc9e	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix cciss_revalidate panic block: max hardware sectors limit wrapper block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead blk-throttle: Correct the placement of smp_rmb() blk-throttle: Trim/adjust slice_end once a bio has been dispatched block: check for proper length of iov entries earlier in blk_rq_map_user_iov() drbd: fix for spin_lock_irqsave in endio callback drbd: don't recvmsg with zero length	2010-12-20 09:19:46 -08:00
Yang Zhang	e61eb2e93f	fs/block: type signature of major_to_index(int) to major_to_index(unsigned) The major/minor device numbers are always defined and used as `unsigned'. Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 09:00:18 +01:00
Yang Zhang	b9f985b6e0	block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 08:58:36 +01:00
Gui Jianfeng	7278c9c19b	cfq-iosched: don't check cfqg in choose_service_tree() When cfq_choose_cfqg() is called in select_queue(), there must be at least one backlogged CFQ queue waiting for dispatching, hence there must be at least one backlogged CFQ group on service tree. So we never call choose_service_tree() with cfqg == NULL. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 08:57:14 +01:00
Mike Snitzer	72d4cd9f38	block: max hardware sectors limit wrapper Implement blk_limits_max_hw_sectors() and make blk_queue_max_hw_sectors() a wrapper around it. DM needs this to avoid setting queue_limits' max_hw_sectors and max_sectors directly. dm_set_device_limits() now leverages blk_limits_max_hw_sectors() logic to establish the appropriate max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was incorrectly setting max_sectors rather than max_hw_sectors (which caused dm_merge_bvec()'s max_hw_sectors check to be ineffective). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 08:36:01 +01:00
Martin K. Petersen	e692cb668f	block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead When stacking devices, a request_queue is not always available. This forced us to have a no_cluster flag in the queue_limits that could be used as a carrier until the request_queue had been set up for a metadevice. There were several problems with that approach. First of all it was up to the stacking device to remember to set queue flag after stacking had completed. Also, the queue flag and the queue limits had to be kept in sync at all times. We got that wrong, which could lead to us issuing commands that went beyond the max scatterlist limit set by the driver. The proper fix is to avoid having two flags for tracking the same thing. We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the block layer merging functions. The queue_limit 'no_cluster' is turned into 'cluster' to avoid double negatives and to ease stacking. Clustering defaults to being enabled as before. The queue flag logic is removed from the stacking function, and explicitly setting the cluster flag is no longer necessary in DM and MD. Reported-by: Ed Lin <ed.lin@promise.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 08:35:53 +01:00
Tejun Heo	77ea887e43	implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block//events[_async] export these to userland. Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block//events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-16 17:53:38 +01:00
Tejun Heo	d2bf1b6723	block: move register_disk() and del_gendisk() to block/genhd.c There's no reason for register_disk() and del_gendisk() to be in fs/partitions/check.c. Move both to genhd.c. While at it, collapse unlink_gendisk(), which was artificially in a separate function due to genhd.c / check.c split, into del_gendisk(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-16 17:53:38 +01:00
Tejun Heo	dddd9dc340	block: kill genhd_media_change_notify() There's no user of the facility. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-16 17:53:38 +01:00
Shaohua Li writes	e4ea0c16a8	block cfq: select new workload if priority changed If priority is changed, continuing to check workload_expires and service tree count of the previous workload does not make sense. We should always choose the workload with lowest key of new priority in such case. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-13 14:32:22 +01:00
James Smart	c7a841f3ac	[SCSI] bsg: correct fault if queue object removed while dev_t open This patch corrects an issue in bsg that results in a general protection fault if an LLD is removed while an application is using an open file handle to a bsg device, and the application issues an ioctl. The fault occurs because the class_dev is NULL, having been cleared in bsg_unregister_queue() when the driver was removed. With this patch, a check is made for the class_dev, and the application will receive ENXIO if the related object is gone. Signed-off-by: Carl Lajeunesse <carl.lajeunesse@emulex.com> Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-12-09 09:41:14 -06:00
Vivek Goyal	04a6b516cd	blk-throttle: Correct the placement of smp_rmb() o I was discussing what are the variable being updated without spin lock and why do we need barriers and Oleg pointed out that location of smp_rmb() should be between read of td->limits_changed and tg->limits_changed. This patch fixes it. o Following is one possible sequence of events. Say cpu0 is executing throtl_update_blkio_group_read_bps() and cpu1 is executing throtl_process_limit_change(). cpu0 cpu1 tg->limits_changed = true; smp_mb__before_atomic_inc(); atomic_inc(&td->limits_changed); if (!atomic_read(&td->limits_changed)) return; if (tg->limits_changed) do_something; If cpu0 has updated tg->limits_changed and td->limits_changed, we want to make sure that if update to td->limits_changed is visible on cpu1, then update to tg->limits_changed should also be visible. Oleg pointed out to ensure that we need to insert an smp_rmb() between td->limits_changed read and tg->limits_changed read. o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed). This patch fixes it. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-01 19:34:52 +01:00
Vivek Goyal	d1ae8ffdfa	blk-throttle: Trim/adjust slice_end once a bio has been dispatched o During some testing I did following and noticed throttling stops working. - Put a very low limit on a cgroup, say 1 byte per second. - Start some reads, this will set slice_end to a very high value. - Change the limit to higher value say 1MB/s - Now IO unthrottles and finishes as expected. - Try to do the read again but IO is not limited to 1MB/s as expected. o What is happening. - Initially low value of limit sets slice_end to a very high value. - During updation of limit, slice_end is not being truncated. - Very high value of slice_end leads to keeping the existing slice valid for a very long time and new slice does not start. - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice() is not called in this path. So slice_start is some old value and practically we are able to do huge amount of IO. o There are many ways it can be fixed. I have fixed it by trying to adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio is big and can't be dispatched in one slice. After dispatch of bio, readjust the slice_end to make sure we don't end up with huge values. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-01 19:34:46 +01:00
Gui Jianfeng	760701bfe1	cfq-iosched: Get rid of on_st flag It's able to check whether a CFQ group on a service tree by checking "cfqg->rb_node". There's no need to maintain an extra flag here. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-30 20:52:47 +01:00
Gui Jianfeng	b54ce60eb7	cfq-iosched: Get rid of st->active When a cfq group is running, it won't be dequeued from service tree, so there's no need to store the active one in st->active. Just gid rid of it. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-30 20:52:46 +01:00
Xiaotian Feng	5478755616	block: check for proper length of iov entries earlier in blk_rq_map_user_iov() commit `9284bcf` checks for proper length of iov entries in blk_rq_map_user_iov(). But if the map is unaligned, kernel will break out the loop without checking for the proper length. So we need to check the proper length before the unalign check. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-29 10:04:50 +01:00
Jens Axboe	f30195c502	Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core	2010-11-27 19:49:18 +01:00
Linus Torvalds	78daa87b1d	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix build for PROC_FS disabled block: fix amiga and atari floppy driver compile warning blk-throttle: Fix calculation of max number of WRITES to be dispatched ioprio: grab rcu_read_lock in sys_ioprio_{set,get}() xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER xen/blkfront: change blk_shadow.request to proper pointer xen/blkfront: map REQ_FLUSH into a full barrier	2010-11-27 07:17:50 +09:00
Arnd Bergmann	451a3c24b0	BKL: remove extraneous #include <smp_lock.h> The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-11-17 08:59:32 -08:00
Mike Snitzer	d07335e51d	block: Rename "block_remap" tracepoint to "block_bio_remap" to clarify the event. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-16 12:53:39 +01:00
Jens Axboe	5fbf856392	Merge branch 'for-2.6.38/rc2-holder' into for-2.6.38/core	2010-11-16 10:10:12 +01:00
Jens Axboe	a02056349c	Merge branch 'v2.6.37-rc2' into for-2.6.38/core	2010-11-16 10:09:42 +01:00
Vivek Goyal	bdc85df7a8	blk-cgroup: Allow creation of hierarchical cgroups o Allow hierarchical cgroup creation for blkio controller o Currently we disallow it as both the io controller policies (throttling as well as proportion bandwidth) do not support hierarhical accounting and control. But the flip side is that blkio controller can not be used with libvirt as libvirt creates a cgroup hierarchy deeper than 1 level. <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups> o So this patch will allow creation of cgroup hierarhcy but at the backend everything will be treated as flat. So if somebody created a an hierarchy like as follows. root / \ test1 test2 \| test3 CFQ and throttling will practically treat all groups at same level. pivot / \| \ \ root test1 test2 test3 o Once we have actual support for hierarchical accounting and control then we can introduce another cgroup tunable file "blkio.use_hierarchy" which will be 0 by default but if user wants to enforce hierarhical control then it can be set to 1. This way there should not be any ABI problems down the line. o The only not so pretty part is introduction of extra file "use_hierarchy" down the line. Kame-san had mentioned that hierarhical accounting is expensive in memory controller hence they keep it off by default. I suspect same will be the case for IO controller also as for each IO completion we shall have to account IO through hierarchy up to the root. if yes, then it probably is not a very bad idea to introduce this extra file so that it will be used only when somebody needs it and some people might enable hierarchy only in part of the hierarchy. o This is how basically memory controller also uses "use_hierarhcy" and they also allowed creation of hierarchies when actual backend support was not available. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com> Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-15 19:37:36 +01:00
Vivek Goyal	c2f6805d47	blk-throttle: Fix calculation of max number of WRITES to be dispatched o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one dispatch round. ummy pointed out that there is a bug in max_nr_writes calculation. This patch fixes it. Reported-by: ummy y <yummylln@yahoo.com.cn> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-15 19:32:42 +01:00
Tejun Heo	e525fd89d3	block: make blkdev_get/put() handle exclusive access Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2010-11-13 11:55:17 +01:00
Jens Axboe	cedb4a7d9f	block: remove unused copy_io_context() Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-11 13:40:11 +01:00
Christoph Hellwig	02e031cbc8	block: remove REQ_HARDBARRIER REQ_HARDBARRIER is dead now, so remove the leftovers. What's left at this point is: - various checks inside the block layer. - sanity checks in bio based drivers. - now unused bio_empty_barrier helper. - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while, but Xen really needs to sort out it's barrier situaton. - setting of ordered tags in uas - dead code copied from old scsi drivers. - scsi different retry for barriers - it's dead and should have been removed when flushes were converted to FS requests. - blktrace handling of barriers - removed. Someone who knows blktrace better should add support for REQ_FLUSH and REQ_FUA, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-10 14:54:09 +01:00
Vasiliy Kulikov	a014741c0a	block: ioctl: fix information leak to userland Structure hd_geometry is copied to userland with 4 padding bytes between cylinders and start fields uninitialized on 64-bit platforms. It leads to leaking of contents of kernel stack memory. Currently there is no memset() in real implementations of getgeo() in drivers/block/, so it makes sense to have memset() in blkdev_ioctl(). Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-10 14:40:53 +01:00
Mike Snitzer	77304d2aba	block: read i_size with i_size_read() Convert direct reads of an inode's i_size to using i_size_read(). i_size_{read,write} use a seqcount to protect reads from accessing incomple writes. Concurrent i_size_write()s require mutual exclussion to protect the seqcount that is used by i_size_{read,write}. But i_size_read() callers do not need to use additional locking. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-10 14:40:53 +01:00
Jens Axboe	9f864c8091	block: take care not to overflow when calculating total iov length Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-10 14:40:42 +01:00
Jens Axboe	9284bcf4e3	block: check for proper length of iov entries in blk_rq_map_user_iov() Ensure that we pass down properly validated iov segments before calling into the mapping or copy functions. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-10 14:40:42 +01:00
Shaohua Li	2b9408a459	cfq-iosched: don't schedule a dispatch for a non-idle queue Vivek suggests we don't need schedule a dispatch when an idle queue becomes nonidle. And he is right, cfq_should_preempt already covers the logic. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-09 14:51:13 +01:00
Shaohua Li	8e1ac66551	cfq-iosched: don't idle if a deep seek queue is slow If a deep seek queue slowly deliver requests but disk is much faster, idle for the queue just wastes disk throughput. If the queue delevers all requests before half its slice is used, the patch disable idle for it. In my test, application delivers 32 requests one time, the disk can accept 128 requests at maxium and disk is fast. without the patch, the throughput is just around 30m/s, while with it, the speed is about 80m/s. The disk is a SSD, but is detected as a rotational disk. I can configure it as SSD, but I thought the deep seek queue logic should be fixed too, for example, considering a fast raid. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-08 15:01:04 +01:00
Shaohua Li	d2d59e18a1	cfq-iosched: schedule dispatch for noidle queue A queue is idle at cfq_dispatch_requests(), but it gets noidle later. Unless other task explictly does unplug or all requests are drained, we will not deliever requests to the disk even cfq_arm_slice_timer doesn't make the queue idle. For example, cfq_should_idle() returns true because of service_tree->count == 1, and then other queues are added. Note, I didn't see obvious performance impacts so far with the patch, but just thought this could be a problem. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-11-08 15:01:03 +01:00

... 2 3 4 5 6 ...

1640 Commits