linux

Author	SHA1	Message	Date
Vivek Goyal	1efe8fe1c2	cfq-iosched: Do not idle on async queues Few weeks back, Shaohua Li had posted similar patch. I am reposting it with more test results. This patch does two things. - Do not idle on async queues. - It also changes the write queue depth CFQ drives (cfq_may_dispatch()). Currently, we seem to driving queue depth of 1 always for WRITES. This is true even if there is only one write queue in the system and all the logic of infinite queue depth in case of single busy queue as well as slowly increasing queue depth based on last delayed sync request does not seem to be kicking in at all. This patch will allow deeper WRITE queue depths (subjected to the other WRITE queue depth contstraints like cfq_quantum and last delayed sync request). Shaohua Li had reported getting more out of his SSD. For me, I have got one Lun exported from an HP EVA and when pure buffered writes are on, I can get more out of the system. Following are test results of pure buffered writes (with end_fsync=1) with vanilla and patched kernel. These results are average of 3 sets of run with increasing number of threads. AVERAGE[bufwfs][vanilla] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- bufwfs 3 1 0 0 95349 474141 bufwfs 3 2 0 0 100282 806926 bufwfs 3 4 0 0 109989 2.7301e+06 bufwfs 3 8 0 0 116642 3762231 bufwfs 3 16 0 0 118230 6902970 AVERAGE[bufwfs] [patched kernel] ------- bufwfs 3 1 0 0 270722 404352 bufwfs 3 2 0 0 206770 1.06552e+06 bufwfs 3 4 0 0 195277 1.62283e+06 bufwfs 3 8 0 0 260960 2.62979e+06 bufwfs 3 16 0 0 299260 1.70731e+06 I also ran buffered writes along with some sequential reads and some buffered reads going on in the system on a SATA disk because the potential risk could be that we should not be driving queue depth higher in presence of sync IO going to keep the max clat low. With some random and sequential reads going on in the system on one SATA disk I did not see any significant increase in max clat. So it looks like other WRITE queue depth control logic is doing its job. Here are the results. AVERAGE[brr, bsr, bufw together] [vanilla] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- brr 3 1 850 546345 0 0 bsr 3 1 14650 729543 0 0 bufw 3 1 0 0 23908 8274517 brr 3 2 981.333 579395 0 0 bsr 3 2 14149.7 1175689 0 0 bufw 3 2 0 0 21921 1.28108e+07 brr 3 4 898.333 1.75527e+06 0 0 bsr 3 4 12230.7 1.40072e+06 0 0 bufw 3 4 0 0 19722.3 2.4901e+07 brr 3 8 900 3160594 0 0 bsr 3 8 9282.33 1.91314e+06 0 0 bufw 3 8 0 0 18789.3 23890622 AVERAGE[brr, bsr, bufw mixed] [patched kernel] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- brr 3 1 837 417973 0 0 bsr 3 1 14357.7 591275 0 0 bufw 3 1 0 0 24869.7 8910662 brr 3 2 1038.33 543434 0 0 bsr 3 2 13351.3 1205858 0 0 bufw 3 2 0 0 18626.3 13280370 brr 3 4 913 1.86861e+06 0 0 bsr 3 4 12652.3 1430974 0 0 bufw 3 4 0 0 15343.3 2.81305e+07 brr 3 8 890 2.92695e+06 0 0 bsr 3 8 9635.33 1.90244e+06 0 0 bufw 3 8 0 0 17200.3 24424392 So looks like it might make sense to include this patch. Thanks Vivek Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-02 20:46:10 +01:00
Gui Jianfeng	bcf4dd4342	blk-cgroup: Fix potential deadlock in blk-cgroup I triggered a lockdep warning as following. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.33-rc2 #1 ------------------------------------------------------- test_io_control/7357 is trying to acquire lock: (blkio_list_lock){+.+...}, at: [<c053a990>] blkiocg_weight_write+0x82/0x9e but task is already holding lock: (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&(&blkcg->lock)->rlock){......}: [<c04583b7>] validate_chain+0x8bc/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a [<c053a4e1>] blkiocg_add_blkio_group+0x1a/0x6d [<c053cac7>] cfq_get_queue+0x225/0x3de [<c053eec2>] cfq_set_request+0x217/0x42d [<c052c8a6>] elv_set_request+0x17/0x26 [<c0532a0f>] get_request+0x203/0x2c5 [<c0532ae9>] get_request_wait+0x18/0x10e [<c0533470>] __make_request+0x2ba/0x375 [<c0531985>] generic_make_request+0x28d/0x30f [<c0532da7>] submit_bio+0x8a/0x8f [<c04d827a>] submit_bh+0xf0/0x10f [<c04d91d2>] ll_rw_block+0xc0/0xf9 [<f86e9705>] ext3_find_entry+0x319/0x544 [ext3] [<f86eae58>] ext3_lookup+0x2c/0xb9 [ext3] [<c04c3e1b>] do_lookup+0xd3/0x172 [<c04c56c8>] link_path_walk+0x5fb/0x95c [<c04c5a65>] path_walk+0x3c/0x81 [<c04c5b63>] do_path_lookup+0x21/0x8a [<c04c66cc>] do_filp_open+0xf0/0x978 [<c04c0c7e>] open_exec+0x1b/0xb7 [<c04c1436>] do_execve+0xbb/0x266 [<c04081a9>] sys_execve+0x24/0x4a [<c04028a2>] ptregs_execve+0x12/0x18 -> #1 (&(&q->__queue_lock)->rlock){..-.-.}: [<c04583b7>] validate_chain+0x8bc/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a [<c053dd2a>] cfq_unlink_blkio_group+0x17/0x41 [<c053a6eb>] blkiocg_destroy+0x72/0xc7 [<c0467df0>] cgroup_diput+0x4a/0xb2 [<c04ca473>] dentry_iput+0x93/0xb7 [<c04ca4b3>] d_kill+0x1c/0x36 [<c04cb5c5>] dput+0xf5/0xfe [<c04c6084>] do_rmdir+0x95/0xbe [<c04c60ec>] sys_rmdir+0x10/0x12 [<c04027cc>] sysenter_do_call+0x12/0x32 -> #0 (blkio_list_lock){+.+...}: [<c0458117>] validate_chain+0x61c/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c06929fd>] _raw_spin_lock+0x1e/0x4e [<c053a990>] blkiocg_weight_write+0x82/0x9e [<c0467f1e>] cgroup_file_write+0xc6/0x1c0 [<c04bd2f3>] vfs_write+0x8c/0x116 [<c04bd7c6>] sys_write+0x3b/0x60 [<c04027cc>] sysenter_do_call+0x12/0x32 other info that might help us debug this: 1 lock held by test_io_control/7357: #0: (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e stack backtrace: Pid: 7357, comm: test_io_control Not tainted 2.6.33-rc2 #1 Call Trace: [<c045754f>] print_circular_bug+0x91/0x9d [<c0458117>] validate_chain+0x61c/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c053a990>] ? blkiocg_weight_write+0x82/0x9e [<c06929fd>] _raw_spin_lock+0x1e/0x4e [<c053a990>] ? blkiocg_weight_write+0x82/0x9e [<c053a990>] blkiocg_weight_write+0x82/0x9e [<c0467f1e>] cgroup_file_write+0xc6/0x1c0 [<c0454df5>] ? trace_hardirqs_off+0xb/0xd [<c044d93a>] ? cpu_clock+0x2e/0x44 [<c050e6ec>] ? security_file_permission+0xf/0x11 [<c04bcdda>] ? rw_verify_area+0x8a/0xad [<c0467e58>] ? cgroup_file_write+0x0/0x1c0 [<c04bd2f3>] vfs_write+0x8c/0x116 [<c04bd7c6>] sys_write+0x3b/0x60 [<c04027cc>] sysenter_do_call+0x12/0x32 To prevent deadlock, we should take locks as following sequence: blkio_list_lock -> queue_lock -> blkcg_lock. The following patch should fix this bug. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-01 09:58:54 +01:00
Divyesh Shah	875feb63b9	cfq-iosched: Respect ioprio_class when preempting In cfq_should_preempt(), we currently allow some cases where a non-RT request can preempt an ongoing RT cfqq timeslice. This should not happen. Examples include: o A sync_noidle wl type non-RT request pre-empting a sync_noidle wl type cfqq on which we are idling. o Once we have per-cgroup async queues, a non-RT sync request pre-empting a RT async cfqq. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 16:16:18 +01:00
Kirill Afonshin	ce289321b7	block: removed unused as_io_context It isn't used anymore, since AS was deleted. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:20 +01:00
Martin K. Petersen	17be8c2450	block: bdev_stack_limits wrapper DM does not want to know about partition offsets. Add a partition-aware wrapper that DM can use when stacking block devices. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:20 +01:00
Martin K. Petersen	dd3d145d49	block: Fix discard alignment calculation and printing Discard alignment reporting for partitions was incorrect. Update to match the algorithm used elsewhere. The alignment can be negative (misaligned). Fix format string accordingly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:19 +01:00
Martin K. Petersen	fe0b393f2c	block: Correct handling of bottom device misaligment The top device misalignment flag would not be set if the added bottom device was already misaligned as opposed to causing a stacking failure. Also massage the reporting so that an error is only returned if adding the bottom device caused the misalignment. I.e. don't return an error if the top is already flagged as misaligned. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:19 +01:00
OGAWA Hirofumi	e79e95db5c	block: Honor the gfp_mask for alloc_page() in blkdev_issue_discard() Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-29 08:53:54 +01:00
Martin K. Petersen	81744ee44a	block: Fix incorrect alignment offset reporting and update documentation queue_sector_alignment_offset returned the wrong value which caused partitions to report an incorrect alignment_offset. Since offset alignment calculation is needed several places it has been split into a separate helper function. The topology stacking function has been updated accordingly. Furthermore, comments have been added to clarify how the stacking function works. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-29 08:35:35 +01:00
Shaohua Li	2f7a2d89a8	cfq-iosched: don't regard requests with long distance as close seek_mean could be very big sometimes, using it as close criteria is meaningless as this doen't improve any performance. So if it's big, let's fallback to default value. Reviewed-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-28 13:18:44 +01:00
Martin K. Petersen	9504e0864b	block: Fix topology stacking for data and discard alignment The stacking code incorrectly scaled up the data offset in some cases causing misaligned devices to report alignment. Rewrite the stacking algorithm to remedy this and apply the same alignment principles to the discard handling. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-21 15:55:51 +01:00
Vivek Goyal	65b32a573e	cfq-iosched: Remove prio_change logic for workload selection o CFQ now internally divides cfq queues in therr workload categories. sync-idle, sync-noidle and async. Which workload to run depends primarily on rb_key offset across three service trees. Which is a combination of mulitiple things including what time queue got queued on the service tree. There is one exception though. That is if we switched the prio class, say we served some RT tasks and again started serving BE class, then with-in BE class we always started with sync-noidle workload irrespective of rb_key offset in service trees. This can provide better latencies for sync-noidle workload in the presence of RT tasks. o This patch gets rid of that exception and which workload to run with-in class always depends on lowest rb_key across service trees. The reason being that now we have multiple BE class groups and if we always switch to sync-noidle workload with-in group, we can potentially starve a sync-idle workload with-in group. Same is true for async workload which will be in root group. Also the workload-switching with-in group will become very unpredictable as it now depends whether some RT workload was running in the system or not. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Vivek Goyal	fb104db41e	cfq-iosched: Get rid of nr_groups o Currently code does not seem to be using cfqd->nr_groups. Get rid of it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Vivek Goyal	1db32c4060	cfq-iosched: Remove the check for same cfq group from allow_merge o allow_merge() already checks if submitting task is pointing to same cfqq as rq has been queued in. If everything is fine, we should not be having a task in one cgroup and having a pointer to cfqq in other cgroup. Well I guess in some situations it can happen and that is, when a random IO queue has been moved into root cgroup for group_isolation=0. In this case, tasks's cgroup/group is different from where actually cfqq is, but this is intentional and in this case merging should be allowed. The second situation is where due to close cooperator patches, multiple processes can be sharing a cfqq. If everything implemented right, we should not end up in a situation where tasks from different processes in different groups are sharing the same cfqq as we allow merging of cooperating queues only if they are in same group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Jens Axboe	b568be627a	block: temporarily disable discard granularity Commit `86b3728141` adds a check for misaligned stacking offsets, but it's buggy since the defaults are 0. Hence all dm devices that pass in a non-zero starting offset will be marked as misaligned amd dm will complain. A real fix is coming, in the mean time disable the discard granularity check so that users don't worry about dm reporting about misaligned devices. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-16 09:16:41 +01:00
Linus Torvalds	51b736b851	Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: cfq: set workload as expired if it doesn't have any slice left Fix a CFQ crash in "for-2.6.33" branch of block tree cfq: Remove wait_request flag when idle time is being deleted cfq-iosched: commenting non-obvious initialization cfq-iosched: Take care of corner cases of group losing share due to deletion cfq-iosched: Get rid of cfqq wait_busy_done flag cfq: Optimization for close cooperating queue searching block,xd: Delay allocation of DMA buffers until device is known drbd: Following the hmac change to SHASH (see linux commit `8bd1209cff`) cfq-iosched: reduce write depth only if sync was delayed	2009-12-15 09:11:28 -08:00
Gui Jianfeng	66ae291978	cfq: set workload as expired if it doesn't have any slice left When a group is resumed, if it doesn't have workload slice left, we should set workload_expires as expired. Otherwise, we might start from where we left in previous group by error. Thanks the idea from Corrado. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-15 10:08:45 +01:00
Vivek Goyal	82bbbf28db	Fix a CFQ crash in "for-2.6.33" branch of block tree I think my previous patch introduced a bug which can lead to CFQ hitting BUG_ON(). The offending commit in for-2.6.33 branch is. commit `7667aa0630` Author: Vivek Goyal <vgoyal@redhat.com> Date: Tue Dec 8 17:52:58 2009 -0500 cfq-iosched: Take care of corner cases of group losing share due to deletion While doing some stress testing on my box, I enountered following. login: [ 3165.148841] BUG: scheduling while atomic: swapper/0/0x10000100 [ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan] [ 3165.149821] Pid: 0, comm: swapper Not tainted 2.6.32-block-for-33-merged-new #3 [ 3165.149821] Call Trace: [ 3165.149821] <IRQ> [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60 [ 3165.149821] [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d [ 3165.149821] [<ffffffff8153a979>] schedule+0xe3/0x7bc [ 3165.149821] [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810422d8>] __cond_resched+0x2a/0x35 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37 [ 3165.149821] [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f [ 3165.149821] [<ffffffff811e4161>] report_bug+0x18/0xac [ 3165.149821] [<ffffffff8100f1fc>] die+0x39/0x63 [ 3165.149821] [<ffffffff8153cde1>] do_trap+0x11a/0x129 [ 3165.149821] [<ffffffff8100d470>] do_invalid_op+0x96/0x9f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67 [ 3165.149821] [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13 [ 3165.149821] [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4 [ 3165.149821] [<ffffffff8100c935>] invalid_op+0x15/0x20 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7 [ 3165.149821] [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21 [ 3165.149821] [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df [ 3165.149821] [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17 [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d931f>] __blk_run_queue+0x42/0x71 [ 3165.149821] [<ffffffff811d9403>] blk_run_queue+0x26/0x3a [ 3165.149821] [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375 [ 3165.149821] [<ffffffff812b60ac>] ? put_device+0x17/0x19 [ 3165.149821] [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b [ 3165.149821] [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5 [ 3165.149821] [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe I think I have hit following BUG_ON() in cfq_dispatch_request(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); Please find attached the patch to fix it. I have done some stress testing with it and have not seen it happening again. o We should wait on a queue even after slice expiry only if it is empty. If queue is not empty then continue to expire it. o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue() will return a valid cfqq and cfq_dispatch_request() can hit following BUG_ON(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)) Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-10 19:25:41 +01:00
Gui Jianfeng	554554f60a	cfq: Remove wait_request flag when idle time is being deleted Remove wait_request flag when idle time is being deleted, otherwise it'll hit this path every time when a request is enqueued. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-10 09:38:39 +01:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Corrado Zoccolo	edc71131c4	cfq-iosched: commenting non-obvious initialization Added a comment to explain the initialization of last_delayed_sync. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 20:56:04 +01:00
Vivek Goyal	7667aa0630	cfq-iosched: Take care of corner cases of group losing share due to deletion If there is a sequential reader running in a group, we wait for next request to come in that group after slice expiry and once new request is in, we expire the queue. Otherwise we delete the group from service tree and group looses its fair share. So far I was marking a queue as wait_busy if it had consumed its slice and it was last queue in the group. But this condition did not cover following two cases. 1.If a request completed and slice has not expired yet. Next request comes in and is dispatched to disk. Now select_queue() hits and slice has expired. This group will be deleted. Because request is still in the disk, this queue will never get a chance to wait_busy. 2.If request completed and slice has not expired yet. Before next request comes in (delay due to think time), select_queue() hits and expires the queue hence group. This queue never got a chance to wait busy. Gui was hitting the boundary condition 1 and not getting fairness numbers proportional to weight. This patch puts the checks for above two conditions and improves the fairness numbers for sequential workload on rotational media. Check in select_queue() takes care of case 1 and additional check in should_wait_busy() takes care of case 2. Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:04 +01:00
Vivek Goyal	c244bb50a9	cfq-iosched: Get rid of cfqq wait_busy_done flag o Get rid of wait_busy_done flag. This flag only tells we were doing wait busy on a queue and that queue got request so expire it. That information can easily be obtained by (cfq_cfqq_wait_busy() && queue_is_not_empty). So remove this flag and keep code simple. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:03 +01:00
Gui Jianfeng	b9d8f4c73b	cfq: Optimization for close cooperating queue searching It doesn't make any sense to try to find out a close cooperating queue if current cfqq is the only one in the group. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:03 +01:00
Corrado Zoccolo	573412b295	cfq-iosched: reduce write depth only if sync was delayed The introduction of ramp-up formula for async queue depths has slowed down dirty page reclaim, by reducing async write performance. This patch makes sure the formula kicks in only when sync request was recently delayed. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 12:32:55 +01:00
Vivek Goyal	878eaddd05	cfq-iosched: Do not access cfqq after freeing it Fix a crash during boot reported by Jeff Moyer. Fix the issue of accessing cfqq after freeing it. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@carl.(none)>	2009-12-07 19:37:15 +01:00
Stephen Rothwell	accee7854b	block: include linux/err.h to use ERR_PTR Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-07 09:47:07 +01:00
Jens Axboe	bb729bc98c	cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit After the merge of the IO controller patches, booting on my megaraid box ran much slower. Vivek Goyal traced it down to megaraid discovery creating tons of devices, each suffering a grace period when they later kill that queue (if no device is found). So lets use call_rcu() to batch these deferred frees, instead of taking the grace period hit for each one. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-06 09:54:19 +01:00
Vivek Goyal	846954b0a3	blkio: Allow CFQ group IO scheduling even when CFQ is a module o Now issues of blkio controller and CFQ in module mode should be fixed. Enable the cfq group scheduling support in module mode. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Vivek Goyal	3e25206689	blkio: Implement dynamic io controlling policy registration o One of the goals of block IO controller is that it should be able to support mulitple io control policies, some of which be operational at higher level in storage hierarchy. o To begin with, we had one io controlling policy implemented by CFQ, and I hard coded the CFQ functions called by blkio. This created issues when CFQ is compiled as module. o This patch implements a basic dynamic io controlling policy registration functionality in blkio. This is similar to elevator functionality where ioschedulers register the functions dynamically. o Now in future, when more IO controlling policies are implemented, these can dynakically register with block IO controller. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Vivek Goyal	9d6a986c0b	blkio: Export some symbols from blkio as its user CFQ can be a module o blkio controller is inside the kernel and cfq makes use of interfaces exported by blkio. CFQ can be a module too, hence export symbols used by CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Louis Rilling	b69f229206	block: Fix io_context leak after failure of clone with CLONE_IO With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
Louis Rilling	61cc74fbb8	block: Fix io_context leak after clone with CLONE_IO With CLONE_IO, copy_io() increments both ioc->refcount and ioc->nr_tasks. However exit_io_context() only decrements ioc->refcount if ioc->nr_tasks reaches 0. Always call put_io_context() in exit_io_context(). Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Shaohua Li	3c764b7a65	cfq-iosched: make nonrot check logic consistent cfq_arm_slice_timer() has logic to disable idle window for SSD device. The same thing should be done at cfq_select_queue() too, otherwise we will still see idle window. This makes the nonrot check logic consistent in cfq. Tests in a intel SSD with low_latency knob close, below patch can triple disk thoughput for muti-thread sequential read. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 13:12:06 +01:00
Jens Axboe	237e5bc4e5	io controller: quick fix for blk-cgroup and modular CFQ It's currently not an allowed configuration, so express that in Kconfig. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 10:07:38 +01:00
Jens Axboe	f2eecb9152	cfq-iosched: move IO controller declerations to a header file They should not be declared inside some other file that's not related to CFQ. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 10:06:35 +01:00
Jens Axboe	2f5ea47712	cfq-iosched: fix compile problem with !CONFIG_CGROUP Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 21:07:17 +01:00
Vivek Goyal	c04645e592	blkio: Wait on sync-noidle queue even if rq_noidle = 1 o rq_noidle() is supposed to tell cfq that do not expect a request after this one, hence don't idle. But this does not seem to work very well. For example for direct random readers, rq_noidle = 1 but there is next request coming after this. Not idling, leads to a group not getting its share even if group_isolation=1. o The right solution for this issue is to scan the higher layers and set right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single line fix helps. This should not have any significant impact when we are not using cgroups. I will later figure out IO paths in higher layer and fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	ae30c28655	blkio: Implement group_isolation tunable o If a group is running only a random reader, then it will not have enough traffic to keep disk busy and we will reduce overall throughput. This should result in better latencies for random reader though. If we don't idle on random reader service tree, then this random reader will experience large latencies if there are other groups present in system with sequential readers running in these. o One solution suggested by corrado is that by default keep the random readers or sync-noidle workload in root group so that during one dispatch round we idle only once on sync-noidle tree. This means that all the sync-idle workload queues will be in their respective group and we will see service differentiation in those but not on sync-noidle workload. o Provide a tunable group_isolation. If set, this will make sure that even sync-noidle queues go in their respective group and we wait on these. This provides stronger isolation between groups but at the expense of throughput if group does not have enough traffic to keep the disk busy. o By default group_isolation = 0 Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f26bd1f0a3	blkio: Determine async workload length based on total number of queues o Async queues are not per group. Instead these are system wide and maintained in root group. Hence their workload slice length should be calculated based on total number of queues in the system and not just queues in the root group. o As root group's default weight is 1000, make sure to charge async queue more in terms of vtime so that it does not get more time on disk because root group has higher weight. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f75edf2dc8	blkio: Wait for cfq queue to get backlogged if group is empty o If a queue consumes its slice and then gets deleted from service tree, its associated group will also get deleted from service tree if this was the only queue in the group. That will make group loose its share. o For the queues on which we have idling on and if these have used their slice, wait a bit for these queues to get backlogged again and then expire these queues so that group does not loose its share. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f8d461d692	blkio: Propagate cgroup weight updation to cfq groups o Propagate blkio cgroup weight updation to associated cfq groups. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	24610333d5	blkio: Drop the reference to queue once the task changes cgroup o If a task changes cgroup, drop reference to the cfqq associated with io context and set cfqq pointer stored in ioc to NULL so that upon next request arrival we will allocate a new queue in new group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	8682e1f15f	blkio: Provide some isolation between groups o Do not allow following three operations across groups for isolation. - selection of co-operating queues - preemtpions across groups - request merging across groups. o Async queues are currently global and not per group. Allow preemption of an async queue if a sync queue in other group gets backlogged. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	220841906f	blkio: Export disk time and sectors used by a group to user space o Export disk time and sector used by a group to user space through cgroup interface. o Also export a "dequeue" interface to cgroup which keeps track of how many a times a group was deleted from service tree. Helps in debugging. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	2868ef7b39	blkio: Some debugging aids for CFQ o Some debugging aids for CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	b1c3576961	blkio: Take care of cgroup deletion and cfq group reference counting o One can choose to change elevator or delete a cgroup. Implement group reference counting so that both elevator exit and cgroup deletion can take place gracefully. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Nauman Rafique <nauman@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	25fb5169d4	blkio: Dynamic cfq group creation based on cgroup tasks belongs to o Determine the cgroup IO submitting task belongs to and create the cfq group if it does not exist already. o Also link cfqq and associated cfq group. o Currently all async IO is mapped to root group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	dae739ebc4	blkio: Group time used accounting and workload context save restore o This patch introduces the functionality to do the accounting of group time when a queue expires. This time used decides which is the group to go next. o Also introduce the functionlity to save and restore the workload type context with-in group. It might happen that once we expire the cfq queue and group, a different group will schedule in and we will lose the context of the workload type. Hence save and restore it upon queue expiry. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00

1 2 3 4 5 ...

1111 Commits