This is basically identical to what Vivek Goyal posted, but combined
into one and labelled 'desktop' instead of 'fairness'. The goal
is to continue to improve on the latency side of things as it relates
to interactiveness, keeping the questionable bits under this sysfs
tunable so it would be easy for throughput-only people to turn off.
Apart from adding the interactive sysfs knob, it also adds the
behavioural change of allowing slice idling even if the hardware
does tagged command queuing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
The blktrace tools can show process id when cfq dispatched a request,
using cfq_log_cfqq() instead of cfq_log().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's not currently used, as pointed out by
Gui Jianfeng <guijianfeng@cn.fujitsu.com>. We already check the
wait_request flag to allow an idling queue priority allocation access,
so we don't need this extra flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Get rid of busy_rt_queues infrastructure. Looks like it is redundant.
o Once an RT queue gets request it will preempt any of the BE or IDLE queues
immediately. Otherwise this queue will be put on service tree and scheduler
will anyway select this queue before any of the BE or IDLE queue. Hence
looks like there is no need to keep track of how many busy RT queues are
currently on service tree.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To lessen the impact of async IO on sync IO, let the device drain of
any async IO in progress when switching to a sync cfqq that has idling
enabled.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
In case memory is scarce, we now default to oom_cfqq. Once memory is
available again, we should allocate a new cfqq and stop using oom_cfqq for
a particular io context.
Once a new request comes in, check if we are using oom_cfqq, and if yes,
try to allocate a new cfqq.
Tested the patch by forcing the use of oom_cfqq and upon next request thread
realized that it was using oom_cfqq and it allocated a new cfqq.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
With the changes for falling back to an oom_cfqq, we never fail
to find/allocate a queue in cfq_get_queue(). So remove the check.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Setup an emergency fallback cfqq that we allocate at IO scheduler init
time. If the slab allocation fails in cfq_find_alloc_queue(), we'll just
punt IO to that cfqq instead. This ensures that cfq_find_alloc_queue()
never fails without having to ensure free memory.
On cfqq lookup, always try to allocate a new cfqq if the given cfq io
context has the oom_cfqq assigned. This ensures that we only temporarily
punt to this shared queue.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We're going to be needing that init code outside of that function
to get rid of the __GFP_NOFAIL in cfqq allocation.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
I noticed a blank line in blktrace output. This patch fixes that.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Actually, last_end_request in cfq_data isn't used now. So lets
just remove it.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently io_context has an atomic_t(32-bit) as refcount. In the case of
cfq, for each device against whcih a task does I/O, a reference to the
io_context would be taken. And when there are multiple process sharing
io_contexts(CLONE_IO) would also have a reference to the same io_context.
Theoretically the possible maximum number of processes sharing the same
io_context + the number of disks/cfq_data referring to the same io_context
can overflow the 32-bit counter on a very high-end machine.
Even though it is an improbable case, let us make it atomic_long_t.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
struct request has had a few different ways to represent some
properties of a request. ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers. The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion. In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.
Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size. This is more convoluted than the hard_ case.
rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests. rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests. This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.
rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front. This is mainly used by drivers
which transfers data by walking request segment-by-segment. This
value always equals rq->bio->bi_size >> 9. However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.
In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs. With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly. Drop all the duplicates. Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length. Everything else is
defined in terms of these three and available only through accessors.
* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
now handles pc and fs requests equally other than rq->sector update.
This means that now pc requests can use partial completion too (no
in-kernel user yet tho).
* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
now uses byte count as the primary data length.
* blk_rq_pos() is now guranteed to be always correct. In-block users
converted.
* blk_rq_bytes() is now guaranteed to be always valid as is
blk_rq_sectors(). In-block users converted.
* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
More convenient one is used.
* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
pointer to request.
[ Impact: API cleanup, single way to represent one property of a request ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With recent cleanups, there is no place where low level driver
directly manipulates request fields. This means that the 'hard'
request fields always equal the !hard fields. Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.
While at it, drop superflous blk_rq_pos() < 0 test in swim.c.
[ Impact: use pos and nr_sectors accessors ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Mike Miller <mike.miller@hp.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Dario Ballabio <ballabio_dario@emc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: unsik Kim <donari75@gmail.com>
Cc: Laurent Vivier <Laurent@lvivier.info>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement accessors - blk_rq_pos(), blk_rq_sectors() and
blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors
and rq->hard_cur_sectors respectively and convert direct references of
the said fields to the accessors.
This is in preparation of request data length handling cleanup.
Geert : suggested adding const to struct request * parameter to accessors
Sergei : spotted error in patch description
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_start_queueing() is identical to __blk_run_queue() except that it
doesn't check for recursion. None of the current users depends on
blk_start_queueing() running request_fn directly. Replace usages of
blk_start_queueing() with [__]blk_run_queue() and kill it.
[ Impact: removal of mostly duplicate interface function ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently we look it up from ->ioprio, but ->ioprio can change if
either the process gets its IO priority changed explicitly, or if
cfq decides to temporarily boost it. So if we are unlucky, we can
end up attempting to remove a node from a different rbtree root than
where it was added.
Fix this by using ->org_ioprio as the prio_tree index, since that
will only change for explicit IO priority settings (not for a boost).
Additionally cache the rbtree root inside the cfqq, then we don't have
to add code to reinsert the cfqq in the prio_tree if IO priority changes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_prio_tree_lookup() should return the direct match, yet it always
returns zero. Fix that.
cfq_prio_tree_add() assumes that we don't get a direct match, while
it is very possible that we do. Using O_DIRECT, you can have different
cfqq with matching requests, since you don't have the page cache
to serialize things for you. Fix this bug by only adding the cfqq if
there isn't an existing match.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If the cfq io context doesn't have enough samples yet to provide a mean
seek distance, then use the default threshold we have for seeky IO instead
of defaulting to 0.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Right now, depending on the first sector to which a process issues I/O,
the seek time may start out way out of whack. So make sure we start
with 0 sectors in seek, instead of the offset of the first request
issued.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have processes that are working in close proximity to each
other on disk, we don't want to idle wait. Instead allow the close
process to issue a request, getting better aggregate bandwidth.
The anticipatory scheduler has similar checks, noop and deadline do
not need it since they don't care about process <-> io mappings.
The code for CFQ is a little more involved though, since we split
request queues into per-process contexts.
This fixes a performance problem with eg dump(8), since it uses
several processes in some silly attempt to speed IO up. Even if
dump(8) isn't really a valid case (it should be fixed by using
CLONE_IO), there are other cases where we see close processes
and where idling ends up hurting performance.
Credit goes to Jeff Moyer <jmoyer@redhat.com> for writing the
initial implementation.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only kick the dispatch for an idling queue, if we think it's a
(somewhat) fully merged request. Also allow a kick if we have other
busy queues in the system, since we don't want to risk waiting for
a potential merge in that case. It's better to get some work done and
proceed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's called from the workqueue handlers from process context, so
we always have irqs enabled when entered.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> reports that commit
b029195dda introduced a regression
of about 50% with sequential threaded read workloads. The test
case is:
tiotest -k0 -k1 -k3 -f 80 -t 32
which starts 32 threads each reading a 80MB file. Twiddle the kick
queue logic so that we do start IO immediately, if it appears to be
a fully merged request. We can't really detect that, so just check
if the request is bigger than a page or not. The assumption is that
since single bio issues will first queue a single request with just
one page attached and then later do merges on that, if we already
have more than a page worth of data in the request, then the request
is most likely good to go.
Verified that this doesn't cause a regression with the test case that
commit b029195dda was fixing. It does not,
we still see maximum sized requests for the queue-then-merge cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When CFQ is waiting for a new request from a process, currently it'll
immediately restart queuing when it sees such a request. This doesn't
work very well with streamed IO, since we then end up splitting IO
that would otherwise have been merged nicely. For a simple dd test,
this causes 10x as many requests to be issued as we should have.
Normally this goes unnoticed due to the low overhead of requests
at the device side, but some hardware is very sensitive to request
sizes and there it can cause big slow downs.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only manipulate the must_dispatch and queue_new flags, they are not
tested anymore. So get rid of them.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The IO scheduler core calls into the IO scheduler dispatch_request hook
to move requests from the IO scheduler and into the driver dispatch
list. It only does so when the dispatch list is empty. CFQ moves several
requests to the dispatch list, which can cause higher latencies if we
suddenly have to switch to some important sync IO. Change the logic to
move one request at the time instead.
This should almost be functionally equivalent to what we did before,
except that we now honor 'quantum' as the maximum queue depth at the
device side from any single cfqq. If there's just a single active
cfqq, we allow up to 4 times the normal quantum.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.
Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the ability to pre-empt an ongoing BE timeslice when a RT
request is waiting for the current timeslice to complete. This reduces the
wait time to disk for RT requests from an upper bound of 4 (current value
of cfq_quantum) to 1 disk request.
Applied Jens' suggeested changes to avoid the rb lookup and use !cfq_class_rt()
and retested.
Latency(secs) for the RT task when doing sequential reads from 10G file.
| only RT | RT + BE | RT + BE + this patch
small (512 byte) reads | 143 | 163 | 145
large (1Mb) reads | 142 | 158 | 146
Signed-off-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Original patch from Nikanth Karthikesan <knikanth@suse.de>
When a queue exits the queue lock is taken and cfq_exit_queue() would free all
the cic's associated with the queue.
But when a task exits, cfq_exit_io_context() gets cic one by one and then
locks the associated queue to call __cfq_exit_single_io_context. It looks like
between getting a cic from the ioc and locking the queue, the queue might have
exited on another cpu.
Fix this by rechecking the cfq_io_context queue key inside the queue lock
again, and not calling into __cfq_exit_single_io_context() if somebody
beat us to it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This basically limits the hardware queue depth to 4*quantum at any
point in time, which is 16 with the default settings. As CFQ uses
other means to shrink the hardware queue when necessary in the first
place, there's really no need for this extra heuristic. Additionally,
it ends up hurting performance in some cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After many improvements on kblockd_flush_work, it is now identical to
cancel_work_sync, so a direct call to cancel_work_sync is suggested.
The only difference is that cancel_work_sync is a GPL symbol,
so no non-GPL modules anymore.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We really need to know about the hardware tagging support as well,
since if the SSD does not do tagging then we still want to idle.
Otherwise have the same dependent sync IO vs flooding async IO
problem as on rotational media.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We don't want to idle in AS/CFQ if the device doesn't have a seek
penalty. So add a QUEUE_FLAG_NONROT to indicate a non-rotational
device, low level drivers should set this flag upon discovery of
an SSD or similar device type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
CFQ's detection of queueing devices assumes a non-queuing device and detects
if the queue depth reaches a certain threshold. Under some workloads (e.g.
synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
the detection logic. This leads to poor performance on queuing hardware,
since the idle window remains enabled.
This patch inverts the sense of the logic: assume a queuing-capable device,
and detect if the depth does not exceed the threshold.
Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that blktrace has the ability to carry arbitrary messages in
its stream, use that for some CFQ logging.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have multiple tasks freeing cfq_io_contexts when cfq-iosched
is being unloaded, we could complete() ioc_gone twice. Fix that by
protecting ioc_gone complete() and clearing with a spinlock for
just that purpose. Doesn't matter from a performance perspective,
since it'll only enter that path when ioc_gone != NULL (when cfq-iosched
is being rmmod'ed).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_cic_lookup() needs to properly protect ioc->ioc_data before
dereferencing it and also exclude updaters of ioc->ioc_data as well.
Also add a number of comments documenting why the existing RCU usage
is OK.
Thanks a lot to "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> for
review and comments!
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
saves 8 bytes of padding & increases objects/slab from 30 to 32 on my
AMD64 config
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We currently set all processes to the best-effort scheduling class,
regardless of what CPU scheduling class they belong to. Improve that
so that we correctly track idle and rt scheduling classes as well.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
put_io_context() drops the RCU read lock before calling into cfq_dtor(),
however we need to hold off freeing there before grabbing and
dereferencing the first object on the list.
So extend the rcu_read_lock() scope to cover the calling of cfq_dtor(),
and optimize cfq_free_io_context() to use a new variant for
call_for_each_cic() that assumes the RCU read lock is already held.
Hit in the wild by Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When switching scheduler from cfq, cfq_exit_queue() does not clear
ioc->ioc_data, leaving a dangling pointer that can deceive the following
lookups when the iosched is switched back to cfq. The pattern that can
trigger that is the following:
- elevator switch from cfq to something else;
- module unloading, with elv_unregister() that calls cfq_free_io_context()
on ioc freeing the cic (via the .trim op);
- module gets reloaded and the elevator switches back to cfq;
- reallocation of a cic at the same address as before (with a valid key).
To fix it just assign NULL to ioc_data in __cfq_exit_single_io_context(),
that is called from the regular exit path and from the elevator switching
code. The only path that frees a cic and is not covered is the error handling
one, but cic's freed in this way are never cached in ioc_data.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
SLAB_DESTROY_BY_RCU is not a direct substitute for normal call_rcu()
freeing, since it'll page freeing but NOT object freeing. So change
cfq to do the freeing on its own.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's cumbersome to browse a radix tree from start to finish, especially
since we modify keys when a process exits. So add a hlist for the single
purpose of browsing over all known cfq_io_contexts, used for exit,
io prio change, etc.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=9948
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently you must be root to set idle io prio class on a process. This
is due to the fact that the idle class is implemented as a true idle
class, meaning that it will not make progress if someone else is
requesting disk access. Unfortunately this means that it opens DOS
opportunities by locking down file system resources, hence it is root
only at the moment.
This patch relaxes the idle class a little, by removing the truly idle
part (which entals a grace period with associated timer). The
modifications make the idle class as close to zero impact as can be done
while still guarenteeing progress. This means we can relax the root only
criteria as well.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The io context sharing introduced a per-ioc spinlock, that would protect
the cfq io context lookup. That is a regression from the original, since
we never needed any locking there because the ioc/cic were process private.
The cic lookup is changed from an rbtree construct to a radix tree, which
we can then use RCU to make the reader side lockless. That is the performance
critical path, modifying the radix tree is only done on process creation
(when that process first does IO, actually) and on process exit (if that
process has done IO).
As it so happens, radix trees are also much faster for this type of
lookup where the key is a pointer. It's a very sparse tree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In theory, if the queue was idle long enough, cfq_idle_class_timer may have
a false (and very long) timeout because jiffies can wrap into the past wrt
->last_end_request.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After the fresh boot:
ionice -c3 -p $$
echo cfq >> /sys/block/XXX/queue/scheduler
dd if=/dev/XXX of=/dev/null bs=512 count=1
Now dd hangs in D state and the queue is completely stalled for approximately
INITIAL_JIFFIES + CFQ_IDLE_GRACE jiffies. This is because cfq_init_queue()
forgets to initialize cfq_data->last_end_request.
(I guess this patch is not complete, overflow is still possible)
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Spotted by Nick <gentuu@gmail.com>, hopefully can explain the second trace in
http://bugzilla.kernel.org/show_bug.cgi?id=9180.
If ->async_idle_cfqq != NULL cfq_put_async_queues() puts it IOPRIO_BE_NR times
in a loop. Fix this.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq_get_queue()->cfq_find_alloc_queue() can fail, check the returned value.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Note that this isn't a bug at the moment, since the regular IO path
does not call this path without __GFP_WAIT set. However, it could be a
future bug, so I've applied it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There are some leftover bits from the task cooperator patch, that was
yanked out again. While it will get reintroduced, no point in having
this write-only stuff in the tree. So yank it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we have two processes with different ioprio_class, but the same
ioprio_data, their async requests will fall into the same queue. I guess
such behavior is not expected, because it's not right to put real-time
requests and best-effort requests in the same queue.
The attached patch fixes the problem by introducing additional *cfqq
fields on cfqd, pointing to per-(class,priority) async queues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past. But with __GFP_ZERO it is possible now to do zeroing
while allocating.
Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the cfq_queue hash removal, we inadvertently got rid of the
async queue sharing. This was not intentional, in fact CFQ purposely
shares the async queue per priority level to get good merging for
async writes.
So put some logic in cfq_get_queue() to track the shared queues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch provides a new macro
KMEM_CACHE(<struct>, <flags>)
to simplify slab creation. KMEM_CACHE creates a slab with the name of the
struct, with the size of the struct and with the alignment of the struct.
Additional slab flags may be specified if necessary.
Example
struct test_slab {
int a,b,c;
struct list_head;
} __cacheline_aligned_in_smp;
test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)
will create a new slab named "test_slab" of the size sizeof(struct
test_slab) and aligned to the alignment of test slab. If it fails then we
panic.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We often lookup the same queue many times in succession, so cache
the last looked up queue to avoid browsing the rbtree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq hash is no more necessary. We always can get cfqq from io context.
cfq_get_io_context_noalloc() function is introduced, because we don't
want to allocate cic on merging and checking may_queue. In order to
identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is
eliminated we need to use other criterion: sync flag for queue is added.
In all places where we dig in rb_tree we're in current context, so no
additional locking is required.
Advantages of this patch: no additional memory for hash, no seeking in
hash, code is cleaner. But it is necessary now to seek cic in per-ioc
rbtree, but it is faster:
- most processes work only with few devices
- most systems have only few block devices
- it is a rb-tree
Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Changes by me:
- Merge into CFQ devel branch
- Get rid of cfq_get_io_context_noalloc()
- Fix various bugs with dereferencing cic->cfqq[] with offset other
than 0 or 1.
- Fix bug in cfqq setup, is_sync condition was reversed.
- Fix bug where only bio_sync() is used, we need to check for a READ too
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For tagged devices, allow overlap of requests if the idle window
isn't enabled on the current active queue.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's only used for preemption now that the IDLE and RT queues also
use the rbtree. If we pass an 'add_front' variable to
cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion
at the front of the tree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently CFQ does a linked insert into the current list for RT
queues. We can just factor the class into the rb insertion,
and then we don't have to treat RT queues in a special way. It's
faster, too.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For cases where the rbtree is mainly used for sorting and min retrieval,
a nice speedup of the rbtree code is to maintain a cache of the leftmost
node in the tree.
Also spotted in the CFS CPU scheduler code.
Improved by Alan D. Brunelle <Alan.Brunelle@hp.com> by updating the
leftmost hint in cfq_rb_first() if it isn't set, instead of only
updating it on insert.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Drawing on some inspiration from the CFS CPU scheduler design, overhaul
the pending cfq_queue concept list management. Currently CFQ uses a
doubly linked list per priority level for sorting and service uses.
Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
to service them.
This unfortunately means that the ionice levels aren't as strong
anymore, will work on improving those later. We only scale the slice
time now, not the number of times we service. This means that latency
is better (for all priority levels), but that the distinction between
the highest and lower levels aren't as big.
The diffstat speaks for itself.
cfq-iosched.c | 363 +++++++++++++++++---------------------------------
1 file changed, 125 insertions(+), 238 deletions(-)
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- Move the queue_new flag clear to when the queue is selected
- Only select the non-first queue in cfq_get_best_queue(), if there's
a substantial difference between the best and first.
- Get rid of ->busy_rr
- Only select a close cooperator, if the current queue is known to take
a while to "think".
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- Implement logic for detecting cooperating processes, so we
choose the best available queue whenever possible.
- Improve residual slice time accounting.
- Remove dead code: we no longer see async requests coming in on
sync queues. That part was removed a long time ago. That means
that we can also remove the difference between cfq_cfqq_sync()
and cfq_cfqq_class_sync(), they are now indentical. And we can
kill the on_dispatch array, just make it a counter.
- Allow a process to go into the current list, if it hasn't been
serviced in this scheduler tick yet.
Possible future improvements including caching the cfqq lookup
in cfq_close_cooperator(), so we don't have to look it up twice.
cfq_get_best_queue() should just use that last decision instead
of doing it again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When testing the syslet async io approach, I discovered that CFQ
sometimes didn't perform as well as expected. cfq_should_preempt()
needs to better check for cooperating tasks, so fix that by allowing
preemption of an equal priority queue if the recently queued request
is as good a candidate for IO as the one we are currently waiting for.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's a really rare and obscure bug in CFQ, that causes a crash in
cfq_dispatch_insert() due to rq == NULL. One example of the resulting
oops is seen here:
http://lkml.org/lkml/2007/4/15/41
Neil correctly diagnosed the situation for how this can happen: if two
concurrent requests with the exact same sector number (due to direct IO
or aliasing between MD and the raw device access), the alias handling
will add the request to the sortlist, but next_rq remains NULL.
Read the more complete analysis at:
http://lkml.org/lkml/2007/4/25/57
This looks like it requires md to trigger, even though it should
potentially be possible to due with O_DIRECT (at least if you edit the
kernel and doctor some of the unplug calls).
The fix is to move the ->next_rq update to when we add a request to the
rbtree. Then we remove the possibility for a request to exist in the
rbtree code, but not have ->next_rq correctly updated.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a 10-15% performance regression for sequential writes on TCQ/NCQ
enabled drives in 2.6.21-rcX after the CFQ update went in. It has been
reported by Valerie Clement <valerie.clement@bull.net> and the Intel
testing folks. The regression is because of CFQ's now more aggressive
queue control, limiting the depth available to the device.
This patches fixes that regression by allowing a greater depth when only
one queue is busy. It has been tested to not impact sync-vs-async
workloads too much - we still do a lot better than 2.6.20.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently check the FIFO once per slice. Optimize that a bit and
only do it as the first thing for a new slice, so we don't end up
doing a single request and then seek to the FIFO requests.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It must always be the active queue, otherwise it's a bug. So just
use the active_queue, don't pass it in explicitly.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If a slice uses less than it is entitled to (or perhaps more), include
that in the decision on how much time to give it the next time it
gets serviced.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>