linux

History

Wu Fengguang 143dfe8611 writeback: IO-less balance_dirty_pages() As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>		2011-10-03 21:08:57 +08:00
..
backing-dev.c	writeback: stabilize bdi->dirty_ratelimit	2011-10-03 21:08:57 +08:00
bootmem.c	crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn	2011-03-23 19:47:19 -07:00
bounce.c	bounce: call flush_dcache_page() after bounce_copy_vec()	2010-09-09 18:57:25 -07:00
cleancache.c	mm: cleancache core ops functions and config	2011-05-26 10:01:36 -06:00
compaction.c	mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2	2011-06-15 20:04:02 -07:00
debug-pagealloc.c
dmapool.c	devres: fix possible use after free	2011-07-25 20:57:14 -07:00
fadvise.c
failslab.c	fault-injection: add ability to export fault_attr in arbitrary directory	2011-08-03 14:25:20 -10:00
filemap_xip.c	mm: Convert i_mmap_lock to a mutex	2011-05-25 08:39:18 -07:00
filemap.c	mm: account skipped entries to avoid looping in find_get_pages	2011-09-14 18:17:56 -07:00
fremap.c	mm: don't access vm_flags as 'int'	2011-05-26 09:20:31 -07:00
highmem.c	mm: make HASHED_PAGE_VIRTUAL page_address' struct page argument const.	2011-08-17 13:00:20 -07:00
huge_memory.c	mm/huge_memory.c: minor lock simplification in __khugepaged_exit	2011-07-25 20:57:09 -07:00
hugetlb.c	mm: hugetlb: fix coding style issues	2011-07-25 20:57:09 -07:00
hwpoison-inject.c	Fix common misspellings	2011-03-31 11:26:23 -03:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm: nommu: sort mm->mmap list properly	2011-05-25 08:39:05 -07:00
Kconfig	mm Kconfig typo: cleancacne -> cleancache	2011-06-10 14:47:52 +02:00
Kconfig.debug	mm: debug-pagealloc: fix kconfig dependency warning	2011-03-22 17:44:02 -07:00
kmemcheck.c
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
ksm.c	ksm: fix NULL pointer dereference in scan_get_next_rmap_item()	2011-06-15 20:04:02 -07:00
maccess.c	maccess,probe_kernel: Make write/read src const void *	2011-05-25 19:56:23 -04:00
madvise.c	fs: kill i_alloc_sem	2011-07-20 20:47:46 -04:00
Makefile	mm: cleancache core ops functions and config	2011-05-26 10:01:36 -06:00
memblock.c	mm/memblock.c: avoid abuse of RED_INACTIVE	2011-07-25 20:57:09 -07:00
memcontrol.c	memcg: Revert "memcg: add memory.vmscan_stat"	2011-09-14 18:09:38 -07:00
memory_hotplug.c	mm: extend memory hotplug API to allow memory hotplug in virtual machines	2011-07-25 20:57:08 -07:00
memory-failure.c	HWPoison: add memory_failure_queue()	2011-08-03 11:15:58 -04:00
memory.c	mm/futex: fix futex writes on archs with SW tracking of dirty & young	2011-07-25 20:57:11 -07:00
mempolicy.c	mm/mempolicy.c: make copy_from_user() provably correct	2011-09-14 18:09:36 -07:00
mempool.c
migrate.c	migrate: don't account swapcache as shmem	2011-06-16 15:01:24 -07:00
mincore.c	mm: clarify the radix_tree exceptional cases	2011-08-03 14:25:24 -10:00
mlock.c	mm: don't access vm_flags as 'int'	2011-05-26 09:20:31 -07:00
mm_init.c
mmap.c	mmap: fix and tidy up overcommit page arithmetic	2011-07-25 20:57:09 -07:00
mmu_context.c
mmu_notifier.c	thp: mmu_notifier_test_young	2011-01-13 17:32:46 -08:00
mmzone.c	mm: page allocator: adjust the per-cpu counter threshold when memory is low	2011-01-13 17:32:31 -08:00
mprotect.c	thp: mprotect: transparent huge page support	2011-01-13 17:32:44 -08:00
mremap.c	mm: Convert i_mmap_lock to a mutex	2011-05-25 08:39:18 -07:00
msync.c
nobootmem.c	memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high()	2011-05-25 08:39:31 -07:00
nommu.c	mmap: fix and tidy up overcommit page arithmetic	2011-07-25 20:57:09 -07:00
oom_kill.c	oom: task->mm == NULL doesn't mean the memory was freed	2011-08-01 15:24:12 -10:00
page_alloc.c	fault-injection: add ability to export fault_attr in arbitrary directory	2011-08-03 14:25:20 -10:00
page_cgroup.c	mm/page_cgroup.c: simplify code by using SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macros	2011-07-25 20:57:09 -07:00
page_io.c	block: kill off REQ_UNPLUG	2011-03-10 08:52:27 +01:00
page_isolation.c	mm: page_isolation: codeclean fix comment and rm unneeded val init	2010-10-26 16:52:11 -07:00
page-writeback.c	writeback: IO-less balance_dirty_pages()	2011-10-03 21:08:57 +08:00
pagewalk.c	pagewalk: fix code comment for THP	2011-07-25 20:57:09 -07:00
percpu-km.c	percpu: clear memory allocated with the km allocator	2010-10-02 10:28:42 +03:00
percpu-vm.c	mm: remove gfp mask from pcpu_get_vm_areas	2011-01-13 17:32:34 -08:00
percpu.c	Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu	2011-05-24 11:53:42 -07:00
pgtable-generic.c	mm/pgtable-generic.c: fix CONFIG_SWAP=n build	2011-01-26 10:49:58 +10:00
prio_tree.c	sanitize <linux/prefetch.h> usage	2011-05-20 12:50:29 -07:00
quicklist.c
readahead.c	readahead: readahead page allocations are OK to fail	2011-05-25 08:39:25 -07:00
rmap.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback	2011-07-26 10:39:54 -07:00
shmem.c	mm: clarify the radix_tree exceptional cases	2011-08-03 14:25:24 -10:00
slab.c	slab, lockdep: Annotate the locks before using them	2011-08-04 10:18:00 +02:00
slob.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
slub.c	slub: add slab with one free object to partial list tail	2011-08-27 11:58:59 +03:00
sparse-vmemmap.c	tree-wide: fix comment/printk typos	2010-11-01 15:38:34 -04:00
sparse.c	mm: make some struct page's const	2011-07-25 20:57:07 -07:00
swap_state.c	block: remove per-queue plugging	2011-03-10 08:52:07 +01:00
swap.c	mm: batch activate_page() to reduce lock contention	2011-05-25 08:39:37 -07:00
swapfile.c	mm: let swap use exceptional entries	2011-08-03 14:25:22 -10:00
thrash.c	mm: swap-token: add a comment for priority aging	2011-07-25 20:57:08 -07:00
truncate.c	mm: a few small updates for radix-swap	2011-08-03 14:25:24 -10:00
util.c	mm: nommu: sort mm->mmap list properly	2011-05-25 08:39:05 -07:00
vmalloc.c	mm: sync vmalloc address space page tables in alloc_vm_area()	2011-09-14 18:09:38 -07:00
vmscan.c	memcg: Revert "memcg: add memory.vmscan_stat"	2011-09-14 18:09:38 -07:00
vmstat.c	numa: fix NUMA compile error when sysfs and procfs are disabled	2011-09-14 18:09:37 -07:00