linux

Author	SHA1	Message	Date
Linus Torvalds	26bb0d3f38	for-6.12/block-20240913 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs 6UlPcJzgYg== =uuLl -----END PGP SIGNATURE----- Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...	2024-09-16 13:33:06 +02:00
Yu Kuai	29390bb566	blk-throttle: support prioritized processing of metadata Currently, blk-throttle handle all IO fifo, hence if data IO is throttled and then meta IO is dispatched, the meta IO will have to wait for the data IO, causing priority inversion problems. This patch support to handle metadata first and then pay debt while throttling data. Test script: use cgroup v1 to throttle root cgroup, then create new dir and file while write back is throttled test() { mkdir /mnt/test/xxx touch /mnt/test/xxx/1 sync /mnt/test/xxx sync /mnt/test/xxx } mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0 mount /dev/nvme0n1 /mnt/test echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none & sleep 4 time test echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device sleep 1 umount /dev/nvme0n1 Test result: time cost for creating new dir and file before this patch: 14s after this patch: 0.1s Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-10 16:31:41 -06:00
Yu Kuai	3bf73e6283	blk-throttle: remove last_low_overflow_time last_low_overflow_time is not used anymore after commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW"). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-10 16:31:41 -06:00
Dr. David Alan Gilbert	01aa8c869d	blk-throttle: remove more latency dead-code The struct 'latency_bucket' and the #define 'request_bucket_index' are unused since commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") and the 'LATENCY_BUCKET_SIZE' #define was only used by the 'request_bucket_index' define. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240727155824.1000042-1-linux@treblig.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-27 10:33:25 -06:00
Yu Kuai	1beabab88e	blk-throttle: fix lower control under super low iops limit User will configure allowed iops limit in 1s, and calculate_io_allowed() will calculate allowed iops in the slice by: limit * HZ / throtl_slice However, if limit is quite low, the result can be 0, then allowed IO in the slice is 0, this will cause missing dispatch and control will be lower than limit. For example, set iops_limit to 5 with HD disk, and test will found that iops will be 3. This is usually not a big deal, because user will unlikely to configure such low iops limit, however, this is still a problem in the extreme scene. Fix the problem by making sure the wait time calculated by tg_within_iops_limit() should allow at least one IO to be dispatched. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240618062108.3680835-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-28 14:55:02 -06:00
Waiman Long	0a751df456	blk-throttle: Fix incorrect display of io.max Commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") attempts to revert the code change introduced by commit `cd5ab1b0fc` ("blk-throttle: add .low interface"). However, it leaves behind the bps_conf[] and iops_conf[] fields in the throtl_grp structure which aren't set anywhere in the new blk-throttle.c code but are still being used by tg_prfill_limit() to display the limits in io.max. Now io.max always displays the following values if a block queue is used: <m>:<n> rbps=0 wbps=0 riops=0 wiops=0 Fix this problem by removing bps_conf[] and iops_conf[] and use bps[] and iops[] instead to complete the revert. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Reported-by: Justin Forbes <jforbes@redhat.com> Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789 Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240530134547.970075-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 19:44:29 -06:00
Dr. David Alan Gilbert	4a482e691c	blk-throttle: remove unused struct 'avg_latency_bucket' 'avg_latency_bucket' is unused since commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240522172458.334173-1-linux@treblig.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-22 11:30:47 -06:00
Yu Kuai	a3166c5170	blk-throttle: delay initialization until configuration Other cgroup policy like bfq, iocost are lazy-initialized when they are configured for the first time for the device, but blk-throttle is initialized unconditionally from blkcg_init_disk(). Delay initialization of blk-throttle as well, to save some cpu and memory overhead if it's not configured. Noted that once it's initialized, it can't be destroyed until disk removal, even if it's disabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 09:44:56 -06:00
Yu Kuai	bf20ab538c	blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just not active at all. On the other hand, even if the config is disabled, there are still many fields in throtl_grp and throtl_data and many functions that are only used for throtl low. At last, currently blk-throtl is initialized during disk initialization, and destroyed during disk removal, and it exposes many functions to be called directly from block layer. Remove throtl low to make code much more cleaner and follow up work much easier. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 09:44:55 -06:00
John Garry	d3a3a086ad	blk-throttle: Only use seq_printf() in tg_prfill_limit() Currently tg_prfill_limit() uses a combination of snprintf() and strcpy() to generate the values parts of the limits string, before passing them as arguments to seq_printf(). Convert to use only a sequence of seq_printf() calls per argument, which is simpler. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240327094020.3505514-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
Tang Yizhou	3bca7640b4	blk-throttle: Eliminate redundant checks for data direction After calling throtl_peek_queued(), the data direction can be determined so there is no need to call bio_data_dir() to check the direction again. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240123081248.3752878-1-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:16:12 -07:00
Jens Axboe	08420cf70c	block: add blk_time_get_ns() and blk_time_get() helpers Convert any user of ktime_get_ns() to use blk_time_get_ns(), and ktime_get() to blk_time_get(), so we have a unified API for querying the current time in nanoseconds or as ktime. No functional changes intended, this patch just wraps ktime_get_ns() and ktime_get() with a block helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:22 -07:00
Ming Lei	27b13e209d	blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!" Inside blkg_for_each_descendant_pre(), both css_for_each_descendant_pre() and blkg_lookup() requires RCU read lock, and either cgroup_assert_mutex_or_rcu_locked() or rcu_read_lock_held() is called. Fix the warning by adding rcu read lock. Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-17 10:48:58 -07:00
Khazhismel Kumykov	2dd710d476	blk-throttle: check for overflow in calculate_bytes_allowed Inexact, we may reject some not-overflowing values incorrectly, but they'll be on the order of exabytes allowed anyways. This fixes divide error crash on x86 if bps_limit is not configured or is set too high in the rare case that jiffy_elapsed is greater than HZ. Fixes: `e8368b57c0` ("blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()") Fixes: `8d6bbaada2` ("blk-throttle: prevent overflow while calculating wait time") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231020223617.2739774-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-20 18:38:17 -06:00
Yu Kuai	eead005664	blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice() Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(), for consequence, 'carryover_ios/bytes' will be used to throttle bio multiple times, for example: 1) set iops limit to 100, and slice start is 0, slice end is 100ms; 2) current time is 0, and 10 ios are dispatched, those io won't be throttled and io_disp is 10; 3) still at current time 0, update iops limit to 1000, carryover_ios is updated to (0 - 10) = -10; 4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means only 90 ios can be dispatched without waiting; 5) assume that io is throttled in slice(0 - 100ms), and throtl_trim_slice() update silce to (100ms - 200ms). In this case, 'carryover_ios/bytes' is not cleared and still only 90 ios can be dispatched between 100ms - 200ms. Fix this problem by updating 'carryover_ios/bytes' in throtl_trim_slice(). Fixes: `a880ae93e5` ("blk-throttle: fix io hung due to configuration updates") Reported-by: zhuxiaohui <zhuxiaohui.400@bytedance.com> Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	e8368b57c0	blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice() There are no functional changes, just make the code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	bb8d5587bd	blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negative carryover_ios/bytes[] can be negative in the case that ios are dispatched in the slice in advance, and then configuration is updated. For example: 1) set iops limit to 1000, and slice start is 0, slice end is 100ms; 2) current time is 0, and 100 ios are dispatched, those ios will not be throttled, hence io_disp is 100; 3) still at current time 0, update iops limit to 100, then carryover_ios is (0 - 100) = -100; 4) then, dispatch a new io at time 0, the expected result is that this io will wait for 1s. The calculation in tg_within_iops_limit: io_disp = 0; io_allowed = calculate_io_allowed + carryover_ios = 10 + (-100) = -90; io won't be throttled if (io_disp + 1 < io_allowed) passed. Before this patch, in step 4) (io_disp + 1 < io_allowed) is passed, because -90 for unsigned value is very huge, and such io won't be throttled. Fix this problem by checking if 'io/bytes_allowed' is negative first. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	ef100397fa	blk-throttle: print signed value 'carryover_bytes/ios' for user 'carryover_bytes/ios' can be negative, indicate that some bio is dispatched in advance within slice while configuration is updated. Print a huge value is not user-friendly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Jinke Han	ad7c3b41e8	blk-throttle: Fix io statistics for cgroup v1 After commit `f382fb0bce` ("block: remove legacy IO schedulers"), blkio.throttle.io_serviced and blkio.throttle.io_service_bytes become the only stable io stats interface of cgroup v1, and these statistics are done in the blk-throttle code. But the current code only counts the bios that are actually throttled. When the user does not add the throttle limit, the io stats for cgroup v1 has nothing. I fix it according to the statistical method of v2, and made it count all ios accurately. Fixes: `a7b36ee6ba` ("block: move blk-throtl fast path inline") Tested-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230507170631.89607-1-hanjinke.666@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-25 08:00:39 -06:00
Chengming Zhou	8e15dfbd9a	blk-throttle: only enable blk-stat when BLK_DEV_THROTTLING_LOW blk_throtl_register() will unconditionally enable blk-stat for gendisk when register, even when we have no BLK_DEV_THROTTLING_LOW config. Since the kernel always has only BLK_DEV_THROTTLING config and the BLK_DEV_THROTTLING_LOW config is still in EXPERIMENTAL state, we can just skip blk-stat when !BLK_DEV_THROTTLING_LOW. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230413062805.2081970-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:48:11 -06:00
Tejun Heo	faffaab289	blkcg: Restructure blkg_conf_prep() and friends We want to support lazy init of rq-qos policies so that iolatency is enabled lazily on configuration instead of gendisk initialization. The way blkg config helpers are structured now is a bit awkward for that. Let's restructure: * blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_ prefix was used because the bdev opening step is blkg-independent. However, the distinction is too subtle and confuses more than helps. Let's switch to blkg prefix so that it's consistent with the type and other helper names. * struct blkg_conf_ctx now remembers the original input string and is always initialized by the new blkg_conf_init(). * blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like blkg_conf_prep() and can be called multiple times safely. Instead of modifying the double pointer to input string directly, blkg_conf_open_bdev() now sets blkg_conf_ctx->body. * blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now must be called on all blkg_conf_ctx's which were initialized with blkg_conf_init(). Combined, this allows the users to either open the bdev first or do it altogether with blkg_conf_prep() which will help implementing lazy init of rq-qos policies. blkg_conf_init/exit() will also be used implement synchronization against device removal. This is necessary because iolat / iocost are configured through cgroupfs instead of one of the files under /sys/block/DEVICE. As cgroupfs operations aren't synchronized with block layer, the lazy init and other configuration operations may race against device removal. This patch makes blkg_conf_init/exit() used consistently for all cgroup-orginating configurations making them a good place to implement explicit synchronization. Users are updated accordingly. No behavior change is intended by this patch. v2: bfq wasn't updated in v1 causing a build error. Fixed. v3: Update the description to include future use of blkg_conf_init/exit() as synchronization points. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Yu Kuai <yukuai1@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:46:49 -06:00
Christoph Hellwig	a06377c5d0	Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" This reverts commit `84d7d462b1`. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	b4e94f9c2c	Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" This reverts commit `c43332fe02` as it is not needed without moving to disk references in the blkg. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	1231039db3	Revert "blk-cgroup: move the cgroup information to struct gendisk" This reverts commit `3f13ab7c80` as a patch it depends on caused a few problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	c43332fe02	blk-cgroup: delay calling blkcg_exit_disk until disk_release While del_gendisk ensures there is no outstanding I/O on the queue, it can't prevent block layer users from building new I/O. This leads to a NULL ->root_blkg reference in bio_associate_blkg when allocating a new bio on a shut down file system. Delay freeing the blk-cgroup subsystems from del_gendisk until disk_release to make sure the blkg and throttle information is still avaіlable for bio submitters, even if those bios will immediately fail. This now can cause a case where disk_release is called on a disk that hasn't been added. That's mostly harmless, except for a case in blk_throttl_exit that now needs to check for a NULL ->td pointer. Fixes: `178fa7d498` ("blk-cgroup: delay blk-cgroup initialization until add_disk") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230208063514.171485-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 08:10:45 -07:00
Christoph Hellwig	3f13ab7c80	blk-cgroup: move the cgroup information to struct gendisk cgroup information only makes sense on a live gendisk that allows file system I/O (which includes the raw block device). So move over the cgroup related members. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0a0b4f79db	blk-cgroup: pass a gendisk to pd_alloc_fn No need to the request_queue here, pass a gendisk and extract the node ids from that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	40e4996ec0	blk-cgroup: pass a gendisk to blkcg_{de,}activate_policy Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	84d7d462b1	blk-cgroup: pin the gendisk in struct blkcg_gq Currently each blkcg_gq holds a request_queue reference, which is what is used in the policies. But a lot of these interfaces will move over to use a gendisk, so store a disk in struct blkcg_gq and hold a reference to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Kemeng Shi	eea3e8b74a	blk-throttle: Use more suitable time_after check for update of slice_start There is no need to update tg->slice_start[rw] to start when they are equal already. So remove "eq" part of check before update slice_start. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-10-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:31 -07:00
Kemeng Shi	9c9f209d9d	blk-throttle: remove repeat check of elapsed time There is no need to check elapsed time from last upgrade for each node in hierarchy. Move this check before traversing as throtl_can_upgrade do to remove repeat check. Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:11 -07:00
Kemeng Shi	e3031d4c7d	blk-throttle: remove incorrect comment for tg_last_low_overflow_time Function tg_last_low_overflow_time is called with intermediate node as following: throtl_hierarchy_can_downgrade throtl_tg_can_downgrade tg_last_low_overflow_time throtl_hierarchy_can_upgrade throtl_tg_can_upgrade tg_last_low_overflow_time throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse from leaf node to sub-root node and pass traversed intermediate node to tg_last_low_overflow_time. No such limit could be found from context and implementation of tg_last_low_overflow_time, so remove this limit in comment. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:45:05 -07:00
Kemeng Shi	009df34171	blk-throttle: fix typo in comment of throtl_adjusted_limit lapsed time -> elapsed time Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:59 -07:00
Kemeng Shi	a4d508e333	blk-throttle: simpfy low limit reached check in throtl_tg_can_upgrade Commit `c79892c557` ("blk-throttle: add upgrade logic for LIMIT_LOW state") added upgrade logic for low limit and methioned that 1. "To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit." 2. "If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed." Besides, we also determine that cgroup reaches low limit if low limit is 0, see comment in throtl_tg_can_upgrade. Collect the information above, the desgin of upgrade check is as following: 1.The low limit is reached if limit is zero or io is already queued. 2.Cgroup will pass upgrade check if low limits of READ and WRITE are both reached. Simpfy the check code described above to removce repeat check and improve readability. There is no functional change. Detail equivalence proof is as following: All replaced conditions to return true are as following: condition 1 (!read_limit && !write_limit) condition 2 read_limit && sq->nr_queued[READ] && (!write_limit \|\| sq->nr_queued[WRITE]) condition 3 write_limit && sq->nr_queued[WRITE] && (!read_limit \|\| sq->nr_queued[READ]) Transferring condition 2 as following: (read_limit && sq->nr_queued[READ]) && (!write_limit \|\| sq->nr_queued[WRITE]) is equivalent to (read_limit && sq->nr_queued[READ]) && (!write_limit \|\| (write_limit && sq->nr_queued[WRITE])) is equivalent to condition 2.1 (read_limit && sq->nr_queued[READ] && !write_limit) \|\| condition 2.2 (read_limit && sq->nr_queued[READ] && (write_limit && sq->nr_queued[WRITE])) Transferring condition 3 as following: write_limit && sq->nr_queued[WRITE] && (!read_limit \|\| sq->nr_queued[READ]) is equivalent to (write_limit && sq->nr_queued[WRITE]) && (!read_limit \|\| (read_limit && sq->nr_queued[READ])) is equivalent to condition 3.1 ((write_limit && sq->nr_queued[WRITE]) && !read_limit) \|\| condition 3.2 ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) Condition 3.2 is the same as condition 2.2, so all conditions we get to return are as following: (!read_limit && !write_limit) (1) (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1) ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1) ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) (2.2) As we can extract conditions "(a1 \|\| a2) && (b1 \|\| b2)" to: a1 && b1 a1 && b2 a2 && b1 ab && b2 Considering that: a1 = !read_limit a2 = read_limit && sq->nr_queued[READ] b1 = !write_limit b2 = write_limit && sq->nr_queued[WRITE] We can pack replaced conditions to (!read_limit \|\| (read_limit && sq->nr_queued[READ])) && (!write_limit \|\| (write_limit && sq->nr_queued[WRITE])) which is equivalent to (!read_limit \|\| sq->nr_queued[READ]) && (!write_limit \|\| sq->nr_queued[WRITE]) Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:48 -07:00
Kemeng Shi	183daeb11d	blk-throttle: correct calculation of wait time in tg_may_dispatch In C language, When executing "if (expression1 && expression2)" and expression1 return false, the expression2 may not be executed. For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) && tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is limited, tg_within_bps_limit will return false and tg_within_iops_limit will not be called. So even bps and iops are both limited, iops_wait will not be calculated and is always zero. So wait time of iops is always ignored. Fix this by always calling tg_within_bps_limit and tg_within_iops_limit to get wait time for both bps and iops. Observed that: 1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always be stored as wait argument is always passed. 2. wait time is stored to zero if iops/bps is limited otherwise non-zero is stored. Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument and return wait time directly. Caller tg_may_dispatch checks if wait time is zero to find if iops/bps is limited. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:42 -07:00
Kemeng Shi	eb18479182	blk-throttle: ignore cgroup without io queued in blk_throtl_cancel_bios Ignore cgroup without io queued in blk_throtl_cancel_bios for two reasons: 1. Save cpu cycle for trying to dispatch cgroup which is no io queued. 2. Avoid non-consistent state that cgroup is inserted to service queue without THROTL_TG_PENDING set as tg_update_disptime will unconditional re-insert cgroup to service queue. If we are on the default hierarchy, IO dispatched from child in tg_dispatch_one_bio will trigger inserting cgroup to service queue without erase first and ruin the tree. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:34 -07:00
Kemeng Shi	84aca0a7e0	blk-throttle: Fix that bps of child could exceed bps limited in parent Consider situation as following (on the default hierarchy): HDD \| root (bps limit: 4k) \| child (bps limit :8k) \| fio bs=8k Rate of fio is supposed to be 4k, but result is 8k. Reason is as following: Size of single IO from fio is larger than bytes allowed in one throtl_slice in child, so IOs are always queued in child group first. When queued IOs in child are dispatched to parent group, BIO_BPS_THROTTLED is set and these IOs will not be limited by tg_within_bps_limit anymore. Fix this by only set BIO_BPS_THROTTLED when the bio traversed the entire tree. There patch has no influence on situation which is not on the default hierarchy as each group is a single root group without parent. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:27 -07:00
Kemeng Shi	f56019aef3	blk-throttle: correct stale comment in throtl_pd_init On the default hierarchy (cgroup2), the throttle interface files don't exist in the root cgroup, so the ablity to limit the whole system by configuring root group is not existing anymore. In general, cgroup doesn't wanna be in the business of restricting resources at the system level, so correct the stale comment that we can limit whole system to we can only limit subtree. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-05 13:44:19 -07:00
Christoph Hellwig	cad9266abc	blk-throttle: pass a gendisk to blk_throtl_cancel_bios Pass the gendisk to blk_throtl_cancel_bios as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:28 -06:00
Christoph Hellwig	5f6dc7522a	blk-throttle: pass a gendisk to blk_throtl_register_queue Pass the gendisk to blk_throtl_register_queue as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Christoph Hellwig	e13793bae6	blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exit Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-26 19:17:27 -06:00
Yu Kuai	81c7a63abc	blk-throttle: improve bypassing bios checkings "tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that don't need to be throttled can be checked accurately. With this patch, bio will be throttled if: 1) Bio is read/write, and corresponding read/write iops limit exist. 2) If corresponding doesn't exist, corresponding bps limit exist and bio is not throttled before. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-24 08:59:43 -06:00
Yu Kuai	8549674990	blk-throttle: remove THROTL_TG_HAS_IOPS_LIMIT Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" both try to bypass bios that don't need to be throttled, however, they are a little redundant and both not perfect: 1) "tg->has_rules" only distinguish read and write, but not iops and bps limit. 2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit exist, read and write is not distinguished, and bps limit is not checked. tg->has_rules will extended to distinguish bps and iops in the following patch. There is no need to keep the flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-24 08:59:43 -06:00
Yu Kuai	c013710e1a	blk-throttle: cleanup tg_update_disptime() tg_update_disptime() only need to adjust postion for 'tg' in 'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(), since they will set/clear flag THROTL_TG_PENDING and increase/decrease nr_pending, which is useless. By the way, clear the flag/decrease nr_pending while there are still throttled bios is not good for debugging. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:08 -06:00
Yu Kuai	8c25ed0cb9	blk-throttle: calling throtl_dequeue/enqueue_tg in pairs It's a little weird to call throtl_dequeue_tg() unconditionally in throtl_select_dispatch(), since it will be called in tg_update_disptime() again if some bio is still throttled. Thus call it later if there are no throttled bio. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:07 -06:00
Yu Kuai	7e9c5c54d4	blk-throttle: use 'READ/WRITE' instead of '0/1' Make the code easier to read, like everywhere else. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220827101637.1775111-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:20:07 -06:00
Yu Kuai	a880ae93e5	blk-throttle: fix io hung due to configuration updates If new configuration is submitted while a bio is throttled, then new waiting time is recalculated regardless that the bio might already wait for some time: tg_conf_updated throtl_start_new_slice tg_update_disptime throtl_schedule_next_dispatch Then io hung can be triggered by always submmiting new configuration before the throttled bio is dispatched. Fix the problem by respecting the time that throttled bio already waited. In order to do that, add new fields to record how many bytes/io are waited, and use it to calculate wait time for throttled bio under new configuration. Some simple test: 1) cd /sys/fs/cgroup/blkio/ echo $$ > cgroup.procs echo "8:0 2048" > blkio.throttle.write_bps_device { sleep 2 echo "8:0 1024" > blkio.throttle.write_bps_device } & dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct 2) cd /sys/fs/cgroup/blkio/ echo $$ > cgroup.procs echo "8:0 1024" > blkio.throttle.write_bps_device { sleep 4 echo "8:0 2048" > blkio.throttle.write_bps_device } & dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct test results: io finish time before this patch with this patch 1) 10s 6s 2) 8s 6s Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	681cd46fff	blk-throttle: factor out code to calculate ios/bytes_allowed No functional changes, new apis will be used in later patches to calculate wait time for throttled bios when new configuration is submitted. Noted this patch also rename tg_with_in_iops/bps_limit() to tg_within_iops/bps_limit(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	8d6bbaada2	blk-throttle: prevent overflow while calculating wait time There is a problem found by code review in tg_with_in_bps_limit() that 'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by calling mul_u64_u64_div_u64() instead. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00
Yu Kuai	320fb0f91e	blk-throttle: fix that io throttle can only work for single bio Test scripts: cd /sys/fs/cgroup/blkio/ echo "8:0 1024" > blkio.throttle.write_bps_device echo $$ > cgroup.procs dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & Test result: 10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s 10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s The problem is that the second bio is finished after 10s instead of 20s. Root cause: 1) second bio will be flagged: __blk_throtl_bio while (true) { ... if (sq->nr_queued[rw]) -> some bio is throttled already break }; bio_set_flag(bio, BIO_THROTTLED); -> flag the bio 2) flagged bio will be dispatched without waiting: throtl_dispatch_tg tg_may_dispatch tg_with_in_bps_limit if (bps_limit == U64_MAX \|\| bio_flagged(bio, BIO_THROTTLED)) *wait = 0; -> wait time is zero return true; commit `9f5ede3c01` ("block: throttle split bio in case of iops limit") support to count split bios for iops limit, thus it adds flagged bio checking in tg_with_in_bps_limit() so that split bios will only count once for bps limit, however, it introduce a new problem that io throttle won't work if multiple bios are throttled. In order to fix the problem, handle iops/bps limit in different ways: 1) for iops limit, there is no flag to record if the bio is throttled, and iops is always applied. 2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED, and io throttle will ignore bio with the flag. Noted this patch also remove the code to set flag in __bio_clone(), it's introduced in commit `111be88398` ("block-throttle: avoid double charge"), and author thinks split bio can be resubmited and throttled again, which is wrong because split bio will continue to dispatch from caller. Fixes: `9f5ede3c01` ("block: throttle split bio in case of iops limit") Cc: <stable@vger.kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-12 00:19:48 -06:00

1 2 3 4 5 ...

302 Commits