License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 07:07:57 -07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-01-29 06:51:59 -07:00
|
|
|
/*
|
|
|
|
* Functions related to sysfs handling
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 01:04:11 -07:00
|
|
|
#include <linux/slab.h>
|
2008-01-29 06:51:59 -07:00
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2015-05-22 14:13:32 -07:00
|
|
|
#include <linux/backing-dev.h>
|
2008-01-29 06:51:59 -07:00
|
|
|
#include <linux/blktrace_api.h>
|
2020-06-19 13:47:30 -07:00
|
|
|
#include <linux/debugfs.h>
|
2008-01-29 06:51:59 -07:00
|
|
|
|
|
|
|
#include "blk.h"
|
2013-12-26 06:31:38 -07:00
|
|
|
#include "blk-mq.h"
|
2017-05-04 00:31:30 -07:00
|
|
|
#include "blk-mq-debugfs.h"
|
2021-11-23 11:53:08 -07:00
|
|
|
#include "blk-mq-sched.h"
|
2023-02-03 08:03:51 -07:00
|
|
|
#include "blk-rq-qos.h"
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 12:38:14 -07:00
|
|
|
#include "blk-wbt.h"
|
2022-02-11 03:11:49 -07:00
|
|
|
#include "blk-cgroup.h"
|
2021-10-05 08:11:56 -07:00
|
|
|
#include "blk-throttle.h"
|
2008-01-29 06:51:59 -07:00
|
|
|
|
|
|
|
struct queue_sysfs_entry {
|
|
|
|
struct attribute attr;
|
2024-06-27 04:14:03 -07:00
|
|
|
ssize_t (*show)(struct gendisk *disk, char *page);
|
2024-09-07 17:07:04 -07:00
|
|
|
int (*load_module)(struct gendisk *disk, const char *page, size_t count);
|
2024-06-27 04:14:03 -07:00
|
|
|
ssize_t (*store)(struct gendisk *disk, const char *page, size_t count);
|
2008-01-29 06:51:59 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
static ssize_t
|
2009-07-17 00:26:26 -07:00
|
|
|
queue_var_show(unsigned long var, char *page)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
2009-07-17 00:26:26 -07:00
|
|
|
return sprintf(page, "%lu\n", var);
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
queue_var_store(unsigned long *var, const char *page, size_t count)
|
|
|
|
{
|
2012-09-08 08:55:45 -07:00
|
|
|
int err;
|
|
|
|
unsigned long v;
|
|
|
|
|
2013-09-11 14:20:08 -07:00
|
|
|
err = kstrtoul(page, 10, &v);
|
2012-09-08 08:55:45 -07:00
|
|
|
if (err || v > UINT_MAX)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*var = v;
|
2008-01-29 06:51:59 -07:00
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_requests_show(struct gendisk *disk, char *page)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show(disk->queue->nr_requests, page);
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2024-06-27 04:14:03 -07:00
|
|
|
queue_requests_store(struct gendisk *disk, const char *page, size_t count)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
|
|
|
unsigned long nr;
|
2014-05-20 10:49:02 -07:00
|
|
|
int ret, err;
|
2009-09-11 13:44:29 -07:00
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
if (!queue_is_mq(disk->queue))
|
2009-09-11 13:44:29 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ret = queue_var_store(&nr, page, count);
|
2012-09-08 08:55:45 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2008-01-29 06:51:59 -07:00
|
|
|
if (nr < BLKDEV_MIN_RQ)
|
|
|
|
nr = BLKDEV_MIN_RQ;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
err = blk_mq_update_nr_requests(disk->queue, nr);
|
2014-05-20 10:49:02 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2008-01-29 06:51:59 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_ra_show(struct gendisk *disk, char *page)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page);
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2024-06-27 04:14:03 -07:00
|
|
|
queue_ra_store(struct gendisk *disk, const char *page, size_t count)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
|
|
|
unsigned long ra_kb;
|
2021-08-09 07:17:43 -07:00
|
|
|
ssize_t ret;
|
2008-01-29 06:51:59 -07:00
|
|
|
|
2021-08-09 07:17:43 -07:00
|
|
|
ret = queue_var_store(&ra_kb, page, count);
|
2012-09-08 08:55:45 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2024-06-27 04:14:03 -07:00
|
|
|
disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10);
|
2008-01-29 06:51:59 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:02 -07:00
|
|
|
#define QUEUE_SYSFS_LIMIT_SHOW(_field) \
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
|
2024-06-27 04:14:02 -07:00
|
|
|
{ \
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show(disk->queue->limits._field, page); \
|
2024-06-27 04:14:02 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_segments)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(io_min)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(io_opt)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(discard_granularity)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(zone_write_granularity)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(virt_boundary_mask)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(dma_alignment)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_open_zones)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(max_active_zones)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(atomic_write_unit_min)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW(atomic_write_unit_max)
|
|
|
|
|
|
|
|
#define QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(_field) \
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
|
2024-06-27 04:14:02 -07:00
|
|
|
{ \
|
|
|
|
return sprintf(page, "%llu\n", \
|
2024-06-27 04:14:03 -07:00
|
|
|
(unsigned long long)disk->queue->limits._field << \
|
|
|
|
SECTOR_SHIFT); \
|
2024-06-27 04:14:02 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors)
|
|
|
|
|
|
|
|
#define QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(_field) \
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
|
2024-06-27 04:14:02 -07:00
|
|
|
{ \
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show(disk->queue->limits._field >> 1, page); \
|
2024-06-27 04:14:02 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(max_sectors)
|
|
|
|
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(max_hw_sectors)
|
|
|
|
|
|
|
|
#define QUEUE_SYSFS_SHOW_CONST(_name, _val) \
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
|
2024-06-27 04:14:02 -07:00
|
|
|
{ \
|
|
|
|
return sprintf(page, "%d\n", _val); \
|
2009-11-10 03:50:21 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:02 -07:00
|
|
|
/* deprecated fields */
|
|
|
|
QUEUE_SYSFS_SHOW_CONST(discard_zeroes_data, 0)
|
|
|
|
QUEUE_SYSFS_SHOW_CONST(write_same_max, 0)
|
|
|
|
QUEUE_SYSFS_SHOW_CONST(poll_delay, -1)
|
2015-07-16 08:14:26 -07:00
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_max_discard_sectors_store(struct gendisk *disk,
|
2024-06-27 04:14:02 -07:00
|
|
|
const char *page, size_t count)
|
2015-07-16 08:14:26 -07:00
|
|
|
{
|
2024-02-13 00:34:16 -07:00
|
|
|
unsigned long max_discard_bytes;
|
2024-02-13 00:34:17 -07:00
|
|
|
struct queue_limits lim;
|
2024-02-13 00:34:16 -07:00
|
|
|
ssize_t ret;
|
2024-02-13 00:34:17 -07:00
|
|
|
int err;
|
2015-07-16 08:14:26 -07:00
|
|
|
|
2024-02-13 00:34:16 -07:00
|
|
|
ret = queue_var_store(&max_discard_bytes, page, count);
|
2015-07-16 08:14:26 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
if (max_discard_bytes & (disk->queue->limits.discard_granularity - 1))
|
2015-07-16 08:14:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2024-02-13 00:34:16 -07:00
|
|
|
if ((max_discard_bytes >> SECTOR_SHIFT) > UINT_MAX)
|
2015-07-16 08:14:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
lim = queue_limits_start_update(disk->queue);
|
2024-02-13 00:34:17 -07:00
|
|
|
lim.max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT;
|
2024-06-27 04:14:03 -07:00
|
|
|
err = queue_limits_commit_update(disk->queue, &lim);
|
2024-02-13 00:34:17 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
2015-07-16 08:14:26 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:02 -07:00
|
|
|
/*
|
|
|
|
* For zone append queue_max_zone_append_sectors does not just return the
|
|
|
|
* underlying queue limits, but actually contains a calculation. Because of
|
|
|
|
* that we can't simply use QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES here.
|
|
|
|
*/
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_zone_append_max_show(struct gendisk *disk, char *page)
|
2020-05-12 01:55:47 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
(u64)queue_max_zone_append_sectors(disk->queue) <<
|
|
|
|
SECTOR_SHIFT);
|
2020-05-12 01:55:47 -07:00
|
|
|
}
|
|
|
|
|
2008-01-29 06:51:59 -07:00
|
|
|
static ssize_t
|
2024-06-27 04:14:03 -07:00
|
|
|
queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count)
|
2008-01-29 06:51:59 -07:00
|
|
|
{
|
2024-02-13 00:34:15 -07:00
|
|
|
unsigned long max_sectors_kb;
|
|
|
|
struct queue_limits lim;
|
|
|
|
ssize_t ret;
|
|
|
|
int err;
|
2008-01-29 06:51:59 -07:00
|
|
|
|
2024-02-13 00:34:15 -07:00
|
|
|
ret = queue_var_store(&max_sectors_kb, page, count);
|
2012-09-08 08:55:45 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
lim = queue_limits_start_update(disk->queue);
|
2024-02-13 00:34:15 -07:00
|
|
|
lim.max_user_sectors = max_sectors_kb << 1;
|
2024-06-27 04:14:03 -07:00
|
|
|
err = queue_limits_commit_update(disk->queue, &lim);
|
2024-02-13 00:34:15 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
2008-01-29 06:51:59 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_feature_store(struct gendisk *disk, const char *page,
|
2024-06-26 07:26:25 -07:00
|
|
|
size_t count, blk_features_t feature)
|
2024-06-16 23:04:41 -07:00
|
|
|
{
|
|
|
|
struct queue_limits lim;
|
|
|
|
unsigned long val;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
ret = queue_var_store(&val, page, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
lim = queue_limits_start_update(disk->queue);
|
2024-06-16 23:04:41 -07:00
|
|
|
if (val)
|
|
|
|
lim.features |= feature;
|
|
|
|
else
|
|
|
|
lim.features &= ~feature;
|
2024-06-27 04:14:03 -07:00
|
|
|
ret = queue_limits_commit_update(disk->queue, &lim);
|
2024-06-16 23:04:41 -07:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
#define QUEUE_SYSFS_FEATURE(_name, _feature) \
|
|
|
|
static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
|
|
|
|
{ \
|
|
|
|
return sprintf(page, "%u\n", \
|
|
|
|
!!(disk->queue->limits.features & _feature)); \
|
|
|
|
} \
|
|
|
|
static ssize_t queue_##_name##_store(struct gendisk *disk, \
|
|
|
|
const char *page, size_t count) \
|
|
|
|
{ \
|
|
|
|
return queue_feature_store(disk, page, count, _feature); \
|
2024-06-16 23:04:41 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL)
|
2024-06-16 23:04:42 -07:00
|
|
|
QUEUE_SYSFS_FEATURE(add_random, BLK_FEAT_ADD_RANDOM)
|
2024-06-16 23:04:43 -07:00
|
|
|
QUEUE_SYSFS_FEATURE(iostats, BLK_FEAT_IO_STAT)
|
2024-06-16 23:04:44 -07:00
|
|
|
QUEUE_SYSFS_FEATURE(stable_writes, BLK_FEAT_STABLE_WRITES);
|
2009-01-07 04:22:39 -07:00
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
#define QUEUE_SYSFS_FEATURE_SHOW(_name, _feature) \
|
|
|
|
static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
|
|
|
|
{ \
|
|
|
|
return sprintf(page, "%u\n", \
|
|
|
|
!!(disk->queue->limits.features & _feature)); \
|
2024-06-27 04:14:02 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
|
|
|
|
QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
|
|
|
|
QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
|
2016-10-17 23:40:29 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
if (blk_queue_is_zoned(disk->queue))
|
2016-10-17 23:40:29 -07:00
|
|
|
return sprintf(page, "host-managed\n");
|
2023-12-17 09:53:57 -07:00
|
|
|
return sprintf(page, "none\n");
|
2016-10-17 23:40:29 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_nr_zones_show(struct gendisk *disk, char *page)
|
2018-10-12 03:08:48 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show(disk_nr_zones(disk), page);
|
2018-10-12 03:08:48 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_nomerges_show(struct gendisk *disk, char *page)
|
2008-04-29 05:44:19 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return queue_var_show((blk_queue_nomerges(disk->queue) << 1) |
|
|
|
|
blk_queue_noxmerges(disk->queue), page);
|
2008-04-29 05:44:19 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_nomerges_store(struct gendisk *disk, const char *page,
|
2008-04-29 05:44:19 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
|
|
|
unsigned long nm;
|
|
|
|
ssize_t ret = queue_var_store(&nm, page, count);
|
|
|
|
|
2012-09-08 08:55:45 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, disk->queue);
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, disk->queue);
|
2010-01-29 01:04:08 -07:00
|
|
|
if (nm == 2)
|
2024-06-27 04:14:03 -07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, disk->queue);
|
2010-01-29 01:04:08 -07:00
|
|
|
else if (nm)
|
2024-06-27 04:14:03 -07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, disk->queue);
|
2009-01-07 04:22:39 -07:00
|
|
|
|
2008-04-29 05:44:19 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_rq_affinity_show(struct gendisk *disk, char *page)
|
2008-09-13 11:26:01 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
bool set = test_bit(QUEUE_FLAG_SAME_COMP, &disk->queue->queue_flags);
|
|
|
|
bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &disk->queue->queue_flags);
|
2008-09-13 11:26:01 -07:00
|
|
|
|
2011-07-23 11:44:25 -07:00
|
|
|
return queue_var_show(set << force, page);
|
2008-09-13 11:26:01 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2024-06-27 04:14:03 -07:00
|
|
|
queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
|
2008-09-13 11:26:01 -07:00
|
|
|
{
|
|
|
|
ssize_t ret = -EINVAL;
|
2013-11-14 15:32:07 -07:00
|
|
|
#ifdef CONFIG_SMP
|
2024-06-27 04:14:03 -07:00
|
|
|
struct request_queue *q = disk->queue;
|
2008-09-13 11:26:01 -07:00
|
|
|
unsigned long val;
|
|
|
|
|
|
|
|
ret = queue_var_store(&val, page, count);
|
2012-09-08 08:55:45 -07:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2011-08-23 12:25:12 -07:00
|
|
|
if (val == 2) {
|
2018-11-14 09:02:07 -07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
|
2011-08-23 12:25:12 -07:00
|
|
|
} else if (val == 1) {
|
2018-11-14 09:02:07 -07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
|
2011-08-23 12:25:12 -07:00
|
|
|
} else if (val == 0) {
|
2018-11-14 09:02:07 -07:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
|
2011-07-23 11:44:25 -07:00
|
|
|
}
|
2008-09-13 11:26:01 -07:00
|
|
|
#endif
|
|
|
|
return ret;
|
|
|
|
}
|
2008-01-29 06:51:59 -07:00
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_poll_delay_store(struct gendisk *disk, const char *page,
|
2016-11-14 13:01:59 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
2016-11-14 13:03:03 -07:00
|
|
|
return count;
|
2016-11-14 13:01:59 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
|
2015-11-05 10:44:55 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
if (!(disk->queue->limits.features & BLK_FEAT_POLL))
|
2015-11-05 10:44:55 -07:00
|
|
|
return -EINVAL;
|
2021-10-12 04:12:25 -07:00
|
|
|
pr_info_ratelimited("writes to the poll attribute are ignored.\n");
|
|
|
|
pr_info_ratelimited("please use driver specific parameters instead.\n");
|
|
|
|
return count;
|
2015-11-05 10:44:55 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page)
|
2018-11-28 09:04:39 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
return sprintf(page, "%u\n", jiffies_to_msecs(disk->queue->rq_timeout));
|
2018-11-28 09:04:39 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
|
2018-11-28 09:04:39 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
|
|
|
unsigned int val;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = kstrtou32(page, 10, &val);
|
|
|
|
if (err || val == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
blk_queue_rq_timeout(disk->queue, msecs_to_jiffies(val));
|
2018-11-28 09:04:39 -07:00
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_wc_show(struct gendisk *disk, char *page)
|
2016-04-12 11:32:46 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
if (blk_queue_write_cache(disk->queue))
|
2024-06-26 07:26:23 -07:00
|
|
|
return sprintf(page, "write back\n");
|
|
|
|
return sprintf(page, "write through\n");
|
2016-04-12 11:32:46 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_wc_store(struct gendisk *disk, const char *page,
|
2016-04-12 11:32:46 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
2024-06-16 23:04:40 -07:00
|
|
|
struct queue_limits lim;
|
|
|
|
bool disable;
|
|
|
|
int err;
|
|
|
|
|
2023-07-07 02:42:39 -07:00
|
|
|
if (!strncmp(page, "write back", 10)) {
|
2024-06-16 23:04:40 -07:00
|
|
|
disable = false;
|
2023-07-07 02:42:39 -07:00
|
|
|
} else if (!strncmp(page, "write through", 13) ||
|
2024-06-16 23:04:40 -07:00
|
|
|
!strncmp(page, "none", 4)) {
|
|
|
|
disable = true;
|
2023-07-07 02:42:39 -07:00
|
|
|
} else {
|
2023-07-07 02:42:38 -07:00
|
|
|
return -EINVAL;
|
2023-07-07 02:42:39 -07:00
|
|
|
}
|
2016-04-12 11:32:46 -07:00
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
lim = queue_limits_start_update(disk->queue);
|
2024-06-16 23:04:40 -07:00
|
|
|
if (disable)
|
2024-06-19 08:45:35 -07:00
|
|
|
lim.flags |= BLK_FLAG_WRITE_CACHE_DISABLED;
|
2024-06-16 23:04:40 -07:00
|
|
|
else
|
2024-06-19 08:45:35 -07:00
|
|
|
lim.flags &= ~BLK_FLAG_WRITE_CACHE_DISABLED;
|
2024-06-27 04:14:03 -07:00
|
|
|
err = queue_limits_commit_update(disk->queue, &lim);
|
2024-06-16 23:04:40 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
2016-04-12 11:32:46 -07:00
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2020-09-02 23:07:00 -07:00
|
|
|
#define QUEUE_RO_ENTRY(_prefix, _name) \
|
|
|
|
static struct queue_sysfs_entry _prefix##_entry = { \
|
|
|
|
.attr = { .name = _name, .mode = 0444 }, \
|
|
|
|
.show = _prefix##_show, \
|
|
|
|
};
|
|
|
|
|
|
|
|
#define QUEUE_RW_ENTRY(_prefix, _name) \
|
|
|
|
static struct queue_sysfs_entry _prefix##_entry = { \
|
|
|
|
.attr = { .name = _name, .mode = 0644 }, \
|
|
|
|
.show = _prefix##_show, \
|
|
|
|
.store = _prefix##_store, \
|
|
|
|
};
|
|
|
|
|
2024-09-07 17:07:04 -07:00
|
|
|
#define QUEUE_RW_LOAD_MODULE_ENTRY(_prefix, _name) \
|
|
|
|
static struct queue_sysfs_entry _prefix##_entry = { \
|
|
|
|
.attr = { .name = _name, .mode = 0644 }, \
|
|
|
|
.show = _prefix##_show, \
|
|
|
|
.load_module = _prefix##_load_module, \
|
|
|
|
.store = _prefix##_store, \
|
|
|
|
}
|
|
|
|
|
2020-09-02 23:07:00 -07:00
|
|
|
QUEUE_RW_ENTRY(queue_requests, "nr_requests");
|
|
|
|
QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb");
|
|
|
|
QUEUE_RW_ENTRY(queue_max_sectors, "max_sectors_kb");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
|
2024-09-07 17:07:04 -07:00
|
|
|
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
|
2020-09-02 23:07:00 -07:00
|
|
|
|
|
|
|
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
|
|
|
|
QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size");
|
|
|
|
QUEUE_RO_ENTRY(queue_chunk_sectors, "chunk_sectors");
|
|
|
|
QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size");
|
|
|
|
QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size");
|
|
|
|
|
|
|
|
QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
|
|
|
|
QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
|
2024-06-27 04:14:02 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
|
|
|
|
QUEUE_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
|
2020-09-02 23:07:00 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
|
|
|
|
|
2024-06-27 04:14:02 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes");
|
|
|
|
QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors,
|
|
|
|
"atomic_write_boundary_bytes");
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 05:53:54 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
|
|
|
|
QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
|
|
|
|
|
2020-09-02 23:07:00 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
|
2024-06-27 04:14:02 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
|
2020-09-02 23:07:00 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
|
2021-01-27 21:47:30 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
|
2020-09-02 23:07:00 -07:00
|
|
|
|
|
|
|
QUEUE_RO_ENTRY(queue_zoned, "zoned");
|
|
|
|
QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
|
|
|
|
QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
|
|
|
|
|
|
|
|
QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
|
|
|
|
QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
|
|
|
|
QUEUE_RW_ENTRY(queue_poll, "io_poll");
|
|
|
|
QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay");
|
|
|
|
QUEUE_RW_ENTRY(queue_wc, "write_cache");
|
|
|
|
QUEUE_RO_ENTRY(queue_fua, "fua");
|
|
|
|
QUEUE_RO_ENTRY(queue_dax, "dax");
|
|
|
|
QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
|
2021-04-05 06:20:12 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
|
2022-06-10 12:58:22 -07:00
|
|
|
QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
|
2008-01-29 06:51:59 -07:00
|
|
|
|
2020-09-02 23:07:00 -07:00
|
|
|
/* legacy alias for logical_block_size: */
|
2008-01-29 11:14:08 -07:00
|
|
|
static struct queue_sysfs_entry queue_hw_sector_size_entry = {
|
2018-05-24 12:38:59 -07:00
|
|
|
.attr = {.name = "hw_sector_size", .mode = 0444 },
|
2009-05-22 14:17:49 -07:00
|
|
|
.show = queue_logical_block_size_show,
|
|
|
|
};
|
|
|
|
|
2024-06-16 23:04:41 -07:00
|
|
|
QUEUE_RW_ENTRY(queue_rotational, "rotational");
|
2020-09-02 23:07:01 -07:00
|
|
|
QUEUE_RW_ENTRY(queue_iostats, "iostats");
|
2024-06-16 23:04:42 -07:00
|
|
|
QUEUE_RW_ENTRY(queue_add_random, "add_random");
|
2020-09-23 23:51:38 -07:00
|
|
|
QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes");
|
2010-06-09 01:42:09 -07:00
|
|
|
|
2023-05-26 18:06:40 -07:00
|
|
|
#ifdef CONFIG_BLK_WBT
|
|
|
|
static ssize_t queue_var_store64(s64 *var, const char *page)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
s64 v;
|
|
|
|
|
|
|
|
err = kstrtos64(page, 10, &v);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
*var = v;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
|
2023-05-26 18:06:40 -07:00
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
if (!wbt_rq_qos(disk->queue))
|
2023-05-26 18:06:40 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
if (wbt_disabled(disk->queue))
|
2023-05-26 18:06:40 -07:00
|
|
|
return sprintf(page, "0\n");
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
return sprintf(page, "%llu\n",
|
|
|
|
div_u64(wbt_get_min_lat(disk->queue), 1000));
|
2023-05-26 18:06:40 -07:00
|
|
|
}
|
|
|
|
|
2024-06-27 04:14:03 -07:00
|
|
|
static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
|
2023-05-26 18:06:40 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
2024-06-27 04:14:03 -07:00
|
|
|
struct request_queue *q = disk->queue;
|
2023-05-26 18:06:40 -07:00
|
|
|
struct rq_qos *rqos;
|
|
|
|
ssize_t ret;
|
|
|
|
s64 val;
|
|
|
|
|
|
|
|
ret = queue_var_store64(&val, page);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (val < -1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
rqos = wbt_rq_qos(q);
|
|
|
|
if (!rqos) {
|
2024-06-27 04:14:03 -07:00
|
|
|
ret = wbt_init(disk);
|
2023-05-26 18:06:40 -07:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (val == -1)
|
|
|
|
val = wbt_default_latency_nsec(q);
|
|
|
|
else if (val >= 0)
|
|
|
|
val *= 1000ULL;
|
|
|
|
|
|
|
|
if (wbt_get_min_lat(q) == val)
|
|
|
|
return count;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that the queue is idled, in case the latency update
|
|
|
|
* ends up either enabling or disabling wbt completely. We can't
|
|
|
|
* have IO inflight if that happens.
|
|
|
|
*/
|
|
|
|
blk_mq_quiesce_queue(q);
|
|
|
|
|
|
|
|
wbt_set_min_lat(q, val);
|
|
|
|
|
|
|
|
blk_mq_unquiesce_queue(q);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
|
|
|
|
#endif
|
|
|
|
|
2023-11-28 12:40:19 -07:00
|
|
|
/* Common attributes for bio-based and request-based queues. */
|
2019-04-02 06:14:30 -07:00
|
|
|
static struct attribute *queue_attrs[] = {
|
2008-01-29 06:51:59 -07:00
|
|
|
&queue_ra_entry.attr,
|
|
|
|
&queue_max_hw_sectors_entry.attr,
|
|
|
|
&queue_max_sectors_entry.attr,
|
2010-03-09 22:48:33 -07:00
|
|
|
&queue_max_segments_entry.attr,
|
2017-02-08 06:46:49 -07:00
|
|
|
&queue_max_discard_segments_entry.attr,
|
2010-09-10 11:50:10 -07:00
|
|
|
&queue_max_integrity_segments_entry.attr,
|
2010-03-09 22:48:33 -07:00
|
|
|
&queue_max_segment_size_entry.attr,
|
2008-01-29 11:14:08 -07:00
|
|
|
&queue_hw_sector_size_entry.attr,
|
2009-05-22 14:17:49 -07:00
|
|
|
&queue_logical_block_size_entry.attr,
|
2009-05-22 14:17:53 -07:00
|
|
|
&queue_physical_block_size_entry.attr,
|
2016-10-17 23:40:30 -07:00
|
|
|
&queue_chunk_sectors_entry.attr,
|
2009-05-22 14:17:53 -07:00
|
|
|
&queue_io_min_entry.attr,
|
|
|
|
&queue_io_opt_entry.attr,
|
2009-11-10 03:50:21 -07:00
|
|
|
&queue_discard_granularity_entry.attr,
|
2024-06-27 04:14:02 -07:00
|
|
|
&queue_max_discard_sectors_entry.attr,
|
|
|
|
&queue_max_hw_discard_sectors_entry.attr,
|
2009-12-03 01:24:48 -07:00
|
|
|
&queue_discard_zeroes_data_entry.attr,
|
2024-06-27 04:14:02 -07:00
|
|
|
&queue_atomic_write_max_sectors_entry.attr,
|
|
|
|
&queue_atomic_write_boundary_sectors_entry.attr,
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 05:53:54 -07:00
|
|
|
&queue_atomic_write_unit_min_entry.attr,
|
|
|
|
&queue_atomic_write_unit_max_entry.attr,
|
2012-09-18 09:19:27 -07:00
|
|
|
&queue_write_same_max_entry.attr,
|
2024-06-27 04:14:02 -07:00
|
|
|
&queue_max_write_zeroes_sectors_entry.attr,
|
2020-05-12 01:55:47 -07:00
|
|
|
&queue_zone_append_max_entry.attr,
|
2021-01-27 21:47:30 -07:00
|
|
|
&queue_zone_write_granularity_entry.attr,
|
2024-06-16 23:04:41 -07:00
|
|
|
&queue_rotational_entry.attr,
|
2016-10-17 23:40:29 -07:00
|
|
|
&queue_zoned_entry.attr,
|
2018-10-12 03:08:48 -07:00
|
|
|
&queue_nr_zones_entry.attr,
|
2020-07-14 14:18:23 -07:00
|
|
|
&queue_max_open_zones_entry.attr,
|
2020-07-14 14:18:24 -07:00
|
|
|
&queue_max_active_zones_entry.attr,
|
2008-04-29 05:44:19 -07:00
|
|
|
&queue_nomerges_entry.attr,
|
2009-01-23 02:54:44 -07:00
|
|
|
&queue_iostats_entry.attr,
|
2020-09-23 23:51:38 -07:00
|
|
|
&queue_stable_writes_entry.attr,
|
2024-06-16 23:04:42 -07:00
|
|
|
&queue_add_random_entry.attr,
|
2015-11-05 10:44:55 -07:00
|
|
|
&queue_poll_entry.attr,
|
2016-04-12 11:32:46 -07:00
|
|
|
&queue_wc_entry.attr,
|
2018-05-08 18:33:58 -07:00
|
|
|
&queue_fua_entry.attr,
|
2016-06-23 14:05:51 -07:00
|
|
|
&queue_dax_entry.attr,
|
2016-11-14 13:01:59 -07:00
|
|
|
&queue_poll_delay_entry.attr,
|
2021-04-05 06:20:12 -07:00
|
|
|
&queue_virt_boundary_mask_entry.attr,
|
2022-06-10 12:58:22 -07:00
|
|
|
&queue_dma_alignment_entry.attr,
|
2008-01-29 06:51:59 -07:00
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
2023-11-28 12:40:19 -07:00
|
|
|
/* Request-based queue attributes that are not relevant for bio-based queues. */
|
2023-05-26 18:06:44 -07:00
|
|
|
static struct attribute *blk_mq_queue_attrs[] = {
|
|
|
|
&queue_requests_entry.attr,
|
|
|
|
&elv_iosched_entry.attr,
|
|
|
|
&queue_rq_affinity_entry.attr,
|
|
|
|
&queue_io_timeout_entry.attr,
|
|
|
|
#ifdef CONFIG_BLK_WBT
|
|
|
|
&queue_wb_lat_entry.attr,
|
|
|
|
#endif
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
2019-04-02 06:14:30 -07:00
|
|
|
static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
|
|
|
|
int n)
|
|
|
|
{
|
2022-11-13 21:26:36 -07:00
|
|
|
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
|
|
|
|
struct request_queue *q = disk->queue;
|
2019-04-02 06:14:30 -07:00
|
|
|
|
2020-07-14 14:18:24 -07:00
|
|
|
if ((attr == &queue_max_open_zones_entry.attr ||
|
|
|
|
attr == &queue_max_active_zones_entry.attr) &&
|
2020-07-14 14:18:23 -07:00
|
|
|
!blk_queue_is_zoned(q))
|
|
|
|
return 0;
|
|
|
|
|
2019-04-02 06:14:30 -07:00
|
|
|
return attr->mode;
|
|
|
|
}
|
|
|
|
|
2023-05-26 18:06:44 -07:00
|
|
|
static umode_t blk_mq_queue_attr_visible(struct kobject *kobj,
|
|
|
|
struct attribute *attr, int n)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
|
|
|
|
struct request_queue *q = disk->queue;
|
|
|
|
|
|
|
|
if (!queue_is_mq(q))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (attr == &queue_io_timeout_entry.attr && !q->mq_ops->timeout)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return attr->mode;
|
|
|
|
}
|
|
|
|
|
2019-04-02 06:14:30 -07:00
|
|
|
static struct attribute_group queue_attr_group = {
|
|
|
|
.attrs = queue_attrs,
|
|
|
|
.is_visible = queue_attr_visible,
|
|
|
|
};
|
|
|
|
|
2023-05-26 18:06:44 -07:00
|
|
|
static struct attribute_group blk_mq_queue_attr_group = {
|
|
|
|
.attrs = blk_mq_queue_attrs,
|
|
|
|
.is_visible = blk_mq_queue_attr_visible,
|
|
|
|
};
|
2019-04-02 06:14:30 -07:00
|
|
|
|
2008-01-29 06:51:59 -07:00
|
|
|
#define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr)
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
|
|
|
|
{
|
|
|
|
struct queue_sysfs_entry *entry = to_queue(attr);
|
2022-11-13 21:26:36 -07:00
|
|
|
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
|
2008-01-29 06:51:59 -07:00
|
|
|
ssize_t res;
|
|
|
|
|
|
|
|
if (!entry->show)
|
|
|
|
return -EIO;
|
2024-06-27 04:14:03 -07:00
|
|
|
mutex_lock(&disk->queue->sysfs_lock);
|
|
|
|
res = entry->show(disk, page);
|
|
|
|
mutex_unlock(&disk->queue->sysfs_lock);
|
2008-01-29 06:51:59 -07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
queue_attr_store(struct kobject *kobj, struct attribute *attr,
|
|
|
|
const char *page, size_t length)
|
|
|
|
{
|
|
|
|
struct queue_sysfs_entry *entry = to_queue(attr);
|
2022-11-13 21:26:36 -07:00
|
|
|
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
|
|
|
|
struct request_queue *q = disk->queue;
|
2008-01-29 06:51:59 -07:00
|
|
|
ssize_t res;
|
|
|
|
|
|
|
|
if (!entry->store)
|
|
|
|
return -EIO;
|
2008-01-31 05:03:55 -07:00
|
|
|
|
2024-09-07 17:07:04 -07:00
|
|
|
/*
|
|
|
|
* If the attribute needs to load a module, do it before freezing the
|
|
|
|
* queue to ensure that the module file can be read when the request
|
|
|
|
* queue is the one for the device storing the module file.
|
|
|
|
*/
|
|
|
|
if (entry->load_module) {
|
|
|
|
res = entry->load_module(disk, page, length);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2024-06-16 23:04:38 -07:00
|
|
|
blk_mq_freeze_queue(q);
|
2008-01-29 06:51:59 -07:00
|
|
|
mutex_lock(&q->sysfs_lock);
|
2024-06-27 04:14:03 -07:00
|
|
|
res = entry->store(disk, page, length);
|
2008-01-29 06:51:59 -07:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2024-06-16 23:04:38 -07:00
|
|
|
blk_mq_unfreeze_queue(q);
|
2008-01-29 06:51:59 -07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2010-01-18 18:58:23 -07:00
|
|
|
static const struct sysfs_ops queue_sysfs_ops = {
|
2008-01-29 06:51:59 -07:00
|
|
|
.show = queue_attr_show,
|
|
|
|
.store = queue_attr_store,
|
|
|
|
};
|
|
|
|
|
2022-06-28 10:18:47 -07:00
|
|
|
static const struct attribute_group *blk_queue_attr_groups[] = {
|
|
|
|
&queue_attr_group,
|
2023-05-26 18:06:44 -07:00
|
|
|
&blk_mq_queue_attr_group,
|
2022-06-28 10:18:47 -07:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
2022-11-13 21:26:36 -07:00
|
|
|
static void blk_queue_release(struct kobject *kobj)
|
|
|
|
{
|
|
|
|
/* nothing to do here, all data is associated with the parent gendisk */
|
|
|
|
}
|
|
|
|
|
2023-02-07 21:01:22 -07:00
|
|
|
static const struct kobj_type blk_queue_ktype = {
|
2022-06-28 10:18:47 -07:00
|
|
|
.default_groups = blk_queue_attr_groups,
|
2008-01-29 06:51:59 -07:00
|
|
|
.sysfs_ops = &queue_sysfs_ops,
|
2022-11-13 21:26:36 -07:00
|
|
|
.release = blk_queue_release,
|
2008-01-29 06:51:59 -07:00
|
|
|
};
|
|
|
|
|
2022-11-13 21:26:34 -07:00
|
|
|
static void blk_debugfs_remove(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
struct request_queue *q = disk->queue;
|
|
|
|
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
|
|
blk_trace_shutdown(q);
|
|
|
|
debugfs_remove_recursive(q->debugfs_dir);
|
|
|
|
q->debugfs_dir = NULL;
|
|
|
|
q->sched_debugfs_dir = NULL;
|
|
|
|
q->rqos_debugfs_dir = NULL;
|
|
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
}
|
|
|
|
|
2018-01-17 12:48:10 -07:00
|
|
|
/**
|
|
|
|
* blk_register_queue - register a block layer queue with sysfs
|
|
|
|
* @disk: Disk of which the request queue should be registered with sysfs.
|
|
|
|
*/
|
2008-01-29 06:51:59 -07:00
|
|
|
int blk_register_queue(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
struct request_queue *q = disk->queue;
|
2022-06-28 10:18:50 -07:00
|
|
|
int ret;
|
2008-01-29 06:51:59 -07:00
|
|
|
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
mutex_lock(&q->sysfs_dir_lock);
|
2022-11-13 21:26:36 -07:00
|
|
|
kobject_init(&disk->queue_kobj, &blk_queue_ktype);
|
|
|
|
ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
|
2022-06-28 10:18:45 -07:00
|
|
|
if (ret < 0)
|
2022-11-13 21:26:36 -07:00
|
|
|
goto out_put_queue_kobj;
|
2008-01-29 06:51:59 -07:00
|
|
|
|
2022-11-13 21:26:35 -07:00
|
|
|
if (queue_is_mq(q)) {
|
|
|
|
ret = blk_mq_sysfs_register(disk);
|
|
|
|
if (ret)
|
2022-11-13 21:26:36 -07:00
|
|
|
goto out_put_queue_kobj;
|
2022-11-13 21:26:35 -07:00
|
|
|
}
|
2022-06-14 00:48:25 -07:00
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
|
2020-06-19 13:47:30 -07:00
|
|
|
mutex_lock(&q->debugfs_mutex);
|
2022-11-13 21:26:36 -07:00
|
|
|
q->debugfs_dir = debugfs_create_dir(disk->disk_name, blk_debugfs_root);
|
2022-06-14 00:48:25 -07:00
|
|
|
if (queue_is_mq(q))
|
2017-05-25 16:38:06 -07:00
|
|
|
blk_mq_debugfs_register(q);
|
2022-06-14 00:48:25 -07:00
|
|
|
mutex_unlock(&q->debugfs_mutex);
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
|
2022-06-28 23:20:13 -07:00
|
|
|
ret = disk_register_independent_access_ranges(disk);
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
if (ret)
|
2022-11-13 21:26:35 -07:00
|
|
|
goto out_debugfs_remove;
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
|
2018-11-15 12:22:51 -07:00
|
|
|
if (q->elevator) {
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
ret = elv_register_queue(q, false);
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
if (ret)
|
2022-11-13 21:26:35 -07:00
|
|
|
goto out_unregister_ia_ranges;
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
|
2022-11-13 21:26:33 -07:00
|
|
|
ret = blk_crypto_sysfs_register(disk);
|
blk-crypto: show crypto capabilities in sysfs
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-24 14:59:38 -07:00
|
|
|
if (ret)
|
2022-11-13 21:26:35 -07:00
|
|
|
goto out_elv_unregister;
|
blk-crypto: show crypto capabilities in sysfs
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-24 14:59:38 -07:00
|
|
|
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
|
2023-02-03 08:03:49 -07:00
|
|
|
wbt_enable_default(disk);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
|
|
|
|
/* Now everything is ready and send out KOBJ_ADD uevent */
|
2022-11-13 21:26:36 -07:00
|
|
|
kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
|
2020-10-08 20:26:32 -07:00
|
|
|
if (q->elevator)
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
mutex_unlock(&q->sysfs_dir_lock);
|
2021-06-08 18:58:22 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* SCSI probing may synchronously create and destroy a lot of
|
|
|
|
* request_queues for non-existent devices. Shutting down a fully
|
|
|
|
* functional queue takes measureable wallclock time as RCU grace
|
|
|
|
* periods are involved. To avoid excessive latency in these
|
|
|
|
* cases, a request_queue starts out in a degraded mode which is
|
|
|
|
* faster to shut down and is made fully functional here as
|
|
|
|
* request_queues for non-existent devices never get registered.
|
|
|
|
*/
|
|
|
|
if (!blk_queue_init_done(q)) {
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
|
|
|
|
percpu_ref_switch_to_percpu(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
return ret;
|
|
|
|
|
2022-11-13 21:26:35 -07:00
|
|
|
out_elv_unregister:
|
blk-crypto: show crypto capabilities in sysfs
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-24 14:59:38 -07:00
|
|
|
elv_unregister_queue(q);
|
2022-11-13 21:26:35 -07:00
|
|
|
out_unregister_ia_ranges:
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
disk_unregister_independent_access_ranges(disk);
|
2022-11-13 21:26:35 -07:00
|
|
|
out_debugfs_remove:
|
|
|
|
blk_debugfs_remove(disk);
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2022-11-13 21:26:36 -07:00
|
|
|
out_put_queue_kobj:
|
|
|
|
kobject_put(&disk->queue_kobj);
|
2022-11-13 21:26:35 -07:00
|
|
|
mutex_unlock(&q->sysfs_dir_lock);
|
2017-02-14 20:27:38 -07:00
|
|
|
return ret;
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|
|
|
|
|
2018-01-17 12:48:10 -07:00
|
|
|
/**
|
|
|
|
* blk_unregister_queue - counterpart of blk_register_queue()
|
|
|
|
* @disk: Disk of which the request queue should be unregistered from sysfs.
|
|
|
|
*
|
|
|
|
* Note: the caller is responsible for guaranteeing that this function is called
|
|
|
|
* after blk_register_queue() has finished.
|
|
|
|
*/
|
2008-01-29 06:51:59 -07:00
|
|
|
void blk_unregister_queue(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
struct request_queue *q = disk->queue;
|
|
|
|
|
2008-04-21 00:51:06 -07:00
|
|
|
if (WARN_ON(!q))
|
|
|
|
return;
|
|
|
|
|
2018-01-08 20:01:13 -07:00
|
|
|
/* Return early if disk->queue was never registered. */
|
2019-08-27 04:01:47 -07:00
|
|
|
if (!blk_queue_registered(q))
|
2018-01-08 20:01:13 -07:00
|
|
|
return;
|
|
|
|
|
2018-01-11 12:11:01 -07:00
|
|
|
/*
|
2018-01-17 12:48:10 -07:00
|
|
|
* Since sysfs_remove_dir() prevents adding new directory entries
|
|
|
|
* before removal of existing entries starts, protect against
|
|
|
|
* concurrent elv_iosched_store() calls.
|
2018-01-11 12:11:01 -07:00
|
|
|
*/
|
2017-08-28 09:52:44 -07:00
|
|
|
mutex_lock(&q->sysfs_lock);
|
2018-03-07 18:10:04 -07:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2017-03-28 16:12:17 -07:00
|
|
|
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
mutex_lock(&q->sysfs_dir_lock);
|
2018-01-17 12:48:10 -07:00
|
|
|
/*
|
|
|
|
* Remove the sysfs attributes before unregistering the queue data
|
|
|
|
* structures that can be modified through sysfs.
|
|
|
|
*/
|
2018-11-15 12:22:51 -07:00
|
|
|
if (queue_is_mq(q))
|
2022-06-28 10:18:50 -07:00
|
|
|
blk_mq_sysfs_unregister(disk);
|
2022-11-13 21:26:33 -07:00
|
|
|
blk_crypto_sysfs_unregister(disk);
|
2018-01-11 12:11:01 -07:00
|
|
|
|
2019-09-23 08:12:09 -07:00
|
|
|
mutex_lock(&q->sysfs_lock);
|
2022-01-24 14:59:36 -07:00
|
|
|
elv_unregister_queue(q);
|
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 19:22:19 -07:00
|
|
|
disk_unregister_independent_access_ranges(disk);
|
2019-09-23 08:12:09 -07:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2022-01-24 14:59:37 -07:00
|
|
|
|
|
|
|
/* Now that we've deleted all child objects, we can delete the queue. */
|
2022-11-13 21:26:36 -07:00
|
|
|
kobject_uevent(&disk->queue_kobj, KOBJ_REMOVE);
|
|
|
|
kobject_del(&disk->queue_kobj);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 04:01:48 -07:00
|
|
|
mutex_unlock(&q->sysfs_dir_lock);
|
2018-01-17 12:48:10 -07:00
|
|
|
|
2022-11-13 21:26:34 -07:00
|
|
|
blk_debugfs_remove(disk);
|
2008-01-29 06:51:59 -07:00
|
|
|
}
|