Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.
This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.
Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position. In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.
Signed-off-by: NeilBrown <neilb@suse.de>
During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'. They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.
However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size). So make the difference more
explicit with a 'generation' number.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end. So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.
Never allow the size to be reduced below the minimum (4 for raid6,
3 otherwise).
We require that the size of the array has already been reduced before
the array is reshaped to a smaller size. This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.
When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning. So we need to handle expand_progress and expand_lo
differently.
This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
This is the first of four patches which combine to allow md/raid5 to
reduce the number of devices in the array by restriping the data over
a subset of the devices.
If the number of disks in a raid4/5/6 is being reduced, then the
default size must be based on the new number, not the old number
of devices.
In general, it should be based on the smaller of new and old.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.
To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Allow userspace to set the size of the array according to the following
semantics:
1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality
Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.
Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested. For personalities that do not reshape emit
a warning if anything but the default size is requested.
In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hello,
I found a typo Bosto"m" in FSF address.
And I am checking around linux source code.
Here is the only place which uses Bosto"m" (not Boston).
Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid6 is still in the layout that comes from converting raid5
into a raid6. this will allow us to convert it back again.
Signed-off-by: NeilBrown <neilb@suse.de>
2-drive raid5's aren't very interesting. But if you are converting
a raid1 into a raid5, you will at least temporarily have one. And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.
layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.
Signed-off-by: NeilBrown <neilb@suse.de>
Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.
Signed-off-by: NeilBrown <neilb@suse.de>
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.
The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.
So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.
There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first. It is safer
to put the check inside md_unregister_thread itself.
Signed-off-by: NeilBrown <neilb@suse.de>
.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.
So fix this up.
A subsequent patch will BUG_ON if these things aren't consistent.
Signed-off-by: NeilBrown <neilb@suse.de>
DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.
This requires unfortunately complex loops...
Signed-off-by: NeilBrown <neilb@suse.de>
DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.
Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
a 'struct stripe_head *' and fill in the relevant fields. This is
more extensible.
Signed-off-by: NeilBrown <neilb@suse.de>
Code currently assumes that the devices in a raid6 stripe are
0 1 ... N-1 P Q
in some rotated order. We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation. We still assume that the data disks
are roughly in-order. However P and Q can be inserted anywhere within
that order.
Signed-off-by: NeilBrown <neilb@suse.de>
This similar to the recent change to get_active_stripe.
There is no functional change, just come rearrangement to make
future patches cleaner.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than passing 'pd_idx' and 'disks' to these functions, just pass
'previous' which tells whether to use the 'previous' or 'current'
geometry during a reshape, and let init_stripe calculate
disks and pd_idx and anything else it might need.
This is not a substantial simplification and even adds a division.
However we will shortly be adding more complexity to init_stripe
to handle more interesting 'reshape' activities, and without this
change, the interface to these functions would get very complex.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.
All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.
All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.
In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from: the metadata on the disk or the
flags passed through the IOCTL.
For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).
This makes it awkward to clear, and is at best inconsistent.
As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.
Signed-off-by: NeilBrown <neilb@suse.de>
Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.
One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.
When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.
Signed-off-by: NeilBrown <neilb@suse.de>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.
Lastly move mdp_major in. It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Use the -y variables instead of the old -objs so we can easily add
conditional objects to the modules. Also always use += to add
subobjects to avoid problems when placing additional objects in
some place in the file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
md: Add support for data integrity to MD
If all subdevices support the same protection format the MD device is
flagged as integrity capable.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add some spares to an array and start recovery, and we have
a bitmap which is stored 'internally' on all devices, we call
bitmap_write_all to make sure the bitmap is correct on the new
device(s).
However that doesn't work as write_sb_page only writes to
'In_sync' devices, and devices undergoing recovery are not
'In_sync' until recovery finishes.
So extend write_sb_page (actually next_active_rdev) to include devices
that are under recovery.
Signed-off-by: NeilBrown <neilb@suse.de>
It is safe to clear a bit from the write-intent bitmap for a raid1
if we know the data has been written to all devices, which is
what the current test does.
But it is not always safe to update the 'events_cleared' counter in
that case. This is because one request could complete successfully
after some other request has partially failed.
So simply disable the clearing and updating of events_cleared whenever
the array is degraded. This might end up not clearing some bits that
could safely be cleared, but it is safest approach.
Note that the bug fixed here did not risk corrupting data by letting
the array get out-of-sync. Rather it meant that when a device is
removed and re-added to the array, it might incorrectly require a full
recovery rather than just recovering based on the bitmap.
Signed-off-by: NeilBrown <neilb@suse.de>
md currently insists that the chunk size used for write-intent
bitmaps (the amount of data that corresponds to one chunk)
be at least one page.
The reason for this restriction is lost in the mists of time,
but a review of the code (and a vague memory) suggests that the only
problem would be related to resync. Resync tries very hard to
work in multiples of a page, but also needs to sync with units
of a bitmap_chunk too.
This connection comes out in the bitmap_start_sync call.
So change bitmap_start_sync to always work in multiples of a page.
If the bitmap chunk size is less that one page, we flag multiple
chunks as 'syncing' and generally make them all appear to the
resync routines like one chunk.
All other code either already works with data ranges that could
span multiple chunks, or explicitly only cares about a single chunk.
Signed-off-by: Neil Brown <neilb@suse.de>
There are two problems with is_mddev_idle.
1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
rest are 'long'.
So if sync_io were to wrap on a 64bit host, the value of
curr_events would go very negative suddenly, and take a very
long time to return to positive.
So do all calculations as 'int'. That gives us plenty of precision
for what we need.
2/ To initialise rdev->last_events we simply call is_mddev_idle, on
the assumption that it will make sure that last_events is in a
suitable range. It used to do this, but now it does not.
So now we need to be more explicit about initialisation.
Signed-off-by: NeilBrown <neilb@suse.de>
There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.
When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed. So using conf after reschedule_retry is not
safe.
Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.
The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.
For raid10 it can be quite different. resync follows the 'array'
address, and makes sure all copies are the same. Recover walks
through 'device' addresses and recreates each missing block.
The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.
In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.
So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.
However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.
This patch fixes it.
In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.
This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.
We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.
We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.
So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.
This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
ab5bd5cbc8 introduced the following
bug in linear software raid for large arrays on 32 bit machines:
which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.
Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.
The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.
As reported by Simon Kirby, this leads to errors such as
linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464
Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.
Cc: stable@kernel.org
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>