If a 'sync_max' has been set (via sysfs), it is wrong to clear it
until a resync (or reshape or recovery ...) actually reached that
point.
So if a resync is interrupted (e.g. by device failure),
leave 'resync_max' unchanged.
This is particularly important for 'reshape' operations that do not
change the size of the array. For such operations mdadm needs to
monitor the reshape taking rolling backups of the section being
reshaped. If resync_max gets cleared, the reshape can get ahead of
mdadm and then the backups that mdadm creates are useless.
This is suitable for 2.6.31.y stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.
This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.
md.c needs a similar fix when interpreting the md metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.
So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.
This is simpler and more correct.
Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Rename some variable and remove some duplicate definitions
to avoid there warnings. None of them are actual errors.
Signed-off-by: NeilBrown <neilb@suse.de>
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough. There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races. But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.
If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.
So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.
Signed-off-by: NeilBrown <neilb@suse.de>
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'. This is to cope with
a system failure between updating the metadata on two difference
devices.
However there are currently times when we update the event count by
2. This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.
This is bad for the above reason. So change it to never increase by
two. This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost. The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.
Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent commit:
commit 449aad3e25
introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.
__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.
This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74de
do avoid a warning of an impossible deadlock.
It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.
So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.
This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d
Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode. So use that instead of mucking about with locks and
i_size_write.
Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.
Signed-off-by: NeilBrown <neilb@suse.de>
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.
Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array. This isn't really safe.
So range-check the value first.
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is changed from RAID6 to RAID5, fewer drives are
needed. So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.
md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().
The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.
For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.
Signed-off-by: NeilBrown <neilb@suse.de>
As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea. So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an md device is created by name (rather than number) we need to
check that the name is not already in use. If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.
Cc: stable@kernel.org
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we try to modify one of the md/ sysfs files
suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.
Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.
This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.
A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.
Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the assignment to utime gets skipped for 'external'
metadata. So move it to the top of the function so that it
always gets effected.
This is of largely cosmetic interest. Nothing actually depends
on ->utime being right for external arrays.
"mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
mdadm-3.0, this is not important for external metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).
Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.
So make them just one method.
Signed-off-by: NeilBrown <neilb@suse.de>
Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.
This is preparation for combining the check_reshape and reconfig
methods
Signed-off-by: NeilBrown <neilb@suse.de>
A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since
is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)
these bits don't need an adjustment for the shift.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Remove chunk size check from md as this is now performed in the run
function in each personality.
Replace chunk size power 2 code calculations by a regular division.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.
The reshape code updates both at the same time. However since
commit 97e4f42d62
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NeilBrown <neilb@suse.de>
The md resync engine has a 'frozen' state which ensures that
no resync/recovery. This is used to avoid races.
Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
E2BIG
if the size requested is too large. This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NeilBrown <neilb@suse.de>
We previously didn't update these fields when writing the metadata
because they could never change. They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NeilBrown <neilb@suse.de>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
rd2 -> dev-sda
indicates that the device with role '2' in the array is sda.
These links are only present when the array is active. They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.
So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.
It is unnecessary because the same can be achieved by writing
'active'. This activates and array, but it still remains 'clean'
until the first write.
It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').
Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races: One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array. Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.
So just disable the use of 'clean' to activate an array.
This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: Rafal Marszewski <rafal.marszewski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
the range of "unsigned long".
So
- change "max_blocks" to "max_sectors", and store sector numbers
in there and in 'resync'
- Make 'rt' a 'sector_t' so it can temporarily hold the number of
remaining sectors.
- use sector_div rather than normal division.
- change the magic '100' used to preserve precision to '32'.
+ making it a power of 2 makes division easier
+ it doesn't need to be as large as it was chosen when we averaged
speed over the entire run. Now we average speed over the last 30
seconds or so.
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
There are circumstances when a user-space process might need to
"oversee" a resync/reshape process. For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).
The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
suspend IO to the first section and back it up
set 'sync_max' to the end of the section
wait for 'sync_completed' to reach that point
resume IO on the first section and move on to the next section.
However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.
It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.
One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that). This defeats some of the double buffering.
With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.
To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max. This will usually be a good time for user
space to update sync_max.
If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops. So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NeilBrown <neilb@suse.de>
The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.
We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).
So:
- make curr_resync_completed be uptodate a little more often,
particularly when raid5 reshape updates status in the metadata
- report curr_resync_completed in the sysfs file
- allow poll/select to report all updates to md/sync_completed.
This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.
So add that option.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (53 commits)
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
md/raid5: enhance raid5_size to work correctly with negative delta_disks
md/raid5: drop qd_idx from r6_state
md/raid6: move raid6 data processing to raid6_pq.ko
md: raid5 run(): Fix max_degraded for raid level 4.
md: 'array_size' sysfs attribute
md: centralize ->array_sectors modifications
...
When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
Allow userspace to set the size of the array according to the following
semantics:
1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality
Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.
Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2-drive raid5's aren't very interesting. But if you are converting
a raid1 into a raid5, you will at least temporarily have one. And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.
layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.
Signed-off-by: NeilBrown <neilb@suse.de>
Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.
Signed-off-by: NeilBrown <neilb@suse.de>
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.
The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.
So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.
There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first. It is safer
to put the check inside md_unregister_thread itself.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.
So fix this up.
A subsequent patch will BUG_ON if these things aren't consistent.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.
All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.
All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.
In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from: the metadata on the disk or the
flags passed through the IOCTL.
For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).
This makes it awkward to clear, and is at best inconsistent.
As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.
Signed-off-by: NeilBrown <neilb@suse.de>
Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.
One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.
When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.
Signed-off-by: NeilBrown <neilb@suse.de>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.
Lastly move mdp_major in. It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
md: Add support for data integrity to MD
If all subdevices support the same protection format the MD device is
flagged as integrity capable.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There are two problems with is_mddev_idle.
1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
rest are 'long'.
So if sync_io were to wrap on a 64bit host, the value of
curr_events would go very negative suddenly, and take a very
long time to return to positive.
So do all calculations as 'int'. That gives us plenty of precision
for what we need.
2/ To initialise rdev->last_events we simply call is_mddev_idle, on
the assumption that it will make sure that last_events is in a
suitable range. It used to do this, but now it does not.
So now we need to be more explicit about initialisation.
Signed-off-by: NeilBrown <neilb@suse.de>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.
We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.
We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.
So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.
This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.
So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.
Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.
Signed-off-by: NeilBrown <neilb@suse.de>
Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.
Also, creating md devices by opening the device special file is a bit
awkward.
So this patch provides a new option for creating and naming devices.
Writing a name such as "md_home" to
/sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created. It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.
md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.
The name of the array must start 'md_' to avoid conflict with
other devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.
Signed-off-by: NeilBrown <neilb@suse.de>
It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk. It should allocate the
queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.
But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.
In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There is no compelling need for this, but sysfs_notify_dirent is a
nicer interface and the change is good for consistency.
Signed-off-by: NeilBrown <neilb@suse.de>
It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close). And it
is quite possible for do_md_stop to be called when the device is not
open. So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
commit 934d9c23b4
It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel
Signed-off-by: NeilBrown <neilb@suse.de>
md arrays are not currently destroyed when they are stopped - they
remain in /sys/block. Last time I tried this I tripped over locking
too much.
A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.
As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: allow extended partitions on md devices.
md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
md: use sysfs_notify_dirent to notify changes to md/array_state
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.
Signed-off-by: NeilBrown <neilb@suse.de>
safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one. sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.
For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.
So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.
Signed-off-by: NeilBrown <neilb@suse.de>
Having
function (args)
instead of
function(args)
make is harder to search for calls of particular functions.
So remove all those spaces.
Signed-off-by: NeilBrown <neilb@suse.de>
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.
Calling do_md_stop to set the array readonly when it is already readonly
returns an error. So make sure not to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.
Normally changes to metadata_version are not permitted while the array
is active. Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.
Signed-off-by: NeilBrown <neilb@suse.de>
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.
This fixes a regression introduced by
commit d7027458d6
Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0. This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev. To make sure no
user is left behind, rename generic devices fields to __dev.
This is in preparation of unifying partition 0 handling with other
partitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.
This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.
So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).
However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.
This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.
All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev. Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.
So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.
So remove it.
This patch possibly addresses the regression recored in
http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
http://bugzilla.kernel.org/show_bug.cgi?id=11452
though it isn't yet clear how it ever worked.
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening. Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device. Then it can be removed
by a request from user space.
The first step is currently not performed for read-only arrays,
so the second step can never succeed.
So allow readonly arrays to remove failed devices (which aren't
blocked).
Signed-off-by: NeilBrown <neilb@suse.de>
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.
For now, simply reject any request to change the size of an array
which has a bitmap. mdadm can remove the bitmap and add a new one
after the array has changed size.
Signed-off-by: NeilBrown <neilb@suse.de>