There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As per printk_ratelimit comment, it should not be used.
Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.
It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
The RAID6 version of this code is usable for RAID5 providing:
- we test "conf->max_degraded" rather than "2" as appropriate
- we make sure s->failed_num[1] is meaningful (and not '-1')
when s->failed > 1
The 'return 1' must become 'goto finish' in the new location.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.
So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical. So resolve these differences and create just one function.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.
So remove the latter and rename the former to simply "fetch_block".
Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.
For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.
Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
The difference between the RAID5 and RAID6 code here is easily
resolved using conf->max_degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Prior to commit ab69ae12ce the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places. That was careless.
The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed. For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.
So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.
The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Adding these three fields will allow more common code to be moved
to handle_stripe()
struct field rearrangement by Namhyung Kim.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.
This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
There is common code at the start of handle_stripe5 and
handle_stripe6. Move it into handle_stripe.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.
That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe. If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.
For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.
We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.
This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.
with merged fixes by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
This is the start of a series of patches to remove sh->lock.
sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].
Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.
All current users are switched over to use the new counter.
Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rcu callback free_conf() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_conf).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: NeilBrown <neilb@suse.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In raid5::make_request(), once bio_data_dir(@bi) is detected
it never (and couldn't) be changed. Use the result always.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc.
I think it's not harmful since @conf->slab_cache already knows
actual size of struct stripe_head.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.
Get rid of those unused pages and share read pages properly.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a device fails in a way that causes pending request to take a while
to complete, md will not be able to immediately remove it from the
array in remove_and_add_spares.
It will then incorrectly look like a spare device and md will try to
recover it even though it is failed.
This leads to a recovery process starting and instantly aborting over
and over again.
We should check if the device is faulty before considering it to be a
spare. This will avoid trying to start a recovery that cannot
proceed.
This bug was introduced in 2.6.26 so that patch is suitable for any
kernel since then.
Cc: stable@kernel.org
Reported-by: Jim Paradis <james.paradis@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In the bio_for_each_segment loop, bvl always points current
bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of
it.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit e9c7469bb4 ("md: implment REQ_FLUSH/FUA support")
introduced R5_WantFUA flag and set rw to WRITE_FUA in that case.
However remaining code still checks whether rw is exactly same
as WRITE or not, so FUAed-write ends up with being treated as
READ. Fix it.
This bug has been present since 2.6.37 and the fix is suitable for any
-stable kernel since then. It is not clear why this has not caused
more problems.
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The @bio->bi_phys_segments consists of active stripes count in the
lower 16 bits and processed stripes count in the upper 16 bits. So
logical-OR operator should be bitwise one.
This bug has been present since 2.6.27 and the fix is suitable for any
-stable kernel since then. Fortunately the bad code is only used on
error paths and is relatively unlikely to be hit.
Cc: stable@kernel.org
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Get rid of ->syncchunk and ->counter_bits since they're never used.
Also discard COUNTER_BYTE_RATIO which is unused.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add check to determine if a device needs full resync or if partial resync will do
RAID 5 was assuming that if a device was not In_sync, it must undergo a full
resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'.
If it is, we can safely skip the full resync and rely on the bitmap for
partial recovery instead. This is the legitimate purpose of 'saved_raid_disk',
from md.h:
int saved_raid_disk; /* role that device used to have in the
* array and could again if we did a partial
* resync from the bitmap
*/
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add bitmap support to the device-mapper specific metadata area.
This patch allows the creation of the bitmap metadata area upon
initial array creation via device-mapper.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)
Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move personality and sync/recovery thread starting outside md_run.
Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Make message a bit clearer by s/blocks/k/
I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Disallow resync I/O while the RAID array is suspended.
Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Don't attempt md_integrity_register if there is no gendisk struct available.
When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reserve just the minimum of pages needed to process one job.
Because we allocate pages from page allocator, we don't need to reserve
a large number of pages. The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.
The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4. This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.
Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures. One structure is enough to
process the whole request.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch changes dm-kcopyd so that it allocates pages from the main
page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
fail in case of memory pressure). If the allocation fails, dm-kcopyd
allocates pages from its own reserve.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a parameter for gfp flags to alloc_pl() for use in following
patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the spinlock protecting the pages allocation. The spinlock is only
taken on initialization or from single-threaded workqueue. Therefore, the
spinlock is useless.
The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.
kcopyd_get_pages is only called from run_pages_job, which is only
called from process_jobs called from do_work.
kcopyd_put_pages is called from client_alloc_pages (which is initialization
function) or from run_complete_job. run_complete_job is only called from
process_jobs called from do_work.
Another spinlock, kc->job_lock is taken each time someone pushes or pops
some work for the worker thread. Once we take kc->job_lock, we
guarantee that any written memory is visible to the other CPUs.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There's a possible theoretical deadlock in dm-kcopyd because multiple
allocations from the same mempool are required to finish a request.
Avoid this by preallocating sub jobs.
There is a mempool of 512 entries. Each request requires up to 9
entries from the mempool. If we have at least 57 concurrent requests
running, the mempool may overflow and mempool allocations may start
blocking until another entry is freed to the mempool. Because the same
thread is used to free entries to the mempool and allocate entries from
the mempool, this may result in a deadlock.
This patch changes it so that one mempool entry contains all 9 "struct
kcopyd_job" required to fulfill the whole request. The allocation is
done only once in dm_kcopyd_copy and no further mempool allocations are
done during request processing.
If dm_kcopyd_copy is not run in the completion thread, this
implementation is deadlock-free.
MIN_JOBS needs reducing accordingly and we've chosen to reduce it
further to 8.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Don't split SUB_JOB_SIZE jobs
If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
Splitting it just unnecessarily wastes time, because the split job size
is SUB_JOB_SIZE too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Integrity errors need to be passed to the owner of the integrity
metadata for processing. Consequently EILSEQ should be passed up the
stack.
Cc: stable@kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds a check that a block device has a request function
defined before it is used. Otherwise, misconfiguration can cause an oops.
Because we are allowing devices with zero size e.g. an offline multipath
device as in commit 2cd54d9bed
("dm: allow offline devices") there needs to be an additional check
to ensure devices are initialised. Some block devices, like a loop
device without a backing file, exist but have no request function.
Reproducer is trivial: dm-mirror on unbound loop device
(no backing file on loop devices)
dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"
and mirror resync will immediatelly cause OOps.
BUG: unable to handle kernel NULL pointer dereference at (null)
? generic_make_request+0x2bd/0x590
? kmem_cache_alloc+0xad/0x190
submit_bio+0x53/0xe0
? bio_add_page+0x3b/0x50
dispatch_io+0x1ca/0x210 [dm_mod]
? read_callback+0x0/0xd0 [dm_mirror]
dm_io+0xbb/0x290 [dm_mod]
do_mirror+0x1e0/0x748 [dm_mirror]
Signed-off-by: Milan Broz <mbroz@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Permit a target to support discards regardless of whether or not all its
underlying devices do.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...
The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to. A value of 0 means the array is
not known to be in-sync at all. A value of MaxSector means the array
is believed to be fully in-sync.
When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match. This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.
However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable. So it would be good if the implied resync
after the array is resized could be avoided.
So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'. Changing it once resync has started is
not going to be useful anyway.
This allows the array to be resized without a resync by:
write 'frozen' to 'sync_action'
write new size to 'component_size' (this will set resync_start)
write 'none' to 'resync_start'
write 'idle' to 'sync_action'.
Also slightly improve some tests on recovery_cp when resizing
raid1/raid5. Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NeilBrown <neilb@suse.de>
When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.
Apply this pattern 3 times, and wrap some other long lines.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the rdev on which a read error happened could be removed
before we perform the fix_error handling. This requires extra tests
for NULL.
So delay the rdev_dec_pending call until after the call to
fix_read_error so that we can be sure that the rdev still exists.
This allows an 'if' clause to be removed so the body gets re-indented
back one level.
Signed-off-by: NeilBrown <neilb@suse.de>
The current handling and freeing of these pages is a bit fragile.
We only keep the list of allocated pages in each bio, so we need to
still have a valid bio when freeing the pages, which is a bit clumsy.
So simply store the allocated page list in the r1_bio so it can easily
be found and freed when we are finished with the r1_bio.
Signed-off-by: NeilBrown <neilb@suse.de>
If we get a read error during resync/recovery we current repeat with
single-page reads to find out just where the error is, and possibly
read each page from a different device.
With check/repair we don't currently do that, we just fail.
However it is possible that while all devices fail on the large 64K
read, we might be able to satisfy each 4K from one device or another.
So call fix_sync_read_error before process_checks to maximise the
chance of finding good data and writing it out to the devices with
read errors.
For this to work, we need to set the 'uptodate' flags properly after
fix_sync_read_error has succeeded.
Signed-off-by: NeilBrown <neilb@suse.de>
These changes are mostly cosmetic:
1/ change mddev->raid_disks to conf->raid_disks because the later is
technically safer, though in current practice it doesn't matter in
this particular context.
2/ Rearrange two for / if loops to have an early 'continue' so the
body of the 'if' doesn't need to be indented so much.
Signed-off-by: NeilBrown <neilb@suse.de>
sync_request_write is too big and too deep.
So split out two self-contains bits of functionality into separate
function.
Signed-off-by: NeilBrown <neilb@suse.de>
- there is no need to test_bit Faulty, as that was already done in
md_error which is the only caller of these functions.
- MD_CHANGE_DEVS should be set *after* faulty is set to ensure
metadata is updated correctly.
- spinlock should be held while updating ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
read_balance has two loops which both look for a 'best'
device based on slightly different criteria.
This is clumsy and makes is hard to add extra criteria.
So replace it all with a single loop that combines everything.
Signed-off-by: NeilBrown <neilb@suse.de>
raid10 read balance has two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.
Signed-off-by: NeilBrown <neilb@suse.de>
If a bitmap is found to be 'stale' the events_cleared value
is set to match 'events'.
However if the array is degraded this does not get stored on disk.
This can subsequently lead to incorrect behaviour.
So change bitmap_update_sb to always update events_cleared in the
superblock from the known events_cleared.
For neatness also set ->state from ->flags.
This requires updating ->state whenever we update ->flags, which makes
sense anyway.
This is suitable for any active -stable release.
cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
The 'add_new_disk' ioctl can be used to add a device either as a
spare, or as an active disk that just needs to be resynced based on
write-intent-bitmap information (re-add)
Currently if a re-add is requested but fails we add as a spare
instead. This makes it impossible for user-space to check for
failure.
So change to require that a re-add attempt will either succeed or
completely fail. User-space can then decide what to do next.
Signed-off-by: NeilBrown <neilb@suse.de>
There is a race when creating an md device by opening /dev/mdXX.
If two processes do this at much the same time they will follow the
call path
__blkdev_get -> get_gendisk -> kobj_lookup
The first will call
-> md_probe -> md_alloc -> add_disk -> blk_register_region
and the race happens when the second gets to kobj_lookup after
add_disk has called blk_register_region but before it returns to
md_alloc.
In the case the second will not call md_probe (as the probe is already
done) but will get a handle on the gendisk, return to __blkdev_get
which will then call md_open (via the ->open) pointer.
As mddev->gendisk hasn't been set yet, md_open will think something is
wrong an return with ERESTARTSYS.
This can loop endlessly while the first thread makes no progress
through add_disk. Nothing is blocking it, but due to scheduler
behaviour it doesn't get a turn.
So this is essentially a live-lock.
We fix this by simply moving the assignment to mddev->gendisk before
the call the add_disk() so md_open doesn't get confused.
Also move blk_queue_flush earlier because add_disk should be as late
as possible.
To make sure that md_open doesn't complete until md_alloc has done all
that is needed, we take mddev->open_mutex during the last part of
md_alloc. md_open will wait for this.
This can cause a lock-up on boot so Cc:ing for stable.
For 2.6.36 and earlier a different patch will be needed as the
'blk_queue_flush' call isn't there.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org
There's a small typo in a comment in drivers/md/raid5.c - 'Of course' is
misspelled as 'Ofcourse'. This patch fixes the spelling error.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Change <sectors> from unsigned long long to sector_t.
This matches its source field.
ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined!
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Problem:
After raid4->raid0 takeover operation, another takeover operation
(e.g raid0->raid10) results "kernel oops".
Root cause:
Variables 'degraded' in mddev structure is not cleared
on raid45->raid0 takeover.
This patch reset this variable.
Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A raid0 array doesn't set 'dev_sectors' as each device might
contribute a different number of sectors.
So when converting to a RAID4 or RAID5 we need to set dev_sectors
as they need the number.
We have already verified that in fact all devices do contribute
the same number of sectors, so use that number.
Signed-off-by: NeilBrown <neilb@suse.de>
We previously needed to set ->queue_lock to match the raid5
device_lock so we could safely use queue_flag_* operations (e.g. for
plugging). which test the ->queue_lock is in fact locked.
However that need has completely gone away and is unlikely to come
back to remove this now-pointless setting.
Signed-off-by: NeilBrown <neilb@suse.de>
We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.
Also remove some plug-related code that is no longer needed.
Signed-off-by: NeilBrown <neilb@suse.de>
In raid5 plugging is used for 2 things:
1/ collecting writes that require a bitmap update
2/ collecting writes in the hope that we can create full
stripes - or at least more-full.
We now release these different sets of stripes when plug_cnt
is zero.
Also in make_request, we call mddev_check_plug to hopefully increase
plug_cnt, and wake up the thread at the end if plugging wasn't
achieved for some reason.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.
If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.
Signed-off-by: NeilBrown <neilb@suse.de>
md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.
This relied on the ->unplug_fn callback which doesn't exist any more.
So remove all of that code, both in md and raid5. Subsequent patches
with restore the plugging functionality.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that unplugging is done differently, the unplug_fn callback is
never called, so it can be completely discarded.
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.
Signed-off-by: NeilBrown <neilb@suse.de>
The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return). To some degree that is unavoidable
(stacked DM devices force this late checking). But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.
Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template. Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.
Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
during table load; uninitialized profiles (e.g. for underlying DM
device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
don't all we can do is report the mismatch (during resume we're past
the point of no return)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We incorrectly returned -EINVAL when none of the devices in the array
had an integrity profile. This in turn prevented mdadm from starting
the metadevice. Fix this so we only return errors on mismatched
profiles and memory allocation failures.
Reported-by: Giacomo Catenazzi <cate@cateee.net>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm stripe: implement merge method
dm mpath: allow table load with no priority groups
dm mpath: fail message ioctl if specified path is not valid
dm ioctl: add flag to wipe buffers for secure data
dm ioctl: prepare for crypt key wiping
dm crypt: wipe keys string immediately after key is set
dm: add flakey target
dm: fix opening log and cow devices for read only tables
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
Implement a merge function in the striped target.
When the striped target's underlying devices provide a merge_bvec_fn
(like all DM devices do via dm_merge_bvec) it is important to call down
to them when building a biovec that doesn't span a stripe boundary.
Without the merge method, a striped DM device stacked on DM devices
causes bios with a single page to be submitted which results
in unnecessary overhead that hurts performance.
This change really helps filesystems (e.g. XFS and now ext4) which take
care to assemble larger bios. By implementing stripe_merge(), DM and the
stripe target no longer undermine the filesystem's work by only allowing
a single page per bio. Buffered IO sees the biggest improvement
(particularly uncached reads, buffered writes to a lesser degree). This
is especially so for more capable "enterprise" storage LUNs.
The performance improvement has been measured to be ~12-35% -- when a
reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe
count that is a power of 2.
In contrast, the performance penalty is ~5-7% for the pathological worst
case stripe configuration (small chunk_size with a stripe count that is
not a power of 2). The reason for this is that stripe_map_sector() is
now called once for every call to dm_merge_bvec(). stripe_map_sector()
will use slower division if stripe count isn't a power of 2.
Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adjusts the multipath target to allow a table with both 0
priority groups and 0 for the initial priority group number.
If any mpath device is held open when all paths in the last priority
group have failed, userspace multipathd will attempt to reload the
associated DM table to reflect the fact that the device no longer has
any priority groups. But the reload attempt always failed because the
multipath target did not allow 0 priority groups.
All multipath target messages related to priority group (enable_group,
disable_group, switch_group) will handle a priority group of 0 (will
cause error).
When reloading a multipath table with 0 priority groups, userspace
multipathd must be updated to specify an initial priority group number
of 0 (rather than 1).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Babu Moger <babu.moger@lsi.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fail the reinstate_path and fail_path message ioctl if the specified
path is not valid.
The message ioctl would succeed for the 'reinistate_path' and
'fail_path' messages even if action was not taken because the
specified device was not a valid path of the multipath device.
Before, when /dev/vdb is not a path of mpathb:
$ dmsetup message mpathb 0 reinstate_path /dev/vdb
$ echo $?
0
After:
$ dmsetup message mpathb 0 reinstate_path /dev/vdb
device-mapper: message ioctl failed: Invalid argument
Command failed
$ echo $?
1
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add DM_SECURE_DATA_FLAG which userspace can use to ensure
that all buffers allocated for dm-ioctl are wiped
immediately after use.
The user buffer is wiped as well (we do not want to keep
and return sensitive data back to userspace if the flag is set).
Wiping is useful for cryptsetup to ensure that the key
is present in memory only in defined places and only
for the time needed.
(For crypt, key can be present in table during load or table
status, wait and message commands).
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Prepare code for implementing buffer wipe flag.
No functional change in this patch.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Always wipe the original copy of the key after processing it
in crypt_set_key().
Signed-off-by: Milan Broz <mbroz@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This target is the same as the linear target except that it returns I/O
errors periodically. It's been found useful in simulating failing
devices for testing purposes.
I needed a dm target to do some failure testing on btrfs's raid code, and
Mike pointed me at this.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If a table is read-only, also open any log and cow devices it uses read-only.
Previously, even read-only devices were opened read-write internally.
After patch 75f1dc0d07
block: check bdev_read_only() from blkdev_get()
was applied, loading such tables began to fail. The patch
was reverted by e51900f7d3
block: revert block_dev read-only check
but this patch fixes this part of the code to work with the original patch.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>