kernel_optimize_test

Author	SHA1	Message	Date
Dan Williams	a445685647	raid5: refactor handle_stripe5 and handle_stripe6 (v3) handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes: * v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path * v3: removed the unused 'dirty' field from struct stripe_head_state * v3: coalesced open coded bi_end_io routines into return_io() Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	9bc89cd82d	async_tx: add the async_tx api The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor async_<operation>(..., struct dma_async_tx_descriptor depend_tx, dma_async_tx_callback cb_fn, void cb_param) { struct dma_chan chan = async_tx_find_channel(depend_tx, <operation>); struct dma_device device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor tx = device ? device->device_prep_dma_<operation>(chan, len, int_en) : NULL; if (tx) { / run <operation> asynchronously / ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { / run <operation> synchronously / ... <operation> ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select the API when MD is enabled * BUG_ON xor source counts <= 1 * implicitly handle hardware concerns like channel switching and interrupts, Neil Brown * remove the per operation type list, and distribute operation capabilities evenly amongst the available channels * simplify async_tx_find_channel to optimize the fast path * introduce the channel_table_initialized flag to prevent early calls to the api * reorganize the code to mimic crypto * include mm.h as not all archs include it in dma-mapping.h * make the Kconfig options non-user visible, Adrian Bunk * move async_tx under crypto since it is meant as 'core' functionality, and the two may share algorithms in the future * move large inline functions into c files * checkpatch.pl fixes * gpl v2 only correction Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:14 -07:00
Dan Williams	685784aaf3	xor: make 'xor_blocks' a library routine for use with async_tx The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <bunk@stusta.de> Cc: NeilBrown <neilb@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-07-13 08:06:14 -07:00
Chandra Seetharaman	dd172d72ad	dm mpath: rdac This patch supports LSI/Engenio devices in RDAC mode. Like dm-emc it requires userspace support. In your multipath.conf file you must have: path_checker rdac hardware_handler "1 rdac" prio_callout "/sbin/mpath_prio_tpc /dev/%n" And you also then must have a updated multipath tools release which has rdac support. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:23 -07:00
Jonathan Brassow	fc1ff9588a	dm raid1: handle log failure When writing to a mirror, the log must be updated first. Failure to update the log could result in the log not properly reflecting the state of the mirror if the machine should crash. We change the return type of the rh_flush function to give us the ability to check if a log write was successful. If the log write was unsuccessful, we fail the writes to avoid the case where the log does not properly reflect the state of the mirror. A follow-up patch - which is dependent on the ability to requeue I/O's to core device-mapper - will requeue the I/O's for retry (allowing the mirror to be reconfigured.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	f44db678ed	dm raid1: handle resync failures Device-mapper mirroring currently takes a best effort approach to recovery - failures during mirror synchronization are completely ignored. This means that regions are marked 'in-sync' and 'clean' and removed from the hash list. Future reads and writes that query the region will incorrectly interpret the region as in-sync. This patch handles failures during the recovery process. If a failure occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added to a new list 'failed_recovered_regions'. Regions on the 'failed_recovered_regions' list are not marked as 'clean' upon removal from the list. Furthermore, if the DM_RAID1_HANDLE_ERRORS flag is set, the region is marked as 'not-in-sync'. This action prevents any future read-balancing from choosing an invalid device because of the 'not-in-sync' status. If "handle_errors" is not specified when creating a mirror (leaving the DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they would be without this patch. This is to preserve backwards compatibility with user-space tools, such as 'pvmove'. However, since future read-balancing policies will rely on the correct sync status of a region, a user must choose "handle_errors" when using read-balancing. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	d0d444c7d4	dm: add ratelimit logging macros Add ratelimit extension to dm logging macros. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Stefan Bader	07a83c47cf	dm: disable barriers This patch causes device-mapper to reject any barrier requests. This is done since most of the targets won't handle this correctly anyway. So until the situation improves it is better to reject these requests at the first place. Since barrier requests won't get to the targets, the checks there can be removed. Cc: stable@kernel.org Signed-off-by: Stefan Bader <shbader@de.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	943317efdb	dm raid1: clear region outside spinlock A clear_region function is permitted to block (in practice, rare) but gets called in rh_update_states() with a spinlock held. The bits being marked and cleared by the above functions are used to update the on-disk log, but are never read directly. We can perform these operations outside the spinlock since the bits are only changed within one thread viz. - mark_region in rh_inc() - clear_region in rh_update_states(). So, we grab the clean_regions list items via list_splice() within the spinlock and defer clear_region() until we iterate over the list for deletion - similar to how the recovered_regions list is already handled. We then move the flush() call down to ensure it encapsulates the changes which are done by the later calls to clear_region(). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	0764147b11	dm snapshot: permit invalid activation Allow invalid snapshots to be activated instead of failing. This allows userspace to reinstate any given snapshot state - for example after an unscheduled reboot - and clean up the invalid snapshot at its leisure. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	fcac03abd3	dm snapshot: fix invalidation deadlock Process persistent exception store metadata IOs in a separate thread. A snapshot may become invalid while inside generic_make_request(). A synchronous write is then needed to update the metadata while still inside that function. Since the introduction of md-dm-reduce-stack-usage-with-stacked-block-devices.patch this has to be performed by a separate thread to avoid deadlock. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jun'ichi Nomura	596f138eed	dm io: fix panic on large request bio_alloc_bioset() will return NULL if 'num_vecs' is too large. Use bio_get_nr_vecs() to get estimation of maximum number. Cc: stable@kernel.org Signed-off-by: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	c95bc206da	dm raid1: fix status Fix mirror status line broken in dm-log-report-fault-status.patch: - space missing between two words - placeholder ("0") required for compatibility with a subsequent patch - incorrect offset parameter Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	0cd3312434	dm: remove duplicate module name from error msgs Remove explicit module name from messages as the macro now includes it automatically. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	ac818646d4	dm delay: cleanup Use setup_timer(). Replace semaphore with mutex. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	028867ac28	dm: use kmem_cache macro Use new KMEM_CACHE() macro and make the newly-exposed structure names more meaningful. Also remove some superfluous casts and inlines (let a modern compiler be the judge). Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	79e15ae424	dm: bio_list prefetch removal Remove dubious prefetch from bio_list_for_each() macro. Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Mike Accetta	ed45666271	md: fix bug in error handling during raid1 repair If raid1/repair (which reads all block and fixes any differences it finds) hits a read error, it doesn't reset the bio for writing before writing correct data back, so the read error isn't fixed, and the device probably gets a zero-length write which it might complain about. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
NeilBrown	af03b8e4e8	md: fix two raid10 bugs 1/ When resyncing a degraded raid10 which has more than 2 copies of each block, garbage can get synced on top of good data. 2/ We round the wrong way in part of the device size calculation, which can cause confusion. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
NeilBrown	a778b73ff7	md: fix bug with linear hot-add and elsewhere Adding a drive to a linear array seems to have stopped working, due to changes elsewhere in md, and insufficient ongoing testing... So the patch to make linear hot-add work in the first place introduced a subtle bug elsewhere that interracts poorly with older version of mdadm. This fixes it all up. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	ab6085c795	md: don't write more than is required of the last page of a bitmap It is possible that real data or metadata follows the bitmap without full page alignment. So limit the last write to be only the required number of bytes, rounded up to the hard sector size of the device. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	787f17feb2	md: avoid overflow in raid0 calculation with large components If a raid0 has a component device larger than 4TB, and is accessed on a 32bit machines, then as 'chunk' is unsigned long, chunk << chunksize_bits can overflow (this can be as high as the size of the device in KB). chunk itself will not overflow (without triggering a BUG). So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes impossible to hit. Cc: "Jeff Zheng" <Jeff.Zheng@endace.com> Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	435b71be20	md: improve the is_mddev_idle test During a 'resync' or similar activity, md checks if the devices in the array are otherwise active and winds back resync activity when they are. This test in done in is_mddev_idle, and it is somewhat fragile - it sometimes thinks there is non-sync io when there isn't. The test compares the total sectors of io (disk_stat_read) with the sectors of resync io (disk->sync_io). This has problems because total sectors gets updated when a request completes, while resync io gets updated when the request is submitted. The time difference can cause large differenced between the two which do not actually imply non-resync activity. The test currently allows for some fuzz (+/- 4096) but there are some cases when it is not enough. The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. If we do this then the amount of in-flight sync io will never cause the appearance of non-sync IO. Once enough non-sync IO to worry about starts happening, resync will be slowed down and the measurements will thus be more precise (as there is less in-flight) and control of resync will still be suitably responsive. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-11 08:29:37 -07:00
NeilBrown	dd00a99e7a	md: avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem. When a raid1 has only one working drive, we want read error to propagate up to the filesystem as there is no point failing the last drive in an array. Currently the code perform this check is racy. If a write and a read a both submitted to a device on a 2-drive raid1, and the write fails followed by the read failing, the read will see that there is only one working drive and will pass the failure up, even though the one working drive is actually the other one. So, tighten up the locking. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-10 09:26:53 -07:00
Linus Torvalds	44ce6294d0	Revert "md: improve partition detection in md array" This reverts commit `5b479c91da`. Quoth Neil Brown: "It causes an oops when auto-detecting raid arrays, and it doesn't seem easy to fix. The array may not be 'open' when do_md_run is called, so bdev->bd_disk might be NULL, so bd_set_size can oops. This whole approach of opening an md device before it has been assembled just seems to get more and more painful. I think I'm going to have to come up with something clever to provide both backward comparability with usage expectation, and sane integration into the rest of the kernel." Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 18:51:36 -07:00
NeilBrown	5b479c91da	md: improve partition detection in md array md currently uses ->media_changed to make sure rescan_partitions is call on md array after they are assembled. However that doesn't happen until the array is opened, which is later than some people would like. So use blkdev_ioctl to do the rescan immediately that the array has been assembled. This means we can remove all the ->change infrastructure as it was only used to trigger a partition rescan. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	08a02ecd28	md: allow reshape_position for md arrays to be set via sysfs "reshape_position" records how much progress has been made on a "reshape" (adding drives, changing layout or chunksize). When it is set, the number of drives, layout and chunksize can have two possible values, an old an a new. So allow these different values to be visible, and allow both old and new to be set: Set the old ones first, then the reshape_position, then the new values. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	42b9bebe3f	md: remove the slash from the name of a kmem_cache used by raid5 SLUB doesn't like slashes as it wants to use the cache name as the name of a directory (or symlink) in sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	4d167f0937	md: stop using csum_partial for checksum calculation in md If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot use it. We shouldn't really be using csum_partial anyway as it is an internal-to-networking interface. So replace it with C code to do the same thing. Speed is not crucial here, so something simple and correct is best. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	e11e93facc	md: move test for whether level supports bitmap to correct place We need to check for internal-consistency of superblock in load_super. validate_super is for inter-device consistency. With the test in the wrong place, a badly created array will confuse md rather an produce sensible errors. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Martin Peschke	c3f94b40e1	md: cleanup: use seq_release_private() where appropriate We can save some lines of code by using seq_release_private(). Signed-off-by: Martin Peschke <mp3@de.ibm.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Ahmed S. Darwish	50511da3da	drivers/md.c: Use ARRAY_SIZE macro when appropriate Use ARRAY_SIZE macro already defined in kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Jonathan Brassow	ba8b45cea5	dm log: fix resume failed log device This patch removes the possibility of having uninitialized log state if the log device has failed. When a mirror resumes operation, it calls 'resume' on the logging module. If disk based logging is being used, the log device is read to fill in the log state. If the log device has failed, we cannot simply return, because this would leave the in-memory log state uninitialized. Instead, we assume all regions are out-of-sync and reset the log state. Failure to do this could result in the logging code reporting a region as in-sync, even though it isn't; which could result in a corrupted mirror. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Jonathan Brassow	b997b82d26	dm raid1: switch rh_in_sync to blocking in do_reads The call to rh_in_sync() in do_reads() should be allowed to block. It is in the mirror worker thread which already permits blocking operations. This will be needed to support clustered mirroring which will perform network operations. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Jonathan Brassow	f5353cd7c9	dm raid1: fix to commit pending clear region requests With the code as it is, it is possible for oustanding clear region requests never to get flushed when a mirror is deactivated or suspended. This means there will always be some resync work required when a mirror is activated, even though it may very well be in-sync. Always requesting the flush doesn't hurt us. This is because the log tracks whether any changes occurred and, if not, no flush is performed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Heinz Mauelshagen	26b9f22870	dm: delay target New device-mapper target that can delay I/O (for testing). Reads can be separated from writes, redirected to different underlying devices and delayed by differing amounts of time. Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	0ba699347e	dm: bio list helpers More bio_list helper functions for new targets (including dm-delay and dm-loop) to manipulate lists of bios. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Bryn Reeves <breeves@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	bf17ce3a60	dm io: remove old interface Remove old dm-io interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	88be163abb	dm raid1: update dm io interface This patch ports dm-raid1.c to the new dm-io interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	5d234d1e03	dm log: update dm io interface This patch ports dm-log.c to the new dm-io interface in order to make it scalable to have a large number of persistent dirty logs active in parallel. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	6cca1e7af5	dm exception store: update dm io interface This patch ports dm-exception-store.c to the new, scalable dm_io() interface. It replaces dm_io_get()/dm_io_put() by dm_io_client_create()/dm_io_client_destroy() calls and dm_io_sync_vm() by dm_io() to achive this. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	373a392bd7	dm kcopyd: update dm io interface This patch ports kcopyd.c to the new, scalable dm_io() interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	c8b03afe3d	dm io: new interface Add a new API to dm-io.c that uses a private mempool and bio_set for each client. The new functions to use are dm_io_client_create(), dm_io_client_destroy(), dm_io_client_resize() and dm_io(). Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	891ce20701	dm io: prepare for new interface Introduce struct dm_io_client to prepare for per-client mempools and bio_sets. Temporary functions bios() and io_pool() choose between the per-client structures and the global ones so the old and new interfaces can co-exist. Make error_bits optional. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	c897feb3dc	dm io: delay dec_count Delay decrementing the 'struct io' reference count until after the bio has been freed so that a bio destructor function may reference it. Required by a later patch. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	a8e6afa236	dm raid1: add handle_errors feature flag This patch adds the ability to specify desired features in the mirror constructor/mapping table. The first feature of interest is "handle_errors". Currently, mirroring will ignore any I/O errors from the devices. Subsequent patches will check for this flag and handle the errors. If flag/feature is not present, mirror will do nothing - maintaining backwards compatibility. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	315dcc226f	dm log: report fault status This patch reports the status of the log device so that userspace can detect the error and take appropriate action. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	01d03a660e	dm log: fault detection This patch gives the disk logging code the ability to store the fact that an error occured on the log device. In addition, an event is raised when an error is encountered during I/O to the log device. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Mike Anderson	2cd54d9bed	dm: allow offline devices Allow check_device_area to succeed if a device has an i_size of zero. This addresses an issue seen on DASD devices setting up a multipath table for paths in online and offline state. Signed-off-by: Mike Anderson <andmike@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Edward Goggin	79eb885c96	dm mpath: log device name Make the mapped device structure accessible to hardware handlers so error messages can include the device name. Signed-off-by: Edward Goggin <egoggin@emc.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Ludwig Nussel	46b477306a	dm crypt: add null iv Add a new IV generation method 'null' to read old filesystem images created with SuSE's loop_fish2 module. Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de> Acked-By: Christophe Saout <christophe@saout.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	f97380bcad	dm crypt: use smaller bvecs in clones Allocate smaller clones With the previous dm-crypt fixes, there is no need for the clone bios to have the same bvec size as the original - we just need to make them big enough for the remaining number of pages. The only requirement is that we clear the "out" index in convert_context, so that crypt_convert starts storing data at the right position within the clone bio. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	2f9941b6c5	dm crypt: fix remove first_clone Get rid of first_clone in dm-crypt This gets rid of first_clone, which is not really needed. Apparently, cloned bios used to share their bvec some time way in the past - this is no longer the case. Contrarily, this even hurts us if we try to create a clone off first_clone after it has completed, and crypt_endio has destroyed its bvec. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	98221eb757	dm crypt: fix avoid cloned bio ref after free Do not access the bio after generic_make_request We should never access a bio after generic_make_request - there's no guarantee it still exists. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	027581f351	dm crypt: fix call to clone_init Call clone_init early We need to call clone_init as early as possible - at least before call bio_put(clone) in any error path. Otherwise, the destructor will try to dereference bi_private, which may still be NULL. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Milan Broz	9c89f8be1a	dm crypt: disable barriers Disable barriers in dm-crypt because of current workqueue processing can reorder requests. This must be addresed later but for now disabling barriers is needed to prevent data corruption. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Holger Smolinski	6ad36fe2b4	dm raid1: one kmirrord per mirror This patch replaces the single instance of kmirrord by one instance per mirror set. This change is required to avoid a deadlock in kmirrord when the persistent dirty log of a mirror itself resides on a mirror. The single instance of kmirrord then issues a sync write to the dirty log in write_bits which gets deferred to kmirrord itself later in the call chain. But kmirrord never does the deferred work because it is still waiting for the sync write_bits. _mirror_sets is removed as it no longer needed, and we always flush the workqueue before destroying it to ensure all work is complete before destroying it. Signed-off-by: Holger Smolinski <smolinski@de.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Mark Fasheh	ef51c97623	Remove do_sync_file_range() Remove do_sync_file_range() and convert callers to just use do_sync_mapping_range(). Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:04 -07:00
Peter Zijlstra	f98393a64c	mm: remove destroy_dirty_buffers from invalidate_bdev() Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't been used in 6 years (so akpm says). find * -name \.[ch] \| xargs grep -l invalidate_bdev \| while read file; do quilt add $file; sed -ie 's/invalidate_bdev($[^,]$,[^)]*)/invalidate_bdev(\1)/g' $file; done Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Jens Axboe	5972511b77	[BLOCK] Don't pin lots of memory in mempools Currently we scale the mempool sizes depending on memory installed in the machine, except for the bio pool itself which sits at a fixed 256 entry pre-allocation. There's really no point in "optimizing" this OOM path, we just need enough preallocated to make progress. A single unit is enough, lets scale it down to 2 just to be on the safe side. This patch saves ~150kb of pinned kernel memory on a 32-bit box. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-04-30 09:08:17 +02:00
Neil Brown	505fa2c4a2	[PATCH] md: fix calculation for size of filemap_attr array in md/bitmap If 'num_pages' were ever 1 more than a multiple of 8 (32bit platforms) or of 16 (64 bit platforms). filemap_attr would be allocated one 'unsigned long' shorter than required. We need a round-up in there. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-04-12 15:31:42 -07:00
NeilBrown	5792a2856a	[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs A device can be removed from an md array via e.g. echo remove > /sys/block/md3/md/dev-sde/state This will try to remove the 'dev-sde' subtree which will deadlock since commit `e7b0d26a86` With this patch we run the kobject_del via schedule_work so as to avoid the deadlock. Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-04-04 21:12:47 -07:00
NeilBrown	5e55e2f5fc	[PATCH] md: convert compile time warnings into runtime warnings ... still not sure why we need this .... Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:15 -07:00
NeilBrown	041ae52e26	[PATCH] md: clear the congested_fn when stopping a raid5 If this mddev and queue got reused for another array that doesn't register a congested_fn, this function would get called incorretly. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:14 -07:00
NeilBrown	3d37890baa	[PATCH] md: allow raid4 arrays to be reshaped All that is missing the the function pointers in raid4_pers. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:14 -07:00
Andy Isaacson	bed31ed9e1	[PATCH] fix read past end of array in md/linear.c When iterating through an array, one must be careful to test one's index variable rather than another similarly-named variable. The loop will read off the end of conf->disks[] in the following (pathological) case: % dd bs=1 seek=840716287 if=/dev/zero of=d1 count=1 % for i in 2 3 4; do dd if=/dev/zero of=d$i bs=1k count=$(($i+150)); done % ./vmlinux ubd0=root ubd1=d1 ubd2=d2 ubd3=d3 ubd4=d4 # mdadm -C /dev/md0 --level=linear --raid-devices=4 /dev/ubd[1234] adding some printks, I saw this: [42949374.960000] hash_spacing = 821120 [42949374.960000] cnt = 4 [42949374.960000] min_spacing = 801 [42949374.960000] j=0 size=820928 sz=820928 [42949374.960000] i=0 sz=820928 hash_spacing=820928 [42949374.960000] j=1 size=64 sz=64 [42949374.960000] j=2 size=64 sz=128 [42949374.960000] j=3 size=64 sz=192 [42949374.960000] j=4 size=1515870810 sz=1515871002 Cc: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Neil Brown <neilb@cse.unsw.edu.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-16 19:25:03 -07:00
NeilBrown	6d3baf2eb8	[PATCH] md: fix for raid6 reshape Recent patch for raid6 reshape had a change missing that showed up in subsequent review. Many places in the raid5 code used "conf->raid_disks-1" to mean "number of data disks". With raid6 that had to be changed to "conf->raid_disk - conf->max_degraded" or similar. One place was missed. This bug means that if a raid6 reshape were aborted in the middle the recorded position would be wrong. On restart it would either fail (as the position wasn't on an appropriate boundary) or would leave a section of the array unreshaped, causing data corruption. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-05 07:57:53 -08:00
NeilBrown	f416885ef4	[PATCH] md: add support for reshape of a raid6 i.e. one or more drives can be added and the array will re-stripe while on-line. Most of the interesting work was already done for raid5. This just extends it to raid6. mdadm newer than 2.6 is needed for complete safety, however any version of mdadm which support raid5 reshape will do a good enough job in almost all cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a reshape that was aborted and had to be restarted with an such a version of mdadm). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	b4c4c7b809	[PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error An error always aborts any resync/recovery/reshape on the understanding that it will immediately be restarted if that still makes sense. However a reshape currently doesn't get restarted. With this patch it does. To avoid restarting when it is not possible to do work, we call into the personality to check that a reshape is ok, and strengthen raid5_check_reshape to fail if there are too many failed devices. We also break some code out into a separate function: remove_and_add_spares as the indent level for that code was getting crazy. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	d1b5380c7f	[PATCH] md: clean out unplug and other queue function on md shutdown The mddev and queue might be used for another array which does not set these, so they need to be cleared. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	7dd5e7c3db	[PATCH] md: move warning about creating a raid array on partitions of the one device md tries to warn the user if they e.g. create a raid1 using two partitions of the same device, as this does not provide true redundancy. However it also warns if a raid0 is created like this, and there is nothing wrong with that. At the place where the warning is currently printer, we don't necessarily know what level the array will be, so move the warning from the point where the device is added to the point where the array is started. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
H. Peter Anvin	a723406c4a	[PATCH] md: RAID6: clean up CPUID and FPU enter/exit code - Use kernel_fpu_begin() and kernel_fpu_end() - Use boot_cpu_has() for feature testing even in userspace Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	64a742bc61	[PATCH] md: fix raid10 recovery problem. There are two errors that can lead to recovery problems with raid10 when used in 'far' more (not the default). Due to a '>' instead of '>=' the wrong block is located which would result in garbage being written to some random location, quite possible outside the range of the device, causing the newly reconstructed device to fail. The device size calculation had some rounding errors (it didn't round when it should) and so recovery would go a few blocks too far which would again cause a write to a random block address and probably a device error. The code for working with device sizes was fairly confused and spread out, so this has been tided up a bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
Eric W. Biederman	0b4d414714	[PATCH] sysctl: remove insert_at_head from register_sysctl The semantic effect of insert_at_head is that it would allow new registered sysctl entries to override existing sysctl entries of the same name. Which is pain for caching and the proc interface never implemented. I have done an audit and discovered that none of the current users of register_sysctl care as (excpet for directories) they do not register duplicate sysctl entries. So this patch simply removes the support for overriding existing entries in the sys_sysctl interface since no one uses it or cares and it makes future enhancments harder. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Corey Minyard <minyard@acm.org> Cc: Neil Brown <neilb@suse.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jan Kara <jack@ucw.cz> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-14 08:09:59 -08:00
Eric W. Biederman	ff1d28efc5	[PATCH] sysctl: md: remove unnecessary insert_at_head flag The sysctls used by the md driver are have unique binary numbers so remove the insert_at_head flag as it serves no useful purpose. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-14 08:09:55 -08:00
Arjan van de Ven	fa027c2a0a	[PATCH] mark struct file_operations const 4 Many struct file_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. [akpm@sdl.org: dvb fix] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-12 09:48:45 -08:00
Andrew Morton	fc0ecff698	[PATCH] remove invalidate_inode_pages() Convert all calls to invalidate_inode_pages() into open-coded calls to invalidate_mapping_pages(). Leave the invalidate_inode_pages() wrapper in place for now, marked as deprecated. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-11 10:51:31 -08:00
Neil Brown	da6e1a32fb	[PATCH] md: avoid possible BUG_ON in md bitmap handling md/bitmap tracks how many active write requests are pending on blocks associated with each bit in the bitmap, so that it knows when it can clear the bit (when count hits zero). The counter has 14 bits of space, so if there are ever more than 16383, we cannot cope. Currently the code just calles BUG_ON as "all" drivers have request queue limits much smaller than this. However is seems that some don't. Apparently some multipath configurations can allow more than 16383 concurrent write requests. So, in this unlikely situation, instead of calling BUG_ON we now wait for the count to drop down a bit. This requires a new wait_queue_head, some waiting code, and a wakeup call. Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower in that case...). Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-09 09:25:47 -08:00
Neil Brown	387bb17374	[PATCH] md: fix various bugs with aligned reads in RAID5 It is possible for raid5 to be sent a bio that is too big for an underlying device. So if it is a READ that we pass stright down to a device, it will fail and confuse RAID5. So in 'chunk_aligned_read' we check that the bio fits within the parameters for the target device and if it doesn't fit, fall back on reading through the stripe cache and making lots of one-page requests. Note that this is the earliest time we can check against the device because earlier we don't have a lock on the device, so it could change underneath us. Also, the code for handling a retry through the cache when a read fails has not been tested and was badly broken. This patch fixes that code. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Kai" <epimetreus@fastmail.fm> Cc: <stable@suse.de> Cc: <org@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-09 09:25:46 -08:00
NeilBrown	c20086de93	[PATCH] md: remove unnecessary printk when raid5 gets an unaligned read. raid5_mergeable_bvec tries to ensure that raid5 never sees a read request that does not fit within just one chunk. However as we must always accept a single-page read, that is not always possible. So when "in_chunk_boundary" fails, it might be unusual, but it is not a problem and printing a message every time is a bad idea. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
NeilBrown	2a2275d630	[PATCH] md: fix potential memalloc deadlock in md If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held, it is possible for a deadlock to eventuate. This happens if the array was marked 'clean', and the memalloc triggers a write-out to the md device. For the writeout to succeed, the array must be marked 'dirty', and that requires getting the mddev_lock. So, before attempting a GFP_KERNEL allocation while holding the lock, make sure the array is marked 'dirty' (unless it is currently read-only). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
Jun'ichi Nomura	bfa152fa5e	[PATCH] dm-multipath: fix stall on noflush suspend/resume Allow noflush suspend/resume of device-mapper device only for the case where the device size is unchanged. Otherwise, dm-multipath devices can stall when resumed if noflush was used when suspending them, all paths have failed and queue_if_no_path is set. Explanation: 1. Something is doing fsync() on the block dev, holding inode->i_sem 2. The fsync write is blocked by all-paths-down and queue_if_no_path 3. Someone requests to suspend the dm device with noflush. Pending writes are left in queue. 4. In the middle of dm_resume(), __bind() tries to get inode->i_sem to do __set_size() and waits forever. 'noflush suspend' is a new device-mapper feature introduced in early 2.6.20. So I hope the fix being included before 2.6.20 is released. Example of reproducer: 1. Create a multipath device by dmsetup 2. Fail all paths during mkfs 3. Do dmsetup suspend --noflush and load new map with healthy paths 4. Do dmsetup resume Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
NeilBrown	f49d5e62d9	[PATCH] md: avoid reading past the end of a bitmap file In most cases we check the size of the bitmap file before reading data from it. However when reading the superblock, we always read the first PAGE_SIZE bytes, which might not always be appropriate. So limit that read to the size of the file if appropriate. Also, we get the count of available bytes wrong in one place, so that too can read past the end of the file. Cc: "yang yin" <yinyang801120@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
NeilBrown	1031be7a5f	[PATCH] md: make sure the events count in an md array never returns to zero Now that we sometimes step the array events count backwards (when transitioning dirty->clean where nothing else interesting has happened - so that we don't need to write to spares all the time), it is possible for the event count to return to zero, which is potentially confusing and triggers and MD_BUG. We could possibly remove the MD_BUG, but is just as easy, and probably safer, to make sure we never return to zero. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
NeilBrown	3eda22d19b	[PATCH] md: make 'repair' actually work for raid1 When 'repair' finds a block that is different one the various parts of the mirror. it is meant to write a chosen good version to the others. However it currently writes out the original data to each. The memcpy to make all the data the same is missing. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
Lars Ellenberg	e3881a6816	[PATCH] md: pass down BIO_RW_SYNC in raid{1,10} md raidX make_request functions strip off the BIO_RW_SYNC flag, thus introducing additional latency. Fixing this in raid1 and raid10 seems to be straightforward enough. For our particular usage case in DRBD, passing this flag improved some initialization time from ~5 minutes to ~5 seconds. Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Lars Ellenberg <lars@linbit.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2007-01-11 18:18:21 -08:00
NeilBrown	3f9d7b0d81	[PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md While developing more functionality in mdadm I found some bugs in md... - When we remove a device from an inactive array (write 'remove' to the 'state' sysfs file - see 'state_store') would should not update the superblock information - as we may not have read and processed it all properly yet. - initialise all raid_disk entries to '-1' else the 'slot sysfs file will claim '0' for all devices in an array before the array is started. - all '\n' not to be present at the end of words written to sysfs files - when we use SET_ARRAY_INFO to set the md metadata version, set the flag to say that there is persistant metadata. - allow GET_BITMAP_FILE to be called on an array that hasn't been started yet. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-22 08:55:51 -08:00
NeilBrown	802ba064c4	[PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly Thanks Jens for alerting me to this. Cc: Jens Axboe <jens.axboe@oracle.com> Cc: <raziebe@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-13 09:05:48 -08:00
Herbert Xu	3263263f70	[CRYPTO] dm-crypt: Select CRYPTO_CBC As CBC is the default chaining method for cryptoloop, we should select it from cryptoloop to ease the transition. Spotted by Rene Herman. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 10:18:57 -08:00
NeilBrown	1757128438	[PATCH] md: assorted md and raid1 one-liners Fix few bugs that meant that: - superblocks weren't alway written at exactly the right time (this could show up if the array was not written to - writting to the array causes lots of superblock updates and so hides these errors). - restarting device recovery after a clean shutdown (version-1 metadata only) didn't work as intended (or at all). 1/ Ensure superblock is updated when a new device is added. 2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync. The body of this if takes one of two branches depending on whether MD_RECOVERY_SYNC is set, so testing it in the clause of the if is wrong. 3/ Flag superblock for updating after a resync/recovery finishes. 4/ If we find the neeed to restart a recovery in the middle (version-1 metadata only) make sure a full recovery (not just as guided by bitmaps) does get done. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	c2b00852fb	[PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error to higher levels. While this should be sufficient, it is safer to explicitly set the error code as well - less room for confusion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	b8c6b64556	[PATCH] md: remove some old ifdefed-out code from raid5.c There are some vestiges of old code that was used for bypassing the stripe cache on reads in raid5.c. This was never updated after the change from buffer_heads to bios, but was left as a reminder. That functionality has nowe been implemented in a completely different way, so the old code can go. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
Jeff Garzik	fdee8ae449	[PATCH] MD: conditionalize some code The autorun code is only used if this module is built into the static kernel image. Adjust #ifdefs accordingly. Signed-off-by: Jeff Garzik <jeff@garzik.org> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	b875e531fc	[PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx stripe_to_pdidx finds the index of the parity disk for a given stripe. It assumes raid5 in that it uses "disks-1" to determine the number of data disks. This is incorrect for raid6 but fortunately the two usages cancel each other out. The only way that 'data_disks' affects the calculation of pd_idx in raid5_compute_sector is when it is divided into the sector number. But as that sector number is calculated by multiplying in the wrong value of 'data_disks' the division produces the right value. So it is innocuous but needs to be fixed. Also change the calculation of raid_disks in compute_blocknr to make it more obviously correct (it seems at first to always use disks-1 too). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
Raz Ben-Jehuda(caro)	5248861511	[PATCH] md: enable bypassing cache for reads Call the chunk_aligned_read where appropriate. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	46031f9a38	[PATCH] md: allow reads that have bypassed the cache to be retried on failure If a bypass-the-cache read fails, we simply try again through the cache. If it fails again it will trigger normal recovery precedures. update 1: From: NeilBrown <neilb@suse.de> 1/ chunk_aligned_read and retry_aligned_read assume that data_disks == raid_disks - 1 which is not true for raid6. So when an aligned read request bypasses the cache, we can get the wrong data. 2/ The cloned bio is being used-after-free in raid5_align_endio (to test BIO_UPTODATE). 3/ We forgot to add rdev->data_offset when submitting a bio for aligned-read 4/ clone_bio calls blk_recount_segments and then we change bi_bdev, so we need to invalidate the segment counts. 5/ We don't de-reference the rdev when the read completes. This means we need to record the rdev to so it is still available in the end_io routine. Fortunately bi_next in the original bio is unused at this point so we can stuff it in there. 6/ We leak a cloned bio if the target rdev is not usable. From: NeilBrown <neilb@suse.de> update 2: 1/ When aligned requests fail (read error) they need to be retried via the normal method (stripe cache). As we cannot be sure that we can process a single read in one go (we may not be able to allocate all the stripes needed) we store a bio-being-retried and a list of bioes-that-still-need-to-be-retried. When find a bio that needs to be retried, we should add it to the list, not to single-bio... 2/ We were never incrementing 'scnt' when resubmitting failed aligned requests. [akpm@osdl.org: build fix] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	f679623f50	[PATCH] md: handle bypassing the read cache (assuming nothing fails) Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	23032a0eb9	[PATCH] md: define raid5_mergeable_bvec This will encourage read request to be on only one device, so we will often be able to bypass the cache for read requests. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
NeilBrown	0d4ca600fc	[PATCH] md: tidy up device-change notification when an md array is stopped An md array can be stopped leaving all the setting still in place, or it can torn down and destroyed. set_capacity and other change notifications only happen in the latter case, but should happen in both. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Adrian Bunk	c642f9e03b	[PATCH] make drivers/md/dm-snap.c:ksnapd static Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00

1 2 3 4 5 ...

656 Commits