kernel_optimize_test/block
Divyesh Shah d585d0b9d7 block: Fix the starving writes bug in the anticipatory IO scheduler
AS scheduler alternates between issuing read and write batches. It does
the batch switch only after all requests from the previous batch are
completed.

When switching to a write batch, if there is an on-going read request,
it waits for its completion and indicates its intention of switching by
setting ad->changed_batch and the new direction but does not update the
batch_expire_time for the new write batch which it does in the case of
no previous pending requests.
On completion of the read request, it sees that we were waiting for the
switch and schedules work for kblockd right away and resets the
ad->changed_data flag.
Now when kblockd enters dispatch_request where it is expected to pick
up a write request, it in turn ends the write batch because the
batch_expire_timer was not updated and shows the expire timestamp for
the previous batch.

This results in the write starvation for all the cases where there is
the intention for switching to a write batch, but there is a previous
in-flight read request and the batch gets reverted to a read_batch
right away.

This also holds true in the reverse case (switching from a write batch
to a read batch with an in-flight write request).

I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
SCSI drives where the driver asks for the next request while a current
request is in-flight.

This patch is based off linux-2.6-block git HEAD.

Bug reproduction:
A simple scenario which reproduces this bug is:
- dd if=/dev/hda3 of=/dev/null &
- lilo
   The lilo takes forever to complete.

This can also be reproduced fairly easily with the earlier dd and
another test
program doing msync().

The example test program below should print out a message after every
iteration
but it simply hangs forever. With this bugfix it makes forward progress.

====
Example test program using msync() (thanks to suleiman AT google DOT
com)

inline uint64_t
rdtsc(void)
{
         int64_t tsc;

         __asm __volatile("rdtsc" : "=A" (tsc));
         return (tsc);
}

int
main(int argc, char **argv)
{
         struct stat st;
         uint64_t e, s, t;
         char *p, q;
         long i;
         int fd;

         if (argc < 2) {
                 printf("Usage: %s <file>\n", argv[0]);
                 return (1);
         }

         if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
                 err(1, "open");

         if (fstat(fd, &st) < 0)
                 err(1, "fstat");

         p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

         t = 0;
         for (i = 0; i < 1000; i++) {
                 *p = 0;
                 msync(p, 4096, MS_SYNC);
                 s = rdtsc();
                *p = 0;
                 __asm __volatile(""::: "memory");
                 e = rdtsc();
                 if (argc > 2)
                         printf("%d: %lld cycles %jd %jd\n",
                                i, e - s, (intmax_t)s, (intmax_t)e);
                 t += e - s;
         }
         printf("average time: %lld cycles\n", t / 1000);
         return (0);
}

Cc: <stable@kernel.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-01 09:06:42 +02:00
..
as-iosched.c block: Fix the starving writes bug in the anticipatory IO scheduler 2008-07-01 09:06:42 +02:00
blk-barrier.c block: remove remaining __FUNCTION__ occurrences 2008-05-01 08:04:02 -07:00
blk-core.c block: Move the second call to get_request to the end of the loop 2008-05-28 14:49:27 +02:00
blk-exec.c block: make core bits checkpatch compliant 2008-02-01 09:26:33 +01:00
blk-ioc.c cfq-iosched: fix RCU race in the cfq io_context destructor handling 2008-05-07 09:28:57 +02:00
blk-map.c block: add dma alignment and padding support to blk_rq_map_kern 2008-04-29 09:50:34 +02:00
blk-merge.c block: get rid of likely/unlikely predictions in merge logic 2008-05-07 09:33:55 +02:00
blk-settings.c Remove blkdev warning triggered by using md 2008-05-14 19:11:15 -07:00
blk-sysfs.c block: sysfs store function needs to grab queue_lock and use queue_flag_*() 2008-05-07 09:09:39 +02:00
blk-tag.c block: adjust tagging function queue bit locking 2008-05-07 09:27:43 +02:00
blk.h block: rename and export rq_init() 2008-04-29 14:48:55 +02:00
blktrace.c block: disable IRQs until data is written to relay channel 2008-06-12 11:20:57 -07:00
bsg.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 2008-05-02 13:52:35 -07:00
cfq-iosched.c cfq-iosched: fix RCU problem in cfq_cic_lookup() 2008-05-28 14:49:28 +02:00
compat_ioctl.c Fix misuses of bdevname() 2008-05-13 08:02:26 -07:00
deadline-iosched.c block: let elv_register() return void 2007-12-18 08:29:28 +01:00
elevator.c Added in elevator switch message to blktrace stream 2008-05-28 14:49:27 +02:00
genhd.c Fix invalid access errors in blk_lookup_devt 2008-06-09 10:06:24 -07:00
ioctl.c compat_ioctl: move common block ioctls to compat_blkdev_ioctl 2007-10-10 09:26:00 +02:00
Kconfig Kconfig: clean up block/Kconfig help descriptions 2008-04-21 09:51:04 +02:00
Kconfig.iosched update I/O sched Kconfig help texts - CFQ is now default, not AS. 2007-02-17 20:08:22 +01:00
Makefile block: ll_rw_blk.c split, add blk-merge.c 2008-01-29 21:55:12 +01:00
noop-iosched.c block: let elv_register() return void 2007-12-18 08:29:28 +01:00
scsi_ioctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 2008-05-02 13:52:35 -07:00