kernel_optimize_test

Author	SHA1	Message	Date
Ming Lei	5a8ce240d4	block: cleanup bio_add_pc_page REQ_PC is out of date, so replace it with passthrough IO. Also remove the local variable of 'prev' since we can reuse the top local variable of 'bvec'. No function change. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 12:11:25 -06:00
Ming Lei	fd7d8d4232	block: don't merge adjacent bvecs to one segment in bio blk_queue_split For normal filesystem IO, each page is added via blk_add_page(), in which bvec(page) merge has been handled already, and basically not possible to merge two adjacent bvecs in one bio. So not try to merge two adjacent bvecs in blk_queue_split(). Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 12:11:21 -06:00
Ming Lei	db5ebd6edd	block: avoid to break XEN by multi-page bvec XEN has special page merge requirement, see xen_biovec_phys_mergeable(). We can't merge pages into one bvec simply for XEN. So move XEN's specific check on page merge into __bio_try_merge_page(), then abvoid to break XEN by multi-page bvec. Cc: ris Ostrovsky <boris.ostrovsky@oracle.com> Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 12:11:17 -06:00
Ming Lei	0383ad4374	block: pass page to xen_biovec_phys_mergeable xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec for checking if the two bvecs can be merged, so pass page to xen_biovec_phys_mergeable() directly. No function change. Cc: ris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 12:11:13 -06:00
Francesco Pollicino	fffca087d5	block, bfq: save & resume weight on a queue merge/split bfq saves the state of a queue each time a merge occurs, to be able to resume such a state when the queue is associated again with its original process, on a split. Unfortunately bfq does not save & restore also the weight of the queue. If the weight is not correctly resumed when the queue is recycled, then the weight of the recycled queue could differ from the weight of the original queue. This commit adds the missing save & resume of the weight. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -06:00
Francesco Pollicino	1e66413c4f	block, bfq: print SHARED instead of pid for shared queues in logs The function "bfq_log_bfqq" prints the pid of the process associated with the queue passed as input. Unfortunately, if the queue is shared, then more than one process is associated with the queue. The pid that gets printed in this case is the pid of one of the associated processes. Which process gets printed depends on the exact sequence of merge events the queue underwent. So printing such a pid is rather useless and above all is often rather confusing because it reports a random pid between those of the associated processes. This commit addresses this issue by printing SHARED instead of a pid if the queue is shared. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -06:00
Paolo Valente	84a746891e	block, bfq: always protect newly-created queues from existing active queues If many bfq_queues belonging to the same group happen to be created shortly after each other, then the processes associated with these queues have typically a common goal. In particular, bursts of queue creations are usually caused by services or applications that spawn many parallel threads/processes. Examples are systemd during boot, or git grep. If there are no other active queues, then, to help these processes get their job done as soon as possible, the best thing to do is to reach a high throughput. To this goal, it is usually better to not grant either weight-raising or device idling to the queues associated with these processes. And this is exactly what BFQ currently does. There is however a drawback: if, in contrast, some other queues are already active, then the newly created queues must be protected from the I/O flowing through the already existing queues. In this case, the best thing to do is the opposite as in the other case: it is much better to grant weight-raising and device idling to the newly-created queues, if they deserve it. This commit addresses this issue by doing so if there are already other active queues. This change also helps eliminating false positives, which occur when the newly-created queues do not belong to an actual large burst of creations, but some background task (e.g., a service) happens to trigger the creation of new queues in the middle, i.e., very close to when the victim queues are created. These false positive may cause total loss of control on process latencies. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -06:00
Paolo Valente	7074f076ff	block, bfq: do not tag totally seeky queues as soft rt Sync random I/O is likely to be confused with soft real-time I/O, because it is characterized by limited throughput and apparently isochronous arrival pattern. To avoid false positives, this commits prevents bfq_queues containing only random (seeky) I/O from being tagged as soft real-time. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -06:00
Paolo Valente	8cacc5ab3e	block, bfq: do not merge queues on flash storage with queueing To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Alessio Masola <alessio.masola@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -06:00
Paolo Valente	2341d662e9	block, bfq: tune service injection basing on request service times The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:39 -06:00
Paolo Valente	fb53ac6cd0	block, bfq: do not idle for lowest-weight queues In most cases, it is detrimental for throughput to plug I/O dispatch when the in-service bfq_queue becomes temporarily empty (plugging is performed to wait for the possible arrival, soon, of new I/O from the in-service queue). There is however a case where plugging is needed for service guarantees. If a bfq_queue, say Q, has a higher weight than some other active bfq_queue, and is sync, i.e., contains sync I/O, then, to guarantee that Q does receive a higher share of the throughput than other lower-weight queues, it is necessary to plug I/O dispatch when Q remains temporarily empty while being served. For this reason, BFQ performs I/O plugging when some active bfq_queue has a higher weight than some other active bfq_queue. But this is unnecessarily overkill. In fact, if the in-service bfq_queue actually has a weight lower than or equal to the other queues, then the queue must not be guaranteed a higher share of the throughput than the other queues. So, not plugging I/O cannot cause any harm to the queue. And can boost throughput. Taking advantage of this fact, this commit does not plug I/O for sync bfq_queues with a weight lower than or equal to the weights of the other queues. Here is an example of the resulting throughput boost with the dbench workload, which is particularly nasty for BFQ. With the dbench test in the Phoronix suite, BFQ reaches its lowest total throughput with 6 clients on a filesystem with journaling, in case the journaling daemon has a higher weight than normal processes. Before this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5, after this commit it is ~100 MB/sec. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:39 -06:00
Paolo Valente	778c02a236	block, bfq: increase idling for weight-raised queues If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 08:14:47 -06:00
Konstantin Khlebnikov	42b1bd33dc	block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED. Code under these ifdefs never worked, something might be broken. Fixes: `0471559c2f` ("block, bfq: add/remove entity weights correctly") Fixes: `73d5811849` ("block, bfq: consider also ioprio classes in symmetry detection") Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-01 06:56:15 -06:00
Jens Axboe	e861857545	blk-mq: fix sbitmap ws_active for shared tags We now wrap sbitmap waitqueues in an active counter, so we can avoid iterating wakeups unless we have waiters there. This works as long as everyone that's manipulating the waitqueues use the proper helpers. For the tag wait case for shared tags, however, we add ourselves to the waitqueue without incrementing/decrementing the ->ws_active count. This means that wakeups can take a long time to happen. Fix this by manually doing the inc/dec as needed for the wait queue handling. Reported-by: Michael Leun <kbug@newton.leun.net> Tested-by: Michael Leun <kbug@newton.leun.net> Cc: stable@vger.kernel.org Reviewed-by: Omar Sandoval <osandov@fb.com> Fixes: `5d2ee7122c` ("sbitmap: optimize wakeup check") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-25 13:05:47 -06:00
Yufen Yu	85fae294e1	blk-mq: update comment for blk_mq_hctx_has_pending() For now, blk_mq_hctx_has_pending() checks any of ctx, hctx->dispatch or io scheduler have pending work. So, update the comment accordingly. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-24 10:26:17 -06:00
Yufen Yu	13f0638152	blk-mq: use blk_mq_put_driver_tag() to put tag Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag() is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag(). Then we can remove the unused blk_mq_put_driver_tag_hctx(). Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-24 10:26:16 -06:00
Linus Torvalds	1bdd3dbfff	io_uring-20190323 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyWVysQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn5lD/0bEg76kbuwOUy5+FDqOpF0MNOU7xZcYcsI YkkaKkUi2YQL6NJlkU7AhtPwep+J2sgSnDW9Ho9WIXbsnsO6UF79uIdcix6zJGIl WnZZ3BLgWeciCfrzFpn3FFZnm/AKJSPWPmllUFvmUYT9GdRgN4ZnHBsS1HTlJ1m5 5HhwLtaYOsZ75NxWBRqWspmtFe+XZ/CrjGgmvIF8FjSuIP2q0RrOmCF1XAA82umd ehiU1ZtQ+v4FHxmJWjzMWhrCj2y0gmPb+DotIqefFjVnd/G+LrFGMD1fsLoQVFDy L5VzSOGj1E4KXfDpIeGnz/08dpqXmOkvsSaNnv1U7vA7SCkbodR/BA1EKJrvk5v7 MGkkcQDaU/WzC41RCyVQNWAWjzNLKbruXQ+1HqCx5eh7uthvMQMXDvGf4Jgeq+/E vGzrEKZ6qI78Vy0mXSy4dfFbFaNTjCkE2jbIG7BQx5zdtnS9/VPXNkpZxPrGLM+P /fTsLXghU9lKn6WHVtLpQsfJr0OMjyC9JA23pTX2G9MtBhDcyuRs+uCeQgG6cIkl F15LGuOY7YGYxRsegdinFaoldnHersUDx19c+uFdrB0k0A/A6KeGHuZx7aJPkW1L M89FkyJr2ZBgc26PvKz6j1Hwl2MKJC5h8TpPES/QnulWh4FbqqH3a501Qa1AQuxC 1me95iy74w== =l4lx -----END PGP SIGNATURE----- Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-block Pull io_uring fixes and improvements from Jens Axboe: "The first five in this series are heavily inspired by the work Al did on the aio side to fix the races there. The last two re-introduce a feature that was in io_uring before it got merged, but which I pulled since we didn't have a good way to have BVEC iters that already have a stable reference. These aren't necessarily related to block, it's just how io_uring pins fixed buffers" * tag 'io_uring-20190323' of git://git.kernel.dk/linux-block: block: add BIO_NO_PAGE_REF flag iov_iter: add ITER_BVEC_FLAG_NO_REF flag io_uring: mark me as the maintainer io_uring: retry bulk slab allocs as single allocs io_uring: fix poll races io_uring: fix fget/fput handling io_uring: add prepped flag io_uring: make io_read/write return an integer io_uring: use regular request ref counts	2019-03-23 10:25:12 -07:00
Bart Van Assche	537d71b3f7	blkcg: Fix kernel-doc warnings Avoid that the following warnings are reported when building with W=1: block/blk-cgroup.c:1755: warning: Function parameter or member 'q' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1755: warning: Function parameter or member 'use_memdelay' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1779: warning: Function parameter or member 'blkg' not described in 'blkcg_add_delay' block/blk-cgroup.c:1779: warning: Function parameter or member 'now' not described in 'blkcg_add_delay' block/blk-cgroup.c:1779: warning: Function parameter or member 'delta' not described in 'blkcg_add_delay' Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-20 14:39:09 -06:00
Bart Van Assche	373e915cd8	blk-iolatency: #include "blk.h" This patch avoids that the following warning is reported when building with W=1: block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes] Cc: Josef Bacik <jbacik@fb.com> Fixes: `d706751215` ("block: introduce blk-iolatency io controller") # v4.19 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-20 14:19:38 -06:00
Bart Van Assche	e6c987120e	block: Unexport blk_mq_add_to_requeue_list() This function is not used outside the block layer core. Hence unexport it. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-20 14:19:36 -06:00
Yufen Yu	29ece8b435	block: add BLK_MQ_POLL_CLASSIC for hybrid poll and return EINVAL for unexpected value For q->poll_nsec == -1, means doing classic poll, not hybrid poll. We introduce a new flag BLK_MQ_POLL_CLASSIC to replace -1, which may make code much easier to read. Additionally, since val is an int obtained with kstrtoint(), val can be a negative value other than -1, so return -EINVAL for that case. Thanks to Damien Le Moal for some good suggestion. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-20 14:02:07 -06:00
Jens Axboe	399254aaf4	block: add BIO_NO_PAGE_REF flag If bio_iov_iter_get_pages() is called on an iov_iter that is flagged with NO_REF, then we don't need to add a page reference for the pages that we add. Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows not to drop a reference to these pages. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-18 10:44:48 -06:00
Yufen Yu	684b73245c	blk-mq: use blk_mq_sched_mark_restart_hctx to set RESTART Let blk_mq_mark_tag_wait() use the blk_mq_sched_mark_restart_hctx() to set BLK_MQ_S_SCHED_RESTART. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-18 08:14:51 -06:00
Linus Torvalds	11efae3506	for-5.1/block-post-20190315 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo 0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq 1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a 0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA vsZNrNNZpw== =qBae -----END PGP SIGNATURE----- Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block Pull more block layer changes from Jens Axboe: "This is a collection of both stragglers, and fixes that came in after I finalized the initial pull. This contains: - An MD pull request from Song, with a few minor fixes - Set of NVMe patches via Christoph - Pull request from Konrad, with a few fixes for xen/blkback - pblk fix IO calculation fix (Javier) - Segment calculation fix for pass-through (Ming) - Fallthrough annotation for blkcg (Mathieu)" * tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits) blkcg: annotate implicit fall through nvme-tcp: support C2HData with SUCCESS flag nvmet: ignore EOPNOTSUPP for discard nvme: add proper write zeroes setup for the multipath device nvme: add proper discard setup for the multipath device nvme: remove nvme_ns_config_oncs nvme: disable Write Zeroes for qemu controllers nvmet-fc: bring Disconnect into compliance with FC-NVME spec nvmet-fc: fix issues with targetport assoc_list list walking nvme-fc: reject reconnect if io queue count is reduced to zero nvme-fc: fix numa_node when dev is null nvme-fc: use nr_phys_segments to determine existence of sgl nvme-loop: init nvmet_ctrl fatal_err_work when allocate nvme: update comment to make the code easier to read nvme: put ns_head ref if namespace fails allocation nvme-trace: fix cdw10 buffer overrun nvme: don't warn on block content change effects nvme: add get-feature to admin cmds tracer md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice ...	2019-03-16 12:36:39 -07:00
Nikolay Borisov	b5420237ec	mm: refactor readahead defines in mm.h All users of VM_MAX_READAHEAD actually convert it to kbytes and then to pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This simplifies the expression in every filesystem. Also rename the macro to VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused VM_MIN_READAHEAD [akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen] Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: David Howells <dhowells@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-03-12 10:04:01 -07:00
Linus Torvalds	92fff53b71	SCSI misc on 20190306 This is mostly update of the usual drivers: arcmsr, qla2xxx, lpfc, hisi_sas, target/iscsi and target/core. Additionally Christoph refactored gdth as part of the dma changes. The major mid-layer change this time is the removal of bidi commands and with them the whole of the osd/exofs driver and filesystem. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXIC54SYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishT1GAPwJEV23 ExPiPsnuVgKj49nLTagZ3rILRQcYNbL+MNYqxQEA0cT8FHzSDBfWY5OKPNE+RQ8z f69LpXGmMpuagKGvvd4= =Fhy1 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: arcmsr, qla2xxx, lpfc, hisi_sas, target/iscsi and target/core. Additionally Christoph refactored gdth as part of the dma changes. The major mid-layer change this time is the removal of bidi commands and with them the whole of the osd/exofs driver and filesystem. This is a major simplification for block and mq in particular" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (240 commits) scsi: cxgb4i: validate tcp sequence number only if chip version <= T5 scsi: cxgb4i: get pf number from lldi->pf scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c scsi: mpt3sas: Add missing breaks in switch statements scsi: aacraid: Fix missing break in switch statement scsi: kill command serial number scsi: csiostor: drop serial_number usage scsi: mvumi: use request tag instead of serial_number scsi: dpt_i2o: remove serial number usage scsi: st: osst: Remove negative constant left-shifts scsi: ufs-bsg: Allow reading descriptors scsi: ufs: Allow reading descriptor via raw upiu scsi: ufs-bsg: Change the calling convention for write descriptor scsi: ufs: Remove unused device quirks Revert "scsi: ufs: disable vccq if it's not needed by UFS device" scsi: megaraid_sas: Remove a bunch of set but not used variables scsi: clean obsolete return values of eh_timed_out scsi: sd: Optimal I/O size should be a multiple of physical block size scsi: MAINTAINERS: SCSI initiator and target tweaks scsi: fcoe: make use of fip_mode enum complete ...	2019-03-09 16:53:47 -08:00
Linus Torvalds	38e7571c07	io_uring-2019-03-06 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyAJvAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphb+EACFaKI2HIdjExQ5T7Cxebzwky+Qiro3FV55 ziW00FZrkJ5g0h4ItBzh/5SDlcNQYZDMlA3s4xzWIMadWl5PjMPq1uJul0cITbSl WIJO5hpgNMXeUEhvcXUl6+f/WzpgYUxN40uW8N5V7EKlooaFVfudDqJGlvEv+UgB g8NWQYThSG+/e7r9OGwK0xDRVKfpjxVvmqmnDH3DrxKaDgSOwTf4xn1u41wKwfQ3 3uPfQ+GBeTqt4a2AhOi7K6KQFNnj5Jz5CXYMiOZI2JGtLPcL6dmyBVD7K0a0HUr+ rs4ghNdd1+puvPGNK4TX8qV0uiNrMctoRNVA/JDd1ZTYEKTmNLxeFf+olfYHlwuK K5FRs60/lgNzNkzcUpFvJHitPwYtxYJdB36PyswE1FZP1YviEeVoKNt9W8aIhEoA 549uj90brfA74eCINGhq98pJqj9CNyCPw3bfi76f5Ej2utwYDb9S5Cp2gfSa853X qc/qNda9efEq7ikwCbPzhekRMXZo6TSXtaSmC2C+Vs5+mD1Scc4kdAvdCKGQrtr9 aoy0iQMYO2NDZ/G5fppvXtMVuEPAZWbsGftyOe15IlMysjRze2ycJV8cFahKEVM9 uBeXLyH1pqGU/j7ABP4+XRZ/sbHJTwjKJbnXhTgBsdU8XO/CR3U+kRQFTsidKMfH Wlo3uH2h2A== =p78E -----END PGP SIGNATURE----- Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface	2019-03-08 14:48:40 -08:00
Linus Torvalds	80201fe175	for-5.1/block-20190302 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/ ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE DRVVhYpocw== =VBaU -----END PGP SIGNATURE----- Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...	2019-03-08 14:12:17 -08:00
Ming Lei	05b700ba60	block: fix segment calculation for passthrough IO blk_recount_segments() can be called in bio_add_pc_page() for calculating how many segments this bio will has after one page is added to this bio. If the resulted segment number is beyond the queue limit, the added page will be removed. The try-and-fix policy requires blk_recount_segments(__blk_recalc_rq_segments) to not consider the segment number limit. Unfortunately bvec_split_segs() does check this limit, and causes small segment number returned to bio_add_pc_page(), then page still may be added to the bio even though segment number limit becomes broken. Fixes this issue by not considering segment number limit when calcualting bio's segment number. Fixes: `dcebd75592` ("block: use bio_for_each_bvec() to compute multi-page bvec count") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-06 09:42:54 -07:00
Ming Lei	aaeee62c84	block: fix updating bio's front segment size When the current bvec can be merged to the 1st segment, the bio's front segment size has to be updated. However, `dcebd75592` doesn't consider that case, then bio's front segment size may not be correct. This patch fixes this issue. Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Fixes: `dcebd75592` ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-02 12:45:37 -07:00
Keyur Patel	dfc76d11dd	block: Replace function name in string with __func__ Replace hard coded function name register_blkdev with __func__, to improve robustness and to conform to the Linux kernel coding style. Issue found using checkpatch. Signed-off-by: Keyur Patel <iamkeyur96@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-28 14:09:08 -07:00
zhengbin	4d7c1d3fd7	block: fix NULL pointer dereference in register_disk If __device_add_disk-->bdi_register_owner-->bdi_register--> bdi_register_va-->device_create_vargs fails, bdi->dev is still NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj. This patch fixes that. Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-28 14:01:36 -07:00
Dongli Zhang	7d76f8562f	blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map Replace set->map[0] with set->map[HCTX_TYPE_DEFAULT] to avoid hardcoding. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-28 13:57:32 -07:00
Jens Axboe	6d0c48aede	block: implement bio helper to add iter bvec pages to bio For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. For now, we grab a reference to those pages, and release them normally on IO completion. This isn't really needed for the normal case of O_DIRECT from/to a file, but some of the more esoteric use cases (like splice(2)) will unconditionally put the pipe buffer pages when the buffers are released. Until we can manage that case properly, ITER_BVEC pages are treated like normal pages in terms of reference counting. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-28 08:24:23 -07:00
Ming Lei	bbcbbd567c	block: optimize blk_bio_segment_split for single-page bvec Introduce a fast path for single-page bvec IO, then we can avoid to call bvec_split_segs() unnecessarily. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-27 06:18:55 -07:00
Ming Lei	48d7727cae	block: optimize __blk_segment_map_sg() for single-page bvec Introduce a fast path for single-page bvec IO, then blk_bvec_map_sg() can be avoided. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-27 06:18:54 -07:00
Ming Lei	4d633062c1	block: introduce bvec_nth_page() Single-page bvec can often be seen in small BS workloads, so introduce bvec_nth_page() for avoiding to call nth_page() unnecessarily, which looks not cheap. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-27 06:18:52 -07:00
Ming Lei	8f4e80da76	block: bounce: make sure that bvec table is updated Block bounce needs to allocate new page for doing IO, and the new page has to be updated to bvec table. Commit `6dc4f100c` switches __blk_queue_bounce() to use the new bio_for_each_segment_all() interface. Unfortunately the new bio_for_each_segment_all() can't be used to update bvec table. This patch fixes this issue by retrieving bvec from the table directly, then the new allocated page can be updated to the bio. This way is safe because the cloned bio is single page bvec. Fixes: `6dc4f100c` ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-21 10:58:44 -07:00
Ming Lei	49b1f22b56	block: avoid to READ fields of null bio rq->bio can be NULL sometimes, such as flush request, so don't read bio->bi_seg_front_size until this 'bio' is checked as valid. Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: `dcebd75592` ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-19 09:19:06 -07:00
Linus Torvalds	24f0a48743	for-linus-20190215 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxm7pAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpl6JEACM5qHp7HEf7muuLKDUoX16G2eDOjacVxbL q1kqyHNvrYD/aGo+8vcshCef6xno9fL1akIxTyaTcMwYJUk9JSMicsVimxC1OvI6 a5ZiWItX2L8Nh/heJe+FtutWbrT+Nd+3Q8DqI+U0YkRnjnXaRVgLFtBmjLOxBrqJ Ps/VepB4GaxA0oWdPbhos/N3wa42uFy3ixdv3Kv6WmHdqraB9uagt8PwwUti3WzQ uxWL6J+JOBSDha8l3fp68Okib1bm/6Nmmc9l8Yz1eFwf+Y+gVgw7wPQxkUD/XaFW bDJGwp3NawK07EanIAIzfXUEGfLvgeRJBEP3OGwV/TAiHX5q9zQo/tbM6x8j4aT9 zGlwU/EnwFixgbRW/hOT5Ox4usBlfB1j0ZiNmgUm8QphHrELFnc35Kd+PR/KONNX sI6ZiifEAMR+4S99kTZ5YjHUqcUVm9ndd8iQGW9mvM6vt3o1L6QKeOeEKBMlhMcx V+JtViC50ojidYc82kEtQFY9OKRkc5x3k1wBsH49LGMT+fvEwETallOXHTarQKrv QAZNN1NINkMmrL5bgBXFqf0qpOy4xHnhis5AilUHNZwa4G8iAe8oqz/2eUCydiV1 Ogx20a8T1ifeSkI2NXrwnBjVzqnfiO9wOb9py98BiLR6k59x3GYtbCdGtpIXfSFv hG79KKoz3Q== =8mjO -----END PGP SIGNATURE----- Merge tag 'for-linus-20190215' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Ensure we insert into the hctx dispatch list, if a request is marked as DONTPREP (Jianchao) - NVMe pull request, single missing unlock on error fix (Keith) - MD pull request, single fix for a potentially data corrupting issue (Nate) - Floppy check_events regression fix (Yufen) * tag 'for-linus-20190215' of git://git.kernel.dk/linux-block: md/raid1: don't clear bitmap bits on interrupted recovery. floppy: check_events callback should not return a negative number nvme-pci: add missing unlock for reset error blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue	2019-02-15 09:12:28 -08:00
Jens Axboe	6fb845f0e7	Linux 5.0-rc6 -----BEGIN PGP SIGNATURE----- iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR A6Ktjw== =scY4 -----END PGP SIGNATURE----- Merge tag 'v5.0-rc6' into for-5.1/block Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c. This is needed since io_uring is now based on the block branch, to avoid a conflict between the multi-page bvecs and the bits of io_uring that touch the core block parts. * tag 'v5.0-rc6': (525 commits) Linux 5.0-rc6 x86/mm: Make set_pmd_at() paravirt aware MAINTAINERS: Update the ocores i2c bus driver maintainer, etc blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter MAINTAINERS: unify reference to xen-devel list x86/mm/cpa: Fix set_mce_nospec() futex: Handle early deadlock return correctly futex: Fix barrier comment net: dsa: b53: Fix for failure when irq is not defined in dt blktrace: Show requests without sector mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach() geneve: should not call rt6_lookup() when ipv6 was disabled KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) signal: Better detection of synchronous signals ...	2019-02-15 08:43:59 -07:00
Ming Lei	56d18f62f5	block: kill BLK_MQ_F_SG_MERGE QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:12 -07:00
Ming Lei	2705c93742	block: kill QUEUE_FLAG_NO_SG_MERGE Since `bdced438ac` ("block: setup bi_phys_segments after splitting"), physical segment number is mainly figured out in blk_queue_split() for fast path, and the flag of BIO_SEG_VALID is set there too. Now only blk_recount_segments() and blk_recalc_rq_segments() use this flag. Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID is set in blk_queue_split(). For another user of blk_recalc_rq_segments(): - run in partial completion branch of blk_update_request, which is an unusual case - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed since dm-rq is the only user. Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the current setup of the I/O path, as it isn't going to save you a significant amount of cycles. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:12 -07:00
Ming Lei	07173c3ec2	block: enable multipage bvecs This patch pulls the trigger for multi-page bvecs. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:12 -07:00
Ming Lei	6dc4f100c1	block: allow bio_for_each_segment_all() to iterate over multi-page bvec This patch introduces one extra iterator variable to bio_for_each_segment_all(), then we can allow bio_for_each_segment_all() to iterate over multi-page bvec. Given it is just one mechannical & simple change on all bio_for_each_segment_all() users, this patch does tree-wide change in one single patch, so that we can avoid to use a temporary helper for this conversion. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:11 -07:00
Ming Lei	862e5a5e6f	block: use bio_for_each_bvec() to map sg It is more efficient to use bio_for_each_bvec() to map sg, meantime we have to consider splitting multipage bvec as done in blk_bio_segment_split(). Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:11 -07:00
Ming Lei	dcebd75592	block: use bio_for_each_bvec() to compute multi-page bvec count First it is more efficient to use bio_for_each_bvec() in both blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how many multi-page bvecs there are in the bio. Secondly once bio_for_each_bvec() is used, the bvec may need to be splitted because its length can be very longer than max segment size, so we have to split the big bvec into several segments. Thirdly when splitting multi-page bvec into segments, the max segment limit may be reached, so the bio split need to be considered under this situation too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:11 -07:00
Ming Lei	1a67356e9a	block: don't use bio->bi_vcnt to figure out segment number It is wrong to use bio->bi_vcnt to figure out how many segments there are in the bio even though CLONED flag isn't set on this bio, because this bio may be splitted or advanced. So always use bio_segments() in blk_recount_segments(), and it shouldn't cause any performance loss now because the physical segment number is figured out in blk_queue_split() and BIO_SEG_VALID is set meantime since `bdced438ac` ("block: setup bi_phys_segments after splitting"). Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `76d8137a31` ("blk-merge: recaculate segment if it isn't less than max segments") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Jianchao Wang	aef1897cd3	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue When requeue, if RQF_DONTPREP, rq has contained some driver specific data, so insert it to hctx dispatch list to avoid any merge. Take scsi as example, here is the trace event log (no io scheduler, because RQF_STARTED would prevent merging), kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H] scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test] scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test] kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H] scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0] scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0] kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0] (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP. Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP, the sdb only contained the part of (32768 + 8), then only that part was completed. The lucky thing was that scsi_io_completion detected it and requeued the remaining part. So we didn't get corrupted data. However, the requeue of (32776 + 8) is not expected. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 19:51:52 -07:00
Aleksei Zakharov	fbd72127c9	block: avoid setting none scheduler if it's already none There's no reason to freeze queue and remove scheduler if there's no scheduler already. Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:21:40 -07:00

1 2 3 4 5 ...

4351 Commits