Commit Graph

902625 Commits

Author SHA1 Message Date
Jeff Layton
785892fe88 ceph: cache layout in parent dir on first sync create
If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
6deb8008a8 ceph: add new MDS req field to hold delegated inode number
Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
d484648787 ceph: decode interval_sets for delegated inos
Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
966c716018 ceph: make ceph_fill_inode non-static
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
2ccb45462a ceph: perform asynchronous unlink if we have sufficient caps
The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.

When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.

In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.

If the operation ends up failing, then set a writeback error on the
directory inode, and the inode itself that can be fetched later by
an fsync on the dir.

The behavior of dir caps is slightly different from caps on normal
files. Because these are just considered an optimization, if the
session is reconnected, we will not automatically reclaim them. They
are instead considered lost until we do another synchronous op in the
parent directory.

Async dirops are enabled via the "nowsync" mount option, which is
patterned after the xfs "wsync" mount option. For now, the default
is "wsync", but eventually we may flip that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
173e70e8ac ceph: don't take refs to want mask unless we have all bits
If we don't have all of the cap bits for the want mask in
try_get_cap_refs, then just take refs on the need bits.

Signed-off-by: "Yan, Zheng" <ukernel@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
a25949b990 ceph: cap tracking for async directory operations
Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).

Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
40dcf75e82 ceph: make __take_cap_refs non-static
Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
891f3f5a6a ceph: add infrastructure for waiting for async create to complete
When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
f5e17aed3a ceph: track primary dentry link
Newer versions of the MDS will flag a dentry as "primary". In later
patches, we'll need to consult this info, so track it in di->flags.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
3bb48b4142 ceph: add flag to designate that a request is asynchronous
...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
c7e4f85ce9 ceph: more caps.c lockdep assertions
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
e8a4d26771 ceph: clean up kick_flushing_inode_caps()
The last thing that this function does is release i_ceph_lock, so
have the caller do that instead. Add a lockdep assertion to
ensure that the function is always called with i_ceph_lock held.
Change the prototype to take a ceph_inode_info pointer and drop
the separate mdsc argument as we can get that from the session.

While at it, make it non-static.  We'll need this to kick any
flushing caps once the create reply comes in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Ilya Dryomov
bb0e681dda libceph: directly skip to the end of redirect reply
Coverity complains about a double write to *p.  Don't bother with
osd_instructions and directly skip to the end of redirect reply.

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Ilya Dryomov
4d8b8fb494 libceph: simplify ceph_monc_handle_map()
ceph_monc_handle_map() confuses static checkers which report a
false use-after-free on monc->monmap, missing that monc->monmap and
client->monc.monmap is the same pointer.

Use monc->monmap consistently and get rid of "old", which is redundant.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Xiubo Li
8ccf7fcce1 ceph: return ETIMEDOUT errno to userland when request timed out
req->r_timeout is only used during mounting, so this error will
be more accurate.

URL: https://tracker.ceph.com/issues/44215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Luis Henriques
1b0c3b9f91 ceph: re-org copy_file_range and fix some error paths
This patch re-organizes copy_file_range, trying to fix a few issues in the
error handling.  Here's the summary:

- Abort copy if initial do_splice_direct() returns fewer bytes than
  requested.

- Move the 'size' initialization (with i_size_read()) further down in the
  code, after the initial call to do_splice_direct().  This avoids issues
  with a possibly stale value if a manual copy is done.

- Move the object copy loop into a separate function.  This makes it
  easier to handle errors (e.g, dirtying caps and updating the MDS
  metadata if only some objects have been copied before an error has
  occurred).

- Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc
  and dst_oloc

- After the object copy loop, the new file size to be reported to the MDS
  (if there's file size change) is now the actual file size, and not the
  size after an eventual extra manual copy.

- Added a few dout() to show the number of bytes copied in the two manual
  copies and in the object copy loop.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
058daab79d ceph: move to a dedicated slabcache for mds requests
On my machine (x86_64) this struct is 952 bytes, which gets rounded up
to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
allocate them without the extra 72 bytes of overhead per.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
c36d641493 ceph: reorganize fields in ceph_mds_request
This shrinks the struct size by 16 bytes.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Andreas Gruenbacher
cb03c14390 ceph: switch to page_mkwrite_check_truncate in ceph_page_mkwrite
Use the "page has been truncated" logic in page_mkwrite_check_truncate
instead of reimplementing it here.  Other than with the existing code,
fail with -EFAULT / VM_FAULT_NOPAGE when page_offset(page) == size here
as well, as should be expected.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Gustavo A. R. Silva
f682dc713c ceph: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Hannes Reinecke
f9b6b98d24 rbd: enable multiple blk-mq queues
Allocate one queue per CPU and get a performance boost from
higher parallelism.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
59e542c869 rbd: embed image request in blk-mq pdu
Avoid making allocations for !IMG_REQ_CHILD image requests.  Only
IMG_REQ_CHILD image requests need to be freed now.

Move the initial request checks to rbd_queue_rq().  Unfortunately we
can't fill the image request and kick the state machine directly from
rbd_queue_rq() because ->queue_rq() isn't allowed to block.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
a52cc68575 rbd: acquire header_rwsem just once in rbd_queue_workfn()
Currently header_rwsem is acquired twice: once in rbd_dev_parent_get()
when the image request is being created and then in rbd_queue_workfn()
to capture mapping_size and snapc.  Introduce rbd_img_capture_header()
and move image request allocation so that header_rwsem can be acquired
just once.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
78b42a871a rbd: get rid of img_request_layered_clear()
No need to clear IMG_REQ_LAYERED before destroying the request.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Hannes Reinecke
679a97d286 rbd: kill img_request kref
The reference counter is never increased, so we can as well call
rbd_img_request_destroy() directly and drop the kref.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Yan, Zheng
bbb480ab05 ceph: check if file lock exists before sending unlock request
When a process exits, kernel closes its files. locks_remove_file()
is called to remove file locks on these files. locks_remove_file()
tries unlocking files even there is no file lock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Xiubo Li
cb63483ad0 ceph: fix description of some mount options
Based on the latest code, the default value for wsize/rsize is
64MB and the default value for the mount_timeout is 60 seconds.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Xiubo Li
5107d7d505 ceph: move ceph_osdc_{read,write}pages to ceph.ko
Since these helpers are only used by ceph.ko, move them there and
rename them with _sync_ qualifiers.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
70837470b4 ceph: don't ClearPageChecked in ceph_invalidatepage()
CephFS doesn't set this bit to begin with, so there should be no need
to clear it.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
94f4857f4b rbd: remove barriers from img_request_layered_{set,clear,test}()
IMG_REQ_LAYERED is set in rbd_img_request_create(), and tested and
cleared in rbd_img_request_destroy() when the image request is about to
be destroyed.  The barriers are unnecessary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
072eaf3c0f libceph: drop CEPH_DEFINE_SHOW_FUNC
Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates
DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-03-30 12:42:40 +02:00
Yan, Zheng
525d15e8e5 ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}
These bits will have new meaning for directory inodes.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
f85122afeb ceph: add refcounting for Fx caps
In future patches we'll be taking and relying on Fx caps. Add proper
refcounting for them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
3db0a2fc56 ceph: register MDS request with dir inode from the start
When the unsafe reply to a request comes in, the request is put on the
r_unsafe_dir inode's list. In future patches, we're going to need to
wait on requests that may not have gotten an unsafe reply yet.

Change __register_request to put the entry on the dir inode's list when
the pointer is set in the request, and don't check the
CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.

The only place that uses this list today is fsync codepath, and with
the coming changes, we'll want to wait on all operations whether it has
gotten an unsafe reply or not.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:39 +02:00
Linus Torvalds
7111951b8d Linux 5.6 2020-03-29 15:25:41 -07:00
Linus Torvalds
570203ec83 Merge branch 'akpm' (patches from Andrew)
Merge vm fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/sparse: fix kernel crash with pfn_section_valid check
  mm: fork: fix kernel_stack memcg stats for various stack implementations
  hugetlb_cgroup: fix illegal access to memory
  drivers/base/memory.c: indicate all memory blocks as removable
  mm/swapfile.c: move inode_lock out of claim_swapfile
2020-03-29 10:40:31 -07:00
Linus Torvalds
ab93e984db A single fix for the Hyper-V clocksource driver to make sched clock
actually return nanoseconds and not the virtual clock value which
 increments at 10e7 HZ (100ns).
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6AxjkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX7ED/9o8WQx1wmZ9B70FZYWkCBim4YX1GzW
 IBNy5dNJKnfDyHQ7kUip5sr2V9RtTU1vTjfEF/dw4NlgFEbk8Z2K4KvraET0DX5v
 UISL64TIl2lNhM0ey7WNX9XRxN2CG1FveAwRqKiTDQP9lla/DN7q7LyOGn8iBamB
 ulCYY6Kp2uppA47XbbwITy3mDbKEGma8nAUNsF0sUcHxDq2BUUBwDdbgWuc6i7b3
 nQa4eqvo2ZHQSkCze5/KAsz0MeBTULOzhtrrZ8SE8lEGyz0KLdymi8hneg0zXexP
 yTJ/IKaP8kfL+yjcJJGBGvZsEbKdUurIMDBUl82955MUSBV0XbYUliiUVNk28n7k
 7UsJPua69erHqWFKa4ucqGzURkUhHmpvNcRiOnZNMGuXKuV2t+CEfU1PLu/nVKos
 MbRw9tSNJ4LDt9wmuJiV3o+xBEXW2QuIgWg1aSdvxdOfRrP4f8zkeojezMO0YO9O
 plMKIlNoH2Xf0sfmMWGze2l/cqpqLgXPzF+e62+M9/wdJDBTLgTdUpFQq3qu46Lc
 vWTTgmNIWO2UDTDKukCTITk/kI0AYtayAnNmcjvviGOO8ge2vlUaOxd6nEiF4y46
 U7X9VffZODopxvjmKX/8tS9sB3cPb/vWqFND0jY4SZzx0SXQFRgPNXsARA+q/Fan
 AqNrpL4auLlTlw==
 =qZa/
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix for the Hyper-V clocksource driver to make sched clock
  actually return nanoseconds and not the virtual clock value which
  increments at 10e7 HZ (100ns)"

* tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
2020-03-29 10:36:29 -07:00
Linus Torvalds
01af08bd24 A single bugfix to prevent reference leaks in irq affinity notifiers.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6AxL0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeYsD/9HDNL8ckEGtPN8LRMSH7BLOAQcKaDk
 c1zoQoa5lW94xT/aqr++IAoaWEr+dJFGWwha7A2q37iiUgzyCAAP5JXseOi5n8Mw
 xzrj/jVwY7sNxNwNtdbJqjTdrijmfmxku9eddhll3EncNP96gPbeGhv+VLZuJsIB
 EkllnDTop6jnCJU6Qxgjp9P5YBQycx7O7r5pCKrsUB9iu68tfRtLZEStzYD3RRam
 mJffSZUrdGwge6GPqSRuS3sRx584O+q8II5pP2VIG6mCz4EO5NHslSph82xwGT00
 nP9aIK9urfwzXGEhwHME1VxnRJ3Ln3AkplbMnv6bs01j9ckupvW9Zut2hSe0AZ3d
 LstlERXdRhLUIlsia4vFvkoIYYo0SNlZ3ZMLZIJ8fcnzrKtHbEJGD5dXaTSLgIaB
 E2N1/FJq9C7Ur0KYas+jQsrhQpJqhJrLg72Kj06DeB1rD9xj6lSJqxWXSvS0J0ow
 sk2Xr7a096rOqjnOPstgSnqVNMR+133L9uIIdXIzyBRaExjcllTznFjLUX3ZTVJu
 9Fk+D2ZTh8aMqWNlCZmarz+6Qs2ok3KB/mMgae8qLULKQfEop2QPVi1trjrA8g/C
 Zn+txKlVGgKm0DcOfq8tN4jHr8G2PKyX+GxcF3ElThnFBgkusz+9MBeS8CqVFJvT
 Xnt483DaCYaEHg==
 =s1UQ
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
 "A single bugfix to prevent reference leaks in irq affinity notifiers"

* tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix reference leaks on irq affinity notifiers
2020-03-29 10:07:00 -07:00
Aneesh Kumar K.V
b943f045a9 mm/sparse: fix kernel crash with pfn_section_valid check
Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
       section_deactivate+0x220/0x240
       __remove_pages+0x118/0x170
       arch_remove_memory+0x3c/0x150
       memunmap_pages+0x1cc/0x2f0
       devm_action_release+0x30/0x50
       release_nodes+0x2f8/0x3e0
       device_release_driver_internal+0x168/0x270
       unbind_store+0x130/0x170
       drv_attr_store+0x44/0x60
       sysfs_kf_write+0x68/0x80
       kernfs_fop_write+0x100/0x290
       __vfs_write+0x3c/0x70
       vfs_write+0xcc/0x240
       ksys_write+0x7c/0x140
       system_call+0x5c/0x68

The crash is due to NULL dereference at

	test_bit(idx, ms->usage->subsection_map);

due to ms->usage = NULL in pfn_section_valid()

With commit d41e2f3bd5 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem().  This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP.  With that
config pfn_to_page does

	__section_mem_map_addr(__sec) + __pfn;

where

  static inline struct page *__section_mem_map_addr(struct mem_section *section)
  {
	unsigned long map = section->section_mem_map;
	map &= SECTION_MAP_MASK;
	return (struct page *)map;
  }

Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()).  Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.

  static inline int pfn_valid(unsigned long pfn)
  {
  ...
	return early_section(ms) || pfn_section_valid(ms, pfn);
  }

where

  static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
  {
	int idx = subsection_map_index(pfn);

	return test_bit(idx, ms->usage->subsection_map);
  }

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed.  For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections.  Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.

[aneesh.kumar@linux.ibm.com: add comment]
  Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd5 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:06 -07:00
Roman Gushchin
8380ce4790 mm: fork: fix kernel_stack memcg stats for various stack implementations
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().

In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg .  In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.

It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).

In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state().  It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .

Note: This is a special version of the patch created for stable
backports.  It contains code from the following two patches:
  - mm: memcg/slab: introduce mem_cgroup_from_obj()
  - mm: fork: fix kernel_stack memcg stats for various stack implementations

[guro@fb.com: introduce mem_cgroup_from_obj()]
  Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
Mina Almasry
726b7bbeaf hugetlb_cgroup: fix illegal access to memory
This appears to be a mistake in commit faced7e080 ("mm: hugetlb
controller for cgroups v2").

Essentially that commit does a hugetlb_cgroup_from_counter assuming that
page_counter_try_charge has initialized counter.

But if that has failed then it seems will not initialize counter, so
hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
causing kasan to complain.

The solution is to simply use 'h_cg', instead of
hugetlb_cgroup_from_counter(counter), since that is a reference to the
hugetlb_cgroup anyway.  After this change kasan ceases to complain.

Fixes: faced7e080 ("mm: hugetlb controller for cgroups v2")
Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Giuseppe Scrivano <gscrivan@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
David Hildenbrand
53cdc1cb29 drivers/base/memory.c: indicate all memory blocks as removable
We see multiple issues with the implementation/interface to compute
whether a memory block can be offlined (exposed via
/sys/devices/system/memory/memoryX/removable) and would like to simplify
it (remove the implementation).

1. It runs basically lockless. While this might be good for performance,
   we see possible races with memory offlining that will require at
   least some sort of locking to fix.

2. Nowadays, more false positives are possible. No arch-specific checks
   are performed that validate if memory offlining will not be denied
   right away (and such check will require locking). For example, arm64
   won't allow to offline any memory block that was added during boot -
   which will imply a very high error rate. Other archs have other
   constraints.

3. The interface is inherently racy. E.g., if a memory block is detected
   to be removable (and was not a false positive at that time), there is
   still no guarantee that offlining will actually succeed. So any
   caller already has to deal with false positives.

4. It is unclear which performance benefit this interface actually
   provides. The introducing commit 5c755e9fd8 ("memory-hotplug: add
   sysfs removable attribute for hotplug memory remove") mentioned

	"A user-level agent must be able to identify which sections
	 of memory are likely to be removable before attempting the
	 potentially expensive operation."

   However, no actual performance comparison was included.

Known users:

 - lsmem: Will group memory blocks based on the "removable" property. [1]

 - chmem: Indirect user. It has a RANGE mode where one can specify
          removable ranges identified via lsmem to be offlined. However,
          it also has a "SIZE" mode, which allows a sysadmin to skip the
          manual "identify removable blocks" step. [2]

 - powerpc-utils: Uses the "removable" attribute to skip some memory
          blocks right away when trying to find some to offline+remove.
          However, with ballooning enabled, it already skips this
          information completely (because it once resulted in many false
          negatives). Therefore, the implementation can deal with false
          positives properly already. [3]

According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
driven from userspace via the drmgr command (powerpc-utils).  Nowadays
it's managed in the kernel - including onlining/offlining of memory
blocks - triggered by drmgr writing to /sys/kernel/dlpar.  So the
affected legacy userspace handling is only active on old kernels.  Only
very old versions of drmgr on a new kernel (unlikely) might execute
slower - totally acceptable.

With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
break any user space tool.  We implement a very bad heuristic now.
Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
"not removable" as before.

Original discussion can be found in [4] ("[PATCH RFC v1] mm:
is_mem_section_removable() overhaul").

Other users of is_mem_section_removable() will be removed next, so that
we can remove is_mem_section_removable() completely.

[1] http://man7.org/linux/man-pages/man1/lsmem.1.html
[2] http://man7.org/linux/man-pages/man8/chmem.8.html
[3] https://github.com/ibm-power-utilities/powerpc-utils
[4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com

Also, this patch probably fixes a crash reported by Steve.
http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com

Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Nathan Fontenot <ndfont@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
Naohiro Aota
d795a90e2b mm/swapfile.c: move inode_lock out of claim_swapfile
claim_swapfile() currently keeps the inode locked when it is successful,
or the file is already swapfile (with -EBUSY).  And, on the other error
cases, it does not lock the inode.

This inconsistency of the lock state and return value is quite confusing
and actually causing a bad unlock balance as below in the "bad_swap"
section of __do_sys_swapon().

This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
check out of claim_swapfile().  The inode is unlocked in
"bad_swap_unlock_inode" section, so that the inode is ensured to be
unlocked at "bad_swap".  Thus, error handling codes after the locking now
jumps to "bad_swap_unlock_inode" instead of "bad_swap".

    =====================================
    WARNING: bad unlock balance detected!
    5.5.0-rc7+ #176 Not tainted
    -------------------------------------
    swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
    but there are no more locks to release!

    other info that might help us debug this:
    no locks held by swapon/4294.

    stack backtrace:
    CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
    Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
    Call Trace:
     dump_stack+0xa1/0xea
     print_unlock_imbalance_bug.cold+0x114/0x123
     lock_release+0x562/0xed0
     up_write+0x2d/0x490
     __do_sys_swapon+0x94b/0x3550
     __x64_sys_swapon+0x54/0x80
     do_syscall_64+0xa4/0x4b0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f15da0a0dc7

Fixes: 1638045c36 ("mm: set S_SWAPFILE on blockdev swap devices")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Qais Youef <qais.yousef@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
Linus Torvalds
e595dd9451 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix memory leak in vti6, from Torsten Hilbrich.

 2) Fix double free in xfrm_policy_timer, from YueHaibing.

 3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from
    Johannes Berg.

 4) Wrong allocation failure check in qlcnic driver, from Xu Wang.

 5) Get ks8851-ml IO operations right, for real this time, from Marek
    Vasut.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
  r8169: fix PHY driver check on platforms w/o module softdeps
  net: ks8851-ml: Fix IO operations, again
  mlxsw: spectrum_mr: Fix list iteration in error path
  qlcnic: Fix bad kzalloc null test
  mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
  mac80211: mark station unauthorized before key removal
  mac80211: Check port authorization in the ieee80211_tx_dequeue() case
  cfg80211: Do not warn on same channel at the end of CSA
  mac80211: drop data frames without key on encrypted links
  ieee80211: fix HE SPR size calculation
  nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
  xfrm: policy: Fix doulbe free in xfrm_policy_timer
  bpf: Explicitly memset some bpf info structures declared on the stack
  bpf: Explicitly memset the bpf_attr structure
  bpf: Sanitize the bpf_struct_ops tcp-cc name
  vti6: Fix memory leak of skb if input policy check fails
  esp: remove the skb from the chain when it's enqueued in cryptd_wq
  ipv6: xfrm6_tunnel.c: Use built-in RCU list checking
  xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
  xfrm: fix uctx len check in verify_sec_ctx_len
  ...
2020-03-28 18:55:15 -07:00
Linus Torvalds
906c40438b Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "Three more driver bugfixes, and two doc improvements fixing build
  warnings while we are here"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: pca-platform: Use platform_irq_get_optional
  i2c: st: fix missing struct parameter description
  i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status()
  i2c: fix a doc warning
  i2c: hix5hd2: add missed clk_disable_unprepare in remove
2020-03-28 13:11:26 -07:00
Linus Torvalds
83fd69c933 SCSI fixes on 20200328
Two small fixes, one in drivers (qla2xxx) and one in the core (sd) to
 try to cope with USB enclosures that silently change reported
 parameters.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXn9r4CYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishcRAAQCoW8Q2
 raOl/SkEJ9CrqAsuav4B4HPl+B08wno3lHnp6wEA5Tkiw1aCLjqPBx/k0FctlRaD
 wQIxxBD8dy47cFV/qfo=
 =tWW/
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Two small fixes: one in drivers (qla2xxx), and one in the core (sd) to
  try to cope with USB enclosures that silently change reported
  parameters"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: Fix optimal I/O size for devices that change reported values
  scsi: qla2xxx: Fix I/Os being passed down when FC device is being deleted
2020-03-28 09:14:16 -07:00
Chris Packham
14c1fe699c i2c: pca-platform: Use platform_irq_get_optional
The interrupt is not required so use platform_irq_get_optional() to
avoid error messages like

  i2c-pca-platform 22080000.i2c: IRQ index 0 not found

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2020-03-28 05:03:17 +01:00
Alain Volmat
f491c66873 i2c: st: fix missing struct parameter description
Fix a missing struct parameter description to allow
warning free W=1 compilation.

Signed-off-by: Alain Volmat <avolmat@me.com>
Reviewed-by: Patrice Chotard <patrice.chotard@st.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2020-03-28 05:00:36 +01:00
David S. Miller
a0ba26f37e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-03-27

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 4 day(s) which contain
a total of 4 files changed, 25 insertions(+), 20 deletions(-).

The main changes are:

1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid
   having to rely on compiler to do so. Issues have been noticed on
   some compilers with padding and other oddities where the request was
   then unexpectedly rejected, from Greg Kroah-Hartman.

2) Sanitize the bpf_struct_ops TCP congestion control name in order to
   avoid problematic characters such as whitespaces, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:18:51 -07:00