Commit Graph

56508 Commits

Author SHA1 Message Date
Linus Torvalds
f896adc42d Changes since last update:
- Fix broken project quota inode counts
 - Fix incorrect PAGE_MASK/PAGE_SIZE usage
 - Fix incorrect return value in btree verifier
 - Fix WARN_ON remap flags false positive
 - Fix splice read overflows
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlwGu/sACgkQ+H93GTRK
 tOum2g/5AWN2OMMqbCYru3BSrp4my26/CgGgsLdfR5go+fKvSI8txAm5Wx8r2doP
 aBQ0LuJaKm16bufpFx+LQDKvE/AV8yd3tL58ozPSbpPjdQFpl4DR3N95pURQJVhP
 Yt3tDuPk6ODJ/pKYSbM0GTTyJ5jr+tUlKU6kmAqNzg1W644TWZVV9zFCVU202sj9
 xZ6Auan3yb5wXk2vhgxPgHrVh5ngRm2G+/V8KxCVoWyP/lHUjgEcWJEu+/Jh8xOQ
 6dAYxdfngQqdlXI3dpJDOkFHbqcSrhl8+fQEnp+g0RyWcKPDS+L6wk4Xlkhaorjy
 nDB4GcnjCPHdJ8x/9pUlNmBrlXpAz4oNYqlQcJKssd5q9p+eTBH2drRDc5VDM5KU
 xSHsdIIxQ1YewJ3uHcIbaYBsK+XcrWCtypUTC73GeGkLqS1qKCuKCEnMQOR3czeE
 /Hyq6/eTSt2uG62MAOZGdIW4uaJsLTnXnfDElZ1YsdwtMEzKbYg1Ll2s86vudrRu
 Otyl3EVdXQjtWRxrelA5eexspoJNoZS69D6Haqt8MGc2HJoJGTuparV6YlsEeGoW
 bZFO8JV/Q0WFCa2cWRXVe+D+kx8fyFoDsKY1mR4Vwp5s0jQXp9/rQ81+zVjQ4wB2
 TU53a4+hMLKLW5aNH3ge3yB9ZlHZhEzex6hrlZH3kuqpAM3Yj00=
 =18EP
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are hopefully the last set of fixes for 4.20.

  There's a fix for a longstanding statfs reporting problem with project
  quotas, a correction for page cache invalidation behaviors when
  fallocating near EOF, and a fix for a broken metadata verifier return
  code.

  Finally, the most important fix is to the pipe splicing code (aka the
  generic copy_file_range fallback) to avoid pointless short directio
  reads by only asking the filesystem for as much data as there are
  available pages in the pipe buffer. Our previous fix (simulated short
  directio reads because the number of pages didn't match the length of
  the read requested) caused subtle problems on overlayfs, so that part
  is reverted.

  Anyhow, this series passes fstests -g all on xfs and overlay+xfs, and
  has passed 17 billion fsx operations problem-free since I started
  testing

  Summary:

   - Fix broken project quota inode counts

   - Fix incorrect PAGE_MASK/PAGE_SIZE usage

   - Fix incorrect return value in btree verifier

   - Fix WARN_ON remap flags false positive

   - Fix splice read overflows"

* tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: partially revert 4721a60109 (simulated directio short read on EFAULT)
  splice: don't read more than available pipe space
  vfs: allow some remap flags to be passed to vfs_clone_file_range
  xfs: fix inverted return from xfs_btree_sblock_verify_crc
  xfs: fix PAGE_MASK usage in xfs_free_file_space
  fs/xfs: fix f_ffree value for statfs when project quota is set
2018-12-08 11:25:02 -08:00
Linus Torvalds
7f80c7325b NFS client bugfixes for Linux 4.20
Highlights include:
 
 Stable fixes:
  - Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.
 
 Bugfixes:
  - Fix a regression that causes the RPC receive code to hang
  - Fix call_connect_status() so that it handles tasks that got transmitted
    while queued waiting for the socket lock.
  - Fix a memory leak in call_encode()
  - Fix several other connect races.
  - Fix receive code error handling.
  - Use the discard iterator rather than MSG_TRUNC for compatibility with
    AF_UNIX/AF_LOCAL sockets.
  - nfs: don't dirty kernel pages read by direct-io
  - pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data
    servers
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcCWIOAAoJEA4mA3inWBJc3BsP/i/VXd0ZSxxL8i/++qCR1KGT
 /p0+t2HbrhPzb3jKmuaBe/6T6bLMbpmkwbesA6cHENkaPiOqxPhxLsJlh4o2BHwg
 NcjAbbov/hkakFAHlp69KqiL7DZe8YEqQE8GlUnn+3C3RM3i2TSRQ3AGXUH22P2a
 MY5fqiub2PmEwe2UZR8BzIEQd5w60AzTNXzQb181/+SCTOPdJTKneh0Tw54lD4d6
 vWKhi64cyQxQxshCvrX6IpcNWu9qwm7qDGQ3rDAg0whunve4YGtTz1suRUk888M4
 VfNxA8skFZuaQS/UU6oek2xaeMlSzEoJQXimKLYTEJKoqf7sWxfNLAfqHwnfyo4T
 Yab3cfVRs5KgEltVZyodb9oVQd6KI13hYeT+vXubz2kq1Ode4NJCnzgEefOP0hNV
 ENDal0hqBrfjfVIkpg/wfgRJln/W4Y/U0oPPm50eJJxa0ZKTfftBWo4me5DwCFF9
 0/XhPdFWTvZsYjmSGRC1RsaSrzUvO+wFo3tKQ2lQqf8QP3ix9ZtGQHN+h8RN9SxK
 ti5OxTMsfM3jYg7+yu4yOAQkcCcoaDA37+JztpuUSlMRfNss8uM7cQKsQ4WQf6Nr
 24At5Wr/ib7hVkAQ5oB98UWh5q1ZLzmmHhzsf8KacTSNcfjgu0H0DmKtm3CfThFK
 xfTHotzM3IqbUXRZQ7++
 =M/mt
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "This is mainly fallout from the updates to the SUNRPC code that is
  being triggered from less common combinations of NFS mount options.

  Highlights include:

  Stable fixes:
   - Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.

  Bugfixes:
   - Fix a regression that causes the RPC receive code to hang
   - Fix call_connect_status() so that it handles tasks that got
     transmitted while queued waiting for the socket lock.
   - Fix a memory leak in call_encode()
   - Fix several other connect races.
   - Fix receive code error handling.
   - Use the discard iterator rather than MSG_TRUNC for compatibility
     with AF_UNIX/AF_LOCAL sockets.
   - nfs: don't dirty kernel pages read by direct-io
   - pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4
     data servers"

* tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Don't force a redundant disconnection in xs_read_stream()
  SUNRPC: Fix up socket polling
  SUNRPC: Use the discard iterator rather than MSG_TRUNC
  SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request()
  SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag
  SUNRPC: Fix RPC receive hangs
  SUNRPC: Fix a potential race in xprt_connect()
  SUNRPC: Fix a memory leak in call_encode()
  SUNRPC: Fix leak of krb5p encode pages
  SUNRPC: call_connect_status() must handle tasks that got transmitted
  nfs: don't dirty kernel pages read by direct-io
  flexfiles: enforce per-mirror stateid only for v4 DSes
2018-12-06 18:57:04 -08:00
Linus Torvalds
d089709045 for-4.20-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwH0k8ACgkQxWXV+ddt
 WDtmVg/+Kgvk7laQI9bLEr1/30eG1JfBUMHcVE1F8+g99l28m1Yihjd21j9norVd
 YexBz53jgKou+zV+37CKWBYT1uDPq7CIoxctkdE2j9U0s+RmsqDrhech0dsBsfMR
 jo9VnHJFuJSxGMhjfGnFV+wMtAr4q5aQptNGBl+hR1MvMneroktFv+0WiLmp0Vhj
 +6Iq9WAClJYpgk//cI7nhKkscdzWwRyN3V9RUtdNeYklk1D7l1WprlaPzw22WA9u
 VjQVMICjEaJeIixIwT/D8lz05QgjKlqy1z6faYG5JuJxoYQikuNv/xe2dhZVm35A
 aNsBR0byf3zzuXKQZAlvXJ6/gYPvep+KI7epPyBOdycaqoZza7rQ+/MkSAgQ77Vk
 yBnQuhqiw9Srjh6LDWFkNclVln2wymRKd1SqpZmFPRZre/8L+DU+I8RRaeS2/WcE
 M2k+awRD0oVofbB+hxkFIoR+I1Ggkp2rxQlTT/41tGx0geWC3AGX+TlKSW6ZM5HD
 lRmRXIsVocfighKEnI3Zy7ecZuwCI4/4D6+PQtyhCJb3tDigZ/a4UEYdSVucG8CG
 SuQ5YMn+MyyKT0wH8xkGKDGT15YZ+u9Q/BmPHZRL6sSouFpiCQHA5miD1YA+t1d9
 qMjH6Ycz46Y3j2M0BDfDcm714zoD5/bgeSy5SPC3Zh5lQCGpeIk=
 =VW/F
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A patch in 4.19 introduced a sanity check that was too strict and a
  filesystem cannot be mounted.

  This happens for filesystems with more than 10 devices and has been
  reported by a few users so we need the fix to propagate to stable"

* tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
2018-12-05 09:58:17 -08:00
Darrick J. Wong
8f67b5adc0 iomap: partially revert 4721a60109 (simulated directio short read on EFAULT)
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.  This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation.  The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.

Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.

Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-04 09:40:02 -08:00
Darrick J. Wong
1761444557 splice: don't read more than available pipe space
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.

The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.

Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.

Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-04 08:50:49 -08:00
Darrick J. Wong
6744557b53 vfs: allow some remap flags to be passed to vfs_clone_file_range
In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
lower filesystem's inode, passing through whatever remap flags it got
from its caller.  Since vfs_copy_file_range first tries a filesystem's
remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
to the second vfs_copy_file_range call, and this isn't an issue.
Change the WARN_ON to look only for the DEDUP flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-12-04 08:50:49 -08:00
Eric Sandeen
7d048df4e9 xfs: fix inverted return from xfs_btree_sblock_verify_crc
xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.

(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-04 08:50:49 -08:00
Darrick J. Wong
a579121f94 xfs: fix PAGE_MASK usage in xfs_free_file_space
In commit e53c4b598, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page.  Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.

Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.

Fixes: e53c4b598 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-12-04 08:50:49 -08:00
Rafael J. Wysocki
a72173ecfc Revert "exec: make de_thread() freezable"
Revert commit c22397888f "exec: make de_thread() freezable" as
requested by Ingo Molnar:

"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:

[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778]  #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800]  dump_stack+0x85/0xcb
[ 1772.588803]  flush_old_exec+0x116/0x890
[ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
[ 1772.588809]  load_elf_binary+0x291/0x1620
[ 1772.588815]  ? sched_clock+0x5/0x10
[ 1772.588817]  ? search_binary_handler+0x6d/0x240
[ 1772.588820]  search_binary_handler+0x80/0x240
[ 1772.588823]  load_script+0x201/0x220
[ 1772.588825]  search_binary_handler+0x80/0x240
[ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832]  ? strncpy_from_user+0x40/0x180
[ 1772.588835]  __x64_sys_execve+0x34/0x40
[ 1772.588838]  do_syscall_64+0x60/0x1c0

The warning gets triggered by an ancient lockdep check in the freezer:

(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52	 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53	 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54	 */
55	static inline bool try_to_freeze_unsafe(void)
56	{
57		might_sleep();
58		if (likely(!freezing(current)))
59			return false;
60		return __refrigerator(false);
61	}

I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.

But there's this recent -rc4 commit:

> Chanho Min (1):
>       exec: make de_thread() freezable

  c22397888f1e: exec: make de_thread() freezable

I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."

Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-12-04 16:04:20 +01:00
Qu Wenruo
10950929e9 btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
  BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
    bg_start=12018974720 bg_len=10888413184, invalid block group size, \
    have 10888413184 expect (0, 10737418240]

This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.

Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.

[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.

__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
       stripe_size = 976128930;
       stripe_size = round_up(976128930, SZ_16M) = 989855744

However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.

[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.

We could just skip the length check for now.

Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-04 15:05:30 +01:00
Dave Kleikamp
ad3cba223a nfs: don't dirty kernel pages read by direct-io
When we use direct_IO with an NFS backing store, we can trigger a
WARNING in __set_page_dirty(), as below, since we're dirtying the page
unnecessarily in nfs_direct_read_completion().

To fix, replicate the logic in commit 53cbf3b157 ("fs: direct-io:
don't dirtying pages for ITER_BVEC/ITER_KVEC direct read").

Other filesystems that implement direct_IO handle this; most use
blockdev_direct_IO(). ceph and cifs have similar logic.

mount 127.0.0.1:/export /nfs
dd if=/dev/zero of=/nfs/image bs=1M count=200
losetup --direct-io=on -f /nfs/image
mkfs.btrfs /dev/loop0
mount -t btrfs /dev/loop0 /mnt/

kernel: WARNING: CPU: 0 PID: 8067 at fs/buffer.c:580 __set_page_dirty+0xaf/0xd0
kernel: Modules linked in: loop(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) fuse(E) tun(E) ip6t_rpfilter(E) ipt_REJECT(E) nf_
kernel:  snd_seq(E) snd_seq_device(E) snd_pcm(E) video(E) snd_timer(E) snd(E) soundcore(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) sr_mod(E) cdrom(E) ata_generic(E) pata_acpi(E) crc32c_intel(E) ahci(E) li
kernel: CPU: 0 PID: 8067 Comm: kworker/0:2 Tainted: G            E     4.20.0-rc1.master.20181111.ol7.x86_64 #1
kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
kernel: Workqueue: nfsiod rpc_async_release [sunrpc]
kernel: RIP: 0010:__set_page_dirty+0xaf/0xd0
kernel: Code: c3 48 8b 02 f6 c4 04 74 d4 48 89 df e8 ba 05 f7 ff 48 89 c6 eb cb 48 8b 43 08 a8 01 75 1f 48 89 d8 48 8b 00 a8 04 74 02 eb 87 <0f> 0b eb 83 48 83 e8 01 eb 9f 48 83 ea 01 0f 1f 00 eb 8b 48 83 e8
kernel: RSP: 0000:ffffc1c8825b7d78 EFLAGS: 00013046
kernel: RAX: 000fffffc0020089 RBX: fffff2b603308b80 RCX: 0000000000000001
kernel: RDX: 0000000000000001 RSI: ffff9d11478115c8 RDI: ffff9d11478115d0
kernel: RBP: ffffc1c8825b7da0 R08: 0000646f6973666e R09: 8080808080808080
kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff9d11478115d0
kernel: R13: ffff9d11478115c8 R14: 0000000000003246 R15: 0000000000000001
kernel: FS:  0000000000000000(0000) GS:ffff9d115ba00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f408686f640 CR3: 0000000104d8e004 CR4: 00000000000606f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  __set_page_dirty_buffers+0xb6/0x110
kernel:  set_page_dirty+0x52/0xb0
kernel:  nfs_direct_read_completion+0xc4/0x120 [nfs]
kernel:  nfs_pgio_release+0x10/0x20 [nfs]
kernel:  rpc_free_task+0x30/0x70 [sunrpc]
kernel:  rpc_async_release+0x12/0x20 [sunrpc]
kernel:  process_one_work+0x174/0x390
kernel:  worker_thread+0x4f/0x3e0
kernel:  kthread+0x102/0x140
kernel:  ? drain_workqueue+0x130/0x130
kernel:  ? kthread_stop+0x110/0x110
kernel:  ret_from_fork+0x35/0x40
kernel: ---[ end trace 01341980905412c9 ]---

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

[forward-ported to v4.20]
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-12-02 09:43:56 -05:00
Tigran Mkrtchyan
320f35b7bf flexfiles: enforce per-mirror stateid only for v4 DSes
Since commit bb21ce0ad2 we always enforce per-mirror stateid.
However, this makes sense only for v4+ servers.

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-12-02 09:43:56 -05:00
Linus Torvalds
880584176e for-linus-20181201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwC1c4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppxmD/4pqn8REEh/QUXWhCJbOXLLLxfQju7Uxs/v
 j2Bc6W/e7Z9jvKAs06IIhaV6SxBrM0oUebf/hJY0E/kTSHiNPJqx/X3W9hFYOo+p
 EJau3vavOrxVzgq5zt8S/i//HeanT+H37nE9WDqSRKXTta8JFDw+DoysepILTUvN
 WGDjuplPcurwmf2W1qES+5vNy/Jpln9ErNuqPBSjc6shozQ8WAzvuupVs+uZEpeK
 +gqrx0pJYrtoU+pSUK+Bt6bSzzp8Z0qHGIVMAabNULbz43qblK0ILRE+qLFbFwsB
 62EMMtX9b2Lsvqpoe2cQ+deQlUalsGVmpyE+7GP/evZbVmtD/NoH6cJQ/dA/tFtw
 cluL3rWBJKB5OZ1yatDE2/rUYsGo5FzqMUz/tIWSf2FdZcLfhRNLka7DueSA6NQe
 wtLJU9GrME67+t+PqncjDxoyQYma4oynAcc5dfqlBQv5OP7HDf4TP28g8FdkHjcy
 fEXAp58516YZiCpoWZf6dPR9fUQ0A1eF+qxHnUacy5tHN4AKPrccU3+k+0WStFNf
 qaOPkj4kWtv17d2DO4UoqAtBqFO16QCYSsa5+drpDeTOq9QgGqA6O+sGngN0LsxS
 F7x3msgBIkgEFYFtpuMBXnamdooiZMKrzI0Ctn7PK8b5Qx1OgRNCZcTQD4uql1Fj
 L6R/6Ynibg==
 =lMlT
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181201' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:

 - Single range elevator discard merge fix, that caused crashes (Ming)

 - Fix for a regression in O_DIRECT, where we could potentially lose the
   error value (Maximilian Heyne)

 - NVMe pull request from Christoph, with little fixes all over the map
   for NVMe.

* tag 'for-linus-20181201' of git://git.kernel.dk/linux-block:
  block: fix single range discard merge
  nvme-rdma: fix double freeing of async event data
  nvme: flush namespace scanning work just before removing namespaces
  nvme: warn when finding multi-port subsystems without multipathing enabled
  fs: fix lost error code in dio_complete
  nvme-pci: fix surprise removal
  nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request()
  nvme: Free ctrl device name on init failure
2018-12-01 11:36:32 -08:00
Linus Torvalds
d8f190ee83 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "31 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  ocfs2: fix potential use after free
  mm/khugepaged: fix the xas_create_range() error path
  mm/khugepaged: collapse_shmem() do not crash on Compound
  mm/khugepaged: collapse_shmem() without freezing new_page
  mm/khugepaged: minor reorderings in collapse_shmem()
  mm/khugepaged: collapse_shmem() remember to clear holes
  mm/khugepaged: fix crashes due to misaccounted holes
  mm/khugepaged: collapse_shmem() stop if punched or truncated
  mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
  mm/huge_memory: splitting set mapping+index before unfreeze
  mm/huge_memory: rename freeze_page() to unmap_page()
  initramfs: clean old path before creating a hardlink
  kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
  psi: make disabling/enabling easier for vendor kernels
  proc: fixup map_files test on arm
  debugobjects: avoid recursive calls with kmemleak
  userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
  userfaultfd: shmem: add i_size checks
  userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
  userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
  ...
2018-11-30 18:45:49 -08:00
Linus Torvalds
fd3b3e0ec5 FS-Cache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXAFfSPu3V2unywtrAQKPQRAAiHDs2d35Kc2qkTFLwGiP+wr+3+7Cyz7A
 hrWAvR7Oe7nBFVPmp6pwEnpBhf3mPsWlQpw3ZKZPo4fDQyRX+mDFC+2C7hkU1Q/J
 BkjTG4vYn1jiQGlL3SD1PfUxcWfwzoK4cz+V3hnFY5y0dsKiBZBR1Lw5G+UkaCnD
 4VaC3VAG56Vh14o5qSF3TWLZFyZ+JN6YA/M/DnwRPl8y4jnj1tJLs1DjdpEcWv6r
 15FKb2FRYaC7MRehpXd22JX6fv5ii2xazU3IfLucBrb4Vj+wAJrBY4wA3x/CFkAa
 as1VmxLkgoJEWa3M71tQOJBC8+QqkRb++PRUI3aadt2H4hXHfx3AmBuKkVroeS8o
 0BDhWGiTW4AqXUajkQcTc/mKV2x6h83V3DLyBRL1iC3+7qaBVhPNtxW+v6ln0Ce1
 FRG2I9LZp+RtWrVVyIPsa03V2V5OD7PTIBXK6TYtuqL+3uu7TNNc+UySvqDHWLL+
 Zo2ogpq//kZbjMdntNVhDEj12LW3zG05dtNuFEeJeuPM28yiXXtoWDmI49RAUQ4v
 RN6SwEXnKWehwG+YITYavV6gfHWlXdZ7grgCMHyViF/s9khBp7AGxbRzR0JXgXqL
 ko1Ojpbq2mdvjwGFQfde4MAqAxM3FPxdxGVLrgi+lgGTsEKv6IzrTo28teyAM81O
 D6cH0ldY90w=
 =6y+F
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache and cachefiles fixes from David Howells:
 "Misc fixes:

   - Fix an assertion failure at fs/cachefiles/xattr.c:138 caused by a
     race between a cache object lookup failing and someone attempting
     to reenable that object, thereby triggering an update of the
     object's attributes.

   - Fix an assertion failure at fs/fscache/operation.c:449 caused by a
     split atomic subtract and atomic read that allows a race to happen.

   - Fix a leak of backing pages when simultaneously reading the same
     page from the same object from two or more threads.

   - Fix a hang due to a race between a cache object being discarded and
     the corresponding cookie being reenabled.

  There are also some minor cleanups:

   - Cast an enum value to a different enum type to prevent clang from
     generating a warning. This shouldn't cause any sort of change in
     the emitted code.

   - Use ktime_get_real_seconds() instead of get_seconds(). This is just
     used to uniquify a filename for an object to be placed in the
     graveyard. Objects placed there are deleted by cachfilesd in
     userspace immediately thereafter.

   - Remove an initialised, but otherwise unused variable. This should
     have been entirely optimised away anyway"

* tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache, cachefiles: remove redundant variable 'cache'
  cachefiles: avoid deprecated get_seconds()
  cachefiles: Explicitly cast enumerated type in put_object
  fscache: fix race between enablement and dropping of object
  cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
  fscache: Fix race in fscache_op_complete() due to split atomic_sub & read
  cachefiles: Fix an assertion failure when trying to update a failed object
2018-11-30 18:32:33 -08:00
Pan Bian
164f7e5867 ocfs2: fix potential use after free
ocfs2_get_dentry() calls iput(inode) to drop the reference count of
inode, and if the reference count hits 0, inode is freed.  However, in
this function, it then reads inode->i_generation, which may result in a
use after free bug.  Move the put operation later.

Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Andrea Arcangeli
29ec90660d userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration.  This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.

The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.

Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a34210 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Pan Bian
c7d7d620dc hfsplus: do not free node before using
hfs_bmap_free() frees node via hfs_bnode_put(node).  However it then
reads node->this when dumping error message on an error path, which may
result in a use-after-free bug.  This patch frees node only when it is
never used.

Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.com
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Pan Bian
ce96a407ad hfs: do not free node before using
hfs_bmap_free() frees the node via hfs_bnode_put(node).  However, it
then reads node->this when dumping error message on an error path, which
may result in a use-after-free bug.  This patch frees the node only when
it is never again used.

Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com
Fixes: a1185ffa2fc ("HFS rewrite")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Larry Chen
e21e57445a ocfs2: fix deadlock caused by ocfs2_defrag_extent()
ocfs2_defrag_extent may fall into deadlock.

ocfs2_ioctl_move_extents
    ocfs2_ioctl_move_extents
      ocfs2_move_extents
        ocfs2_defrag_extent
          ocfs2_lock_allocators_move_extents

            ocfs2_reserve_clusters
              inode_lock GLOBAL_BITMAP_SYSTEM_INODE

	  __ocfs2_flush_truncate_log
              inode_lock GLOBAL_BITMAP_SYSTEM_INODE

As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
against the global bitmap if local allocator has not sufficient cluters.
Once global bitmap could meet the demand, ocfs2_reserve_cluster will
return success with global bitmap locked.

After ocfs2_reserve_cluster(), if truncate log is full,
__ocfs2_flush_truncate_log() will definitely fall into deadlock because
it needs to inode_lock global bitmap, which has already been locked.

To fix this bug, we could remove from
ocfs2_lock_allocators_move_extents() the code which intends to lock
global allocator, and put the removed code after
__ocfs2_flush_truncate_log().

ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
here, the other does not need the data allocator context, which means
this patch does not affect the caller so far.

Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:13 -08:00
Linus Torvalds
5f1ca5c619 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes all over the place.

  The iov_iter one is this cycle regression (splice from UDP triggering
  WARN_ON()), the rest is older"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  afs: Use d_instantiate() rather than d_add() and don't d_drop()
  afs: Fix missing net error handling
  afs: Fix validation/callback interaction
  iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones
  exportfs: do not read dentry after free
  exportfs: fix 'passing zero to ERR_PTR()' warning
  aio: fix failure to put the file pointer
  sysv: return 'err' instead of 0 in __sysv_write_inode
2018-11-30 10:47:50 -08:00
Linus Torvalds
e9eaf72e73 pstore fix:
- Fix corrupted compression due to unlucky size choice with ECC
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlwAcoMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjwQD/4t+0AYLLGJKkc+YEXj2aVDPftc
 cjHoNvT4voNTf3UzEsjbegfslg0g9UEAnDSOxlecJs81HYiopFpBQ2RMvYffWH7K
 qhJN9X12Fpop5DYJ7fgJJpNPqmVGY783gMNRA5jgh1SKZhK0yzTIwCElysFMoxKc
 U6cKXfkGwnZggNL3bFjqt5r1tiLVkDQVrfZgqcghrOkROmkF0I1kc+PwxRDkeCYh
 Kk3BtKMHTh3XQoX4Xqkq9bSCACYmfvLg6CuTAzqtw5bpWlgtZ3KcXxDxlbSwNe3X
 8SRr9N0qkUsbiQ/vFXY3PY2l9iI1NSVN0cDldaJ/bagOV7YL0kbQZfM01IEUK0j/
 iPrsv4ELT3w0NTYQCB47x4VOf2pt44OwNAmovAmtg71OwPKXAsFJZ4jCid2Pq4Pr
 esik+vwfroWJb+979WVcpyT9eA1P3BHEhBsl5yJV6jSwhWWBZ670RPRDNcHDol/x
 pJPGfDKTznCxwHqBdycqf1z1YtnD1VwGzd8OkNc183qoLeorew/Zv0VeYOL+6d92
 qVBj3FcKeAuzn6it+trBZ6zbGGH4Nxo68tI2BYiAMQJogxRcVjqJ1dNwoZWjdYNm
 w00jqlItJD0rwE8XAOYBoHg5J1o+/QUyz8dwf4mF2rWjqpBT+YC+Y1V5Xdhxg7D0
 xc0VYYZAYwc3XlR83A==
 =/GIT
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "Fix corrupted compression due to unlucky size choice with ECC"

* tag 'pstore-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Correctly calculate usable PRZ bytes
2018-11-30 09:03:15 -08:00
Colin Ian King
31ffa56383 fscache, cachefiles: remove redundant variable 'cache'
Variable 'cache' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'cache' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-30 16:00:58 +00:00
Arnd Bergmann
34e06fe4d0 cachefiles: avoid deprecated get_seconds()
get_seconds() returns an unsigned long can overflow on some architectures
and is deprecated because of that. In cachefs, we cast that number to
a a 32-bit integer, which will overflow in year 2106 on all architectures.

As confirmed by David Howells, the overflow probably isn't harmful
in the end, since the timestamps are only used to make the file names
unique, but they don't strictly have to be in monotonically increasing
order since the files only exist in order to be deleted as quickly
as possible.

Moving to ktime_get_real_seconds() avoids the deprecated interface.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-30 16:00:58 +00:00
Nathan Chancellor
b7e768b7e3 cachefiles: Explicitly cast enumerated type in put_object
Clang warns when one enumerated type is implicitly converted to another.

fs/cachefiles/namei.c:247:50: warning: implicit conversion from
enumeration type 'enum cachefiles_obj_ref_trace' to different
enumeration type 'enum fscache_obj_ref_trace' [-Wenum-conversion]
        cache->cache.ops->put_object(&xobject->fscache,
cachefiles_obj_put_wait_retry);

Silence this warning by explicitly casting to fscache_obj_ref_trace,
which is also done in put_object.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-30 16:00:58 +00:00
NeilBrown
c5a94f434c fscache: fix race between enablement and dropping of object
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().

At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.

When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared.  This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is cleared

There is some uncertainty in this analysis, but it seems to be fit the
observations.  Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.

Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.

The backtrace for the blocked process looked like:

PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
 #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
 #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
 #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
 #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
 #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
 #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
 #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
 #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
 #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
 #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-30 15:57:31 +00:00
Maximilian Heyne
41e817bca3 fs: fix lost error code in dio_complete
commit e259221763 ("fs: simplify the
generic_write_sync prototype") reworked callers of generic_write_sync(),
and ended up dropping the error return for the directio path. Prior to
that commit, in dio_complete(), an error would be bubbled up the stack,
but after that commit, errors passed on to dio_complete were eaten up.

This was reported on the list earlier, and a fix was proposed in
https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
never followed up with.  We recently hit this bug in our testing where
fencing io errors, which were previously erroring out with EIO, were
being returned as success operations after this commit.

The fix proposed on the list earlier was a little short -- it would have
still called generic_write_sync() in case `ret` already contained an
error. This fix ensures generic_write_sync() is only called when there's
no pending error in the write. Additionally, transferred is replaced
with ret to bring this code in line with other callers.

Fixes: e259221763 ("fs: simplify the generic_write_sync prototype")
Reported-by: Ravi Nankani <rnankani@amazon.com>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Torsten Mehlan <tomeh@amazon.de>
CC: Uwe Dannowski <uwed@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-30 08:35:14 -07:00
David Howells
73116df7bb afs: Use d_instantiate() rather than d_add() and don't d_drop()
Use d_instantiate() rather than d_add() and don't d_drop() in
afs_vnode_new_inode().  The dentry shouldn't be removed as it's not
changing its name.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-29 21:08:14 -05:00
David Howells
4584ae96ae afs: Fix missing net error handling
kAFS can be given certain network errors (EADDRNOTAVAIL, EHOSTDOWN and
ERFKILL) that it doesn't handle in its server/address rotation algorithms.
They cause the probing and rotation to abort immediately rather than
rotating.

Fix this by:

 (1) Abstracting out the error prioritisation from the VL and FS rotation
     algorithms into a common function and expand usage into the server
     probing code.

     When multiple errors are available, this code selects the one we'd
     prefer to return.

 (2) Add handling for EADDRNOTAVAIL, EHOSTDOWN and ERFKILL.

Fixes: 0fafdc9f88 ("afs: Fix file locking")
Fixes: 0338747d8454 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-29 21:08:14 -05:00
David Howells
ae3b7361dc afs: Fix validation/callback interaction
When afs_validate() is called to validate a vnode (inode), there are two
unhandled cases in the fastpath at the top of the function:

 (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
     counters match and the data has expired, then there's an implicit case
     in which the vnode needs revalidating.

     This has no consequences since the default "valid = false" set at the
     top of the function happens to do the right thing.

 (2) If the vnode is not promised and it hasn't been deleted
     (AFS_VNODE_DELETED is not set) then there's a default case we're not
     handling in which the vnode is invalid.  If the vnode is invalid, we
     need to bring cb_s_break and cb_v_break up to date before we refetch
     the status.

     As a consequence, once the server loses track of the client
     (ie. sufficient time has passed since we last sent it an operation),
     it will send us a CB.InitCallBackState* operation when we next try to
     talk to it.  This calls afs_init_callback_state() which increments
     afs_server::cb_s_break, but this then doesn't propagate to the
     afs_vnode record.

     The result being that every afs_validate() call thereafter sends a
     status fetch operation to the server.

Clarify and fix this by:

 (A) Setting valid in all the branches rather than initialising it at the
     top so that the compiler catches where we've missed.

 (B) Restructuring the logic in the 'promised' branch so that we set valid
     to false if the callback is due to expire (or has expired) and so that
     the final case is that the vnode is still valid.

 (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
     promised and deleted cases don't match.

Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-29 21:08:14 -05:00
Kees Cook
89d328f637 pstore/ram: Correctly calculate usable PRZ bytes
The actual number of bytes stored in a PRZ is smaller than the
bytes requested by platform data, since there is a header on each
PRZ. Additionally, if ECC is enabled, there are trailing bytes used
as well. Normally this mismatch doesn't matter since PRZs are circular
buffers and the leading "overflow" bytes are just thrown away. However, in
the case of a compressed record, this rather badly corrupts the results.

This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
Any stored crashes would not be uncompressable (producing a pstorefs
"dmesg-*.enc.z" file), and triggering errors at boot:

  [    2.790759] pstore: crypto_comp_decompress failed, ret = -22!

Backporting this depends on commit 70ad35db33 ("pstore: Convert console
write to use ->write_buf")

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Fixes: b0aad7a99c ("pstore: Add compression support to pstore")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-11-29 13:46:43 -08:00
Linus Torvalds
9af33b5745 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlv/mEEACgkQnJ2qBz9k
 QNm0kggA8SckBccdUZqpjxohASx9crp+jJeor2TRYCwyVfqdbDjkCOfv9k7+ddD4
 D1qN/AEudQXPzr/DrjI0W9fnFG957FLo60RQ0aGwsq/3wfmzfMJ0qXIw2tyoqjIH
 VxFecL2f3TKMr5zU9N1QoxJVAMAo5LOGbXO+qNYgTaOJ0EBpw9kxK14ib9JOi+pQ
 CItZkAv4eOWMsA/AaRsWN7AGmsziwcMdoAZYRT2wiWdTmubYn66Negnffq2jVHjP
 /kYlFjNcgpsuTg1AiTM+JlBwctEeLm37PUyimip3cdYwu6HcfNmCHDjEIzN2YphJ
 twzAauTEr0bjNFiNWYraUPVHEcspNw==
 =kKDs
 -----END PGP SIGNATURE-----

Merge tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2 and udf fixes from Jan Kara:
 "Three small ext2 and udf fixes"

* tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: fix potential use after free
  ext2: initialize opts.s_mount_opt as zero before using it
  udf: Allow mounting volumes with incorrect identification strings
2018-11-29 09:56:00 -08:00
Linus Torvalds
121b018f8c for-4.20-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
 WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
 So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
 tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
 63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
 Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
 h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
 7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
 xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
 MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
 nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
 iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
 =cCvI
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Some of these bugs are being hit during testing so we'd like to get
  them merged, otherwise there are usual stability fixes for stable
  trees"

* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: relocation: set trans to be NULL after ending transaction
  Btrfs: fix race between enabling quotas and subvolume creation
  Btrfs: send, fix infinite loop due to directory rename dependencies
  Btrfs: ensure path name is null terminated at btrfs_control_ioctl
  Btrfs: fix rare chances for data loss when doing a fast fsync
  btrfs: Always try all copies when reading extent buffers
2018-11-28 08:38:20 -08:00
Kiran Kumar Modukuri
9a24ce5b66 cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
[Description]

In a heavily loaded system where the system pagecache is nearing memory
limits and fscache is enabled, pages can be leaked by fscache while trying
read pages from cachefiles backend.  This can happen because two
applications can be reading same page from a single mount, two threads can
be trying to read the backing page at same time.  This results in one of
the threads finding that a page for the backing file or netfs file is
already in the radix tree.  During the error handling cachefiles does not
clean up the reference on backing page, leading to page leak.

[Fix]
The fix is straightforward, to decrement the reference when error is
encountered.

  [dhowells: Note that I've removed the clearance and put of newpage as
   they aren't attested in the commit message and don't appear to actually
   achieve anything since a new page is only allocated is newpage!=NULL and
   any residual new page is cleared before returning.]

[Testing]
I have tested the fix using following method for 12+ hrs.

1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
2) create 10000 files of 2.8MB in a NFS mount.
3) start a thread to simulate heavy VM presssure
   (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
4) start multiple parallel reader for data set at same time
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   ..
   ..
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
   free -h , cat /proc/meminfo and page-types -r -b lru
   to ensure all pages are freed.

Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
[dja: forward ported to current upstream]
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-28 14:47:05 +00:00
David Howells
e6bc06faf6 cachefiles: Fix an assertion failure when trying to update a failed object
If cachefiles gets an error other then ENOENT when trying to look up an
object in the cache (in this case, EACCES), the object state machine will
eventually transition to the DROP_OBJECT state.

This state invokes fscache_drop_object() which tries to sync the auxiliary
data with the cache (this is done lazily since commit 402cb8dda9) on an
incomplete cache object struct.

The problem comes when cachefiles_update_object_xattr() is called to
rewrite the xattr holding the data.  There's an assertion there that the
cache object points to a dentry as we're going to update its xattr.  The
assertion trips, however, as dentry didn't get set.

Fix the problem by skipping the update in cachefiles if the object doesn't
refer to a dentry.  A better way to do it could be to skip the update from
the DROP_OBJECT state handler in fscache, but that might deny the cache the
opportunity to update intermediate state.

If this error occurs, the kernel log includes lines that look like the
following:

 CacheFiles: Lookup failed error -13
 CacheFiles:
 CacheFiles: Assertion failed
 ------------[ cut here ]------------
 kernel BUG at fs/cachefiles/xattr.c:138!
 ...
 Workqueue: fscache_object fscache_object_work_func [fscache]
 RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
 ...
 Call Trace:
  cachefiles_update_object+0xdd/0x1c0 [cachefiles]
  fscache_update_aux_data+0x23/0x30 [fscache]
  fscache_drop_object+0x18e/0x1c0 [fscache]
  fscache_object_work_func+0x74/0x2b0 [fscache]
  process_one_work+0x18d/0x340
  worker_thread+0x2e/0x390
  ? pwq_unbound_release_workfn+0xd0/0xd0
  kthread+0x112/0x130
  ? kthread_bind+0x30/0x30
  ret_from_fork+0x35/0x40

Note that there are actually two issues here: (1) EACCES happened on a
cache object and (2) an oops occurred.  I think that the second is a
consequence of the first (it certainly looks like it ought to be).  This
patch only deals with the second.

Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Reported-by: Zhibin Li <zhibli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-28 13:19:20 +00:00
Pan Bian
ecebf55d27 ext2: fix potential use after free
The function ext2_xattr_set calls brelse(bh) to drop the reference count
of bh. After that, bh may be freed. However, following brelse(bh),
it reads bh->b_data via macro HDR(bh). This may result in a
use-after-free bug. This patch moves brelse(bh) after reading field.

CC: stable@vger.kernel.org
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-27 10:21:15 +01:00
xingaopeng
e5f5b71798 ext2: initialize opts.s_mount_opt as zero before using it
We need to initialize opts.s_mount_opt as zero before using it, else we
may get some unexpected mount options.

Fixes: 088519572c ("ext2: Parse mount options into a dedicated structure")
CC: stable@vger.kernel.org
Signed-off-by: xingaopeng <xingaopeng@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-27 10:21:03 +01:00
Ye Yin
de7243057e fs/xfs: fix f_ffree value for statfs when project quota is set
When project is set, we should use inode limit minus the used count

Signed-off-by: Ye Yin <dbyin@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-26 15:01:37 -08:00
Linus Torvalds
17c2f54086 NFS client bugfixes for Linux 4.20
Highlights include:
 
 Bugfixes:
  - Fix a NFSv4 state manager deadlock when returning a delegation
  - NFSv4.2 copy do not allocate memory under the lock
  - flexfiles: Use the correct stateid for IO in the tightly coupled case
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb+hCNAAoJEA4mA3inWBJc8ZQP/jR+uemJycwgyWINvnn6PEtE
 hyiSwL+c3jhBHwX2IroF1KvaHIa8GXMbIWj+DfW1iB2htYnIJYz4IFJOGpfN1S7n
 bKCgonV0V06+dFF4DqcL3HcM1L6bo26n16voi3otgY0R5U5HGwB1tocZPCbR6DpK
 meiRbrmB6O962zluUlTuu9zFSvsALyZR0h4tYMGYA0MlgWQJVLH6+dufyG2Zgp2Z
 OH9tUzRFknD/Q4KrJv7zrMY198mHa+RQovsO2Jc/iE4bbrSMyVNtrPuVJphsP1BD
 lZ5SvvWLXjNepUMsDCK+Es7i6dUmtHsGPS6gNDwUwY9/UlwOPYlp44VJzmEYmQcz
 /VrrHn3LSoKDSAVNrksghto9O4T1NPnuVja1Q+SHf5hVX5OlsxyDkvX23ZUdgdkW
 BeXeNWZuAJdDTI1KU+ahm2ilfUnuFpRGRHUrH2sYczV2okC38cO5YCIRI3Tckz6e
 jrhmJcw+zCWv3Yl3h2Rbf8fVRcWJHA+qLWT3Str5nCyZiqPCag7Z7br9r5316zDv
 Yma7nITZO7HH1cZUv+byA0PVHU96kDsMhhpxYISrSr4lf2BcZNnjQC/0IHb7qdWz
 FgpYzv/BsIi+KxyZKshiR5E60kOmVxv2wIhre8uLOuuabcGsh/wit6URVnQJ+GDp
 7klRY1t1P24XaIbgBR9U
 =hqbe
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix a NFSv4 state manager deadlock when returning a delegation

 - NFSv4.2 copy do not allocate memory under the lock

 - flexfiles: Use the correct stateid for IO in the tightly coupled case

* tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  flexfiles: use per-mirror specified stateid for IO
  NFSv4.2 copy do not allocate memory under the lock
  NFSv4: Fix a NFSv4 state manager deadlock
2018-11-25 09:19:58 -08:00
Linus Torvalds
e2125dac22 XArray updates for 4.20-rc4
We found some bugs in the DAX conversion to XArray (and one bug which
 predated the XArray conversion).  There were a couple of bugs in some of
 the higher-level functions, which aren't actually being called in today's
 kernel, but surfaced as a result of converting existing radix tree &
 IDR users over to the XArray.  Some of the other changes to how the
 higher-level APIs work were also motivated by converting various users;
 again, they're not in use in today's kernel, so changing them has a low
 probability of introducing a bug.
 
 Dan can still trigger a bug in the DAX code with hot-offline/online,
 and we're working on tracking that down.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlv542AUHHdpbGx5QGlu
 ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5BoAf/QZzbBcYuYMLMDYofvHKGlmk2yx/a
 ObUlxlQtXGHvPp3oC3rdwAvcN/KAMDpU0u+PXab2MnrNw5okhpS6ZwGODlkarNA4
 XbVQNGbtEbACr1V3CWc0NzLbYm6JtGpMum0Wx9MVR/VdTnGArBLBYQMYa/c1YhKA
 vEBPf+w0j0QoCTAgPiIvq0aksuBQERUvjhlUvoaMY7F4sAhnaW558lvaEcc1xGxq
 70+3cRPT6Uh12tEvi0LKP1NNEXebvQSftMvFEUPF2xo5z2v//KEobzv/anbojxQ8
 BtxouIGSr4tME9g3xSpd9rTbUcW3bwDAhuWZvpP/ViRwW2UkEQonpApdaw==
 =0Ert
 -----END PGP SIGNATURE-----

Merge tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax

Pull XArray updates from Matthew Wilcox:
 "We found some bugs in the DAX conversion to XArray (and one bug which
  predated the XArray conversion). There were a couple of bugs in some
  of the higher-level functions, which aren't actually being called in
  today's kernel, but surfaced as a result of converting existing radix
  tree & IDR users over to the XArray.

  Some of the other changes to how the higher-level APIs work were also
  motivated by converting various users; again, they're not in use in
  today's kernel, so changing them has a low probability of introducing
  a bug.

  Dan can still trigger a bug in the DAX code with hot-offline/online,
  and we're working on tracking that down"

* tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax:
  XArray tests: Add missing locking
  dax: Avoid losing wakeup in dax_lock_mapping_entry
  dax: Fix huge page faults
  dax: Fix dax_unlock_mapping_entry for PMD pages
  dax: Reinstate RCU protection of inode
  dax: Make sure the unlocking entry isn't locked
  dax: Remove optimisation from dax_lock_mapping_entry
  XArray tests: Correct some 64-bit assumptions
  XArray: Correct xa_store_range
  XArray: Fix Documentation
  XArray: Handle NULL pointers differently for allocation
  XArray: Unify xa_store and __xa_store
  XArray: Add xa_store_bh() and xa_store_irq()
  XArray: Turn xa_erase into an exported function
  XArray: Unify xa_cmpxchg and __xa_cmpxchg
  XArray: Regularise xa_reserve
  nilfs2: Use xa_erase_irq
  XArray: Export __xa_foo to non-GPL modules
  XArray: Fix xa_for_each with a single element at 0
2018-11-24 18:44:01 -08:00
Linus Torvalds
abe72ff413 Changes since last update:
- Numerous corruption fixes for copy on write
 - Numerous corruption fixes for blocksize < pagesize writes
 - Don't miscalculate AG reservations for small final AGs
 - Fix page cache truncation to work properly for reflink and extent
   shifting
 - Fix use-after-free when retrying failed inode/dquot buffer logging
 - Fix corruptions seen when using copy_file_range in directio mode
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlv1oroACgkQ+H93GTRK
 tOu+1RAAnteFwaq3WLDYmSrMia/4m52sxvatxlCCSjCGdvfvuZozTwbTdB6FFfuc
 Ql6Z6F2Lx1sHDNJvwBCsO8qPB0qOhjSnBI/wPe2kz/NETGYNp88vHX7OZvkPVONl
 jDaCWTcu0BWNiOGi17uTY8sBa8u1izbo5F+pEQIyUjoCgUc9JB2di9dVnUJ0byrh
 wZjrmPD95ojqOozqppXfFQ0QIbozpTXR3kyU9S0EhHmbnWJZ9t08Iuhd2LjOoDB4
 cUFG/1qDXuFvALyM8m1mA7xSBZpA/glFgNeAtmz53aIOZ9Tl8w8cLJJBRx5AqUDU
 bpBU1y08Bm3OIw+uiTMkiPkCQRMDgtiHKlPxuiKqlsNY0KqYgwWlWcbU/OTvHly8
 In+CnbEBqLejKyEIz3nEQ4YimfvHbAlC/3V+nx2qO45hvTXA5lEIGAbBmiLW0ni8
 6eBXGeIjKAw0zYOoXC4OuKIiHlQh7AHJB25i9xJTzknRI9jqwZFGkxgdl33Vrq8W
 gTnfgOhMX2dGmcPrgMgtu+aiBwKf+GJv94/2EJwligExnWXQSsQmGCwKl7ysoE1g
 iU/MhJT5IYYP/TDqldkahUPSwD2FN4UFtzNfpeDX3H6kxM1R41l+aerdu64UPNji
 G98U+cWyyUmbu9ziLyREM/XyWz4UhNAz7lRId3ryeu8GPUm2AoY=
 =TiLQ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Dave and I have continued our work fixing corruption problems that can
  be found when running long-term burn-in exercisers on xfs. Here are
  some patches fixing most of the problems, but there will likely be
  more. :/

   - Numerous corruption fixes for copy on write

   - Numerous corruption fixes for blocksize < pagesize writes

   - Don't miscalculate AG reservations for small final AGs

   - Fix page cache truncation to work properly for reflink and extent
     shifting

   - Fix use-after-free when retrying failed inode/dquot buffer logging

   - Fix corruptions seen when using copy_file_range in directio mode"

* tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: readpages doesn't zero page tail beyond EOF
  vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  iomap: dio data corruption and spurious errors when pipes fill
  iomap: sub-block dio needs to zeroout beyond EOF
  iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
  xfs: delalloc -> unwritten COW fork allocation can go wrong
  xfs: flush removing page cache in xfs_reflink_remap_prep
  xfs: extent shifting doesn't fully invalidate page cache
  xfs: finobt AG reserves don't consider last AG can be a runt
  xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
  xfs: uncached buffer tracing needs to print bno
  xfs: make xfs_file_remap_range() static
  xfs: fix shared extent data corruption due to missing cow reservation
2018-11-24 09:11:52 -08:00
Linus Torvalds
b88af99487 Power management fixes for 4.20-rc4
- Fix tasks freezer deadlock in de_thread() that occurs if one
    of its sub-threads has been frozen already (Chanho Min).
 
  - Avoid registering a platform device by the ti-cpufreq driver
    on platforms that cannot use it (Dave Gerlach).
 
  - Fix a mistake in the ti-opp-supply operating performance points
    (OPP) driver that caused an incorrect reference voltage to be
    used and make it adjust the minimum voltage dynamically to avoid
    hangs or crashes in some cases (Keerthy).
 
  - Fix issues related to compiler flags in the cpupower utility
    and correct a linking problem in it by renaming a file with
    a duplicate name (Jiri Olsa, Konstantin Khlebnikov).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJb99ANAAoJEILEb/54YlRxUk4QAIkpXmLdEm0zvQXJZKJxuxu/
 8eiglPcoCAAVgR3uRjezTyQeFCDjqGIqmpiipFOkxgAaeb+aiFta2B88B4oZ3OAf
 G5yJG4CMs5+Tp/oH2+McX4PMo2WfoEIDoOGVOdtlk4iGAIh9tT5KvNfCUyxHLP7c
 cFjdeUQuc8PRPzutESxYHB23FmECCEprVmEFVckQ7vSCnWX6dm1mFteCiNZADH8/
 YNgRWRYtyWXlPRNCPJuCvYmVHLgm0Tw6CVhG4ttXT9wIkYyiLK9Flx9X6gsBWdoS
 ej1jbJM7R/EAUzHvCjUCSfMKDNWvQySR3gC2m0vG5Goext2UXWieSHaI6BfqjPP2
 wTBX9lAKu3iIGISoorjJZAMe4v26Tpbs0v5ApsOGr50WNZKZtRkBwaJ0EFT2hEa8
 MsDtUjr+d7CIQ+ZDHmAbOzghpDlSuhGypfkD7IPtA/AY1dy1wJ3BUYya8ucqSdr3
 iKkQcndN3ASR1+Fyb3c1cflh9fWe84aeSmYPihcX2m+/D43keYXbj2ANdCH1dF3E
 cwAXw9+VteUQ1tihqgJapFogK3VphMbRErPSGXryuyiMxr38g7tgHAd2rId0JG4g
 QftqpZa19aEed8tJNaQWWjBaBFgYeAT9BQqcP6CjFOO9klPH//NSiqzQUFMdzsEc
 Qql1V/aoOIsflb7gJH3/
 =JQxr
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two issues in the Operating Performance Points (OPP)
  framework, one cpufreq driver issue, one problem related to the tasks
  freezer and a few build-related issues in the cpupower utility.

  Specifics:

   - Fix tasks freezer deadlock in de_thread() that occurs if one of its
     sub-threads has been frozen already (Chanho Min).

   - Avoid registering a platform device by the ti-cpufreq driver on
     platforms that cannot use it (Dave Gerlach).

   - Fix a mistake in the ti-opp-supply operating performance points
     (OPP) driver that caused an incorrect reference voltage to be used
     and make it adjust the minimum voltage dynamically to avoid hangs
     or crashes in some cases (Keerthy).

   - Fix issues related to compiler flags in the cpupower utility and
     correct a linking problem in it by renaming a file with a duplicate
     name (Jiri Olsa, Konstantin Khlebnikov)"

* tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  exec: make de_thread() freezable
  cpufreq: ti-cpufreq: Only register platform_device when supported
  opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
  opp: ti-opp-supply: Dynamically update u_volt_min
  tools cpupower: Override CFLAGS assignments
  tools cpupower debug: Allow to use outside build flags
  tools/power/cpupower: fix compilation with STATIC=true
2018-11-23 10:52:57 -08:00
Pan Bian
2084ac6c50 exportfs: do not read dentry after free
The function dentry_connected calls dput(dentry) to drop the previously
acquired reference to dentry. In this case, dentry can be released.
After that, IS_ROOT(dentry) checks the condition
(dentry == dentry->d_parent), which may result in a use-after-free bug.
This patch directly compares dentry with its parent obtained before
dropping the reference.

Fixes: a056cc8934c("exportfs: stop retrying once we race with
rename/remove")

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-23 09:08:17 -05:00
Pan Bian
42a657f576 btrfs: relocation: set trans to be NULL after ending transaction
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.

Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-23 13:47:46 +01:00
Tigran Mkrtchyan
bb21ce0ad2 flexfiles: use per-mirror specified stateid for IO
rfc8435 says:

  For tight coupling, ffds_stateid provides the stateid to be used by
  the client to access the file.

However current implementation replaces per-mirror provided stateid with
by open or lock stateid.

Ensure that per-mirror stateid is used by ff_layout_write_prepare_v4 and
nfs4_ff_layout_prepare_ds.

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-22 14:04:55 -05:00
Olga Kornievskaia
99f2c55591 NFSv4.2 copy do not allocate memory under the lock
Bruce pointed out that we shouldn't allocate memory while holding
a lock in the nfs4_callback_offload() and handle_async_copy()
that deal with a racing CB_OFFLOAD and reply to COPY case.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-22 13:54:46 -05:00
Filipe Manana
552f0329c7 Btrfs: fix race between enabling quotas and subvolume creation
We have a race between enabling quotas end subvolume creation that cause
subvolume creation to fail with -EINVAL, and the following diagram shows
how it happens:

              CPU 0                                          CPU 1

 btrfs_ioctl()
  btrfs_ioctl_quota_ctl()
   btrfs_quota_enable()
    mutex_lock(fs_info->qgroup_ioctl_lock)

                                                  btrfs_ioctl()
                                                   create_subvol()
                                                    btrfs_qgroup_inherit()
                                                     -> save fs_info->quota_root
                                                        into quota_root
                                                     -> stores a NULL value
                                                     -> tries to lock the mutex
                                                        qgroup_ioctl_lock
                                                        -> blocks waiting for
                                                           the task at CPU0

   -> sets BTRFS_FS_QUOTA_ENABLED in fs_info
   -> sets quota_root in fs_info->quota_root
      (non-NULL value)

   mutex_unlock(fs_info->qgroup_ioctl_lock)

                                                     -> checks quota enabled
                                                        flag is set
                                                     -> returns -EINVAL because
                                                        fs_info->quota_root was
                                                        NULL before it acquired
                                                        the mutex
                                                        qgroup_ioctl_lock
                                                   -> ioctl returns -EINVAL

Returning -EINVAL to user space will be confusing if all the arguments
passed to the subvolume creation ioctl were valid.

Fix it by grabbing the value from fs_info->quota_root after acquiring
the mutex.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-22 18:59:59 +01:00
Dave Chinner
8c110d43c6 iomap: readpages doesn't zero page tail beyond EOF
When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.

However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.

Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.

The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.

Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.

This patch results in correct EOF detection through readpages:

xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864

As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes. The original version of the patch got
this right, but it got another case wrong.

The EOF detection crossing really needs to the the original length
as plen, while it starts at the end of the block, will be shortened
as up-to-date blocks are found on the page. This means "orig_pos +
plen" no longer points to the end of the page, and so will not
correctly detect EOF crossing. Hence we have to use the length
passed in to detect this partial page case:

xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296

Heere we see a trace where the first block on the EOF page is up to
date, hence poff = 1024 bytes. The offset into the page of EOF is
3072, so the range we want to read is 1024 - 3071, and the range we
want to zero is 3072 - 4095. You can see this is split correctly
now.

This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21 10:10:54 -08:00
Dave Chinner
494633fac7 vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().

Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21 10:10:54 -08:00
Dave Chinner
4721a60109 iomap: dio data corruption and spurious errors when pipes fill
When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.

The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.

copy_file_write() then completely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.

Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.

To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21 10:10:53 -08:00