kernel_optimize_test/fs
Filipe Manana adc11110d1 btrfs: send: fix invalid clone operations when cloning from the same file and root
commit 518837e65068c385dddc0a87b3e577c8be7c13b1 upstream.

When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.

Example reproducer:

  $ cat test-send-clone.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  # Create our test file with a single and large extent (1M) and with
  # different content for different file ranges that will be reflinked
  # later.
  xfs_io -f \
         -c "pwrite -S 0xab 0 128K" \
         -c "pwrite -S 0xcd 128K 128K" \
         -c "pwrite -S 0xef 256K 256K" \
         -c "pwrite -S 0x1a 512K 512K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now do a series of changes to our file such that we end up with
  # different parts of the extent reflinked into different file offsets
  # and we overwrite a large part of the extent too, so no file extent
  # items refer to that part that was overwritten. This used to confuse
  # the algorithm used by the kernel to figure out which file ranges to
  # clone, making it attempt to clone from a source range starting at
  # the current eof of the file, resulting in the receiver to fail since
  # it is an invalid clone operation.
  #
  xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
         -c "reflink $MNT/foobar 0K 512K 256K" \
         -c "reflink $MNT/foobar 512K 128K 256K" \
         -c "pwrite -S 0x73 384K 640K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  echo -e "\nFile digest in the original filesystem:"
  md5sum $MNT/snap2/foobar

  # Now unmount the filesystem, create a new one, mount it and try to
  # apply both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  # Must match what we got in the original filesystem of course.
  echo -e "\nFile digest in the new filesystem:"
  md5sum $MNT/snap2/foobar

  umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

  $ ./test-send-clone.sh
  wrote 131072/131072 bytes at offset 0
  128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
  wrote 131072/131072 bytes at offset 131072
  128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
  wrote 262144/262144 bytes at offset 262144
  256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
  wrote 524288/524288 bytes at offset 524288
  512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  linked 983040/983040 bytes at offset 1048576
  960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
  linked 262144/262144 bytes at offset 524288
  256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
  linked 262144/262144 bytes at offset 131072
  256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
  wrote 655360/655360 bytes at offset 393216
  640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2

  File digest in the original filesystem:
  9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to foobar: Invalid argument

  File digest in the new filesystem:
  132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
   the extent is shared and can be (fully or partially) found at file
   offset 0.

   At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
   64K and, since the file extent item at file offset 0 has a data offset
   of 0, it issues a clone operation, from the same file and root, that
   has a source range offset of 64K, destination offset of 1M and a length
   of 64K, since the extent item at file offset 0 refers only to the first
   128K of the shared extent.

   After this clone operation, the file size (and eof) at the receiver is
   increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
   it checks for the next file extent item, which starts at file offset
   128K. This file extent item has a data offset of 0 and a length of
   256K, so a clone operation with a source range offset of 256K, a
   destination offset of 1088K (1M + 64K) and length of 128K is issued.

   After this operation the file size (and eof) at the receiver increases
   from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
   it checks for the next file extent item, located at file offset 384K.
   This file extent item points to a different extent, not the one we want
   to clone, with a length of 640K. So we issue a write operation into the
   file range 1216K (1088K + 128K, end of the last clone operation), with
   a length of 640K and with a data matching the one we can find for that
   range in send root.

   After this operation, the file size (and eof) at the receiver increases
   from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
   we look into the file extent item, which is for file offset 1M and it
   points to the extent we want to clone, with a data offset of 64K and a
   length of 960K.

   However this matches the file offset we started with, the start of the
   range to clone into. So we can't for sure find any file extent item
   from here onwards with the rest of the data we want to clone, yet we
   proceed and since the file extent item points to the shared extent,
   with a data offset of 64K, we issue a clone operation with a source
   range starting at file offset 1856K, which matches the file extent
   item's offset, 1M, plus the amount of data cloned and written so far,
   which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
   operation is invalid since the source range offset matches the current
   eof of the file in the receiver. We should have stopped looking for
   extents to clone at this point and instead fallback to write, which
   would simply the contain the data in the file range from 1856K to
   1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:53 +01:00
..
9p fs: 9p: add generic splice_write file operation 2020-12-01 21:40:47 +01:00
adfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
affs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
afs afs: Fix memory leak when mounting with multiple source parameters 2020-12-08 15:59:25 -08:00
autofs
befs
bfs bfs: don't use WARNING: string when it's just info. 2021-01-06 14:56:52 +01:00
btrfs btrfs: send: fix invalid clone operations when cloning from the same file and root 2021-01-27 11:54:53 +01:00
cachefiles cachefiles: Handle readpage error correctly 2020-10-26 10:42:54 -07:00
ceph ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails 2021-01-06 14:56:55 +01:00
cifs cifs: fix interrupted close commands 2021-01-19 18:27:19 +01:00
coda
configfs
cramfs
crypto fscrypt: add fscrypt_is_nokey_name() 2020-12-26 16:02:43 +01:00
debugfs debugfs: remove return value of debugfs_create_devm_seqfile() 2020-10-30 08:37:39 +01:00
devpts
dlm
ecryptfs
efivarfs efivarfs: revert "fix memory leak in efivarfs_create()" 2020-11-25 16:55:02 +01:00
efs
erofs erofs: avoid using generic_block_bmap 2020-12-30 11:53:46 +01:00
exfat exfat: Avoid allocating upcase table using kcalloc() 2020-12-26 16:02:38 +01:00
exportfs
ext2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
ext4 ext4: fix superblock checksum failure when setting password salt 2021-01-19 18:27:31 +01:00
f2fs f2fs: fix race of pending_pages in decompression 2021-01-06 14:56:54 +01:00
fat
freevxfs
fscache
fuse fuse: fix bad inode 2021-01-09 13:46:24 +01:00
gfs2 gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func 2020-12-01 00:21:10 +01:00
hfs fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
hfsplus fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
hostfs
hpfs
hugetlbfs
iomap iomap: clean up writeback state logic on writepage error 2020-11-04 08:52:46 -08:00
isofs fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
jbd2 jbd2: fix kernel-doc markups 2020-11-19 22:38:29 -05:00
jffs2 jffs2: Fix NULL pointer dereference in rp_size fs option parsing 2021-01-06 14:56:49 +01:00
jfs jfs: Fix array index bounds check in dbAdjTree 2020-12-30 11:54:18 +01:00
kernfs
lockd lockd: don't use interval-based rebinding over TCP 2020-12-30 11:53:30 +01:00
minix
nfs NFS: nfs_igrab_and_active must first reference the superblock 2021-01-19 18:27:31 +01:00
nfs_common nfs_common: need lock during iterate through the list 2020-12-30 11:53:45 +01:00
nfsd nfsd4: readdirplus shouldn't return parent of export 2021-01-23 16:03:58 +01:00
nilfs2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
nls
notify fanotify: Fix sys_fanotify_mark() on native x86-32 2021-01-17 14:16:59 +01:00
ntfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
ocfs2 ocfs2: initialize ip_next_orphan 2020-11-14 11:26:04 -08:00
omfs
openpromfs
orangefs
overlayfs ovl: make ioctl() safe 2020-12-30 11:54:16 +01:00
proc mm: don't play games with pinned pages in clear_page_refs 2021-01-19 18:27:29 +01:00
pstore
qnx4
qnx6
quota quota: Don't overflow quota file offsets 2021-01-06 14:56:53 +01:00
ramfs
reiserfs reiserfs: add check for an invalid ih_entry_count 2021-01-06 14:56:52 +01:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
sysfs
sysv
tracefs
ubifs ubifs: wbuf: Don't leak kernel memory to flash 2020-12-30 11:54:17 +01:00
udf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
ufs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
unicode
vboxsf
verity
xfs xfs: revert "xfs: fix rmap key and record comparison functions" 2020-11-19 15:17:50 -08:00
zonefs zonefs: select CONFIG_CRC32 2021-01-17 14:17:03 +01:00
aio.c vfs: separate __sb_start_write into blocking and non-blocking helpers 2020-11-10 16:53:07 -08:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
block_dev.c
buffer.c mm, memcg: rework remote charging API to support nesting 2020-10-18 09:27:09 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: fix core_pattern parse error 2020-12-06 10:19:07 -08:00
d_path.c
dax.c fuse update for 5.10 2020-10-19 14:28:30 -07:00
dcache.c
dcookies.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c epoll: check for events when removing a timed out thread from the wait queue 2020-12-30 11:54:00 +01:00
exec.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
fcntl.c fcntl: Fix potential deadlock in send_sig{io, urg}() 2021-01-06 14:56:53 +01:00
fhandle.c
file_table.c task_work: cleanup notification modes 2020-10-17 15:05:30 -06:00
file.c
filesystems.c
fs_context.c
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c
fsopen.c
init.c
inode.c fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode() 2020-12-30 11:53:49 +01:00
internal.h
io_uring.c io_uring: drop file refs after task cancel 2021-01-19 18:27:25 +01:00
io-wq.c io-wq: cancel request if it's asking for files and we don't have them 2020-11-04 10:22:56 -07:00
io-wq.h io_uring: fix io_wqe->work_list corruption 2020-12-30 11:54:03 +01:00
ioctl.c
Kconfig
Kconfig.binfmt
kernel_read_file.c
libfs.c libfs: fix error cast of negative value in simple_attr_write() 2020-11-22 10:48:22 -08:00
locks.c
Makefile Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
mbcache.c
mount.h
mpage.c
namei.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
namespace.c umount(2): move the flag validity checks first 2021-01-19 18:27:32 +01:00
no-block.c
nsfs.c
open.c openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT 2020-12-30 11:54:24 +01:00
pipe.c
pnode.c
pnode.h fs/namespace.c: WARN if mnt_count has become negative 2021-01-06 14:56:54 +01:00
posix_acl.c
proc_namespace.c proc mountinfo: make splice available again 2020-12-30 11:54:02 +01:00
read_write.c Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
readdir.c
remap_range.c
select.c poll: fix performance regression due to out-of-line __put_user() 2021-01-19 18:27:27 +01:00
seq_file.c fix return values of seq_read_iter() 2020-11-15 22:12:53 -05:00
signalfd.c
splice.c io_uring-5.10-2020-10-24 2020-10-24 12:40:18 -07:00
stack.c
stat.c
statfs.c
super.c vfs: move __sb_{start,end}_write* to fs.h 2020-11-10 16:53:11 -08:00
sync.c
timerfd.c
userfaultfd.c
utimes.c
xattr.c