split_leaf was not properly balancing leaves when it was forced to
split a leaf twice. This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.
If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910e: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dir_item() can return either ERR_PTRs or null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_iget() returns an ERR_PTR() on failure and not null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Unwind and return -ENOMEM if the allocation fails here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a potential null dereference in relocation.c
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot use the loop device which has been connected to a file in the btrf
The reproduce steps is following:
# dd if=/dev/zero of=vdev0 bs=1M count=1024
# losetup /dev/loop0 vdev0
# mkfs.btrfs /dev/loop0
...
failed to zero device start -5
The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents. This means it
wasn't honoring snapshots properly.
The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons. This
usually works well but can cause problems when there are
many many writers.
When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO. This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.
There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.
With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.
btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed two places we were doing a lot of work
without task->state set to TASK_RUNNING. This sets the state
properly after we get ready to sleep but decide not to.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)
1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.
2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.
3) Use ordered extents for the writes so that all of the
lock_extent()
lookup_ordered()
type checks continue to work.
4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.
I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:
1. Avoid COW tree leave in the phrase of merging tree.
2. Handle interaction with snapshot creation.
3. make the backref cache can live across transactions.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reserve metadata space for extent tree, checksum tree and root tree
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introduce metadata reservation context for delayed allocation
and update various related functions.
This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.
Changes since V1:
Add code that check if unlink and rmdir will free space.
Add ENOSPC handling for clone ioctl.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.
Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:
Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().
Make space accounting more accurate, mainly for handling read only
block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
All code in init_btrfs_i can be moved into btrfs_alloc_inode()
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure the chunk allocator doesn't create zero length chunks
Btrfs: fix data enospc check overflow
A recent commit allowed for smaller chunks to be created, but didn't
make sure they were always bigger than a stripe. After some divides,
this led to zero length stripes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: add check for changed leaves in setup_leaf_for_split
Btrfs: create snapshot references in same commit as snapshot
Btrfs: fix small race with delalloc flushing waitqueue's
Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
Btrfs: fix chunk allocate size calculation
Btrfs: kill max_extent mount option
Btrfs: fail to mount if we have problems reading the block groups
Btrfs: check btrfs_get_extent return for IS_ERR()
Btrfs: handle kmalloc() failure in inode lookup ioctl
Btrfs: dereferencing freed memory
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
Btrfs: add NULL check for do_walk_down()
Btrfs: remove duplicate include in ioctl.c
Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
Because we account for reserved space we get from the allocator before we
actually account for allocating delalloc space, we can have a small window where
the amount of "used" space in a space_info is more than the total amount of
space in the space_info. This will cause a overflow in our check, so it will
seem like we have _tons_ of free space, and we'll allow reservations to occur
that will end up larger than the amount of space we have. I've seen users
report ENOSPC panic's in cow_file_range a few times recently, so I tried to
reproduce this problem and found I could reproduce it if I ran one of my tests
in a loop for like 20 minutes. With this patch my test ran all night without
issues. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_leaf_for_split needs to drop the path and search again, and has
checks to see if the item we want to split changed size. But, it misses
the case where the leaf changed and now has enough room for the item
we want to insert.
This adds an extra check to make sure the leaf really needs splitting
before we call btrfs_split_leaf(), which keeps us from trying to split
a leaf with a single item.
btrfs_split_leaf() will blindly split the single item leaf, leaving us
with one good leaf and one empty leaf and then a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>