There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During unlink we remove any references to the inode from
the tree log. It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.
If users _do_ want the deletion to be durable, they can call 'sync'.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Create a snap without waiting for it to commit to disk. The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.
We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.
The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop. Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true. However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.
Fix this by deciding how long to wait when we wait. Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.
Still needs more review.
Found by gcc 4.6's new warnings
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not
read which are really bugs.
- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things
Still needs more review.
Found by gcc 4.6's new warnings.
[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices. This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.
This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.
I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].
Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027
After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.
This switches us to kick the bdi flusher threads instead. All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An earlier commit tried to keep us from allocating too many
empty metadata chunks. It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.
This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.
This changes things to properly kick out the EIO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.
Signed-off-by: Josef Bacik <josef@redhat.com>
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl. This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw). So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.
This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space. It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems. But if it does get back ENOSPC it
handles it properly. The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight. So instead just remove
the warning. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation. So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction. This fixes a panic I could reproduce
at will. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we failed to find the root subvol id, or the subvol=<name>, we would
deactivate the locked super and close the devices. The problem is at this point
we have gotten the SB all setup, which includes setting super_operations, so
when we'd deactiveate the super, we'd do a close_ctree() which closes the
devices, so we'd end up closing the devices twice. So if you do something like
this
mount /dev/sda1 /mnt/test1
mount /dev/sda1 /mnt/test2 -o subvol=xxx
umount /mnt/test1
it would blow up (if subvol xxx doesn't exist). This patch fixes that problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC. So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing. The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves. This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should. For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used. So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.
Signed-off-by: Josef Bacik <josef@redhat.com>
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups. Instead if we have mixed block groups just set
data used to 0. Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system. This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type. So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which
means we'll just keep looping trying to shrink delalloc. Instead, if we fail to
shrink delalloc 3 times in a row break out since we're not likely to make any
progress. Tested this with a 100mb fs an xfstests test 13. Before the patch it
would hang the box, with the patch we get -ENOSPC like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
BTRFS does not define a '->write_super()' method, so it should
not mark its superblock as dirty. This looks like some left-over.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.
In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:
spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.
2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around). I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't. Even if it's not
exploitable, it couldn't hurt to be safe.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE). It was then using that offset when
extending the destination file's i_size. Fix this by not setting i_size
beyond the originally requested ending offset.
This bug was introduced by a22285a6 (2.6.35-rc1).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice. This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.
If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: The file argument for fsync() is never null
Btrfs: handle ERR_PTR from posix_acl_from_xattr()
Btrfs: avoid BUG when dropping root and reference in same transaction
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
Btrfs: should add a permission check for setfacl
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
Btrfs: unwind after btrfs_start_transaction() errors
Btrfs: btrfs_iget() returns ERR_PTR
Btrfs: handle kzalloc() failure in open_ctree()
Btrfs: handle error returns from btrfs_lookup_dir_item()
Btrfs: Fix BUG_ON for fs converted from extN
Btrfs: Fix null dereference in relocation.c
Btrfs: fix remap_file_pages error
Btrfs: uninitialized data is check_path_shared()
Btrfs: fix fallocate regression
Btrfs: fix loop device on top of btrfs
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dir_item() can return either ERR_PTRs or null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_iget() returns an ERR_PTR() on failure and not null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Unwind and return -ENOMEM if the allocation fails here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a potential null dereference in relocation.c
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot use the loop device which has been connected to a file in the btrf
The reproduce steps is following:
# dd if=/dev/zero of=vdev0 bs=1M count=1024
# losetup /dev/loop0 vdev0
# mkfs.btrfs /dev/loop0
...
failed to zero device start -5
The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
...
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents. This means it
wasn't honoring snapshots properly.
The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons. This
usually works well but can cause problems when there are
many many writers.
When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO. This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.
There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.
With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.
btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed two places we were doing a lot of work
without task->state set to TASK_RUNNING. This sets the state
properly after we get ready to sleep but decide not to.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)
1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.
2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.
3) Use ordered extents for the writes so that all of the
lock_extent()
lookup_ordered()
type checks continue to work.
4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.
I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:
1. Avoid COW tree leave in the phrase of merging tree.
2. Handle interaction with snapshot creation.
3. make the backref cache can live across transactions.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reserve metadata space for extent tree, checksum tree and root tree
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introduce metadata reservation context for delayed allocation
and update various related functions.
This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.
Changes since V1:
Add code that check if unlink and rmdir will free space.
Add ENOSPC handling for clone ioctl.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.
Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:
Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().
Make space accounting more accurate, mainly for handling read only
block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
All code in init_btrfs_i can be moved into btrfs_alloc_inode()
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure the chunk allocator doesn't create zero length chunks
Btrfs: fix data enospc check overflow
A recent commit allowed for smaller chunks to be created, but didn't
make sure they were always bigger than a stripe. After some divides,
this led to zero length stripes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: add check for changed leaves in setup_leaf_for_split
Btrfs: create snapshot references in same commit as snapshot
Btrfs: fix small race with delalloc flushing waitqueue's
Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
Btrfs: fix chunk allocate size calculation
Btrfs: kill max_extent mount option
Btrfs: fail to mount if we have problems reading the block groups
Btrfs: check btrfs_get_extent return for IS_ERR()
Btrfs: handle kmalloc() failure in inode lookup ioctl
Btrfs: dereferencing freed memory
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
Btrfs: add NULL check for do_walk_down()
Btrfs: remove duplicate include in ioctl.c
Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
Because we account for reserved space we get from the allocator before we
actually account for allocating delalloc space, we can have a small window where
the amount of "used" space in a space_info is more than the total amount of
space in the space_info. This will cause a overflow in our check, so it will
seem like we have _tons_ of free space, and we'll allow reservations to occur
that will end up larger than the amount of space we have. I've seen users
report ENOSPC panic's in cow_file_range a few times recently, so I tried to
reproduce this problem and found I could reproduce it if I ran one of my tests
in a loop for like 20 minutes. With this patch my test ran all night without
issues. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_leaf_for_split needs to drop the path and search again, and has
checks to see if the item we want to split changed size. But, it misses
the case where the leaf changed and now has enough room for the item
we want to insert.
This adds an extra check to make sure the leaf really needs splitting
before we call btrfs_split_leaf(), which keeps us from trying to split
a leaf with a single item.
btrfs_split_leaf() will blindly split the single item leaf, leaving us
with one good leaf and one empty leaf and then a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This creates the reference to a new snapshot in the same commit as the
snapshot itself. This avoids the need for a second commit in order for a
snapshot to be persistent, and also avoids the problem of "leaking" a
new snapshot tree root if the host crashes before the second commit takes
place.
It is not at all clear to me why it wasn't always done this way. If there
is still a reason for the two-stage {create,finish}_pending_snapshots()
approach I'm missing something! :)
I've been running this for a couple weeks under pretty heavy usage (a few
snapshots per minute) without obvious problems.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everytime we start a new flushing thread, we init the waitqueue if there isn't a
flushing thread running. The problem with this is we check
space_info->flushing, which we clear right before doing a wake_up on the
flushing waitqueue, which causes problems if we init the waitqueue in the middle
of clearing the flushing flagh and calling wake_up. This is hard to hit, but
the code is wrong anyway, so init the flushing/allocating waitqueue when
creating the space info and let it be. I haven't seen the panic since I've been
using this patch. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.
add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.
Signed-off-by: Nick Piggin <npiggin@suse.de>:
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the amount of free space left in a device is less than what we think should
be the minimum size, just ignore the minimum size and use the amount we have. I
ran into this running tests on a 600mb volume, the chunk allocator wouldn't let
me allocate the last 52mb of the disk for data because we want to have at least
64mb chunks for data. This patch fixes that problem. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As Yan pointed out, theres not much reason for all this complicated math to
account for file extents being split up into max_extent chunks, since they are
likely to all end up in the same leaf anyway. Since there isn't much reason to
use max_extent, just remove the option altogether so we have one less thing we
need to test.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We don't actually check the return value of btrfs_read_block_groups, so we can
possibly succeed to mount, but then fail to say read the superblock xattr for
selinux which will cause the vfs code to deactivate the super.
This is a problem because in find_free_extent we just assume that we
will find the right space_info for the allocation we want. But if we
failed to read the block groups, we won't have setup any space_info's,
and we'll hit a NULL pointer deref in find_free_extent.
This patch fixes that problem by checking the return value of
btrfs_read_block_groups, and failing out properly. I've also added a
check in find_free_extent so if for some reason we don't find an
appropriate space_info, we just return -ENOSPC.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The original code dereferenced range on the next line.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use this simple method to make source more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to check return value of btrfs_search_slot() in
btrfs_read_chunk_tree() and do corresponding error handing.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We only need to call finish_wait() after wait loop.
By the way, this patch makes code of waiting loop similar to
example in wait.h(no functional change)
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_create_tree_block() may return NULL, so we must check the returned
value, or we will access a NULL pointer.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs/btrfs/ioctl.c: ctree.h is included more than once.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
Btrfs: allow treeid==0 in the inode lookup ioctl
Btrfs: return keys for large items to the search ioctl
Btrfs: fix key checks and advance in the search ioctl
Btrfs: buffer results in the space_info ioctl
Btrfs: use __u64 types in ioctl.h
Btrfs: fix search_ioctl key advance
Btrfs: fix gfp flags masking in the compression code
Btrfs: don't look at bio flags after submit_bio
btrfs: using btrfs_stack_device_id() get devid
btrfs: use memparse
Btrfs: add a "df" ioctl for btrfs
Btrfs: cache the extent state everywhere we possibly can V2
Btrfs: cache ordered extent when completing io
Btrfs: cache extent state in find_delalloc_range
Btrfs: change the ordered tree to use a spinlock instead of a mutex
Btrfs: finish read pages in the order they are submitted
btrfs: fix btrfs_mkdir goto for no free objectids
Btrfs: flush data on snapshot creation
Btrfs: make df be a little bit more understandable
...
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root. But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.
This changes the search to look for the highest possible backref and hop
back one item. It also fixes a leaked path on failure to find the root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.
This allows userland to do searches based on that root later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer). This changes things to at least copy
the item header so that we can send information about the item back to
userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.
The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.
This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The space_info ioctl was using copy_to_user inside rcu_read_lock. This
commit changes things to copy into a buffer first and then dump the
result down to userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
key->type is u8, not u64.
fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After callling submit_bio, the bio can be freed at any time. The
btrfs submission thread helper was checking the bio flags too late,
which might not give the correct answer.
When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use btrfs_stack_device_id() to get dev_item->devid
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memparse() instead of its own private implementation.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>