Commit Graph

291 Commits

Author SHA1 Message Date
Linus Torvalds
a22180d266 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "A big set of fixes and features.

  In terms of line count, most of the code comes from Stefan, who added
  the ability to replace a single drive in place.  This is different
  from how btrfs normally replaces drives, and is much much much faster.

  Josef is plowing through our synchronous write performance.  This pull
  request does not include the DIO_OWN_WAITING patch that was discussed
  on the list, but it has a number of other improvements to cut down our
  latencies and CPU time during fsync/O_DIRECT writes.

  Miao Xie has a big series of fixes and is spreading out ordered
  operations over more CPUs.  This improves performance and reduces
  contention.

  I've put in fixes for error handling around hash collisions.  These
  are going back to individual stable kernels as I test against them.

  Otherwise we have a lot of fixes and cleanups, thanks everyone!
  raid5/6 is being rebased against the device replacement code.  I'll
  have it posted this Friday along with a nice series of benchmarks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
  Btrfs: fix a bug of per-file nocow
  Btrfs: fix hash overflow handling
  Btrfs: don't take inode delalloc mutex if we're a free space inode
  Btrfs: fix autodefrag and umount lockup
  Btrfs: fix permissions of empty files not affected by umask
  Btrfs: put raid properties into global table
  Btrfs: fix BUG() in scrub when first superblock reading gives EIO
  Btrfs: do not call file_update_time in aio_write
  Btrfs: only unlock and relock if we have to
  Btrfs: use tokens where we can in the tree log
  Btrfs: optimize leaf_space_used
  Btrfs: don't memset new tokens
  Btrfs: only clear dirty on the buffer if it is marked as dirty
  Btrfs: move checks in set_page_dirty under DEBUG
  Btrfs: log changed inodes based on the extent map tree
  Btrfs: add path->really_keep_locks
  Btrfs: do not mark ems as prealloc if we are writing to them
  Btrfs: keep track of the extents original block length
  Btrfs: inline csums if we're fsyncing
  Btrfs: don't bother copying if we're only logging the inode
  ...
2012-12-18 09:42:05 -08:00
Andrew Morton
965c8e59cf lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.  Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:12 -08:00
Josef Bacik
6c760c0724 Btrfs: do not call file_update_time in aio_write
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload.  We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:27 -05:00
Josef Bacik
70c8a91ce2 Btrfs: log changed inodes based on the extent map tree
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree.  So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Josef Bacik
b493968096 Btrfs: keep track of the extents original block length
If we've written to a prealloc extent we need to know the original block len
for the extent.  We can't figure this out currently since ->block_len is
just set to the extent length.  So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:23 -05:00
Josef Bacik
b812ce2879 Btrfs: inline csums if we're fsyncing
The tree logging stuff needs the csums to be on the ordered extents in order
to log them properly, so mark that we're sync and inline the csum creation
so we don't have to wait on the csumming to be done when logging extents
that are still in flight.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:22 -05:00
Miao Xie
7426cc04d4 Btrfs: punch hole past the end of the file
Since we can pre-allocate the space past EOF, we should be able to reclaim
that space if we need. This patch implements it by removing the EOF check.

Though the manual of fallocate command says we can use truncate command to
reclaim the pre-allocated space which past EOF, but because truncate command
changes the file size, we must run several commands to reclaim the space if we
don't want to change the file size, so it is not a good choice.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:20 -05:00
Miao Xie
0061280d2c Btrfs: fix the page that is beyond EOF
Steps to reproduce:
 # mkfs.btrfs <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8
 # fallocate -p -o 2048 -l 16384 <mnt>/<file>
 # dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat
 # umount <mnt>
 # dmesg
 WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330

The reason is that we inputed a range which is beyond the end of the file. And
because the end of this range was not page-aligned, we had to truncate the last
page in this range, this operation is similar to a buffered file write. In other
words, we reserved enough space and clear the data which was in the hole range
on that page. But when we expanded that test file, write the data into the same
page, we forgot that we have reserved enough space for the buffered write of
that page because in most cases there is no page that is beyond the end of
the file. As a result, we reserved the space twice.

In fact, we needn't truncate the page if it is beyond the end of the file, just
release the allocated space in that range. Fix the above problem by this way.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:19 -05:00
Miao Xie
6347b3c433 Btrfs: fix off-by-one error of the same page check in btrfs_punch_hole()
(start + len) is the start of the adjacent extent, not the end of the current
extent, so we should not use it to check the hole is on the same page or not.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:19 -05:00
Miao Xie
0ff6fabdb0 Btrfs: fix off-by-one error of the reserved size of btrfs_allocate()
alloc_end is not the real end of the current extent, it is the start of the
next adjoining extent. So we needn't +1 when calculating the size the space
that is about to be reserved.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:15 -05:00
Miao Xie
797f427711 Btrfs: use existing align macros in btrfs_allocate()
The kernel developers have implemented some often-used align macros, we should
use them instead of the complex code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:14 -05:00
Miao Xie
b66f00da0c Btrfs: fix freeze vs auto defrag
If we freeze the fs, the auto defragment should not run. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:12 -05:00
Miao Xie
26176e7c2a Btrfs: restructure btrfs_run_defrag_inodes()
This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:12 -05:00
Miao Xie
8ddc473433 Btrfs: fix unprotected defragable inode insertion
We forget to get the defrag lock when we re-add the defragable inode,
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:12 -05:00
Miao Xie
9247f3170b Btrfs: use slabs for auto defrag allocation
The auto defrag allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.

And besides that, it can do check for leaked objects when the module is removed.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:11 -05:00
Liu Bo
b53d3f5db2 Btrfs: cleanup for btrfs_btree_balance_dirty
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
  a bunch of code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:28 -05:00
Tsutomu Itoh
e1f5790e05 Btrfs: set hole punching time properly
Even if the hole punching is executed, the modification time of the
file is not updated.
So, current time is set to inode.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:26 -05:00
Liu Bo
9f3959c53d Btrfs: get right arguments for btrfs_wait_ordered_range
btrfs_wait_ordered_range expects for 'len' instead of 'end'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:19 -05:00
Namjae Jeon
d0e1d66b5a writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:21 -08:00
Linus Torvalds
72055425e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "This is a large pull, with the bulk of the updates coming from:

   - Hole punching

   - send/receive fixes

   - fsync performance

   - Disk format extension allowing more hardlinks inside a single
     directory (btrfs-progs patch required to enable the compat bit for
     this one)

  I'm cooking more unrelated RAID code, but I wanted to make sure this
  original batch makes it in.  The largest updates here are relatively
  old and have been in testing for some time."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
  btrfs: init ref_index to zero in add_inode_ref
  Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
  Btrfs: fix page leakage
  Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
  Btrfs: don't bug on enomem in readpage
  Btrfs: cleanup pages properly when ENOMEM in compression
  Btrfs: make filesystem read-only when submitting barrier fails
  Btrfs: detect corrupted filesystem after write I/O errors
  Btrfs: make compress and nodatacow mount options mutually exclusive
  btrfs: fix message printing
  Btrfs: don't bother committing delayed inode updates when fsyncing
  btrfs: move inline function code to header file
  Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
  btrfs: remove unused function btrfs_insert_some_items()
  Btrfs: don't commit instead of overcommitting
  Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
  Btrfs: be smarter about dropping things from the tree log
  Btrfs: don't lookup csums for prealloc extents
  Btrfs: cache extent state when writing out dirty metadata pages
  Btrfs: do not hold the file extent leaf locked when adding extent item
  ...
2012-10-10 10:49:20 +09:00
Konstantin Khlebnikov
0b173bc4da mm: kill vma flag VM_CAN_NONLINEAR
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:17 +09:00
Josef Bacik
c3308f84c1 Btrfs: fix punch hole when no extent exists
I saw the warning in btrfs_drop_extent_cache where our end is less than our
start while running xfstests 68 in a loop.  This is because we
unconditionally do drop_end = min(end, extent_end) in
__btrfs_drop_extents(), even though we may not have found an extent in the
range we were looking to drop.  So keep track of wether or not we found
something, and if we didn't just use our end.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:40:00 -04:00
Miao Xie
90abccf2c6 Revert "Btrfs: do not do filemap_write_and_wait_range in fsync"
This reverts commit 0885ef5b56

After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.

The following is the test result of sysbench:
	Before		After
	24MB/s		39MB/s

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:22 -04:00
Liu Bo
9e8a4a8b0b Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:15 -04:00
Miao Xie
903889f462 Btrfs: fix wrong size for the reservation when doing, file pre-allocation.
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:14 -04:00
Miao Xie
2ecb79239b Btrfs: fix unprotected ->log_batch
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
66d8f3dd1c Btrfs: add a new "type" field into the block reservation structure
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Josef Bacik
7014cdb493 Btrfs: btrfs_drop_extent_cache should never fail
I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent.  This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given.  This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:09 -04:00
Josef Bacik
2aaa665581 Btrfs: add hole punching
This patch adds hole punching via fallocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:07 -04:00
Josef Bacik
2671485d39 Btrfs: remove unused hint byte argument for btrfs_drop_extents
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument.  I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:06 -04:00
Josef Bacik
5dc562c541 Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Josef Bacik
224ecce517 Btrfs: fix possible corruption when fsyncing written prealloced extents
While working on my fsync patch my fsync tester kept hitting mismatching
md5sums when I would randomly write to a prealloc'ed region, syncfs() and
then write to the prealloced region some more and then fsync() and then
immediately reboot.  This is because the tree logging code will skip writing
csums for file extents who's generation is less than the current running
transaction.  When we mark extents as written we haven't been updating their
generation so they were always being skipped.  This wouldn't happen if you
were to preallocate and then write in the same transaction, but if you for
example prealloced a VM you could definitely run into this problem.  This
patch makes my fsync tester happy again.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:02 -04:00
Jan Kara
b2b5ef5c8e btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:45:52 +04:00
Linus Torvalds
5eecb9cc90 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I held off on my rc5 pull because I hit an oops during log recovery
  after a crash.  I wanted to make sure it wasn't a regression because
  we have some logging fixes in here.

  It turns out that a commit during the merge window just made it much
  more likely to trigger directory logging instead of full commits,
  which exposed an old bug.

  The new backref walking code got some additional fixes.  This should
  be the final set of them.

  Josef fixed up a corner where our O_DIRECT writes and buffered reads
  could expose old file contents (not stale, just not the most recent).
  He and Liu Bo fixed crashes during tree log recover as well.

  Ilya fixed errors while we resume disk balancing operations on
  readonly mounts."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: run delayed directory updates during log replay
  Btrfs: hold a ref on the inode during writepages
  Btrfs: fix tree log remove space corner case
  Btrfs: fix wrong check during log recovery
  Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
  Btrfs: resume balance on rw (re)mounts properly
  Btrfs: restore restriper state on all mounts
  Btrfs: fix dio write vs buffered read race
  Btrfs: don't count I/O statistic read errors for missing devices
  Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
  Btrfs: fix tree mod log rewind of ADD operations
  Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
  Btrfs: always put insert_ptr modifications into the tree mod log
  Btrfs: fix tree mod log for root replacements at leaf level
  Btrfs: support root level changes in __resolve_indirect_ref
  Btrfs: avoid waiting for delayed refs when we must not
2012-07-05 13:06:25 -07:00
Josef Bacik
c3473e8300 Btrfs: fix dio write vs buffered read race
Miao pointed out there's a problem with mixing dio writes and buffered
reads.  If the read happens between us invalidating the page range and
actually locking the extent we can bring in pages into page cache.  Then
once the write finishes if somebody tries to read again it will just find
uptodate pages and we'll read stale data.  So we need to lock the extent and
check for uptodate bits in the range.  If there are uptodate bits we need to
unlock and invalidate again.  This will keep this race from happening since
we will hold the extent locked until we create the ordered extent, and then
teh read side always waits for ordered extents.  There was also a race in
how we updated i_size, previously we were relying on the generic DIO stuff
to adjust the i_size after the DIO had completed, but this happens outside
of the extent lock which means reads could come in and not see the updated
i_size.  So instead move this work into where we create the extents, and
then this way the update ordered i_size stuff works properly in the endio
handlers.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-02 15:36:23 -04:00
Linus Torvalds
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Josef Bacik
e41f941a23 Btrfs: move over to use ->update_time
Btrfs had been doing it's own file_update_time so we could catch ENOSPC
properly, so just update our btrfs_update_time to work with the new stuff and
then we'll be fancy later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01 12:07:52 -04:00
Josef Bacik
22ee6985de Btrfs: check to see if the inode is in the log before fsyncing
We have this check down in the actual logging code, but this is after we
start a transaction and all that good stuff.  So move the helper
inode_in_log() out so we can call it in fsync() and avoid starting a
transaction altogether and just exit if we've already fsync()'ed this file
recently.  You would notice this issue if you fsync()'ed a file over and
over again until the transaction committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:42 -04:00
Miao Xie
762f226326 Btrfs: fix the same inode id problem when doing auto defragment
Two files in the different subvolumes may have the same inode id, so
The rb-tree which is used to manage the defragment object must take it
into account. This patch fix this problem.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-05-30 10:23:38 -04:00
Josef Bacik
72ac3c0d79 Btrfs: convert the inode bit field to use the actual bit operations
Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right.  Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting.  So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:36 -04:00
Josef Bacik
0885ef5b56 Btrfs: do not do filemap_write_and_wait_range in fsync
We already do the btrfs_wait_ordered_range which will do this for us, so
just remove this call so we don't call it twice.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:29 -04:00
Josef Bacik
0c4d2d95d0 Btrfs: use i_version instead of our own sequence
We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:27 -04:00
Chris Mason
dc7fdde39e Btrfs: reduce lock contention during extent insertion
We're spending huge amounts of time on lock contention during
end_io processing because we unconditionally assume we are overwriting
an existing extent in the file for each IO.

This checks to see if we are outside i_size, and if so, it uses a
less expensive readonly search of the btree to look for existing
extents.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-27 14:51:05 -04:00
Jeff Mahoney
79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Jeff Mahoney
d0082371cf btrfs: drop gfp_t from lock_extent
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
 argument and use GFP_NOFS consistently.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Linus Torvalds
855a85f704 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Quoth Chris:
 "This is later than I wanted because I got backed up running through
  btrfs bugs from the Oracle QA teams.  But they are all bug fixes that
  we've queued and tested since rc1.

  Nothing in particular stands out, this just reflects bug fixing and QA
  done in parallel by all the btrfs developers.  The most user visible
  of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

  Because that helps deal with out of date drives (say an iscsi disk
  that has gone away and come back).  The old code wasn't always
  properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix compiler warnings on 32 bit systems
  Btrfs: increase the global block reserve estimates
  Btrfs: clear the extent uptodate bits during parent transid failures
  Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
  Btrfs: make sure we update latest_bdev
  Btrfs: improve error handling for btrfs_insert_dir_item callers
  Btrfs: be less strict on finding next node in clear_extent_bit
  Btrfs: fix a bug on overcommit stuff
  Btrfs: kick out redundant stuff in convert_extent_bit
  Btrfs: skip states when they does not contain bits to clear
  Btrfs: check return value of lookup_extent_mapping() correctly
  Btrfs: fix deadlock on page lock when doing auto-defragment
  Btrfs: fix return value check of extent_io_ops
  btrfs: honor umask when creating subvol root
  btrfs: silence warning in raid array setup
  btrfs: fix structs where bitfields and spinlock/atomic share 8B word
  btrfs: delalloc for page dirtied out-of-band in fixup worker
  Btrfs: fix memory leak in load_free_space_cache()
  btrfs: don't check DUP chunks twice
  Btrfs: fix trim 0 bytes after a device delete
  ...
2012-02-24 09:02:53 -08:00
Jeff Liu
6af021d8fc Btrfs: return the internal error unchanged if btrfs_get_extent_fiemap() call failed for SEEK_DATA/SEEK_HOLE inquiry
Given that ENXIO only means "offset beyond EOF" for either SEEK_DATA or SEEK_HOLE inquiry
in a desired file range, so we should return the internal error unchanged if btrfs_get_extent_fiemap()
call failed, rather than ENXIO.

Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2012-02-15 16:40:23 +01:00
Chris Mason
d98456fcaf Btrfs: don't reserve data with extents locked in btrfs_fallocate
btrfs_fallocate tries to allocate space only if ranges in the file don't
already exist.  But the enospc checks it does are not allowed with
extents locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-31 20:27:41 -05:00
Linus Torvalds
f9156c7288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: use larger system chunks
  Btrfs: add a delalloc mutex to inodes for delalloc reservations
  Btrfs: space leak tracepoints
  Btrfs: protect orphan block rsv with spin_lock
  Btrfs: add allocator tracepoints
  Btrfs: don't call btrfs_throttle in file write
  Btrfs: release space on error in page_mkwrite
  Btrfs: fix btrfsck error 400 when truncating a compressed
  Btrfs: do not use btrfs_end_transaction_throttle everywhere
  Btrfs: add balance progress reporting
  Btrfs: allow for resuming restriper after it was paused
  Btrfs: allow for canceling restriper
  Btrfs: allow for pausing restriper
  Btrfs: add skip_balance mount option
  Btrfs: recover balance on mount
  Btrfs: save balance parameters to disk
  Btrfs: soft profile changing mode (aka soft convert)
  Btrfs: implement online profile changing
  Btrfs: do not reduce profile in do_chunk_alloc()
  Btrfs: virtual address space subset filter
  ...

Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
2012-01-17 15:49:54 -08:00
Josef Bacik
45a8090e62 Btrfs: don't call btrfs_throttle in file write
Btrfs_throttle will make us wait if there is a currently committing transaction
until we can open new transactions, which is ridiculous since we don't actually
start any transactions within the file write path anyway, so all this does is
introduce big latencies if we have a sync/fsync heavy workload going on while
somebody else is trying to do work.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:28:55 -05:00