Commit Graph

32624 Commits

Author SHA1 Message Date
Linus Torvalds
bcd7351e83 FS-Cache patches 2013-07-02
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIVAwUAUdLdUxOxKuMESys7AQK1kQ//W7fgFXCG+5XVk4ECHGN5tqRn4tU69DY0
 9nYU2/y1wbqV5cTO36XTcFPQK1qbW2ZdyvEZ2CF8OfwtQpLmcALGtpBIgJwYs+4H
 DMkgO06zdk4caxc0C4JBIGs+MDeLNk2SQObqblGl1BAQKQ5cqsCLsIZ/rxln999m
 ufuobfns1YvuHkzMtswUDmm3zWMpwqqPAbbl+fTwPU683a/AleckG2ACyFvKZAxA
 OyI8kJR4e33a3/BGo/5OFb3qI1+Z25EOWdvdnM+r4hdKJZF9ZySlyc640GZHAO2J
 wKj5lYp1nBpyNPvYvly174s2MxPju1CRHb7gxcV4LX3vtEY4/MCg7m6P46EUfC6R
 C3V7PMMCjZXEQ01MKEmGig47EJKIiecCQUZupJnP7HFKPzeJR9mQZFd68WqzswAM
 w9hcCw9hQ9y/kTDVrTVCHs0Q9iTxShfrJyfRJnQ1VcoT+1dieruTa9am9OBKiEw6
 CQrPjq9RZZfsZHYr6RlGZHGJyzjrTzrf6EhxwmgaCxWycpvCuV7z76YgAVZI7V4r
 qnJmH8dXWdoSA7nZ6sgsb5TRCLT9wu1nNId0DMpAGB1cDGga/55AZtqxdoJLnlkj
 y/4wQavIrkfHHuS8c3gzVXPtYmM19CHgcKRFydXD0uGobzfxwYKTKMH+Gviu1NnH
 /pGNNY2vVGI=
 =Wjhu
 -----END PGP SIGNATURE-----

Merge tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull FS-Cache updates from David Howells:
 "This contains a number of fixes for various FS-Cache issues plus some
  cleanups.  The commits are, in order:

   1) Provide a system wait_on_atomic_t() and wake_up_atomic_t() sharing
      the bit-wait table (enhancement for ).

   2) Don't put spin_lock() in a while-condition as spin_lock() may have
      a do {} while(0) wrapper (cleanup).

   3) Symbolically name i_mutex lock classes rather than using numbers
      in CacheFiles (cleanup).

   4) Don't sleep in page release if __GFP_FS is not set (deadlock vs
      ext4).

   5) Uninline fscache_object_init() (cleanup for ).

   6) Wrap checks on object state (cleanup for ).

   7) Simplify the object state machine by separating work states from
      wait states.

   8) Simplify cookie retention by objects (NULL pointer deref fix).

   9) Remove unused list_to_page() macro (cleanup).

  10) Make the remaining-pages counter in the retrieval op atomic
      (assertion failure fix).

  11) Don't use spin_is_locked() in assertions (assertion failure fix)"

* tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  FS-Cache: Don't use spin_is_locked() in assertions
  FS-Cache: The retrieval remaining-pages counter needs to be atomic_t
  cachefiles: remove unused macro list_to_page()
  FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
  FS-Cache: Fix object state machine to have separate work and wait states
  FS-Cache: Wrap checks on object state
  FS-Cache: Uninline fscache_object_init()
  FS-Cache: Don't sleep in page release if __GFP_FS is not set
  CacheFiles: name i_mutex lock class explicitly
  fs/fscache: remove spin_lock() from the condition in while()
  Add wait_on_atomic_t() and wake_up_atomic_t()
2013-07-02 09:52:47 -07:00
Linus Torvalds
6072a93b98 dlm for 3.11
This set includes a number of SCTP related fixes in the dlm,
 and a few other minor fixes and changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0eE4AAoJEDgbc8f8gGmqmE4QAKgEdiuaLzb1Y76ng7GU9UMz
 6+RVgjDr+fix5pHrs/KgTMNyfP8QKUYgoaspTTgH1YzyjL0Fvx7SjVt8MIs4694j
 OT1sPQhwWcCaeRxK9ss4qQxexN25N69iv/A6W1o+RPbL7VqXkvHfyCTePUEMlX4y
 55AXcSMlCUoYXKq3CMEkTG/25rK6L3fMopzLDKMyH+bbeOtQwHmTL9pHKj3VRwv2
 iV3A5U0pOdbGeWilTzhsLz+iF0oZsfjsNADrrAwuYaAbK5fcXQDDkO5hGkDwa/dd
 nDR4qs4TBIGUu8J8wgK4dz8Zhcm6mEo5A5/ORgbKvRo3dClLlhX/78xaYtMp/RGG
 8v+u2HVwiMd/Ma029m3SxgtjI6cH6XFtZ1ncnL8A72h2Ey1bPEbxmMOQFkSU48Ve
 ZZyjxLYhtaDxl0xzfqLhh8tWAk8JnOAKu2T5bgO2ShLFvTU6yMmyKwovPdFgZctf
 qV81wQbuXeqF4xi/uwNLh7UBIPnzXS3OYNhfgzTuawMSwSP2ix+rouqozv6c5804
 Ay7/cv+cVJItR0f7GKny0U/tEc5BXkngbhIAG3vWQfOjmU5WMuDMvAxzA9i4CH6h
 P1pgnoh0VoxoaX1meu0H7XnREAdBo1my9iaJE3Tzk4sjV341JjF9BAuYraMB6krK
 zHiQjNdDDBy/F8P0YCYt
 =/kqw
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a number of SCTP related fixes in the dlm, and a few
  other minor fixes and changes."

* tag 'dlm-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: Avoid LVB truncation
  dlm: log an error for unmanaged lockspaces
  dlm: config: using strlcpy instead of strncpy
  dlm: remove duplicated include from lowcomms.c
  dlm: disable nagle for SCTP
  dlm: retry failed SCTP sends
  dlm: try other IPs when sctp init assoc fails
  dlm: clear correct bit during sctp init failure handling
  dlm: set sctp assoc id during setup
  dlm: clear correct init bit during sctp setup
2013-07-02 09:52:07 -07:00
Linus Torvalds
3f490f7f99 This patch-set includes the following major enhancement patches.
o remount_fs callback function
  o restore parent inode number to enhance the fsync performance
  o xattr security labels
  o reduce the number of redundant lock/unlock data pages
  o avoid frequent write_inode calls
 
 The other minor bug fixes are as follows.
  o endian conversion bugs
  o various bugs in the roll-forward recovery routine
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0h8YAAoJEEAUqH6CSFDSJNMP/1V+e/PaLa8YsHuw5eWPT3Fe
 RYFXJv7SXadHqgQInjwD7JOF8BC9DlUyknYUiXFUZgvKsgMHz+OTCNl+1hLTbUXt
 e6Rhn7vU/fMu3TEtj+v6g8JbpXdXH8TSbtvh9LpoczRS98GYpZuckP0BtQsVkTL3
 jIq6WD8JRkb2DvpDl7viTT0Eq0T61CSmtOwOIvfhhiVxggvDWcnR1mTM9Tymdi4J
 NKtFkjCsKP0Z/7IZZUJAczGEHjsT+O6JDwp8+KVWuZT4BSuchoX4MYAY5wX527Ne
 rZvkolbbfnAFCC3BtETr0DPOkpxnHmJ6dEveIxjZ9cux12CAFA/Ww1QAL4ygiDkd
 Avn8BBEJnfnuzeOUkE0by+9hdF9LOU3CVSxiDhWJegYB16z+c9pSBD9xvlKhKk5B
 QNsjptB6m0CftAq7vIDsryL60uJ2cSHFxPqfwAHEpNngiU/NohTFSZE0sUMbLUNh
 FI6NrHoT7yld1HcB6cvL1lnUqIENFbNhDSSDcTdlN49IJJap4oqtgCnmMMIwbUCO
 vR2/26k5W7NwG+K6XN2IX13AsayzQahxTv8in/+LV0bfjHo6/1VzzGcqAmXJbDQw
 dLrNAeWaaIJi8J/zJiENMbFKXTj8Wax9jxKsW+4towQuyEt4ADvyt1c5gX3VR42T
 x8+YEargsdBf6FAhtN+H
 =qFcy
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches:
   - remount_fs callback function
   - restore parent inode number to enhance the fsync performance
   - xattr security labels
   - reduce the number of redundant lock/unlock data pages
   - avoid frequent write_inode calls

  The other minor bug fixes are as follows.
   - endian conversion bugs
   - various bugs in the roll-forward recovery routine"

* tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (56 commits)
  f2fs: fix to recover i_size from roll-forward
  f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()
  f2fs: remove reusing any prefree segments
  f2fs: code cleanup and simplify in func {find/add}_gc_inode
  f2fs: optimize the init_dirty_segmap function
  f2fs: fix an endian conversion bug detected by sparse
  f2fs: fix crc endian conversion
  f2fs: add remount_fs callback support
  f2fs: recover wrong pino after checkpoint during fsync
  f2fs: optimize do_write_data_page()
  f2fs: make locate_dirty_segment() as static
  f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
  f2fs: avoid freqeunt write_inode calls
  f2fs: optimise the truncate_data_blocks_range() range
  f2fs: use the F2FS specific flags in f2fs_ioctl()
  f2fs: sync dir->i_size with its block allocation
  f2fs: fix i_blocks translation on various types of files
  f2fs: set sb->s_fs_info before calling parse_options()
  f2fs: support xattr security labels
  f2fs: fix iget/iput of dir during recovery
  ...
2013-07-02 09:42:38 -07:00
Linus Torvalds
c4eb1b0730 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
 "There are a few bug fixes for various, mostly very minor corner cases,
  plus some interesting new features.

  The new features include atomic_open whose main benefit will be the
  reduction in locking overhead in case of combined lookup/create and
  open operations, sorting the log buffer lists by block number to
  improve the efficiency of AIL writeback, and aggressively issuing
  revokes in gfs2_log_flush to reduce overhead when dropping glocks."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Reserve journal space for quota change in do_grow
  GFS2: Fix fstrim boundary conditions
  GFS2: fix warning message
  GFS2: aggressively issue revokes in gfs2_log_flush
  GFS2: fix regression in dir_double_exhash
  GFS2: Add atomic_open support
  GFS2: Only do one directory search on create
  GFS2: fix error propagation in init_threads()
  GFS2: Remove no-op wrapper function
  GFS2: Cocci spatch "ptr_ret.spatch"
  GFS2: Eliminate gfs2_rg_lops
  GFS2: Sort buffer lists by inplace block number
2013-07-02 09:41:18 -07:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Linus Torvalds
63580e51bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS patches (part 1) from Al Viro:
 "The major change in this pile is ->readdir() replacement with
  ->iterate(), dealing with ->f_pos races in ->readdir() instances for
  good.

  There's a lot more, but I'd prefer to split the pull request into
  several stages and this is the first obvious cutoff point."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits)
  [readdir] constify ->actor
  [readdir] ->readdir() is gone
  [readdir] convert ecryptfs
  [readdir] convert coda
  [readdir] convert ocfs2
  [readdir] convert fatfs
  [readdir] convert xfs
  [readdir] convert btrfs
  [readdir] convert hostfs
  [readdir] convert afs
  [readdir] convert ncpfs
  [readdir] convert hfsplus
  [readdir] convert hfs
  [readdir] convert befs
  [readdir] convert cifs
  [readdir] convert freevxfs
  [readdir] convert fuse
  [readdir] convert hpfs
  reiserfs: switch reiserfs_readdir_dentry to inode
  reiserfs: is_privroot_deh() needs only directory inode, actually
  ...
2013-07-02 09:28:37 -07:00
Dave Chinner
7747bd4bce sync: don't block the flusher thread waiting on IO
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-02 09:16:42 -07:00
Josef Bacik
0e267c44c3 Btrfs: wait ordered range before doing direct io
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch.  If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write.  This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written.  This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:49 -04:00
Josef Bacik
7fb7d76f96 Btrfs: only do the tree_mod_log_free_eb if this is our last ref
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
I no longer hit the panic in __tree_mod_log_rewind.  Thanks,

Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:20 -04:00
Josef Bacik
f1ca7e98a6 Btrfs: hold the tree mod lock in __tree_mod_log_rewind
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:18 -04:00
Josef Bacik
261c84b662 Btrfs: make backref walking code handle skinny metadata
I missed fixing the backref stuff when I introduced the skinny metadata.  If you
try and do things like snapshot aware defrag with skinny metadata you are going
to see tons of warnings related to the backref count being less than 0.  This is
because the delayed refs will be found for stuff just fine, but it won't find
the skinny metadata extent refs.  With this patch I'm not seeing warnings
anymore.  Thanks,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:02 -04:00
Liu Bo
35f0399db6 Btrfs: fix crash regarding to ulist_add_merge
Several users reported this crash of NULL pointer or general protection,
the story is that we add a rbtree for speedup ulist iteration, and we
use krealloc() to address ulist growth, and krealloc() use memcpy to copy
old data to new memory area, so it's OK for an array as it doesn't use
pointers while it's not OK for a rbtree as it uses pointers.

So krealloc() will mess up our rbtree and it ends up with crash.

Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:59 -04:00
Miao Xie
edd1400be9 Btrfs: fix several potential problems in copy_nocow_pages_for_inode
- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:58 -04:00
Miao Xie
826aa0a82c Btrfs: cleanup the code of copy_nocow_pages_for_inode()
- It make no sense that we continue to do something after the error
  happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
  after write, inode check in the end of the function, because we are
  sure they exist.
- remove the unnecessary goto in the return value check of the write

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:56 -04:00
Miao Xie
26b2589190 Btrfs: fix oops when recovering the file data by scrub function
We get oops while running btrfs replace start test,
------------[ cut here ]------------
kernel BUG at mm/filemap.c:608!
[SNIP]
Call Trace:
  [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
  [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
  [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
  [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
  [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
  [<ffffffff8106f2f0>] kthread+0xc0/0xd0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
  [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
[SNIP]
 RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
  RSP <ffff88010316bb98>
 ---[ end trace 421e79ad0dd72c7d ]---

it is because we forgot to lock the page again after we read data to
the page. Fix it.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:55 -04:00
Josef Bacik
6df9a95e63 Btrfs: make the chunk allocator completely tree lockless
When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock.  To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff.  We still keep the in-memory
stuff which makes sure everything is consistent.

One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice.  This has the side
effect of fixing a bug with balance that has been there ever since balance
existed.  Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction.  So if you happen to crash
during a balance you could come back to a completely broken file system.  This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits.  This has passed all of the xfstests and my super annoying stress test
followed by a balance.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:53 -04:00
Josef Bacik
68a7342c51 Btrfs: cleanup orphaned root orphan item
I hit a weird problem were my root item had been deleted but the orphan item had
not.  This isn't necessarily a problem, but it keeps the file system from being
mounted.  To fix this we just need to axe the orphan item if we can't find the
fs root when we're putting them altogether.  With this patch I was able to
successfully mount my file system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:52 -04:00
Miao Xie
a70c6172e7 Btrfs: fix wrong mirror number tuning
Now reading the data from the target device of the replace operation is allowed,
so the mirror number that is greater than the stripes number of a chunk is valid,
we will tune it when we find there is no target device later. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:50 -04:00
Miao Xie
e6da5d2ec9 Btrfs: cleanup redundant code in btrfs_submit_direct()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:48 -04:00
Miao Xie
f51a4a1826 Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.

By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).

test command:
 # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:47 -04:00
Josef Bacik
7ee9e4405f Btrfs: check if we can nocow if we don't have data space
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write.  This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue.  With this patch we now pass xfstests
generic/274.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:45 -04:00
Josef Bacik
925a6efb8f Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
is completely fraking useless for us as we need to make sure pages are actually
written before we go and check if there are ordered extents.  So replace this
with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
underway check so that we are sure to actually have flushed some dirty pages out
and will have ordered extents to use.  With this patch xfstests generic/273 now
passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:43 -04:00
Josef Bacik
b150a4f10d Btrfs: use a percpu to keep track of possibly pinned bytes
There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation.  This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space.  So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space.  So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free.  We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction.  With this patch we now pass xfstests 226.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:42 -04:00
Josef Bacik
f23b5a5995 Btrfs: check for actual acls rather than just xattrs when caching no acl
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode.  This saves us a lookup later to check the
acls for writes or any other access.  The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:40 -04:00
Aruna Balakrishnaiah
1d8b368ab4 pstore: Add hsize argument in write_buf call of pstore_ftrace_call
Incorporate the addition of hsize argument in write_buf callback
of pstore. This was forgotten in

    6bbbca7359
    pstore: Pass header size in the pstore write callback

Causing a build failure when ftrace and pstore are enabled.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-02 18:39:37 +10:00
Jaegeuk Kim
a1dd3c13ce f2fs: fix to recover i_size from roll-forward
If user requests many data writes and fsync together, the last updated i_size
should be stored to the inode block consistently.

But, previous write_end just marks the inode as dirty and doesn't update its
metadata into its inode block.
After that, fsync just writes the inode block with newly updated data index
excluding inode metadata updates.

So, this patch introduces write_end in which updates inode block too when the
i_size is changed.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:16 +09:00
Gu Zheng
5ebefc5b40 f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()
As destroy_fsync_dnodes() is a simple list-cleanup func, so delete the unused
and unrelated f2fs_sb_info argument of it.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:15 +09:00
Jaegeuk Kim
763bfe1bc5 f2fs: remove reusing any prefree segments
This patch removes check_prefree_segments initially designed to enhance the
performance by narrowing the range of LBA usage across the whole block device.

When allocating a new segment, previous f2fs tries to find proper prefree
segments, and then, if finds a segment, it reuses the segment for further
data or node block allocation.

However, I found that this was totally wrong approach since the prefree segments
have several data or node blocks that will be used by the roll-forward mechanism
operated after sudden-power-off.

Let's assume the following scenario.

/* write 8MB with fsync */
for (i = 0; i < 2048; i++) {
	offset = i * 4096;
	write(fd, offset, 4KB);
	fsync(fd);
}

In this case, naive segment allocation sequence will be like:
 data segment: x, x+1, x+2, x+3
 node segment: y, y+1, y+2, y+3.

But, if we can reuse prefree segments, the sequence can be like:
 data segment: x, x+1, y, y+1
 node segment: y, y+1, y+2, y+3.
Because, y, y+1, and y+2 became prefree segments one by one, and those are
reused by data allocation.

After conducting this workload, we should consider how to recover the latest
inode with its data.
If we reuse the prefree segments such as y or y+1, we lost the old node blocks
so that f2fs even cannot start roll-forward recovery.

Therefore, I suggest that we should remove reusing prefree segments.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:15 +09:00
Gu Zheng
6cc4af5606 f2fs: code cleanup and simplify in func {find/add}_gc_inode
This patch simplifies list operations in find_gc_inode and add_gc_inode.
Just simple code cleanup.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:14 +09:00
Namjae Jeon
8736fbf003 f2fs: optimize the init_dirty_segmap function
Optimize the while loop condition

Since this condition will always be true and while loop will
be terminated by the following condition in code:

if (segno >= TOTAL_SEGS(sbi))
    break;
Hence we can replace the while loop condition with while(1)
instead of always checking for segno to be less than Total segs.

Also we do not need to use TOTAL_SEGS() everytime. We can store
this value in a local variable since this value is constant.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:14 +09:00
Jaegeuk Kim
060dd67b3c f2fs: fix an endian conversion bug detected by sparse
This patch should fix the following bug reported by kbuild test robot.

fs/f2fs/recovery.c:233:33: sparse: incorrect type in assignment
(different base types)

parse warnings: (new ones prefixed by >>)

>> recovery.c:233: sparse: incorrect type in assignment (different base types)
   recovery.c:233:    expected unsigned int [unsigned] [assigned] ofs_in_node
   recovery.c:233:    got restricted __le16 [assigned] [usertype] ofs_in_node
>> recovery.c:238: sparse: incorrect type in assignment (different base types)
   recovery.c:238:    expected unsigned int [unsigned] ofs_in_node
   recovery.c:238:    got restricted __le16 [assigned] [usertype] ofs_in_node

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:48:13 +09:00
Jaegeuk Kim
7e586fa024 f2fs: fix crc endian conversion
While calculating CRC for the checkpoint block, we use __u32, but when storing
the crc value to the disk, we use __le32.

Let's fix the inconsistency.

Reported-and-Tested-by: Oded Gabbay <ogabbay@advaoptical.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02 08:47:35 +09:00
J. Bruce Fields
d08d32e6e5 nfsd4: return delegation immediately if lease fails
This case shouldn't happen--the administrator shouldn't really allow
other applications access to the export until clients have had the
chance to reclaim their state--but if it does then we should set the
"return this lease immediately" bit on the reply.  That still leaves
some small races, but it's the best the protocol allows us to do in the
case a lease is ripped out from under us....

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:07 -04:00
J. Bruce Fields
0a262ffb75 nfsd4: do not throw away 4.1 lock state on last unlock
This reverts commit eb2099f31b "nfsd4:
release lockowners on last unlock in 4.1 case".  Trond identified
language in rfc 5661 section 8.2.4 which forbids this behavior:

	Stateids associated with byte-range locks are an exception.
	They remain valid even if a LOCKU frees all remaining locks, so
	long as the open file with which they are associated remains
	open, unless the client frees the stateids via the FREE_STATEID
	operation.

And bakeathon 2013 testing found a 4.1 freebsd client was getting an
incorrect BAD_STATEID return from a FREE_STATEID in the above situation
and then failing.

The spec language honestly was probably a mistake but at this point with
implementations already following it we're probably stuck with that.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:06 -04:00
J. Bruce Fields
89f6c3362c nfsd4: delegation-based open reclaims should bypass permissions
We saw a v4.0 client's create fail as follows:

	- open create succeeds and gets a read delegation
	- client attempts to set mode on new file, gets DELAY while
	  server recalls delegation.
	- client attempts a CLAIM_DELEGATE_CUR open using the
	  delegation, gets error because of new file mode.

This probably can't happen on a recent kernel since we're no longer
giving out delegations on create opens.  Nevertheless, it's a
bug--reclaim opens should bypass permission checks.

Reported-by: Steve Dickson <steved@redhat.com>
Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:05 -04:00
J. Bruce Fields
590b743143 nfsd4: minor read_buf cleanup
The code to step to the next page seems reasonably self-contained.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:32:03 -04:00
J. Bruce Fields
247500820e nfsd4: fix decoding of compounds across page boundaries
A freebsd NFSv4.0 client was getting rare IO errors expanding a tarball.
A network trace showed the server returning BAD_XDR on the final getattr
of a getattr+write+getattr compound.  The final getattr started on a
page boundary.

I believe the Linux client ignores errors on the post-write getattr, and
that that's why we haven't seen this before.

Cc: stable@vger.kernel.org
Reported-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:29:40 -04:00
J. Bruce Fields
99c415156c nfsd4: clean up nfs4_open_delegation
The nfs4_open_delegation logic is unecessarily baroque.

Also stop pretending we support write delegations in several places.

Some day we will support write delegations, but when that happens adding
back in these flag parameters will be the easy part.  For now they're
just confusing.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
Steve Dickson
9a0590aec3 NFSD: Don't give out read delegations on creates
When an exclusive create is done with the mode bits
set (aka open(testfile, O_CREAT | O_EXCL, 0777)) this
causes a OPEN op followed by a SETATTR op. When a
read delegation is given in the OPEN, it causes
the SETATTR to delay with EAGAIN until the
delegation is recalled.

This patch caused exclusive creates to give out
a write delegation (which turn into no delegation)
which allows the SETATTR seamlessly succeed.

Signed-off-by: Steve Dickson <steved@redhat.com>
[bfields: do this for any CREATE, not just exclusive; comment]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
J. Bruce Fields
57569a7070 nfsd4: allow client to send no cb_sec flavors
In testing I notice that some of the pynfs tests forget to send any
cb_sec flavors, and that we haven't necessarily errored out in that case
before.

I'll fix pynfs, but am also inclined to default to trying AUTH_NONE in
that case in case this is something clients actually do.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:07 -04:00
J. Bruce Fields
b78724b705 nfsd4: fail attempts to request gss on the backchannel
We don't support gss on the backchannel.  We should state that fact up
front rather than just letting things continue and later making the
client try to figure out why the backchannel isn't working.

Trond suggested instead returning NFS4ERR_NOENT.  I think it would be
tricky for the client to distinguish between the case "I don't support
gss on the backchannel" and "I can't find that in my cache, please
create another context and try that instead", and I'd prefer something
that currently doesn't have any other meaning for this operation, hence
the (somewhat arbitrary) NFS4ERR_ENCR_ALG_UNSUPP.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
J. Bruce Fields
57266a6e91 nfsd4: implement minimal SP4_MACH_CRED
Do a minimal SP4_MACH_CRED implementation suggested by Trond, ignoring
the client-provided spo_must_* arrays and just enforcing credential
checks for the minimum required operations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
J. Bruce Fields
0dc1531aca svcrpc: store gss mech in svc_cred
Store a pointer to the gss mechanism used in the rq_cred and cl_cred.
This will make it easier to enforce SP4_MACH_CRED, which needs to
compare the mechanism used on the exchange_id with that used on
protected operations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01 17:23:06 -04:00
Eliezer Tamir
91e2fd3378 net: avoid calling sched_clock when LLS is off
Change Low Latency Sockets code for select and poll so that
when LLS is disabled sched_clock() is never called.

Also, avoid sending POLL_LL to sockets if disabled.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 14:06:47 -07:00
Dan Carpenter
6af8652849 ceph: tidy ceph_mdsmap_decode() a little
I introduced a new temporary variable "info" instead of
"m->m_info[mds]".  Also I reversed the if condition and pulled
everything in one indent level.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:02 -07:00
Emil Goode
c213b50b7d ceph: improve error handling in ceph_mdsmap_decode
This patch makes the following improvements to the error handling
in the ceph_mdsmap_decode function:

- Add a NULL check for return value from kcalloc
- Make use of the variable err

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-01 09:52:02 -07:00
Jim Schutt
4d1bf79aff ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Josef Bacik
a71754fc68 Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
This has plagued us forever and I'm so over working around it.  When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point.  The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things.  This also tends to bite the orphan
cleanup stuff too which keeps people from mounting.  To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data.  This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:33 -04:00
Josef Bacik
0b08851fda Btrfs: optimize reada_for_balance
This patch does two things.  First we no longer explicitly read in the blocks
we're trying to readahead.  For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there.  Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go.  Hopefully this will cut down on
the number of re-searches.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Josef Bacik
bdf7c00e8f Btrfs: optimize read_block_for_search
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified.  The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Josef Bacik
fdf8e2ea3c Btrfs: unlock extent range on enospc in compressed submit
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding.  Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry.  This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock.  Fix this by unlocking the range.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:31 -04:00
Wang Sheng-Hui
90b6d2830a Btrfs: fix the comment typo for btrfs_attach_transaction_barrier
The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:30 -04:00
Josef Bacik
aee68ee5f5 Btrfs: fix not being able to find skinny extents during relocate
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item.  This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:30 -04:00
Josef Bacik
da61d31a78 Btrfs: cleanup backref search commit root flag stuff
Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL.  I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:29 -04:00
Josef Bacik
d88d46c6e0 Btrfs: free csums when we're done scrubbing an extent
A user reported scrub taking up an unreasonable amount of ram as it ran.  This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram.  This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed.  The user reported this to fix their problem.  Thanks,

Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:28 -04:00
Josef Bacik
1be41b78bc Btrfs: fix transaction throttling for delayed refs
Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram.  This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs.  What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run.  To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve.  With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:28 -04:00
Josef Bacik
501407aab8 Btrfs: stop waiting on current trans if we aborted
I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction.  If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing.  This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up.  To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be.  All the callers
will notice that the transaction has aborted and exit out properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:27 -04:00
Josef Bacik
f971fe29b1 Btrfs: wake up delayed ref flushing waiters on abort
I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever.  This
should fix the problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:26 -04:00
Jie Liu
3fb4037599 btrfs: fix the code comments for LZO compression workspace
Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:26 -04:00
Miao Xie
5bc7247ac4 Btrfs: fix broken nocow after balance
Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks.  The side effect of
setting last_snapshot is to break nocow attributes of files.

Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.

Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:25 -04:00
Ashish Sangwan
6ae06ff51e ext4: optimize starting extent in ext4_ext_rm_leaf()
Both hole punch and truncate use ext4_ext_rm_leaf() for removing
blocks.  Currently we choose the last extent as the starting
point for removing blocks:

	ex = EXT_LAST_EXTENT(eh);

This is OK for truncate but for hole punch we can optimize the extent
selection as the path is already initialized.  We could use this
information to select proper starting extent.  The code change in this
patch will not affect truncate as for truncate path[depth].p_ext will
always be NULL.

Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:41 -04:00
Theodore Ts'o
41a5b91319 jbd2: invalidate handle if jbd2_journal_restart() fails
If jbd2_journal_restart() fails the handle will have been disconnected
from the current transaction.  In this situation, the handle must not
be used for for any jbd2 function other than jbd2_journal_stop().
Enforce this with by treating a handle which has a NULL transaction
pointer as an aborted handle, and issue a kernel warning if
jbd2_journal_extent(), jbd2_journal_get_write_access(),
jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.

This commit also fixes a bug where jbd2_journal_stop() would trip over
a kernel jbd2 assertion check when trying to free an invalid handle.

Also move the responsibility of setting current->journal_info to
start_this_handle(), simplifying the three users of this function.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Younger Liu <younger.liu@huawei.com>
Cc: Jan Kara <jack@suse.cz>
2013-07-01 08:12:41 -04:00
Theodore Ts'o
21ddd568c1 ext4: translate flag bits to strings in tracepoints
Translate the bitfields used in various flags argument to strings to
make the tracepoint output more human-readable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:40 -04:00
Theodore Ts'o
cb53054118 ext4: fix up error handling for mpage_map_and_submit_extent()
The function mpage_released_unused_page() must only be called once;
otherwise the kernel will BUG() when the second call to
mpage_released_unused_page() tries to unlock the pages which had been
unlocked by the first call.

Also restructure the error handling so that we only give up on writing
the dirty pages in the case of ENOSPC where retrying the allocation
won't help.  Otherwise, a transient failure, such as a kmalloc()
failure in calling ext4_map_blocks() might cause us to give up on
those pages, leading to a scary message in /var/log/messages plus data
loss.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-07-01 08:12:40 -04:00
Theodore Ts'o
39c04153fd jbd2: fix theoretical race in jbd2__journal_restart
Once we decrement transaction->t_updates, if this is the last handle
holding the transaction from closing, and once we release the
t_handle_lock spinlock, it's possible for the transaction to commit
and be released.  In practice with normal kernels, this probably won't
happen, since the commit happens in a separate kernel thread and it's
unlikely this could all happen within the space of a few CPU cycles.

On the other hand, with a real-time kernel, this could potentially
happen, so save the tid found in transaction->t_tid before we release
t_handle_lock.  It would require an insane configuration, such as one
where the jbd2 thread was set to a very high real-time priority,
perhaps because a high priority real-time thread is trying to read or
write to a file system.  But some people who use real-time kernels
have been known to do insane things, including controlling
laser-wielding industrial robots.  :-)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-01 08:12:40 -04:00
Lukas Czerner
e1be3a928e ext4: only zero partial blocks in ext4_zero_partial_blocks()
Currently if we pass range into ext4_zero_partial_blocks() which covers
entire block we would attempt to zero it even though we should only zero
unaligned part of the block.

Fix this by checking whether the range covers the whole block skip
zeroing if so.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:39 -04:00
Theodore Ts'o
42c832debb ext4: check error return from ext4_write_inline_data_end()
The function ext4_write_inline_data_end() can return an error.  So we
need to assign it to a signed integer variable to check for an error
return (since copied is an unsigned int).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2013-07-01 08:12:39 -04:00
jon ernst
353eefd338 ext4: delete unnecessary C statements
Comparing unsigned variable with 0 always returns false.
err = 0 is duplicated and unnecessary.

[ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]

Signed-off-by: "Jon Ernst" <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:39 -04:00
Al Viro
64cb927371 ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
in-core rbtree for use by call_filldir().  All updates of ->f_pos are
done by the latter; bumping it here (on error) is obviously wrong - we
might very well have it nowhere near the block we'd found an error in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-01 08:12:38 -04:00
Theodore Ts'o
fe52d17cdd jbd2: move superblock checksum calculation to jbd2_write_superblock()
Some of the functions which modify the jbd2 superblock were not
updating the checksum before calling jbd2_write_superblock().  Move
the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
that the checksum is calculated consistently.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: stable@vger.kernel.org
2013-07-01 08:12:38 -04:00
Ashish Sangwan
aeb2817a4e ext4: pass inode pointer instead of file pointer to punch hole
No need to pass file pointer when we can directly pass inode pointer.

Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:38 -04:00
boxi liu
c4932dbe63 ext4: improve free space calculation for inline_data
In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.

Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-07-01 08:12:37 -04:00
Joe Perches
e7c96e8e47 ext4: reduce object size when !CONFIG_PRINTK
Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Zheng Liu
d3922a777f ext4: improve extent cache shrink mechanism to avoid to burn CPU time
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Alexey Khoroshilov
2c00ef3ee3 ext4: implement error handling of ext4_mb_new_preallocation()
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
  ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
  ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
  of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
      *(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
  structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:36 -04:00
Maarten ter Huurne
6ca792edc1 ext4: fix corruption when online resizing a fs with 1K block size
Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.

Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
2013-07-01 08:12:08 -04:00
Aruna Balakrishnaiah
6bbbca7359 pstore: Pass header size in the pstore write callback
Header size is needed to distinguish between header and the dump data.
Incorporate the addition of new argument (hsize) in the pstore write
callback.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 18:10:48 +10:00
Benjamin Herrenschmidt
24a72acac1 Linux 3.10
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
 Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
 Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
 9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
 bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
 U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
 =Ob6Z
 -----END PGP SIGNATURE-----

Merge tag 'v3.10' into next

Merge 3.10 in order to get some of the last minute powerpc
changes, resolve conflicts and add additional fixes on top
of them.
2013-07-01 17:57:25 +10:00
Linus Torvalds
63edbce160 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ubifs fixes from Al Viro:
 "A couple of ubifs readdir/lseek race fixes.  Stable fodder, really
  nasty..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  UBIFS: fix a horrid bug
  UBIFS: prepare to fix a horrid bug
2013-06-29 10:30:31 -07:00
Linus Torvalds
82d0b80ad6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "One more fix for a recently discovered bug"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Disable monitoring on setuid processes for regular users
2013-06-29 10:26:50 -07:00
Al Viro
2142914e3e lseek_execute() doesn't need an inode passed to it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:53 +04:00
Al Viro
5d48f3a2de block_dev: switch to fixed_size_llseek()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:52 +04:00
Jeff Layton
7b2296afb3 locks: give the blocked_hash its own spinlock
There's no reason we have to protect the blocked_hash and file_lock_list
with the same spinlock. With the tests I have, breaking it in two gives
a barely measurable performance benefit, but it seems reasonable to make
this locking as granular as possible.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:46 +04:00
Jeff Layton
3999e49364 locks: add a new "lm_owner_key" lock operation
Currently, the hashing that the locking code uses to add these values
to the blocked_hash is simply calculated using fl_owner field. That's
valid in most cases except for server-side lockd, which validates the
owner of a lock based on fl_owner and fl_pid.

In the case where you have a small number of NFS clients doing a lot
of locking between different processes, you could end up with all
the blocked requests sitting in a very small number of hash buckets.

Add a new lm_owner_key operation to the lock_manager_operations that
will generate an unsigned long to use as the key in the hashtable.
That function is only implemented for server-side lockd, and simply
XORs the fl_owner and fl_pid.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:45 +04:00
Jeff Layton
48f7418654 locks: turn the blocked_list into a hashtable
Break up the blocked_list into a hashtable, using the fl_owner as a key.
This speeds up searching the hash chains, which is especially significant
for deadlock detection.

Note that the initial implementation assumes that hashing on fl_owner is
sufficient. In most cases it should be, with the notable exception being
server-side lockd, which compares ownership using a tuple of the
nlm_host and the pid sent in the lock request. So, this may degrade to a
single hash bucket when you only have a single NFS client. That will be
addressed in a later patch.

The careful observer may note that this patch leaves the file_lock_list
alone. There's much less of a case for turning the file_lock_list into a
hashtable. The only user of that list is the code that generates
/proc/locks, and it always walks the entire list.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:44 +04:00
Jeff Layton
139ca04ee5 locks: convert fl_link to a hlist_node
Testing has shown that iterating over the blocked_list for deadlock
detection turns out to be a bottleneck. In order to alleviate that,
begin the process of turning it into a hashtable. We start by turning
the fl_link into a hlist_node and the global lists into hlists. A later
patch will do the conversion of the blocked_list to a hashtable.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:44 +04:00
Jeff Layton
4e8c765d38 locks: avoid taking global lock if possible when waking up blocked waiters
Since we always hold the i_lock when inserting a new waiter onto the
fl_block list, we can avoid taking the global lock at all if we find
that it's empty when we go to wake up blocked waiters.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:43 +04:00
Jeff Layton
1c8c601a8c locks: protect most of the file_lock handling with i_lock
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.

->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:42 +04:00
Jeff Layton
8897469171 locks: encapsulate the fl_link list handling
Move the fl_link list handling routines into a separate set of helpers.
Also ensure that locks and requests are always put on global lists
last (after fully initializing them) and are taken off before unintializing
them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:41 +04:00
Jeff Layton
b9746ef80f locks: make "added" in __posix_lock_file a bool
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:40 +04:00
Jeff Layton
1cb3601259 locks: comment cleanups and clarifications
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:39 +04:00
Jeff Layton
d4f22d19df locks: make generic_add_lease and generic_delete_lease static
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:39 +04:00
Jeff Layton
1a9e64a711 cifs: use posix_unblock_lock instead of locks_delete_block
commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
the locks_delete_block symbol. There's already an exported helper
function that provides this capability however, so make cifs use that
instead and turn locks_delete_block back into a static function.

Note that if fl->fl_next == NULL then this lock has already been through
locks_delete_block(), so we should be OK to ignore an ENOENT error here
and simply not retry the lock.

Cc: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:38 +04:00
Jeff Layton
f891a29f46 locks: drop the unused filp argument to posix_unblock_lock
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:37 +04:00
Linus Torvalds
da53be12bb Don't pass inode to ->d_hash() and ->d_compare()
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:36 +04:00
Dan Carpenter
642b704cd7 minix: bug widening a binary "not" operation
"chunk_size" is an unsigned int and "pos" is an unsigned long.  The
"& ~(chunk_size-1)" operation clears the high 32 bits unintentionally.

The ALIGN() macro does the correct thing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-29 12:57:35 +04:00
Al Viro
18c67cb9f0 splice: lift checks from do_splice_from() into callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:35 +04:00
Al Viro
68d70d03f8 constify rw_verify_area()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:34 +04:00
Al Viro
1bf9d14dff new helper: fixed_size_llseek()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:26 +04:00
Al Viro
0747fdb2bd ecryptfs: switch ecryptfs_decode_and_decrypt_filename() from dentry to sb
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:25 +04:00
Al Viro
cb5e05d1a6 fuse: another open-coded file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:24 +04:00
Al Viro
6d0379ec49 btrfs: more open-coded file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:24 +04:00
Al Viro
3058dca694 fanotify: quit wanking with FASYNC in ->release()
... especially since there's no way to get that sucker
on the list fsnotify_fasync() works with - the only thing
adding to it is fsnotify_fasync() itself and it's never
called for fanotify files while they are opened.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:23 +04:00
Al Viro
0b3fca1fd1 kill find_inode_number()
the only remaining caller (in ncpfs) is guaranteed to return 0 -
we only hit it if we'd just checked that there's no dentry with
such name.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:20 +04:00
Al Viro
6b5e1223d9 coda: don't bother with find_inode_number()
the fallback it's using for dcache misses is actually the
same value we would've used for inumber anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:20 +04:00
Al Viro
1df98b8bbc proc_fill_cache(): clean up, get rid of pointless find_inode_number() use
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:19 +04:00
Al Viro
c52a47ace7 proc_fill_cache(): just make instantiate_t return int
all instances always return ERR_PTR(-E...) or NULL, anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:18 +04:00
Al Viro
db96316487 proc_pid_readdir(): stop wanking with proc_fill_cache() for /proc/self
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:17 +04:00
Al Viro
147ce69974 proc_fill_cache(): kill pointless check
we'd just checked that child->d_inode is non-NULL, for fuck sake!

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:16 +04:00
Al Viro
338b2f5749 ncpfs: don't bother with EBUSY on removal of busy directories
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:16 +04:00
Al Viro
5faf153ebf don't call file_pos_write() if vfs_{read,write}{,v}() fails
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:15 +04:00
David Howells
c77cecee52 Replace a bunch of file->dentry->d_inode refs with file_inode()
Replace a bunch of file->dentry->d_inode refs with file_inode().

In __fput(), use file->f_inode instead so as not to be affected by any tricks
that file_inode() might grow.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:13 +04:00
Al Viro
656d09df8f udf: provide ->tmpfile()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:12 +04:00
Al Viro
e6bbef9542 ext3 ->tmpfile() support
In this case we do need a bit more than usual, due to orphan
list handling.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:12 +04:00
Al Viro
f4e0c30c19 allow the temp files created by open() to be linked to
O_TMPFILE | O_CREAT => linkat() with AT_SYMLINK_FOLLOW and /proc/self/fd/<n>
as oldpath (i.e. flink()) will create a link
O_TMPFILE | O_CREAT | O_EXCL => ENOENT on attempt to link those guys

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:11 +04:00
Al Viro
60545d0d46 [O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:10 +04:00
Al Viro
f9652e10c1 allow build_open_flags() to return an error
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:09 +04:00
Al Viro
50cd2c5776 lift file_*_write out of do_splice_direct()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:08 +04:00
Al Viro
500368f7fb lift file_*_write out of do_splice_from()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:08 +04:00
Al Viro
bc77daa783 do_last(): fix missing checks for LAST_BIND case
/proc/self/cwd with O_CREAT should fail with EISDIR.  /proc/self/exe, OTOH,
should fail with ENOTDIR when opened with O_DIRECTORY.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:07 +04:00
Al Viro
ac6614b764 [readdir] constify ->actor
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:05 +04:00
Al Viro
2233f31aad [readdir] ->readdir() is gone
everything's converted to ->iterate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:04 +04:00
Al Viro
2de5f059c4 [readdir] convert ecryptfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:04 +04:00
Al Viro
e924f25126 [readdir] convert coda
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:03 +04:00
Al Viro
3704412bdb [readdir] convert ocfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:02 +04:00
Al Viro
2c6a2473b8 [readdir] convert fatfs
... pox upon the idiotic ioctls; life would be much easier without
those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:01 +04:00
Al Viro
b8227554c9 [readdir] convert xfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:00 +04:00
Al Viro
9cdda8d31f [readdir] convert btrfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:00 +04:00
Al Viro
8e28bc7e71 [readdir] convert hostfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:59 +04:00
Al Viro
1bbae9f818 [readdir] convert afs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:58 +04:00
Al Viro
76f582a8f6 [readdir] convert ncpfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:57 +04:00
Al Viro
e72514e7ad [readdir] convert hfsplus
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:56 +04:00
Al Viro
002f8bec85 [readdir] convert hfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:55 +04:00
Al Viro
f0f49ef5ce [readdir] convert befs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:55 +04:00
Al Viro
be4ccdcc25 [readdir] convert cifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:54 +04:00
Al Viro
9b5d5a1707 [readdir] convert freevxfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:53 +04:00
Al Viro
8d3af7f333 [readdir] convert fuse
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:52 +04:00
Al Viro
568f8f5ec5 [readdir] convert hpfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:51 +04:00
Al Viro
cd62cdae0b reiserfs: switch reiserfs_readdir_dentry to inode
... and clean the callers up a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:51 +04:00
Al Viro
99ce4169a9 reiserfs: is_privroot_deh() needs only directory inode, actually
... and that - only to get the superblock.  Privroot is a directory
and we don't allow hardlinks to those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:50 +04:00
Al Viro
4acf381e1b [readdir] convert reiserfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:49 +04:00
Al Viro
956ce2083c [readdir] convert ntfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:48 +04:00
Al Viro
bfee7169c0 [readdir] convert isofs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:47 +04:00
Al Viro
0312fa7ccd [readdir] convert jffs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:47 +04:00
Al Viro
6f7f231e7b [readdir] convert f2fs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:46 +04:00
Al Viro
8f29843a51 [readdir] convert 9p
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:45 +04:00
Al Viro
0edf977d2a [readdir] convert affs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:44 +04:00
Al Viro
2638ffbac9 [readdir] convert adfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:44 +04:00
Al Viro
46d0733801 [readdir] convert logfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:43 +04:00
Al Viro
070a0ebf42 [readdir] convert jfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:42 +04:00
Al Viro
77acfa29e1 [readdir] convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:41 +04:00
Al Viro
23db862060 [readdir] convert nfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:40 +04:00
Al Viro
725bebb278 [readdir] convert ext4
and trim the living hell out bogosities in inline dir case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:40 +04:00
Al Viro
4deb398a1b [readdir] convert qnx6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:39 +04:00
Al Viro
663f4deca7 [readdir] convert qnx4
... and use strnlen() instead of strlen() - it's done on untrusted data,
after all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:38 +04:00
Al Viro
9fd4d05949 [readdir] convert omfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:37 +04:00
Al Viro
1616abe841 [readdir] convert nilfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:36 +04:00
Al Viro
d55fea8ddb [readdir] convert sysfs
get rid of the kludges in sysfs_readdir()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:36 +04:00
Al Viro
d81a8ef598 [readdir] convert gfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:35 +04:00
Al Viro
75811d4fda [readdir] convert exofs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:34 +04:00
Al Viro
81b9f66e6b [readdir] convert bfs
... and get rid of that ridiculous mutex in bfs_readdir()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:33 +04:00
Al Viro
f0c3b5093a [readdir] convert procfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
Al Viro
68c6147113 [readdir] convert openpromfs
what the hell is op_mutex for, BTW?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
Al Viro
7aa123a0dc [readdir] convert efs
* sanity checks belong before risky operation, not after it
* don't quit as soon as we'd found an entry

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:31 +04:00
Al Viro
52018855e6 [readdir] convert configfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:30 +04:00
Al Viro
3903b38ce7 [readdir] convert romfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:29 +04:00
Al Viro
5f6039ce69 [readdir] convert squashfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:28 +04:00
Al Viro
01122e0688 [readdir] convert ubifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:25 +04:00
Al Viro
5add2ee198 [readdir] convert udf
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:50 +04:00
Al Viro
5ded75ec4c [readdir] convert ext3
new helper: dir_relax(inode).  Call when you are in location that will
_not_ be invalidated by directory modifications (block boundary, in case
of ext*).  Returns whether the directory has survived (dropping i_mutex
allows rmdir to kill the sucker; if it returns false to us, ->iterate()
is obviously done)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:49 +04:00
Al Viro
5f99f4e79a [readdir] switch dcache_readdir() users to ->iterate()
new helpers - dir_emit_dot(file, ctx, dentry), dir_emit_dotdot(file, ctx),
dir_emit_dots(file, ctx).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:48 +04:00
Al Viro
80886298c0 [readdir] simple local unixlike: switch to ->iterate()
ext2, ufs, minix, sysv

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:47 +04:00
Al Viro
bb6f619b3a [readdir] introduce ->iterate(), ctx->pos, dir_emit()
New method - ->iterate(file, ctx).  That's the replacement for ->readdir();
it takes callback from ctx->actor, uses ctx->pos instead of file->f_pos and
calls dir_emit(ctx, ...) instead of filldir(data, ...).  It does *not*
update file->f_pos (or look at it, for that matter); iterate_dir() does the
update.

Note that dir_emit() takes the offset from ctx->pos (and eventually
filldir_t will lose that argument).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:47 +04:00
Al Viro
5c0ba4e076 [readdir] introduce iterate_dir() and dir_context
iterate_dir(): new helper, replacing vfs_readdir().

struct dir_context: contains the readdir callback (and will get more stuff
in it), embedded into whatever data that callback wants to deal with;
eventually, we'll be passing it to ->readdir() replacement instead of
(data,filldir) pair.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:46 +04:00
Al Viro
e06aeb5716 compat.c: LOOP_CLR_FD is taken care of in loop.c itself...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:44 +04:00
Artem Bityutskiy
605c912bb8 UBIFS: fix a horrid bug
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no
mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are
in the middle of 'ubifs_readdir()'.

This means that 'file->private_data' can be freed while 'ubifs_readdir()' uses
it, and this is a very bad bug: not only 'ubifs_readdir()' can return garbage,
but this may corrupt memory and lead to all kinds of problems like crashes an
security holes.

This patch fixes the problem by using the 'file->f_version' field, which
'->llseek()' always unconditionally sets to zero. We set it to 1 in
'ubifs_readdir()' and whenever we detect that it became 0, we know there was a
seek and it is time to clear the state saved in 'file->private_data'.

I tested this patch by writing a user-space program which runds readdir and
seek in parallell. I could easily crash the kernel without these patches, but
could not crash it with these patches.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:45:37 +04:00
Artem Bityutskiy
33f1a63ae8 UBIFS: prepare to fix a horrid bug
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no
mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are
in the middle of 'ubifs_readdir()'.

First of all, this means that 'file->private_data' can be freed while
'ubifs_readdir()' uses it.  But this particular patch does not fix the problem.
This patch is only a preparation, and the fix will follow next.

In this patch we make 'ubifs_readdir()' stop using 'file->f_pos' directly,
because 'file->f_pos' can be changed by '->llseek()' at any point. This may
lead 'ubifs_readdir()' to returning inconsistent data: directory entry names
may correspond to incorrect file positions.

So here we introduce a local variable 'pos', read 'file->f_pose' once at very
the beginning, and then stick to 'pos'. The result of this is that when
'ubifs_dir_llseek()' changes 'file->f_pos' while we are in the middle of
'ubifs_readdir()', the latter "wins".

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:45:37 +04:00
David Disseldorp
7ac0febb81 cifs: fill TRANS2_QUERY_FILE_INFO ByteCount fields
Currently the trans2 ByteCount field is incorrectly left zero in
TRANS2_QUERY_FILE_INFO info_level=SMB_QUERY_FILE_ALL_INFO and
info_level=SMB_QUERY_FILE_UNIX_BASIC requests. The field should properly
reflect the FID, information_level and padding bytes carried in these
requests.

Leaving this field zero causes such requests to fail against Novell CIFS
servers. Other SMB servers (e.g. Samba) use the parameter count fields
for data length calculations instead, so do not suffer the same fate.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-29 00:09:44 -05:00
Chandra Seetharaman
83e782e1a1 xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD
Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead,
start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and
PQUOTA respectively.

On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD.

Read and write of the superblock does the conversion from *OQUOTA*
to *[PG]QUOTA*.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 17:39:22 -05:00
Lenny Szubowicz
8e48b1a8ed pstore: Return unique error if backend registration excluded by kernel param
This is patch 1/3 of a patch set that avoids what misleadingly appears
to be a error during boot:

ERST: Could not register with persistent store

This message is displayed if the system has a valid ACPI ERST table and the
pstore.backend kernel parameter has been used to disable use of ERST by
pstore. But this same message is used for errors that preclude registration.

As part of fixing this, return a unique error status from pstore_register
if the pstore.backend kernel parameter selects a specific facility other
than the requesting facility and check for this condition before any others.
This allows the caller to distinquish this benign case from the other failure
cases.

Also, print an informational console message about which facility
successfully registered as the pstore backend. Since there are various
kernel parameters, config build options, and boot-time errors that can
influence which facility registers with pstore, it's useful to have a
positive indication.

Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Reported-by: Naotaka Hamaguchi <n.hamaguchi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-28 15:21:52 -07:00
Trond Myklebust
959d921f5e Merge branch 'labeled-nfs' into linux-next
* labeled-nfs:
  NFS: Apply v4.1 capabilities to v4.2
  NFS: Add in v4.2 callback operation
  NFS: Make callbacks minor version generic
  Kconfig: Add Kconfig entry for Labeled NFS V4 client
  NFS: Extend NFS xattr handlers to accept the security namespace
  NFS: Client implementation of Labeled-NFS
  NFS: Add label lifecycle management
  NFS:Add labels to client function prototypes
  NFSv4: Extend fattr bitmaps to support all 3 words
  NFSv4: Introduce new label structure
  NFSv4: Add label recommended attribute and NFSv4 flags
  NFSv4.2: Added NFS v4.2 support to the NFS client
  SELinux: Add new labeling type native labels
  LSM: Add flags field to security_sb_set_mnt_opts for in kernel mount data.
  Security: Add Hook to test if the particular xattr is part of a MAC model.
  Security: Add hook to calculate context based on a negative dentry.
  NFS: Add NFSv4.2 protocol constants

Conflicts:
	fs/nfs/nfs4proc.c
2013-06-28 16:29:51 -04:00
Chuck Lever
f112bb4899 NFS: Set NFS_CS_MIGRATION for NFSv4 mounts
NFS_CS_MIGRATION makes sense only for NFSv4 mounts.  Introduced by
commit 89652617 (NFS: Introduce "migration" mount option) Fri Sep 14
17:24:11 2012.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 16:04:10 -04:00
Andy Adamson
18aad3d552 NFSv4.1 Refactor nfs4_init_session and nfs4_init_channel_attrs
nfs4_init_session was originally written to be called prior to
nfs4_init_channel_attrs, setting the session target_max response and request
sizes that nfs4_init_channel_attrs would pay attention to.

In the current code flow, nfs4_init_session, just like nfs4_init_ds_session
for the data server case, is called after the session is all negotiated, and
is actually used in a RECLAIM COMPLETE call to the server.

Remove the un-needed fc_target_max response and request fields from
nfs4_session and just set the max_resp_sz and max_rqst_sz in
nfs4_init_channel_attrs.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:55:19 -04:00
Jeff Layton
9111c95b07 nfs: have NFSv3 try server-specified auth flavors in turn
The current scheme is to try and pick the auth flavor that the server
prefers. In some cases though, we may find that we're not actually
able to use that auth flavor later. For instance, the server may
prefer an AUTH_GSS flavor, but we may not be able to get GSSAPI creds.

The current code just gives up at that point. Change it instead to
try the ->create_server call using each of the different authflavors
in the server's list if one was not specified at mount time. Once
we have a successful ->create_server call, return the result. Only
give up and return error if all attempts fail.

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:52:37 -04:00
Jeff Layton
fb9b02fda0 nfs: have nfs_mount fake up a auth_flavs list when the server didn't provide it
Instead of handling this as a special case in the auth-selection code,
we can simply fake up an auth_flavs list when the server doesn't
provide it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:51:51 -04:00
Jeff Layton
294ae81d4f nfs: move server_authlist into nfs_try_mount_request
In a later patch we're going to want to cycle over this list and attempt
to call ->create_server for each different flavor until one succeeds.
Move the list allocation to the stack of nfs_try_mount_request() and
pass a pointer to it and its length to nfs_request_mount().

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:49:49 -04:00
Jeff Layton
d17540c61b nfs: refactor "need_mount" code out of nfs_try_mount
This looks like pointless refactoring for now, but we'll flesh out
the need_mount case a little more in a later patch.

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:49:27 -04:00
Andy Adamson
52fcac988a NFSv4.1 use pnfs_device maxcount for the objectlayout gdia_maxcount
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:34:45 -04:00
Andy Adamson
968fe25243 NFSv4.1 use pnfs_device maxcount for the blocklayout gdia_maxcount
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:34:44 -04:00
Andy Adamson
f1c097be2b NFSv4.1 Fix gdia_maxcount calculation to fit in ca_maxresponsesize
The GETDEVICEINFO gdia_maxcount represents all of the data being returned
within the GETDEVICEINFO4resok structure and includes the XDR overhead.

The CREATE_SESSION ca_maxresponsesize is the maximum reply and includes the RPC
headers (including security flavor credentials and verifiers).

Split out the struct pnfs_device field maxcount which is the gdia_maxcount
from the pglen field which is the reply (the total) buffer length.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:34:43 -04:00
Bryan Schumaker
ffa57b9e53 NFS: Improve legacy idmapping fallback
Fallback should happen only when the request_key() call fails, because
this indicates that there was a problem running the nfsidmap program.
We shouldn't call the legacy code if the error was elsewhere.

Signed-off-by: Bryan Schumaker <bjschuma@netappp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28 15:22:07 -04:00
Chandra Seetharaman
0e6436d99e xfs: Change xfs_dquot_acct to be a 2-dimensional array
In preparation for combined pquota/gquota support, for the sake
of readability, change xfs_dquot_acct to be a 2-dimensional array.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 14:12:22 -05:00
Chandra Seetharaman
113a56835d xfs: Code cleanup and removal of some typedef usage
In preparation for combined pquota/gquota support, for the sake
of readability, do some code cleanup surrounding the affected
code.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 13:13:59 -05:00
Chandra Seetharaman
995961c451 xfs: Replace macro XFS_DQ_TO_QIP with a function
In preparation for combined pquota/gquota support, for the sake
of readability, change the macro to an inline function.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 13:12:42 -05:00
Chandra Seetharaman
329e087528 xfs: Replace macro XFS_DQUOT_TREE with a function
In preparation for combined pquota/gquota support, for the sake
of readability, change the macro to an inline function.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 13:05:07 -05:00
Chandra Seetharaman
9cad19d2cb xfs: Define a new function xfs_is_quota_inode()
In preparation for combined pquota/gquota support, define
a new function to check if the given inode is a quota inode.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 13:03:49 -05:00
Dave Chinner
dc037ad7d2 xfs: implement inode change count
For CRC enabled filesystems, add support for the monotonic inode
version change counter that is needed by protocols like NFSv4 for
determining if the inode has changed in any way at all between two
unrelated operations on the inode.

This bumps the change count the first time an inode is dirtied in a
transaction. Since all modifications to the inode are logged, this
will catch all changes that are made to the inode, including
timestamp updates that occur during data writes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28 13:00:05 -05:00
Jan Kara
a5faeaf910 writeback: Fix periodic writeback after fs mount
Code in blkdev.c moves a device inode to default_backing_dev_info when
the last reference to the device is put and moves the device inode back
to its bdi when the first reference is acquired. This includes moving to
wb.b_dirty list if the device inode is dirty. The code however doesn't
setup timer to wake corresponding flusher thread and while wb.b_dirty
list is non-empty __mark_inode_dirty() will not set it up either. Thus
periodic writeback is effectively disabled until a sync(2) call which can
lead to unexpected data loss in case of crash or power failure.

Fix the problem by setting up a timer for periodic writeback in case we
add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().

Reported-by: Bert De Jonghe <Bert.DeJonghe@amplidata.com>
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:02 +02:00
Rafael J. Wysocki
207bc1181b Merge branch 'freezer'
* freezer:
  af_unix: use freezable blocking calls in read
  sigtimedwait: use freezable blocking call
  nanosleep: use freezable blocking call
  futex: use freezable blocking call
  select: use freezable blocking call
  epoll: use freezable blocking call
  binder: use freezable blocking calls
  freezer: add new freezable helpers using freezer_do_not_count()
  freezer: convert freezable helpers to static inline where possible
  freezer: convert freezable helpers to freezer_do_not_count()
  freezer: skip waking up tasks with PF_FREEZER_SKIP set
  freezer: shorten freezer sleep time using exponential backoff
  lockdep: check that no locks held at freeze time
  lockdep: remove task argument from debug_check_no_locks_held
  freezer: add unsafe versions of freezable helpers for CIFS
  freezer: add unsafe versions of freezable helpers for NFS
2013-06-28 13:00:53 +02:00
Jeff Layton
50285882fd cifs: fix SMB2 signing enablement in cifs_enable_signing
Commit 9ddec56131 (cifs: move handling of signed connections into
separate function) broke signing on SMB2/3 connections. While the code
to enable signing on the connections was very similar between the two,
the bits that get set in the sec_mode are different.

Declare a couple of new smb_version_values fields and set them
appropriately for SMB1 and SMB2/3. Then change cifs_enable_signing to
use those instead.

Reported-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-27 23:42:18 -05:00
Dave Chinner
ddf6ad0143 xfs: Use inode create transaction
Replace the use of buffer based logging of inode initialisation,
uses the new logical form to describe the range to be initialised
in recovery. We continue to "log" the inode buffers to push them
into the AIL and ensure that the inode create transaction is not
removed from the log before the inode buffers are written to disk.

Update the transaction identifier and reservations to match the
changed implementation.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 14:27:18 -05:00
Dave Chinner
28c8e41af6 xfs: Inode create item recovery
When we find a icreate transaction, we need to get and initialise
the buffers in the range that has been passed. Extract and verify
the information in the item record, then loop over the range
initialising and issuing the buffer writes delayed.

Support an arbitrary size range to initialise so that in
future when we allocate inodes in much larger chunks all kernels
that understand this transaction can still recover them.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 14:26:21 -05:00
Dave Chinner
b8402b4729 xfs: Inode create transaction reservations
Define the log and space transaction sizes. Factor the current
create log reservation macro into the two logical halves and reuse
one half for the new icreate transactions. The icreate transaction
is transparent to all the high level create code - the
pre-calculated reservations will correctly set the reservations
dependent on whether the filesystem supports the icreate
transaction.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:36:37 -05:00
Dave Chinner
3ebe7d2d73 xfs: Inode create log items
Introduce the inode create log item type for logical inode create logging.
Instead of logging the changes in buffers, pass the range to be
initialised through the log by a new transaction type.  This reduces
the amount of log space required to record initialisation during
allocation from about 128 bytes per inode to a small fixed amount
per inode extent to be initialised.

This requires a new log item type to track it through the log
and the AIL. This is a relatively simple item - most callbacks are
noops as this item has the same life cycle as the transaction.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:34:12 -05:00
Dave Chinner
5f6bed76c0 xfs: Introduce an ordered buffer item
If we have a buffer that we have modified but we do not wish to
physically log in a transaction (e.g. we've logged a logical
change), we still need to ensure that transactional integrity is
maintained. Hence we must not move the tail of the log past the
transaction that the buffer is associated with before the buffer is
written to disk.

This means these special buffers still need to be included in the
transaction and added to the AIL just like a normal buffer, but we
do not want the modifications to the buffer written into the
transaction. IOWs, what we want is an "ordered buffer" that
maintains the same transactional life cycle as a physically logged
buffer, just without the transcribing of the modifications to the
log.

Hence we need to flag the buffer as an "ordered buffer" to avoid
including it in vector size calculations or formatting during the
transaction. Once the transaction is committed, the buffer appears
for all intents to be the same as a physically logged buffer as it
transitions through the log and AIL.

Relogging will also work just fine for such an ordered buffer - the
logical transaction will be replayed before the subsequent
modifications that relog the buffer, so everything will be
reconstructed correctly by recovery.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:33:11 -05:00
Dave Chinner
fd63875cc4 xfs: Introduce ordered log vector support
And "ordered log vector" is a log vector that is used for
tracking a log item through the CIL and into the AIL as part of the
log checkpointing. These ordered log vectors are special in that
they are not written to to journal in any way, and are not accounted
to the checkpoint being written.

The reason for this behaviour is to allow operations to attach items
to transactions and have them follow the normal transactional
lifecycle without actually having to write them to the journal. This
allows logging of items that track high level logical changes and
writing them to the log, while the physical items being modified
pass through into the AIL and pin the tail of the log (and therefore
the logical item in the log) until all the modified items are
physically written to disk.

IOWs, it allows us to write metadata without physically logging
every individual change but still maintain the full transactional
integrity guarantees we currently have w.r.t. crash recovery.

This change modifies some of the CIL item insertion loops, as
ordered log vectors introduce some new constraints as they don't
track any data. One advantage of this change is that it combines
two log vector chain walks into a single pass, so there is less
overhead in the transaction commit pass as well. It also kills some
unused code in the log vector walk loop when committing the CIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:32:08 -05:00
Dave Chinner
1baaed8fa9 xfs: xfs_ifree doesn't need to modify the inode buffer
Long ago, bulkstat used to read inodes directly from the backing
buffer for speed. This had the unfortunate problem of being cache
incoherent with unlinks, and so xfs_ifree() had to mark the inode
as free directly in the backing buffer. bulkstat was changed some
time ago to use inode cache coherent lookups, and so will never see
unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
touch the inode backing buffer anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:31:04 -05:00
Dave Chinner
cca9f93a52 xfs: don't do IO when creating an new inode
When we are allocating a new inode, we read the inode cluster off
disk to increment the generation number. We are already using a
random generation number for newly allocated inodes, so if we are not
using the ikeep mode, we can just generate a new generation number
when we initialise the newly allocated inode.

This avoids the need for reading the inode buffer during inode
creation. This will speed up allocation of inodes in cold, partially
allocated clusters as they will no longer need to be read from disk
during allocation. It will also reduce the CPU overhead of inode
allocation by not having the process the buffer read, even on cache
hits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:28:20 -05:00
Dave Chinner
133eeb1747 xfs: don't use speculative prealloc for small files
Dedicated small file workloads have been seeing significant free
space fragmentation causing premature inode allocation failure
when large inode sizes are in use. A particular test case showed
that a workload that runs to a real ENOSPC on 256 byte inodes would
fail inode allocation with ENOSPC about about 80% full with 512 byte
inodes, and at about 50% full with 1024 byte inodes.

The same workload, when run with -o allocsize=4096 on 1024 byte
inodes would run to being 100% full before giving ENOSPC. That is,
no freespace fragmentation at all.

The issue was caused by the specific IO pattern the application had
- the framework it was using did not support direct IO, and so it
was emulating it by using fadvise(DONT_NEED). The result was that
the data was getting written back before the speculative prealloc
had been trimmed from memory by the close(), and so small single
block files were being allocated with 2 blocks, and then having one
truncated away. The result was lots of small 4k free space extents,
and hence each new 8k allocation would take another 8k from
contiguous free space and turn it into 4k of allocated space and 4k
of free space.

Hence inode allocation, which requires contiguous, aligned
allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
(1024 byte inodes) can fail to find sufficiently large freespace and
hence fail while there is still lots of free space available.

There's a simple fix for this, and one that has precendence in the
allocator code already - don't do speculative allocation unless the
size of the file is larger than a certain size. In this case, that
size is the minimum default preallocation size:
mp->m_writeio_blocks. And to keep with the concept of being nice to
people when the files are still relatively small, cap the prealloc
to mp->m_writeio_blocks until the file goes over a stripe unit is
size, at which point we'll fall back to the current behaviour based
on the last extent size.

This will effectively turn off speculative prealloc for very small
files, keep preallocation low for small files, and behave as it
currently does for any file larger than a stripe unit. This
completely avoids the freespace fragmentation problem this
particular IO pattern was causing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:27:37 -05:00
Dave Chinner
34eefc06a0 xfs: plug directory buffer readahead
Similar to bulkstat inode chunk readahead, we need to plug directory
data buffer readahead during getdents to ensure that we can merge
adjacent readahead requests and sort out of order requests optimally
before they are dispatched. This improves the readahead efficiency
and reduces the IO load it generates as the IO patterns are
significantly better for both contiguous and fragmented directories.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:27:24 -05:00
Dave Chinner
cbb2864aa4 xfs: add pluging for bulkstat readahead
I was running some tests on bulkstat on CRC enabled filesystems when
I noticed that all the IO being issued was 8k in size, regardless of
the fact taht we are issuing sequential 8k buffers for inodes
clusters. The IO size should be 16k for 256 byte inodes, and 32k for
512 byte inodes, but this wasn't happening.

blktrace showed that there was an explict plug and unplug happening
around each readahead IO from _xfs_buf_ioapply, and the unplug was
causing the IO to be issued immediately. Hence no opportunity was
being given to the elevator to merge adjacent readahead requests and
dispatch them as a single IO.

Add plugging around the inode chunk readahead dispatch loop in
bulkstat to ensure that we don't unplug the queue between adjacent
inode buffer readahead IOs and so we get fewer, larger IO requests
hitting the storage subsystem for bulkstat.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:26:23 -05:00
Bob Peterson
a01aedfe21 GFS2: Reserve journal space for quota change in do_grow
If a GFS2 file system is mounted with quotas and a file is grown
in such a way that its free blocks for the allocation are represented
in a secondary bitmap, GFS2 ran out of blocks in the transaction.
That resulted in "fatal: assertion "tr->tr_num_buf <= tr->tr_blocks".
This patch reserves extra blocks for the quota change so the
transaction has enough space.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-27 18:16:27 +01:00
Steve French
e65a5cb417 [CIFS] Fix build warning
Fix build warning in Shirish's recent SMB3 signing patch
which occurs when SMB2 support is disabled in Kconfig.

fs/built-in.o: In function `cifs_setup_session':
>> (.text+0xa1767): undefined reference to `generate_smb3signingkey'

Pointed out by: automated 0-DAY kernel build testing backend
Intel Open Source Technology Center

CC: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-27 01:06:50 -05:00
Steve French
429b46f4fd [CIFS] SMB3 Signing enablement
SMB3 uses a much faster method of signing (which is also better in other ways),
AES-CMAC.  With the kernel now supporting AES-CMAC since last release, we
are overdue to allow SMB3 signing (today only CIFS and SMB2 and SMB2.1,
but not SMB3 and SMB3.1 can sign) - and we need this also for checking
secure negotation and also per-share encryption (two other new SMB3 features
which we need to implement).

This patch needs some work in a few areas - for example we need to
move signing for SMB2/SMB3 from per-socket to per-user (we may be able to
use the "nosharesock" mount option in the interim for the multiuser case),
and Shirish found a bug in the earlier authentication overhaul
(setting signing flags properly) - but those can be done in followon
patches.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 23:45:05 -05:00
Steve French
f87ab88b40 [CIFS] Do not set DFS flag on SMB2 open
If we would set SMB2_FLAGS_DFS_OPERATIONS on open we also would have
to pass the path on the Open SMB prefixed by \\server\share.
Not sure when we would need to do the augmented path (if ever) and
setting this flag breaks the SMB2 open operation since it is
illegal to send an empty path name (without \\server\share prefix)
when the DFS flag is set in the SMB open header. We could
consider setting the flag on all operations other than open
but it is safer to net set it for now.

Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 19:14:55 -05:00
Steve French
84ceeb9626 [CIFS] fix static checker warning
Dan Carpenter wrote:

The patch 7f420cee8bd6: "[CIFS] Charge at least one credit, if server
says that it supports multicredit" from Jun 23, 2013, leads to the
following Smatch complaint:

fs/cifs/smb2pdu.c:120 smb2_hdr_assemble()
         warn: variable dereferenced before check 'tcon->ses' (see line 115)

CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:52:17 -05:00
Jeff Layton
52dfb446db cifs: try to handle the MUST SecurityFlags sanely
The cifs.ko SecurityFlags interface wins my award for worst-designed
interface ever, but we're sort of stuck with it since it's documented
and people do use it (even if it doesn't work correctly).

Case in point -- you can specify multiple sets of "MUST" flags. It makes
absolutely no sense, but you can do it.

What should the effect be in such a case? No one knows or seems to have
considered this so far, so let's define it now. If you try to specify
multiple MUST flags, clear any other MAY or MUST bits except for the
ones that involve signing.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:31:55 -05:00
Steve French
5d875cc928 When server doesn't provide SecurityBuffer on SMB2Negotiate pick default
According to MS-SMB2 section 2.2.4: if no blob, client picks default which
for us will be
	ses->sectype = RawNTLMSSP;
but for time being this is also our only auth choice so doesn't matter
as long as we include this fix (which does not treat the empty
SecurityBuffer as an error as the code had been doing).
We just found a server which sets blob length to zero expecting raw so
this fixes negotiation with that server.

Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:31:50 -05:00
Steve French
fdf96a907c Handle big endianness in NTLM (ntlmv2) authentication
This is RH bug 970891
Uppercasing of username during calculation of ntlmv2 hash fails
because UniStrupr function does not handle big endian wchars.

Also fix a comment in the same code to reflect its correct usage.

[To make it easier for stable (rather than require 2nd patch) fixed
this patch of Shirish's to remove endian warning generated
by sparse -- steve f.]

Reported-by: steve <sanpatr1@in.ibm.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Cc: <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:31:45 -05:00
Jeff Layton
2a2c41c07c revalidate directories instiantiated via FIND_* in order to handle DFS referrals
We've had a long-standing problem with DFS referral points. CIFS servers
generally try to make them look like directories in FIND_FIRST/NEXT
responses. When you go to try to do a FIND_FIRST on them though, the
server will then (correctly) return STATUS_PATH_NOT_COVERED. Mostly this
manifests as spurious EREMOTE errors back to userland.

This patch attempts to fix this by marking directories that are
discovered via FIND_FIRST/NEXT for revaldiation. When the lookup code
runs across them again, we'll reissue a QPathInfo against them and that
will make it chase the referral properly.

There is some performance penalty involved here and no I haven't
measured it -- it'll be highly dependent upon the workload and contents
of the mounted share. To try and mitigate that though, the code only
marks the inode for revalidation when it's possible to run across a DFS
referral. i.e.: when the kernel has DFS support built in and the share
is "in DFS"

[At the Microsoft plugfest we noted that usually the DFS links had
the REPARSE attribute tag enabled - DFS junctions are reparse points
after all - so I just added a check for that flag too so the
performance impact should be smaller - Steve]

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:31:41 -05:00
Steve French
4a72dafa19 SMB2 FSCTL and IOCTL worker function
This worker function is needed to send SMB2 fsctl
(and ioctl) requests including:

validating negotiation info (secure negotiate)
querying the servers network interfaces
copy offload (refcopy)

Followon patches for the above three will use this.
This patch also does general validation of the response.

In the future, as David Disseldorp notes, for the copychunk ioctl
case, we will want to enhance the response processing to allow
returning the chunk request limits to the caller (even
though the server returns an error, in that case we would
return data that the caller could use - see 2.2.32.1).

See MS-SMB2 Section 2.2.31 for more details on format of fsctl.

Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:31:29 -05:00
Steve French
2b80d049eb Charge at least one credit, if server says that it supports multicredit
In SMB2.1 and later the server will usually set the large MTU flag, and
we need to charge at least one credit, if server says that since
it supports multicredit.  Windows seems to let us get away with putting
a zero there, but they confirmed that it is wrong and the spec says
to put one there (if the request is under 64K and the CAP_LARGE_MTU
was returned during protocol negotiation by the server.

CC: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:26:03 -05:00
Steve French
7f6538585e Remove typo
Cut and paste likely introduced accidentally inserted spurious #define
in d60622eb5a causes no harm but looks weird

Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:26:03 -05:00
Steve French
c8664730bb Some missing share flags
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:26:03 -05:00
Zhao Hongjiang
46b51d0835 cifs: using strlcpy instead of strncpy
for NUL terminated string, need alway set '\0' in the end.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-26 17:25:20 -05:00
Jie Liu
80a4049813 xfs: Remove dead function prototype xfs_sync_inode_grab()
Remove dead function prototype xfs_sync_inode_grab()
from xfs_icache.h.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-26 12:29:27 -05:00
Jie Liu
43df2ee659 xfs: Remove the left function variable from xfs_ialloc_get_rec()
This patch clean out the left function variable as it is
useless to xfs_ialloc_get_rec().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-26 12:22:41 -05:00
Bart Van Assche
cfa805f6f1 dlm: Avoid LVB truncation
For lockspaces with an LVB length above 64 bytes, avoid truncating
the LVB while exchanging it with another node in the cluster.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-26 11:38:02 -05:00
Stephane Eranian
2976b10f05 perf: Disable monitoring on setuid processes for regular users
There was a a bug in setup_new_exec(), whereby
the test to disabled perf monitoring was not
correct because the new credentials for the
process were not yet committed and therefore
the get_dumpable() test was never firing.

The patch fixes the problem by moving the
perf_event test until after the credentials
are committed.

Signed-off-by: Stephane Eranian <eranian@google.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 11:40:18 +02:00
Eliezer Tamir
2d48d67fa8 net: poll/select low latency socket support
select/poll busy-poll support.

Split sysctl value into two separate ones, one for read and one for poll.
updated Documentation/sysctl/net.txt

Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
sk_poll_ll if possible. sock_poll sets this flag in its return value
to indicate to select/poll when a socket that can busy poll is found.

When poll/select have nothing to report, call the low-level
sock_poll again until we are out of time or we find something.

Once the system call finds something, it stops setting POLL_LL, so it can
return the result to the user ASAP.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:35:52 -07:00
Linus Torvalds
5dbc746960 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse bugfix from Miklos Szeredi:
 "This fixes a race between fallocate() and truncate()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: hold i_mutex in fuse_file_fallocate()
2013-06-25 09:06:04 -10:00
David Teigland
696b3d8460 dlm: log an error for unmanaged lockspaces
Log an error message if the dlm user daemon exits
before all the lockspaces have been removed.

Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-25 12:53:20 -05:00
Zhao Hongjiang
ad917e7f82 dlm: config: using strlcpy instead of strncpy
for NUL terminated string, need alway set '\0' in the end.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-25 12:53:06 -05:00
Aruna Balakrishnaiah
bf2883339a pstore: Fail to unlink if a driver has not defined pstore_erase
pstore_erase is used to erase the record from the persistent store.
So if a driver has not defined pstore_erase callback return
-EPERM instead of unlinking a file as deleting the file without
erasing its record in persistent store will give a wrong impression
to customers.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-25 10:04:38 -07:00
Greg Kroah-Hartman
b5aef682e0 Merge 3.10-rc7 into driver-core-next
We want the firmware merge fixes, and other bits, in here now.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-24 15:14:43 -07:00
Steve French
be7457d388 Update headers to update various SMB3 ioctl definitions
MS-SMB2 Section 2.2.31 lists fsctls.  Update our list of valid
cifs/smb2/smb3 fsctls and some related structs
based on more recent version of docs.  Additional detail on
less common ones can be found in MS-FSCC section 2.3.

CopyChunk (server side copy, ie refcopy) will depend on a few
of these

Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:47 -05:00
Steve French
f43a033d44 Update cifs version number
More than 160 fixes since we last bumped the version number of cifs.ko.
Update to version 2.01 so it is easier in modinfo to tell
that fixes are in.

Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:46 -05:00
Steve French
769ee6a402 Add ability to dipslay SMB3 share flags and capabilities for debugging
SMB3 protocol adds various optional per-share capabilities (and
SMB3.02 adds one more beyond that).  Add ability to dump
(/proc/fs/cifs/DebugData) the share capabilities and share flags to
improve debugging.

Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-24 01:56:46 -05:00
Steve French
2b5dc286da Add some missing SMB3 and SMB3.02 flags
A few missing flags from SMB3.0 dialect, one missing from 2.1, and the
new #define flags for SMB3.02

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:45 -05:00
Steve French
20b6d8b42e Add SMB3.02 dialect support
The new Windows update supports SMB3.02 dialect, a minor update to SMB3.
This patch adds support for mounting with vers=3.02

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2013-06-24 01:56:45 -05:00
Steve French
9cd2e62c49 Fix endian error in SMB2 protocol negotiation
Fix minor endian error in Jeff's auth rewrite

Reviewed-by: Jeff Laytonn <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:45 -05:00
Jeff Layton
7715dad8e1 cifs: clean up the SecurityFlags write handler
The SecurityFlags handler uses an obsolete simple_strtoul() call, and
doesn't really handle the bounds checking well. Fix it to use
kstrtouint() instead. Clean up the error messages as well and fix a
bogus check for an unsigned int to be less than 0.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:44 -05:00
Jeff Layton
896a8fc25b cifs: update the default global_secflags to include "raw" NTLMv2
Before this patchset, the global_secflags could only offer up a single
sectype. With the new set though we have the ability to allow different
sectypes since we sort out the one to use after talking to the server.

Change the global_secflags to allow NTLMSSP or NTLMv2 by default. If the
server sets the extended security bit in the Negotiate response, then
we'll use NTLMSSP. If it doesn't then we'll use raw NTLMv2. Mounting a
LANMAN server will still require a sec= option by default.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:44 -05:00
Jeff Layton
3f618223dc move sectype to the cifs_ses instead of TCP_Server_Info
Now that we track what sort of NEGOTIATE response was received, stop
mandating that every session on a socket use the same type of auth.

Push that decision out into the session setup code, and make the sectype
a per-session property. This should allow us to mix multiple sectypes on
a socket as long as they are compatible with the NEGOTIATE response.

With this too, we can now eliminate the ses->secFlg field since that
info is redundant and harder to work with than a securityEnum.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:44 -05:00
Jeff Layton
38d77c50b4 cifs: track the enablement of signing in the TCP_Server_Info
Currently, we determine this according to flags in the sec_mode, flags
in the global_secflags and via other methods. That makes the semantics
very hard to follow and there are corner cases where we don't handle
this correctly.

Add a new bool to the TCP_Server_Info that acts as a simple flag to tell
us whether signing is enabled on this connection or not, and fix up the
places that need to determine this to use that flag.

This is a bit weird for the SMB2 case, where signing is per-session.
SMB2 needs work in this area already though. The existing SMB2 code has
similar logic to what we're using here, so there should be no real
change in behavior. These changes should make it easier to implement
per-session signing in the future though.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton
1e3cc57e47 add new fields to smb_vol to track the requested security flavor
We have this to some degree already in secFlgs, but those get "or'ed" so
there's no way to know what the last option requested was. Add new fields
that will eventually supercede the secFlgs field in the cifs_ses.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton
28e11bd86d cifs: add new fields to cifs_ses to track requested security flavor
Currently we have the overrideSecFlg field, but it's quite cumbersome
to work with. Add some new fields that will eventually supercede it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton
e598d1d8fb cifs: track the flavor of the NEGOTIATE reponse
Track what sort of NEGOTIATE response we get from the server, as that
will govern what sort of authentication types this socket will support.

There are three possibilities:

LANMAN: server sent legacy LANMAN-type response

UNENCAP: server sent a newer-style response, but extended security bit
wasn't set. This socket will only support unencapsulated auth types.

EXTENDED: server sent a newer-style response with the extended security
bit set. This is necessary to support krb5 and ntlmssp auth types.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00
Jeff Layton
515d82ffd0 cifs: add new "Unspecified" securityEnum value
Add a new securityEnum value to cover the case where a sec= option
was not explicitly set.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00
Jeff Layton
9193400b69 cifs: factor out check for extended security bit into separate function
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00