This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
Pull the timespec64 conversion from Deepa Dinamani:
"The series aims to switch vfs timestamps to use
struct timespec64. Currently vfs uses struct timespec,
which is not y2038 safe.
The flag patch applies cleanly. I've not seen the timestamps
update logic change often. The series applies cleanly on 4.17-rc6
and linux-next tip (top commit: next-20180517).
I'm not sure how to merge this kind of a series with a flag patch.
We are targeting 4.18 for this.
Let me know if you have other suggestions.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
I've tried to keep the conversions with the script simple, to
aid in the reviews. I've kept all the internal filesystem data
structures and function signatures the same.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions."
I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In this round, we've mainly focused on discard, aka unmap, control along with
fstrim for Android-specific usage model. In addition, we've fixed writepage flow
which returned EAGAIN previously resulting in EIO of fsync(2) due to mapping's
error state. In order to avoid old MM bug [1], we decided not to use __GFP_ZERO
for the mapping for node and meta page caches. As always, we've cleaned up many
places for future fsverity and symbol conflicts.
Enhancement:
- do discard/fstrim in lower priority considering fs utilization
- split large discard commands into smaller ones for better responsiveness
- add more sanity checks to address syzbot reports
- add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
- clean up symbol namespace with modified function names
- be strict on block allocation and IO control in corner cases
Bug fix:
- don't use __GFP_ZERO for mappings
- fix error reports in writepage to avoid fsync() failure
- avoid selinux denial on CAP_RESOURCE on resgid/resuid
- fix some subtle race conditions in GC/atomic writes/shutdown
- fix overflow bugs in sanity_check_raw_super
- fix missing bits on get_flags
Clean-up:
- prepare the generic flow for future fsverity integration
- fix some broken coding standard
[1] https://lkml.org/lkml/2018/4/8/661
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlsepb8ACgkQQBSofoJI
UNJdSw/+IhrYJFkJEN/pV4M5xSjYirl/P2WJ4AGi6HcpjEGmaDiBi2whod1Jw2NE
1auSMiby7K91VAmPvxMmmLhOdC8XgJ8jwY1nEaZMfmMXohlaD3FDY5bzYf5rJDF4
J184P6xUZ2IKlFVA4prwNQgYi3awPthVu1lxbFPp8GUHDbmr5ZXEysxPDzz2O0Em
oE7WmklmyCHJPhmg/EcVXfF/Ekf3zMOVR+EI2otcDjnWIQioVetIK8CKi0MM4bkG
X8Z318ANjGTd42woupXIzsiTrMRONY7zzkUvE+S6tfUjKZoIdofDM5OIXMdOxpxL
DZ53WrwfeB74igD8jDZgqD6OaonIfDfCuKrwUASFAC2Ou4h3apj3ckUzoHtAhEUL
z5yTSKTrtfuoSufhBp+nKKs3ijDgms76arw8x/pPdN6D6xDwIJtBPxC2sObPaj35
damv4GyM4+sbhGO/Gbie2q6za55IvYFZc7JNCC2D2K5tnBmUaa7/XdvxcyigniGk
AZgkaddHePkAZpa5AYYirZR8bd7IFds0+m6VcybG0/pYb0qPEcI6U4mujBSCIwVy
kXuD7su3jNjj6hWnCl5PSQo8yBWS5H8c6/o+5XHozzYA91dsLAmD8entuCreg6Hp
NaIFio0qKULweLK86f66qQTsRPMpYRAtqPS0Ew0+3llKMcrlRp4=
=JrW7
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've mainly focused on discard, aka unmap, control
along with fstrim for Android-specific usage model. In addition, we've
fixed writepage flow which returned EAGAIN previously resulting in EIO
of fsync(2) due to mapping's error state. In order to avoid old MM bug
[1], we decided not to use __GFP_ZERO for the mapping for node and
meta page caches. As always, we've cleaned up many places for future
fsverity and symbol conflicts.
Enhancements:
- do discard/fstrim in lower priority considering fs utilization
- split large discard commands into smaller ones for better responsiveness
- add more sanity checks to address syzbot reports
- add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
- clean up symbol namespace with modified function names
- be strict on block allocation and IO control in corner cases
Bug fixes:
- don't use __GFP_ZERO for mappings
- fix error reports in writepage to avoid fsync() failure
- avoid selinux denial on CAP_RESOURCE on resgid/resuid
- fix some subtle race conditions in GC/atomic writes/shutdown
- fix overflow bugs in sanity_check_raw_super
- fix missing bits on get_flags
Clean-ups:
- prepare the generic flow for future fsverity integration
- fix some broken coding standard"
[1] https://lkml.org/lkml/2018/4/8/661
* tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (79 commits)
f2fs: fix to clear FI_VOLATILE_FILE correctly
f2fs: let sync node IO interrupt async one
f2fs: don't change wbc->sync_mode
f2fs: fix to update mtime correctly
fs: f2fs: insert space around that ':' and ', '
fs: f2fs: add missing blank lines after declarations
fs: f2fs: changed variable type of offset "unsigned" to "loff_t"
f2fs: clean up symbol namespace
f2fs: make set_de_type() static
f2fs: make __f2fs_write_data_pages() static
f2fs: fix to avoid accessing cross the boundary
f2fs: fix to let caller retry allocating block address
disable loading f2fs module on PAGE_SIZE > 4KB
f2fs: fix error path of move_data_page
f2fs: don't drop dentry pages after fs shutdown
f2fs: fix to avoid race during access gc_thread pointer
f2fs: clean up with clear_radix_tree_dirty_tag
f2fs: fix to don't trigger writeback during recovery
f2fs: clear discard_wake earlier
f2fs: let discard thread wait a little longer if dev is busy
...
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
i_current_depth is used only for directory inode, but its space is
shared with i_gc_failures field used for regular inode, in order to
avoid affecting i_gc_failures' value, this patch fixes to initialize
the union's fields according to inode type.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to give a option for user to be able to recover B/foo in the below
case.
mkdir A
sync()
rename(A, B)
creat (B/foo)
fsync (B/foo)
---crash---
Sugessted-by: Velayudhan Pillai <vijay@cs.utexas.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we use generic FS_*_FL defined by vfs to indicate inode status
for each bit of i_flags, so f2fs's flag status definition is tied to vfs'
one, it will be hard for f2fs to reuse bits f2fs never used to indicate
new status..
In order to solve this issue, we introduce private inode status mapping,
Note, for these bits have already been persisted into disk, we should
never change their definition, for other ones, we can remap them for
later new coming status.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.
Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.
Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
xfstest generic/429 sometimes hangs on f2fs, caused by a thread being
unable to take a directory's i_rwsem for write in vfs_rmdir(). In the
test, one thread repeatedly creates and removes a directory, and other
threads repeatedly look up a file in the directory. The bug is that
f2fs_mkdir() calls d_instantiate() before unlock_new_inode(), resulting
in the directory inode being exposed to lookups before it has been fully
initialized. And with CONFIG_DEBUG_LOCK_ALLOC, unlock_new_inode()
reinitializes ->i_rwsem, corrupting its state when it is already held.
Fix it by calling unlock_new_inode() before d_instantiate(). This
matches what other filesystems do.
Fixes: 57397d86c6 ("f2fs: add inode operations for special inodes")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new mount option `test_dummy_encryption'
to allow fscrypt to create a fake fscrypt context. This is used
by xfstests.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges miscellaneous mount options into struct f2fs_mount_info,
After this patch, once we add new mount option, we don't need to worry
about recovery of it in remount_fs(), since we will recover the
f2fs_sb_info.mount_opt including all options.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit "0a007b97aad6"(f2fs: recover directory operations by fsync)
fixed xfstest generic/342 case, but it also increased the written
data and caused the performance degradation. In most cases, there's
no need to do so heavy fsync actually.
So we introduce new mount option "fsync_mode={posix,strict}" to
control the policy of fsync. "fsync_mode=posix" is set by default,
and means that f2fs uses a light fsync, which follows POSIX semantics.
And "fsync_mode=strict" means that it's a heavy fsync, which behaves
in line with xfs, ext4 and btrfs, where generic/342 will pass, but
the performance will regress.
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to recognize hot file extension in f2fs, so that we
can allocate proper hot segment location for its data, which can lead to
better hot/cold seperation in filesystem.
In addition, we changes a bit on query/add/del operation method for
extension_list sysfs entry as below:
- Query: cat /sys/fs/f2fs/<disk>/extension_list
- Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list
- Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
- [h] means add/del hot file extension
- [c] means add/del cold file extension
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous dentry page uses highmem, which will cause panic in platforms
using highmem (such as arm), since the address space of dentry pages
from highmem directly goes into the decryption path via the function
fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not
from highmem, and then cause panic since it doesn't call kmap_high but
kunmap_high is triggered at the end. To fix this problem in a simple
way, this patch avoids to put dentry page in pagecache into highmem.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp2R3AACgkQ8vlZVpUN
gaOIdAgApEdlFR2Gf93z2hMj5HxVL5rjkuPJVtVkKu0eH2HMQJyxNmjymrRfuFmM
8W1CrEvVKi5Aj6r8q4KHIdVV247Ya0SVEhLwKM0LX4CvlZUXmwgCmZ/MPDTXA1eq
C4vPVuJAuSNGNVYDlDs3+NiMHINGNVnBVQQFSPBP9P+iNWPD7o486712qaF8maVn
RbfbQ2rWtOIRdlAOD1U5WqgQku59lOsmHk2pc0+X4LHCZFpMoaO80JVjENPAw+BF
daRt6TX+WljMyx6DRIaszqau876CJhe/tqlZcCLOkpXZP0jJS13yodp26dVQmjCh
w8YdiY7uHK2D+S/8eyj7h7DIwzu3vg==
=ZjQP
-----END PGP SIGNATURE-----
Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Refactor support for encrypted symlinks to move common code to fscrypt"
Ted also points out about the merge:
"This makes the f2fs symlink code use the fscrypt_encrypt_symlink()
from the fscrypt tree. This will end up dropping the kzalloc() ->
f2fs_kzalloc() change, which means the fscrypt-specific allocation
won't get tested by f2fs's kmalloc error injection system; which is
fine"
* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: (26 commits)
fscrypt: fix build with pre-4.6 gcc versions
fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
fscrypt: document symlink length restriction
fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
fscrypt: calculate NUL-padding length in one place only
fscrypt: move fscrypt_symlink_data to fscrypt_private.h
fscrypt: remove fscrypt_fname_usr_to_disk()
ubifs: switch to fscrypt_get_symlink()
ubifs: switch to fscrypt ->symlink() helper functions
ubifs: free the encrypted symlink target
f2fs: switch to fscrypt_get_symlink()
f2fs: switch to fscrypt ->symlink() helper functions
ext4: switch to fscrypt_get_symlink()
ext4: switch to fscrypt ->symlink() helper functions
fscrypt: new helper function - fscrypt_get_symlink()
fscrypt: new helper functions for ->symlink()
fscrypt: trim down fscrypt.h includes
fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
...
This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we invoke f2fs_mark_inode_dirty_sync() to make an inode dirty in
advance of creating a new node page for the inode. By this, some inodes
whose node page is not created yet can be linked into the global dirty
list.
If the checkpoint is executed at this moment, the inode will be written
back by writeback_single_inode() and finally update_inode_page() will
fail to detach the inode from the global dirty list because the inode
doesn't have a node page.
The problem is that the inode's state in VFS layer will become clean
after execution of writeback_single_inode() and it's still linked in
the global dirty list of f2fs and this will cause a kernel panic.
So, we will prevent the newly created inode from being dirtied during
the FI_NEW_INODE flag of the inode is set. We will make it dirty
right after the flag is cleared.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes generic/342 which doesn't recover renamed file which was fsynced
before. It will be done via another fsync on newly created file.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_kzalloc based on f2fs_kmalloc in order to
support error injection for kzalloc().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch replaces to use cp_error flag instead of RDONLY for quota off.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs enables hash-indexed directory by default, so we need to tag
FS_INDEX_FL in inode::i_flags during directory creataion, in order
to show correct status of inode in lsattr:
Before:
------------------- /mnt/f2fs/dir/
After:
-----------I------- /mnt/f2fs/dir/
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->lookup(), we will have a try to recover dot or dotdot for
corrupted directory, once disk quota is on, if it allocates new
block during dotdot recovery, we need to record disk quota info
for the allocation, so this patch fixes this issue by adding
missing dquot_initialize() in __recover_dot_dentries.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add new flag F2FS_EXTRA_ATTR storing in inode.i_inline
to indicate that on-disk structure of current inode is extended.
In order to extend, we changed the inode structure a bit:
Original one:
struct f2fs_inode {
...
struct f2fs_extent i_ext;
__le32 i_addr[DEF_ADDRS_PER_INODE];
__le32 i_nid[DEF_NIDS_PER_INODE];
}
Extended one:
struct f2fs_inode {
...
struct f2fs_extent i_ext;
union {
struct {
__le16 i_extra_isize;
__le16 i_padding;
__le32 i_extra_end[0];
};
__le32 i_addr[DEF_ADDRS_PER_INODE];
};
__le32 i_nid[DEF_NIDS_PER_INODE];
}
Once F2FS_EXTRA_ATTR is set, we will steal four bytes in the head of
i_addr field for storing i_extra_isize and i_padding. with i_extra_isize,
we can calculate actual size of reserved space in i_addr, available
attribute fields included in total extra attribute fields for current
inode can be described as below:
+--------------------+
| .i_mode |
| ... |
| .i_ext |
+--------------------+
| .i_extra_isize |-----+
| .i_padding | |
| .i_prjid | |
| .i_atime_extra | |
| .i_ctime_extra | |
| .i_mtime_extra |<----+
| .i_inode_cs |<----- store blkaddr/inline from here
| .i_xattr_cs |
| ... |
+--------------------+
| |
| block address |
| |
+--------------------+
| .i_nid |
+--------------------+
| node_footer |
| (nid, ino, offset) |
+--------------------+
Hence, with this patch, we would enhance scalability of f2fs inode for
storing more newly added attribute.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to support plain user/group quota.
Change Note by Jaegeuk Kim.
- Use f2fs page cache for quota files in order to consider garbage collection.
so, quota files are not tolerable for sudden power-cuts, so user needs to do
quotacheck.
- setattr() calls dquot_transfer which will transfer inode->i_blocks.
We can't reclaim that during f2fs_evict_inode(). So, we need to count
node blocks as well in order to match i_blocks with dquot's space.
Note that, Chao wrote a patch to count inode->i_blocks without inode block.
(f2fs: don't count inode block in in-memory inode.i_blocks)
- in f2fs_remount, we need to make RW in prior to dquot_resume.
- handle fault_injection case during f2fs_quota_off_umount
- TODO: Project quota
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After renaming a directory, fsck could detect unmatched pino. The scenario
can be reproduced as the following:
$ mkdir /bar/subbar /foo
$ rename /bar/subbar /foo
Then fsck will report:
[ASSERT] (__chk_dots_dentries:1182) --> Bad inode number[0x3] for '..', parent parent ino is [0x4]
Rename sets LOST_PINO for old_inode. However, the flag cannot be cleared,
since dir is written back with CP. So, let's get rid of LOST_PINO for a
renamed dir and fix the pino directly at the end of rename.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've focused on enhancing performance with regards to block
allocation, GC, and discard/in-place-update IO controls. There are a bunch
of clean-ups as well as minor bug fixes.
= Enhancement
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
= Bug fix
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
/bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
+r5s7PqTDuXv5Cfzc5NC
=SBUb
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've focused on enhancing performance with regards to
block allocation, GC, and discard/in-place-update IO controls. There
are a bunch of clean-ups as well as minor bug fixes.
Enhancements:
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
Bug fixes:
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases"
* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
f2fs: fix a mount fail for wrong next_scan_nid
f2fs: enhance scalability of trace macro
f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
f2fs: Make flush bios explicitely sync
f2fs: show available_nids in f2fs/status
f2fs: flush dirty nats periodically
f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
f2fs: allow cpc->reason to indicate more than one reason
f2fs: release cp and dnode lock before IPU
f2fs: shrink size of struct discard_cmd
f2fs: don't hold cmd_lock during waiting discard command
f2fs: nullify fio->encrypted_page for each writes
f2fs: sanity check segment count
f2fs: introduce valid_ipu_blkaddr to clean up
f2fs: lookup extent cache first under IPU scenario
f2fs: reconstruct code to write a data page
f2fs: introduce __wait_discard_cmd
f2fs: introduce __issue_discard_cmd
f2fs: enable small discard by default
f2fs: delay awaking discard thread
...
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it. Also add the same
warning message that ext4 uses.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As discuss with Jaegeuk and Chao,
"Once checkpoint is done, f2fs doesn't need to update there-in filename at all."
The disk-level filename is used only one case,
1. create a file A under a dir
2. sync A
3. godown
4. umount
5. mount (roll_forward)
Only the rename/cross_rename changes the filename, if it happens,
a. between step 1 and 2, the sync A will caused checkpoint, so that,
the roll_forward at step 5 never happens.
b. after step 2, the roll_forward happens, file A will roll forward
to the result as after step 1.
So that, any updating the disk filename is useless, just cleanup it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The parent directory's nlink will change, not the inode.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This round introduces several interesting features such as on-disk NAT bitmaps,
IO alignment, and a discard thread. And it includes a couple of major bug fixes
as below.
== Enhancement ==
- introduce on-disk bitmaps to avoid scanning NAT blocks when getting free nids
- support IO alignment to prepare open-channel SSD integration in future
- introduce a discard thread to avoid long latency during checkpoint and fstrim
- use SSR for warm node and enable inline_xattr by default
- introduce in-memory bitmaps to check FS consistency for debugging
- improve write_begin by avoiding needless read IO
== Bug fix ==
- fix broken zone_reset behavior for SMR drive
- fix wrong victim selection policy during GC
- fix missing behavior when preparing discard commands
- fix bugs in atomic write support and fiemap
- workaround to handle multiple f2fs_add_link calls having same name
And it includes a bunch of clean-up patches as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYtmdOAAoJEEAUqH6CSFDSs0UP/AzngT37xVIhVBD13J9oHIuv
rFA/eHVGRJmU1xc4SG1bghKm45xq8rwUX7irarfvLLc5aL+6VPGSdaRBykUr4A5N
MN/bgK//EPp7If8EF+8PpY+9x7g67i0mtz5iD8dDrK+bUKV/IDKV1LWw5pR3g/g6
RwMH0dUVOiD/HJ5iFp1ykTdVPe4vFY013uVmyPxUq+nCBlqlQm1nOvrGjF/HeYyX
kqcD2LEc79GPfS5ebQIKfCfLE0rsWVnnS6YaqlDNCD5/oRim71CUtA4MPTYv29vp
R/SebWlayEm+u68+uQUu6AyIk/1IdP0+AtRuQd/VxuteoyXmkTMHER662DqN4F8J
npPdNrbNdlzwuAP77avy+hplqbD19yUa7o7Fl1No5rfheT3CiNTSj2uoriyEAffH
1AM6tES7S7n5ttrXOr9iOxrK0u/vuaf7fbKVtK+RI09hwzdvyGB5HUdQB0iP/XR+
obw8dru79ISMVZ9YuDhSfjI5ohAcfthfuqgjUt2RAfDv19IRsg5eayAp3T6nUfEX
AGQbV/52dkO9svZztMbcBW95zmqkE0cMeX66KIMCPXNuDiE474t8k115K6kHpFwP
e4Kx+mTSNhR1LEAaVdmCjbLb0gVrumVHTdjaZopnxTFmE70u/M6h1vY90m1LkReF
ZDK5mhfMmGzU4wkvbgP8
=tw8c
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This round introduces several interesting features such as on-disk NAT
bitmaps, IO alignment, and a discard thread. And it includes a couple
of major bug fixes as below.
Enhancements:
- introduce on-disk bitmaps to avoid scanning NAT blocks when getting
free nids
- support IO alignment to prepare open-channel SSD integration in
future
- introduce a discard thread to avoid long latency during checkpoint
and fstrim
- use SSR for warm node and enable inline_xattr by default
- introduce in-memory bitmaps to check FS consistency for debugging
- improve write_begin by avoiding needless read IO
Bug fixes:
- fix broken zone_reset behavior for SMR drive
- fix wrong victim selection policy during GC
- fix missing behavior when preparing discard commands
- fix bugs in atomic write support and fiemap
- workaround to handle multiple f2fs_add_link calls having same name
... and it includes a bunch of clean-up patches as well"
* tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (97 commits)
f2fs: avoid to flush nat journal entries
f2fs: avoid to issue redundant discard commands
f2fs: fix a plint compile warning
f2fs: add f2fs_drop_inode tracepoint
f2fs: Fix zoned block device support
f2fs: remove redundant set_page_dirty()
f2fs: fix to enlarge size of write_io_dummy mempool
f2fs: fix memory leak of write_io_dummy mempool during umount
f2fs: fix to update F2FS_{CP_}WB_DATA count correctly
f2fs: use MAX_FREE_NIDS for the free nids target
f2fs: introduce free nid bitmap
f2fs: new helper cur_cp_crc() getting crc in f2fs_checkpoint
f2fs: update the comment of default nr_pages to skipping
f2fs: drop the duplicate pval in f2fs_getxattr
f2fs: Don't update the xattr data that same as the exist
f2fs: kill __is_extent_same
f2fs: avoid bggc->fggc when enough free segments are avaliable after cp
f2fs: select target segment with closer temperature in SSR mode
f2fs: show simple call stack in fault injection message
f2fs: no need lock_op in f2fs_write_inline_data
...
This patch fixes the renaming bug on encrypted filenames, which was pointed by
(ext4: don't allow encrypted operations without keys)
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We checked that "inode" is not an error pointer earlier so there is
no need to check again here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull partial readlink cleanups from Miklos Szeredi.
This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.
Miklos and Al are still discussing the rest of the series.
* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: make generic_readlink() static
vfs: remove ".readlink = generic_readlink" assignments
vfs: default to generic_readlink()
vfs: replace calling i_op->readlink with vfs_readlink()
proc/self: use generic_readlink
ecryptfs: use vfs_get_link()
bad_inode: add missing i_op initializers
If .readlink == NULL implies generic_readlink().
Generated by:
to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.
The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount
In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning