As the below comment of bio_alloc_bioset, f2fs can allocate multiple bios at the
same time. So, we can't guarantee that bio is allocated all the time.
"
* When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
* able to allocate a bio. This is due to the mempool guarantees. To make this
* work, callers must never allocate more than 1 bio at a time from this pool.
* Callers that need to allocate more than 1 bio must always submit the
* previously allocated bio for IO before attempting to allocate a new one.
* Failure to do so can cause deadlocks under memory pressure.
"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch increases the number of maximum hard links for one file.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should avoid needless checkpoints when there is no dirty and prefree segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __count_free_nids/try_to_free_nids and registers
them in slab shrinker for shrinking under memory pressure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_delete_entry, if last dirent is remove from the dentry page,
we will try to punch that page since it has no valid date in it.
But truncate_hole which is used for punching could fail because of
no memory or IO error, if that happened, we'd better skip clearing
this valid dentry page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not write node pages when deleting orphan inodes.
In order to do that, we can eaisly set POR_DOING flag earlier before entering
orphan inode routine.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recover_orphan_inode, whenever f2fs_iget fail, we will make kernel panic,
but it's not reasonable, because f2fs_iget can fail due to a lot of reasons
including out of memory.
So we change error handling method as below:
a) when finding no entry for the orphan inode, bug_on for catching bugs;
b) for other reasons, report it to caller.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is not enough free segment, we should not assign a new segment
explicitly. Otherwise, we can run out of free segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Previously, we use radix tree to index all registered page entries for
atomic file, but now we only use radix tree to see whether current page
is indexed or not, since the other user of radix tree is gone in commit
042b7816aa ("f2fs: remove unnecessary call to invalidate inmemory pages").
So in this patch, we try to use one more efficient way:
Introducing a macro ATOMIC_WRITTEN_PAGE, and setting it as page private
value to indicate page indexing status. By using this way, we can save
memory and lookup time.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We run ltp testcase with f2fs and obtain a TFAIL in diotest4, the result in
detail is as fallow:
dio04
<<<test_start>>>
tag=dio04 stime=1432278894
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TFAIL : diotest4.c:129: write allows odd count.returns 1: Success
diotest4 4 TFAIL : diotest4.c:183: Odd count of read and write
diotest4 5 TPASS : Read beyond the file size
......
the result of ext4 with same environment:
dio04
<<<test_start>>>
tag=dio04 stime=1432259643
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TPASS : Odd count of read and write
diotest4 4 TPASS : Read beyond the file size
......
The reason is that when triggering DIO in f2fs, we will return zero value
in ->direct_IO if writer's buffer offset, file offset and transfer size is
not alignment to block size of filesystem, resulting in falling back into
buffered write instead of returning -EINVAL.
This patch fixes that problem by returning correct error number for above
case, and removing the judgement condition in check_direct_IO to make sure
the verification will be enabled for direct reader too.
Besides, Jaegeuk Kim pointed out that there is expectional cases we should
always make direct-io falling back into buffered write, such as dio in
encrypted file.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Chao Yu make small change and add detail description in commit message]
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fill_zero can fail due to a lot of reason, but previously we do not handle
its return value, so its callers such as punch_hole/f2fs_zero_range may
report success, but actually can fail because of error occurs inside
fill_zero.
This patch fixes to report correct return value of fill_zero.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with generic/101 in xfstests, error message outputed as below:
--- tests/generic/101.out
+++ results//generic/101.out.bad
@@ -10,10 +10,14 @@
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
-0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0372000
...
(Run 'diff -u tests/generic/101.out results/generic/101.out.bad' to see the entire diff)
The test flow is like below:
1. pwrite foo -S 0xaa 0 64K
2. pwrite foo -S 0xbb 64K 61K
3. sync
4. truncate foo 64K
5. truncate foo 125K
6. fsync foo
7. flakey drop writes
8. umount
After this test, we expect the data of recovered file will have the first
64k of data filling with value 0xaa and the next 61k of data filling with
value 0x00 because we have fsynced it before dropping writes in dm.
In f2fs, during recovering, we will only recover the valid block address
in direct node page if it is marked as a fsynced dnode, but block address
which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be
recovered. So, the file recovered shows its incorrect data 0xbb in range of
[61k, 125k].
In this patch, we fix to recover invalid/reserved block during recover flow.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In some cases, we only need the block address when we call
f2fs_reserve_block,
other fields of struct dnode_of_data aren't necessary.
We can try extent cache first for such cases in order to speed up the
process.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To avoid meeting garbage data in next free node block at the end of warm
node chain when doing recovery, we will try to zero out that invalid block.
If the device is not support discard, our way for zeroing out block is:
grabbing a temporary zeroed page in meta inode, then, issue write request
with this page.
But, we forget to release that temporary page, so our memory usage will
increase without gaining any hit ratio benefit, so it's better to free it
for saving memory.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In following call path, we will pass a locked and referenced ipage
pointer to get_new_data_page:
- init_inode_metadata
- make_empty_dir
- get_new_data_page
There are two exit paths in get_new_data_page when error occurs:
1) grab_cache_page fails, ipage will not be released;
2) f2fs_reserve_block fails, ipage will be released in callee.
So, it's not consistent for error handling in get_new_data_page.
For f2fs_reserve_block, it's not very easy to change the rule
of error handling, since it's already complicated.
Here we deside to choose an easy way to fix this issue:
If any error occur in get_new_data_page, we will ensure releasing
ipage in this function.
The same issue is in f2fs_convert_inline_dir, fix that too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Replace BUG_ON with f2fs_bug_on to deal with
block and segment validity check failed.
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_meta_page, we guarantee no failure for the returned page,
but sometimes, IO error from device will incur returning an
non-updated page.
Then, we still use this page as updated one, exception could happen
when using this kind of page.
So in this condition, we'd better freeze fs by making fs readonly and
and stop doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
some backing devices need pages to be stable during writeback. It doesn't
matter if
the page is completely overwritten or already uptodate, it needs to wait
before write.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to handle error cases in commit_inmem_pages.
If an error occurs, it stops to write the pages and return the error right
away.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is no enough free nids in free nid cache, we will try to
readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode,
then, reading nat entries in nat page for adding free nids to free nid
cache.
But when traversing all nat pages we readaheaded in a circulation,
our exit condition is not set right, one more nat page will be scanned
without readaheading, resulting worse read performance.
This patch fixes to read the correct number nat pages to avoid bad
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we clear inline data/dentry flag in handle_failed_inode, we will fail
to decline the stat count of inline data/dentry in f2fs_evict_inode due
to no flag in inode. So remove the wrong clearing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting
inline data, we will report error to user, but still remain atomic/volatile
flag in inode, it will impact further writes for this file. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the incorrect range (0, LONG_MAX) which is used
in ranged fsync. If we use LONG_MAX as the parameter for indicating
the end of file we want to synchronize, in 32-bits architecture
machine, these datas after 4GB offset may not be persisted in
storage after ->fsync returned.
Here, we alter LONG_MAX to LLONG_MAX to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When flushing comes from background, if there is no dirty page in the
mapping of inode, we'd better to skip seeking dirty page from mapping
for writebacking.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The if statement "goto continue_unlock" is exactly the same when
each if condition is true that is depended on the value of both
"step" and "is_cold_data(page)" are 0 or 1. That means when the
value of "step" equals to "is_cold_data(page)", the if condition
is true and the if statement "goto continue_unlock" appears only
once, so it can be optimized to reduce the duplicated code.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In handle_failed_inode, there is a potential deadlock which can happen
in below call path:
- f2fs_create
- f2fs_lock_op down_read(cp_rwsem)
- f2fs_add_link
- __f2fs_add_link
- init_inode_metadata
- f2fs_init_security failed
- truncate_blocks failed
- handle_failed_inode
- f2fs_truncate
- truncate_blocks(..,true)
- write_checkpoint
- block_operations
- f2fs_lock_all down_write(cp_rwsem)
- f2fs_lock_op down_read(cp_rwsem)
So in this path, we pass parameter to f2fs_truncate to make sure
cp_rwsem in truncate_blocks will not be locked again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be
held until all blocks are left shifted, so if we try to collapse small
area at the beginning of large file, checkpoint who want to grab writer's
lock of cp_rwsem will be delayed for long time.
In order to avoid this condition, altering to lock/unlock cp_rwsem each
shift operation.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a lookup and a insertion interface for extent tree.
The new lookup return the insert position and the prev/next
extents closest to the offset we lookup when find no match.
The new insertion uses above parameters to improve performance.
There are three possible insertions after the lookup in
f2fs_update_extent_tree, two of them insert parts of removed extent
back to tree, since no merge happens during this process, new insertion
skips the merge check in this scanario; the another insertion inserts a
new extent to tree, new insertion uses prev/next extent and insert
position to insert this extent directly, and save the time of searching
down the tree.
As long as tree remains unchanged between lookup and insertion, this
would work fine. And the new lookup would be useful when add
multi-blocks extent support for insertion interface.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes for a caller to handle the page after its bio gets an error.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Variables for recording extent cache ratio info were updated without
protection, this patch tries to alter them to atomic_t type for more
accurate stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to stat the number of inline xattr inode for
showing in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
That encrypted page is used temporarily, so we don't need to mark it accessed.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are gced dirty pages and normal dirty pages in the mapping
of one inode, we might writeback them alternately with discontinuous
block address, resulting in low performance.
This patch introduces f2fs_write_cache_pages with codes copied from
write_cache_pages in mm/page-writeback.c.
In this function, we refactor flow with two steps:
1) writeback all cold type pages.
2) writeback all non-cold type pages.
By using this method, f2fs will writeback dirty pages with the same
temperature in bunch mode, it makes writeouted block being with
more continuous address, so they can be merged as much as possible
in f2fs bio cache, and also it will reduce the chance of submiting
small IO from block layer.
Test environment: 8g nokia sd card (very old sd card, but it shows
better effect when testing with this patch, and with a 32g kingston
sd card, I didn't see much more improvement).
Test step:
1. touch testfile;
2. truncate -s 512K testfile;
3. write all pages with odd index;
4. trigger gc by ioctl;
5. write all pages with even index;
6. time fsync testfile.
before:
real 0m0.402s
user 0m0.000s
sys 0m0.000s
after:
real 0m0.143s
user 0m0.004s
sys 0m0.004s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to return correct error number of ->setxattr, which
is reported by xfstest tests/generic/026 as below:
generic/026 - output mismatch
--- tests/generic/026.out
+++ results/generic/026.out.bad
@@ -4,6 +4,6 @@
1 below acl max
acl max
1 above acl max
-chacl: cannot set access acl on "largeaclfile": Argument list too long
+chacl: cannot set access acl on "largeaclfile": Numerical result out of range
use 16 aces
use 17 aces
...
Ran: generic/026
Failures: generic/026
Failed 1 of 1 tests
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, since 'commit 4531929e39 ("f2fs: move grabing orphan
pages out of protection region")' was committed, in write_orphan_inodes(),
we will grab all meta page in a batch before we use them under spinlock,
so that we can avoid large time delay of grabbing meta pages under
spinlock.
Now, 'commit d6c67a4fee ("f2fs: revmove spin_lock for
write_orphan_inodes")' remove the spinlock in write_orphan_inodes,
so there is no issue we describe above, we'd better recover to move
the grab operation to original place for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With cost-benifit method, background gc will consider old section with
fewer valid blocks as candidate victim, these old blocks in section will
be treated as cold data, and laterly will be moved into cold segment.
But if the gcing page is attached by user through buffered or mmaped
write, we should reset the page as non-cold one, because this page may
have more opportunity for further updating.
So fix to add clearing code for the missed 'mmap' case.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When background gc is off, the only way to trigger gc is executing
a force gc in some operations who wants to grab space in disk.
The executing condition is limited: to execute force gc, we should
wait for the time when there is almost no more free section for LFS
allocation. This seems not reasonable for our user who wants to
control triggering gc by himself.
This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for
triggering garbage collection by using ioctl. It provides our users
one more option to trigger gc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves extent cache related code from data.c into extent_cache.c
since extent cache is independent feature, and its codes are not relate to
others in data.c, it's better for us to maintain them in separated place.
There is no functionality change, but several small coding style fixes
including:
* rename __drop_largest_extent to f2fs_drop_largest_extent for exporting;
* rename misspelled word 'untill' to 'until';
* remove unneeded 'return' in the end of f2fs_destroy_extent_tree().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will
be kept in extent cache after split, extents already shorter than
F2FS_MIN_EXTENT_LEN don't need to try split at all.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update page flag (e.g. Uptodate/cold flag) in
->write_begin.
Otherwise, page will be non-uptodate when we try to write entire
page, and cold data flag in page will not be clean when gced page
is being rewritten.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an extent_tree entry has a zero reference count, we can drop it from the
cache in higher priority rather than currently referencing entries.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->writepages, we use writepages mutex lock to serialize all block
address allocation and page submitting pairs from different inodes.
This method makes our delayed dirty pages of one inode being written
continously as many as possible.
But there is one problem that we did not submit current cached bio in
protection region of writepages mutex lock, so there is a small chance
that we submit the one of other thread's as below, resulting in
splitting more bios.
thread 1 thread 2
->writepages
lock(writepages)
->write_cache_pages
unlock(writepages)
lock(writepages)
->write_cache_pages
->f2fs_submit_merged_bio
->writepage
unlock(writepages)
fs_mark-6535 [002] .... 2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288
fs_mark-6536 [000] .... 2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096
fs_mark-6536 [000] .... 2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096
fs_mark-6535 [002] .... 2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096
This may really increase time of block layer works, and may cause
larger IO lantency.
This patch moves the submitting operation into region of writepages
mutex lock to avoid bio splits when concurrently writebacking is
intensive.
my test environment: virtual machine,
intel cpu i5 2500, 8GB size memory, 4GB size ramdisk
time fs_mark -t 16 -L 1 -s 524288 -S 1 -d /mnt/f2fs/
before:
real 0m4.244s
user 0m0.088s
sys 0m12.336s
after:
real 0m3.822s
user 0m0.072s
sys 0m10.760s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with fs_mark, some blocks were written out as cold
data which were mixed with warm data, resulting in splitting more
bios.
This is because fs_mark will create file with random filename as
below:
559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV
559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM
559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM
559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3
559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM
559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4
559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE
559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG
They are regarded as cold file since their filename are ended with
multimedia files' extension, but this should be wrong as we only
match the extension of filename, not the whole one.
In this patch, we try to fix the format of multimedia filename to:
"filename + '.' + extension", then we set cold file only its
filename matches the format.
So after this change, it will reduce the probability we set the
wrong cold file, also it helps a little for fs_mark's performance
on f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This makes the function check_dnode have a return type of bool
due to this particular function only ever returning either one
or zero as its return value and changes the name of the function
to is_alive in order to better explain this function's intended
work of checking if a dnode is still in use by the filesystem.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
[Jaegeuk Kim: change the return value check for the renamed function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee
that the largest extent would be cached in the tree all the time.
Instead of relying on extent_tree, we can simply check the cached one in extent
tree accordingly.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to handle the duplicate extent information.
The integrated rule is:
- update on-disk extent with largest one tracked by in-memory extent_cache
- destroy extent_tree for the truncation case
- drop per-inode extent_cache by shrinker
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a shrinker targeting to reduce memory footprint consumed
by a number of in-memory f2fs data structures.
In addition, it newly adds:
- sbi->umount_mutex to avoid data races on shrinker and put_super
- sbi->shruinker_run_no to not revisit objects
Note that the basic implementation was copied from fs/ubifs/shrinker.c
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.
But, in the following scenario:
1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut
after power-on, f2fs should serve extent Y, but we have an on-disk extent X.
This causes a failure on xfstests/311.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong calculation on block address field when an extent is
split.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For newly added fallocate types, it should convert inline_data before handling
block swapping.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Before iput is called, the inode number used by a bad inode can be reassigned
to other new inode, resulting in any abnormal behaviors on the new inode.
This should not happen for the new inode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The write_checkpoint can update stat information, so we should destroy the stat
structure after it.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Dirty page can be exist in mapping of newly created symlink, but previously
we did not maintain the counting of dirty page for symlink like we maintained
for regular/directory, so the counting we lookuped should be wrong.
This patch adds missed dirty page counting for symlink to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
Pull f2fs updates from Jaegeuk Kim:
"New features:
- per-file encryption (e.g., ext4)
- FALLOC_FL_ZERO_RANGE
- FALLOC_FL_COLLAPSE_RANGE
- RENAME_WHITEOUT
Major enhancement/fixes:
- recovery broken superblocks
- enhance f2fs_trim_fs with a discard_map
- fix a race condition on dentry block allocation
- fix a deadlock during summary operation
- fix a missing fiemap result
.. and many minor bug fixes and clean-ups were done"
* tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
f2fs: do not trim preallocated blocks when truncating after i_size
f2fs crypto: add alloc_bounce_page
f2fs crypto: fix to handle errors likewise ext4
f2fs: drop the volatile_write flag only
f2fs: skip committing valid superblock
f2fs: setting discard option in parse_options()
f2fs: fix to return exact trimmed size
f2fs: support FALLOC_FL_INSERT_RANGE
f2fs: hide common code in f2fs_replace_block
f2fs: disable the discard option when device doesn't support
f2fs crypto: remove alloc_page for bounce_page
f2fs: fix a deadlock for summary page lock vs. sentry_lock
f2fs crypto: clean up error handling in f2fs_fname_setup_filename
f2fs crypto: avoid f2fs_inherit_context for symlink
f2fs crypto: do not set encryption policy for non-directory by ioctl
f2fs crypto: allow setting encryption policy once
f2fs crypto: check context consistent for rename2
f2fs: avoid duplicated code by reusing f2fs_read_end_io
f2fs crypto: use per-inode tfm structure
f2fs: recovering broken superblock during mount
...
When we perform generic/092 in xfstests, output is like below:
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
0: [0..10239]: data
0: [0..10239]: data
-1: [10240..20479]: unwritten
+1: [10240..14335]: unwritten
This is because with this testcase, we redefine the regulation for
truncate in perallocated space past i_size as below:
"There was some confused about what the fs was supposed to do when you
truncate at i_size with preallocated space past i_size. We decided on the
following things.
1) truncate(i_size) will trim all blocks past i_size.
2) truncate(x) where x > i_size will not trim all blocks past i_size.
"
This method is used in xfs, and then ext4/btrfs will follow the rule.
This patch fixes to follow the new rule for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recovery procedure for superblock, we try to write data of valid
superblock into invalid one for recovery, work should be finished here,
but then still we will write the valid one with its original data.
This operation is not needed. Let's skip doing this unnecessary work.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the first mount of f2fs image with realtime discard option, we will
disable discard option if device is not supported, but for remount
operation, our discard option can still be set, this should be avoided.
This patch moves configuring of discard option to parse_options() to fix
this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we add all the candidates for trim commands and then finally issue
discard commands.
So, we should count the trimmed size in back-end.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
FALLOC_FL_INSERT_RANGE flag for ->fallocate was introduced in commit
dd46c78778 ("fs: Add support FALLOC_FL_INSERT_RANGE for fallocate").
The effect of FALLOC_FL_INSERT_RANGE command is the opposite of
FALLOC_FL_COLLAPSE_RANGE, if this command was performed, all data from
offset to EOF in our file will be shifted to right as given length, and
then range [offset, offset + length] becomes a hole.
This command is useful for our user who wants to add some data in the
middle of the file, for example: video/music editor will insert a keyframe
in specified position of media file, with this command we can easily create
a hole for inserting without removing original data.
This patch introduces f2fs_insert_range() to support FALLOC_FL_INSERT_RANGE.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch clean up codes through:
1.rename f2fs_replace_block to __f2fs_replace_block().
2.introduce new f2fs_replace_block() to include __f2fs_replace_block()
and some common related codes around __f2fs_replace_block().
Then, newly introduced function f2fs_replace_block can be used by
following patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Current f2fs check the whether the blk device can support discard.
However, the code will cause the discard option cannot be enabled.
Because the clear_opt(sbi, DISCARD) will be invoked forever.
This patch can fix this issue.
Jaegeuk Kim:
The original patch was intended to disable the discard option when device
does not support trim command.
Rather than remaining the buggy patch, let's replace with this patch as
an integrated one.
Signed-off-by: Chenxi Mao <chenxi.mao2013@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to call alloc_page() prior to mempool_alloc(), since the
mempool_alloc() calls alloc_page() internally.
And, if __GFP_WAIT is set, it never fails on page allocation, so let's
give GFP_NOWAIT and handle ENOMEM by writepage().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to call f2fs_inherit_context twice for newly created symlink.
The original one is called by f2fs_add_link(), which invokes f2fs_setxattr.
If the second one is called again, f2fs_setxattr is triggered again with same
encryption index.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Encryption policy should only be set to an empty directory through ioctl,
This patch add a judgement condition to verify type of the target inode
to avoid incorrectly configuring for non-directory.
Additionally, remove unneeded inline data conversion since regular or symlink
file should not be processed here.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add XATTR_CREATE flag in setxattr when setting encryption
context for inode. Without this flag the context could be set more than
once, this should never happen. So, fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For exchange rename, we should check context consistent of encryption
between new_dir and old_inode or old_dir and new_inode. Otherwise
inheritance of parent's encryption context will be broken.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: sync with ext4 approach]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to clean up code because part code of f2fs_read_end_io
and mpage_end_io are the same, so it's better to merge and reuse them.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch applies the following ext4 patch:
ext4 crypto: use per-inode tfm structure
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page. Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.
Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch recovers a broken superblock with the other valid one.
Signed-off-by: hujianyang <hujianyang@huawei.com>
[Jaegeuk Kim: reinitialize local variables in f2fs_fill_super for retrial]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As the description of rename in manual, RENAME_WHITEOUT is a special operation
that only makes sense for overlay/union type filesystem.
When performing rename with RENAME_WHITEOUT, dst will be replace with src, and
meanwhile, a 'whiteout' will be create with name of src.
A "whiteout" is designed to be a char device with 0,0 device number, it has
specially meaning for stackable filesystem. In these filesystems, there are
multiple layers exist, and only top of these can be modified. So a whiteout
in top layer is used to hide a corresponding file in lower layer, as well
removal of whiteout will make the file appear.
Now in overlayfs, when we rename a file which is exist in lower layer, it
will be copied up to upper if it is not on upper layer yet, and then rename
it on upper layer, source file will be whiteouted to hide corresponding file
in lower layer at the same time.
So in upper layer filesystem, implementation of RENAME_WHITEOUT provide a
atomic operation for stackable filesystem to support rename operation.
There are multiple ways to implement RENAME_WHITEOUT in log of this commit:
7dcf5c3e45 ("xfs: add RENAME_WHITEOUT support") which pointed out by
Dave Chinner.
For now, we just try to follow the way that xfs/ext4 use.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a help function update_meta_page() to update meta page with specified
buffer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now page cache of meta inode is used by garbage collection for encrypted page,
it may contain random data, so we should zero it before issuing discard.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits f2fs_crypto_init/exit with two parts: base initialization and
memory allocation.
Firstly, f2fs module declares the base encryption memory pointers.
Then, allocating internal memories is done at the first encrypted inode access.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When encryption feature is enable, if we rmmod f2fs module,
we will encounter a stack backtrace reported in syslog:
"BUG: Bad page state in process rmmod pfn:aaf8a
page:f0f4f148 count:0 mapcount:129 mapping:ee2f4104 index:0x80
flags: 0xee2830a4(referenced|lru|slab|private_2|writeback|swapbacked|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x2030a0(lru|slab|private_2|writeback|mlocked)
Modules linked in: f2fs(O-) fuse bnep rfcomm bluetooth dm_crypt binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device joydev ppdev mac_hid lp hid_generic i2c_piix4
parport_pc psmouse snd serio_raw parport soundcore ext4 jbd2 mbcache usbhid hid e1000 [last unloaded: f2fs]
CPU: 1 PID: 3049 Comm: rmmod Tainted: G B O 4.1.0-rc3+ #10
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
00000000 00000000 c0021eb4 c15b7518 f0f4f148 c0021ed8 c112e0b7 c1779174
c9b75674 000aaf8a 01b13ce1 c17791a4 f0f4f148 ee2830a4 c0021ef8 c112e3c3
00000000 f0f4f148 c0021f34 f0f4f148 ee2830a4 ef9f0000 c0021f20 c112fdf8
Call Trace:
[<c15b7518>] dump_stack+0x41/0x52
[<c112e0b7>] bad_page.part.72+0xa7/0x100
[<c112e3c3>] free_pages_prepare+0x213/0x220
[<c112fdf8>] free_hot_cold_page+0x28/0x120
[<c1073380>] ? try_to_wake_up+0x2b0/0x2b0
[<c112ff15>] __free_pages+0x25/0x30
[<c112c4fd>] mempool_free_pages+0xd/0x10
[<c112c5f1>] mempool_free+0x31/0x90
[<f0f441cf>] f2fs_exit_crypto+0x6f/0xf0 [f2fs]
[<f0f456c4>] exit_f2fs_fs+0x23/0x95f [f2fs]
[<c10c30e0>] SyS_delete_module+0x130/0x180
[<c11556d6>] ? vm_munmap+0x46/0x60
[<c15bd888>] sysenter_do_call+0x12/0x12"
The reason is that:
since commit 0827e645fd35
("f2fs crypto: shrink size of the f2fs_crypto_ctx structure") is merged,
some fields in f2fs_crypto_ctx structure are merged into a union as they
will never be used simultaneously in write path, read path or on free list.
In f2fs_exit_crypto, we traverse each crypto ctx from free list, in this
moment, our free_list field in union is valid, but still we will try to
release memory space which is pointed by other invalid field in union
structure for each ctx.
Then the error occurs, let's fix it with this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes memory leak issue in error path of f2fs_fname_setup_filename().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: shrink size of the ext4_crypto_ctx structure
Some fields are only used when the crypto_ctx is being used on the
read path, some are only used on the write path, and some are only
used when the structure is on free list. Optimize memory use by using
a union."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: get rid of ci_mode from struct ext4_crypt_info
The ci_mode field was superfluous, and getting rid of it gets rid of
an unused hole in the structure."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch integrates the below patch into f2fs.
"ext4 crypto: use slab caches
Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
slighly better memory efficiency and debuggability."
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Hu reported, F2FS has a space leak problem, when conducting:
1) format a 4GB f2fs partition
2) dd a 3G file,
3) unlink it.
So, when doing f2fs_drop_inode(), we need to truncate data blocks
before skipping it.
We can also drop unused caches assigned to each inode.
Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The return was not indented far enough so it looked like it was supposed
to go with the other if statement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A bug fix to the debug output extended the type of some local
variables to 64-bit, which now causes the kernel to fail building
because of missing 64-bit division functions:
ERROR: "__aeabi_uldivmod" [fs/f2fs/f2fs.ko] undefined!
In the kernel, we have to use div_u64 or do_div to do this,
in order to annotate that this is an expensive operation.
As the function is only called for debug out, we know this
is not performance critical, so it is safe to use div_u64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d1f85bd38db19 ("f2fs: avoid value overflow in showing current status")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
introduce compat_ioctl to regular files, but doesn't add this
functionality to f2fs_dir_operations.
While running a 32-bit busybox, I met an error like this:
(A is a directory)
chattr: reading flags on A: Inappropriate ioctl for device
This patch copies compat_ioctl from f2fs_file_operations and
fix this problem.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I use f2fs filesystem with /data partition on my Android phone
by the default mount options. When I remount /data in order to
adding discard option to run some benchmarks, I find the default
options such as background_gc, user_xattr and acl turned off.
So I introduce a function named default_options in super.c. It do
some default setting, and both mount and remount operations will
call this function to complete default setting.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No matter what the key is valid or not, readdir shows the dir entries correctly.
So, lookup should not failed.
But, we expect further accesses should be denied from open, rename, link, and so
on.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. mount $mnt
2. cp data $mnt/
3. umount $mnt
4. log out
5. log in
6. cat $mnt/data
-> panic, due to no i_crypt_info.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements encryption support for symlink.
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a bit flag to indicate whether or not i_name in the inode
is encrypted.
If this name is encrypted, we can't do recover_dentry during roll-forward.
So, f2fs_sync_file() needs to do checkpoint, if this will be needed in future.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements filename encryption support for f2fs_lookup.
Note that, f2fs_find_entry should be outside of f2fs_(un)lock_op().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds encryption support in read and write paths.
Note that, in f2fs, we need to consider cleaning operation.
In cleaning procedure, we must avoid encrypting and decrypting written blocks.
So, this patch implements move_encrypted_block().
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch activates the following APIs for encryption support.
The rules quoted by ext4 are:
- An unencrypted directory may contain encrypted or unencrypted files
or directories.
- All files or directories in a directory must be protected using the
same key as their containing directory.
- Encrypted inode for regular file should not have inline_data.
- Encrypted symlink and directory may have inline_data and inline_dentry.
This patch activates the following APIs.
1. f2fs_link : validate context
2. f2fs_lookup : ''
3. f2fs_rename : ''
4. f2fs_create/f2fs_mkdir : inherit its dir's context
5. f2fs_direct_IO : do buffered io for regular files
6. f2fs_open : check encryption info
7. f2fs_file_mmap : ''
8. f2fs_setattr : ''
9. f2fs_file_write_iter : '' (Called by sys_io_submit)
10. f2fs_fallocate : do not support fcollapse
11. f2fs_evict_inode : free_encryption_info
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds filename encryption infra.
Most of codes are copied from ext4 part, but changed to adjust f2fs
directory structure.
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch copies from encrypt_key.c in ext4, and modifies for f2fs.
Use GFP_NOFS, since _f2fs_get_encryption_info is called under f2fs_lock_op.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Most of parts were copied from ext4, except:
- add f2fs_restore_and_release_control_page which returns control page and
restore control page
- remove ext4_encrypted_zeroout()
- remove sbi->s_file_encryption_mode & sbi->s_dir_encryption_mode
- add f2fs_end_io_crypto_work for mpage_end_io
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds encryption policy and password salt support through ioctl
implementation.
It adds three ioctls:
F2FS_IOC_SET_ENCRYPTION_POLICY,
F2FS_IOC_GET_ENCRYPTION_POLICY,
F2FS_IOC_GET_ENCRYPTION_PWSALT, which use xattr operations.
Note that, these definition and codes are taken from ext4 crypto support.
For f2fs, xattr operations and on-disk flags for superblock and inode were
changed.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs encryption config.
This patch integrates:
"ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled
On arm64 this is apparently needed for CTS mode to function correctly.
Otherwise attempts to use CTS return ENOENT."
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes overflow when do cat /sys/kernel/debug/f2fs/status.
If a section is relatively large, dist value can be overflowed.
Reported-by: Yossi Goldfill <ygoldfill@radianmemory.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, FALLOC_FL_ZERO_RANGE flag in ->fallocate is supported in ext4/xfs.
In commit, the semantics of this flag is descripted as following:"
1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
block number is not the starting block of the extent, split the extent
such that the block number is the starting block of the extent.
4) Shift all the extents which are lying between
[offset, last allocated extent] towards right by len bytes. This step
will make a hole of len bytes at offset."
This patch implements fallocate's FALLOC_FL_ZERO_RANGE for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, FALLOC_FL_COLLAPSE_RANGE flag in ->fallocate is supported in ext4/xfs.
In commit, the semantics of this flag is descripted as following:"
1) It collapses the range lying between offset and length by removing any
data blocks which are present in this range and than updates all the
logical offsets of extents beyond "offset + len" to nullify the hole
created by removing blocks. In short, it does not leave a hole.
2) It should be used exclusively. No other fallocate flag in combination.
3) Offset and length supplied to fallocate should be fs block size aligned
in case of xfs and ext4.
4) Collaspe range does not work beyond i_size."
This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce a generic function replace_block base on recover_data_page,
and export it. So with it we can operate file's meta data which is in
CP/SSA area when we invoke fallocate with FALLOC_FL_COLLAPSE_RANGE
flag.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In set_node_addr, we try to lookup cached nat entry of inode and then
set flag in it.
But previously in this function, we have already grabbed nat entry with
current node id, if the node id is the same as the one of inode, we
do not need to lookup it in cache again.
So this patch adds condition judgment for reducing unneeded lookup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove f2fs_make_empty() declaration, since the main body of this function
is move into do_make_empty_dir() and the function is obsolete now.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch determines to issue discard commands by comparing given minlen and
the length of produced final candidates.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a bitmap for discard issues from f2fs_trim_fs.
There-in rule is to issue discard commands only for invalidated blocks
after mount.
Once mount is done, f2fs_trim_fs trims out whole invalid area.
After ehn, it will not issue and discrads redundantly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes spin_lock, since this is covered by f2fs_lock_op already.
And, we should avoid to use page operations inside spin_lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits find_data_page as follows.
1. f2fs_gc
- use get_read_data_page() with read only
2. find_in_level
- use find_data_page without locked page
3. truncate_partial_page
- In the case cache_only mode, just drop cached page.
- Ohterwise, use get_lock_data_page() and guarantee to truncate
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two threads:
f2fs_delete_entry() get_new_data_page()
f2fs_reserve_block()
dn.blkaddr = XXX
lock_page(dentry_block)
truncate_hole()
dn.blkaddr = NULL
unlock_page(dentry_block)
lock_page(dentry_block)
fill the block from XXX address
add new dentries
unlock_page(dentry_block)
Later, f2fs_write_data_page() will truncate the dentry_block, since
its block address is NULL.
The reason for this was due to the wrong lock order.
In this case, we should do f2fs_reserve_block() after locking its dentry block.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_sb_info and page pointers in f2fs_io_info structure.
With this change, we can reduce a lot of parameters for IO functions.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch implements f2fs_mpage_readpages for further optimization on
encryption support.
The basic code was taken from fs/mpage.c, and changed to be simple by adjusting
that block_size is equal to page_size in f2fs.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.
Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().
b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
has_fsynced_inode() has no other caller out of node.c, make it static.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the punch_hole(), if offset bigger than inode size, it returns SUCCESS.
Then f2fs_fallocate() will update time and dirty mark.
In that case, inode has not been modified actually.
So I have added offset check routine that prevent to call the punch_hole().
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Export is_valid_blkaddr() and use it to replace some codes for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our f2fs_acl_create is copied from posix_acl_create in ./fs/posix_acl.c and
modified to avoid deadlock bug when inline_dentry feature is enabled.
Dan Carpenter rewrites posix_acl_create in commit 2799563b281f
("fs/posix_acl.c: make posix_acl_create() safer and cleaner") to make this
function more safer, so that we can avoid potential bug in its caller,
especially for ocfs2.
Let's back port the patch to f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The page_follow_link_light returns NULL and its error pointer was remained
in nd->path.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reports performance regression by Yuanhan Liu.
The basic idea was to reduce one-point mutex, but it turns out this causes
another contention like context swithes.
https://lkml.org/lkml/2015/4/21/11
Until finishing the analysis on this issue, I'd like to revert this for a while.
This reverts commit 78373b7319.
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
Pull f2fs updates from Jaegeuk Kim:
"New features:
- in-memory extent_cache
- fs_shutdown to test power-off-recovery
- use inline_data to store symlink path
- show f2fs as a non-misc filesystem
Major fixes:
- avoid CPU stalls on sync_dirty_dir_inodes
- fix some power-off-recovery procedure
- fix handling of broken symlink correctly
- fix missing dot and dotdot made by sudden power cuts
- handle wrong data index during roll-forward recovery
- preallocate data blocks for direct_io
... and a bunch of minor bug fixes and cleanups"
* tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits)
f2fs: pass checkpoint reason on roll-forward recovery
f2fs: avoid abnormal behavior on broken symlink
f2fs: flush symlink path to avoid broken symlink after POR
f2fs: change 0 to false for bool type
f2fs: do not recover wrong data index
f2fs: do not increase link count during recovery
f2fs: assign parent's i_mode for empty dir
f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries
f2fs: fix mismatching lock and unlock pages for roll-forward recovery
f2fs: fix sparse warnings
f2fs: limit b_size of mapped bh in f2fs_map_bh
f2fs: persist system.advise into on-disk inode
f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get
f2fs: preallocate fallocated blocks for direct IO
f2fs: enable inline data by default
f2fs: preserve extent info for extent cache
f2fs: initialize extent tree with on-disk extent info of inode
f2fs: introduce __{find,grab}_extent_tree
f2fs: split set_data_blkaddr from f2fs_update_extent_cache
f2fs: enable fast symlink by utilizing inline data
...
This patch adds CP_RECOVERY to remain recovery information for checkpoint.
And, it makes sure writing checkpoint in this case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs_symlink was triggered and checkpoint was done before syncing its
link path, f2fs can get broken symlink like "xxx -> \0\0\0".
This incurs abnormal path_walk by VFS.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to avoid broken symlink case after POR in best effort.
This results in performance regression.
But, if f2fs has inline_data and the target path is under 3KB-sized long,
the page would be stored in its inode_block, so that there would be no
performance regression.
Note that, if user wants to keep this file atomically, it needs to trigger
dir->fsync.
And, there is still a hole to produce broken symlink.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
in the f2fs_fill_super function, variable "retry" is bool type
i think that it should be set as false.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During the roll-forward recovery, if we found a new data index written fsync
lastly, we need to recover new block address.
But, if that address was corrupted, we should not recover that.
Otherwise, f2fs gets kernel panic from:
In check_index_in_prev_nodes(),
sentry = get_seg_entry(sbi, segno);
--------------------------> out-of-range segno.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are multiple fsynced dnodes having a dent flag, roll-forward routine
sets FI_INC_LINK for their inode, and recovery_dentry increases its link count
accordingly.
That results in normal file having a link count as 2, so we can't unlink those
files.
This was added to handle several inode blocks having same inode number with
different directory paths.
But, current f2fs doesn't replay all of path changes and only recover its dentry
for the last fsynced inode block.
So, there is no reason to do this.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs was corrupted with missing dot dentries, it needs to recover them after
fsck.f2fs detection.
The underlying precedure is:
1. The fsck.f2fs remains F2FS_INLINE_DOTS flag in directory inode, if it detects
missing dot dentries.
2. When f2fs looks up the corrupted directory, it triggers f2fs_add_link with
proper inode numbers and their dot and dotdot names.
3. Once f2fs recovers the directory without errors, it removes F2FS_INLINE_DOTS
finally.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the below warning.
sparse warnings: (new ones prefixed by >>)
>> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
>> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Map bh over max size which caller defined is not needed, limit it in
f2fs_map_bh.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to dirty inode for persisting i_advise of f2fs inode info into
on-disk inode if user sets system.advise through setxattr. Otherwise the new
value will be lost.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Normally, due to DIO_SKIP_HOLES flag is set by default, blockdev_direct_IO in
f2fs_direct_IO tries to skip DIO in holes when writing inside i_size, this
makes us falling back to buffered IO which shows lower performance.
So in commit 59b802e5a4 ("f2fs: allocate data blocks in advance for
f2fs_direct_IO"), we improve perfromance by allocating data blocks in advance
if we meet holes no matter in i_size or not, since with it we can avoid falling
back to buffered IO.
But we forget to consider for unwritten fallocated block in this commit.
This patch tries to fix it for fallocate case, this helps to improve
performance.
Test result:
Storage info: sandisk ultra 64G micro sd card.
touch /mnt/f2fs/file
truncate -s 67108864 /mnt/f2fs/file
fallocate -o 0 -l 67108864 /mnt/f2fs/file
time dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=64 conv=notrunc oflag=direct
Time before applying the patch:
67108864 bytes (67 MB) copied, 36.16 s, 1.9 MB/s
real 0m36.162s
user 0m0.000s
sys 0m0.180s
Time after applying the patch:
67108864 bytes (67 MB) copied, 27.7776 s, 2.4 MB/s
real 0m27.780s
user 0m0.000s
sys 0m0.036s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Enable inline_data feature by default since it brings us better
performance and space utilization and now has already stable.
Add another option noinline_data to disable it during mount.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Suggested-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to preserve last extent info in extent tree cache into on-disk
inode, so this can help us to reuse the last extent info next time for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With normal extent info cache, we records largest extent mapping between logical
block and physical block into extent info, and we persist extent info in on-disk
inode.
When we enable extent tree cache, if extent info of on-disk inode is exist, and
the extent is not a small fragmented mapping extent. We'd better to load the
extent info into extent tree cache when inode is loaded. By this way we can have
more chance to hit extent tree cache rather than taking more time to read dnode
page for block address.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __{find,grab}_extent_tree for reusing by following
patches.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split __set_data_blkaddr from f2fs_update_extent_cache for readability.
Additionally rename __set_data_blkaddr to set_data_blkaddr for exporting.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fast symlink can utilize inline data flow to avoid using any
i_addr region, since we need to handle many cases such as
truncation, roll-forward recovery, and fsck/dump tools.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is to avoid some punch_hole overhead when releasing volatile data.
If volatile data was not written yet, we just can make the first page as zero.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes wrong f2fs_bug_on in truncate_inline_inode.
When there is no space, it can happen a corner case where i_isze is over
MAX_INLINE_SIZE while its inode is still inline_data.
The scenario is
1. write small data into file #A.
2. fill the whole partition to 100%.
3. truncate 4096 on file #A.
4. write data at 8192 offset.
--> f2fs_write_begin
-> -ENOSPC = f2fs_convert_inline_page
-> f2fs_write_failed
-> truncate_blocks
-> truncate_inline_inode
BUG_ON, since i_size is 4096.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_write_data_pages has a mutex, sbi->writepages, to serialize
data writes to maximize write bandwidth, while sacrificing multi-threads
performance.
Practically, however, multi-threads environment is much more important for
users. So this patch tries to remove the mutex.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch modifies to call set_buffer_new, if new blocks are allocated.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to set SBI_NEED_FSCK flag into sbi only when we fail to recover
in fill_super, so we could skip fscking image when we fail to fill super for
other reason.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the following call stack, f2fs changes the bitmap for dirty segments and # of
dirty sentries without grabbing sit_i->sentry_lock.
This can result in mismatch on bitmap and # of dirty sentries, since if there
are some direct_io operations.
In allocate_data_block,
- __allocate_new_segments
- mutex_lock(&curseg->curseg_mutex);
- s_ops->allocate_segment
- new_curseg/change_curseg
- reset_curseg
- __set_sit_entry_type
- __mark_sit_entry_dirty
- set_bit(dirty_sentries_bitmap)
- dirty_sentries++;
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __allocate_data_blocks, we should check current blkaddr which is located at
ofs_in_node of dnode page instead of checking first blkaddr all the time.
Otherwise we can only allocate one blkaddr in each dnode page. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously if inode is with inline data, we will try to invalid partial inline
data in page #0 when we truncate size of inode in truncate_partial_data_page().
And then we set page #0 to dirty, after this we can synchronize inode page with
page #0 at ->writepage().
But sometimes we will fail to operate page #0 in truncate_partial_data_page()
due to below reason:
a) if offset is zero, we will skip setting page #0 to dirty.
b) if page #0 is not uptodate, we will fail to update it as it has no mapping
data.
So with following operations, we will meet recent data which should be
truncated.
1.write inline data to file
2.sync first data page to inode page
3.truncate file size to 0
4.truncate file size to max_inline_size
5.echo 1 > /proc/sys/vm/drop_caches
6.read file --> meet original inline data which is remained in inode page.
This patch renames truncate_inline_data() to truncate_inline_inode() for code
readability, then use truncate_inline_inode() to truncate inline data in inode
page in truncate_blocks() and truncate page #0 in truncate_partial_data_page()
for fixing.
v2:
o truncate partially #0 page in truncate_partial_data_page to avoid keeping
old data in #0 page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our f2fs_acl_create is copied and modified from posix_acl_create to avoid
deadlock bug when inline_dentry feature is enabled.
Now, we got reference leaks in posix_acl_create, and this has been fixed in
commit fed0b588be ("posix_acl: fix reference leaks in posix_acl_create")
by Omar Sandoval.
https://lkml.org/lkml/2015/2/9/5
Let's fix this issue in f2fs_acl_create too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Changman Lee <cm224.lee@ssamsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When lookuping for creating, we will try to record the level of current dentry
hash table if current dentry has enough contiguous slots for storing name of new
file which will be created later, this can save our lookup time when add a link
into parent dir.
But currently in find_target_dentry, our current length of contiguous free slots
is not calculated correctly. This make us leaving some holes in dentry block
occasionally, it wastes our space of dentry block.
Let's refactor the lookup flow for max slots as following to fix this issue:
a) increase max_len if current slot is free;
b) update max_slots with max_len if max_len is larger than max_slots;
c) reset max_len to zero if current slot is not free.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
nm_i->nat_tree_lock is used to sync both the operations of nat entry
cache tree and nat set cache tree, however, it isn't held when flush
nat entries during checkpoint which lead to potential race, this patch
fix it by holding the lock when gang lookup nat set cache and delete
item from nat set cache.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Through each macro, we can read the meaning easily.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the unnecessary condition judgment, because
'max_slots' has been initialized to '0' at the beginging
of the function, as following:
if (max_slots)
*max_slots = 0;
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function 'find_in_inline_dir()' contain 'res_page'
as an argument. So, we should initiaize 'res_page' before
this function.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __set_free we will check whether all segment are free in one section
when free one segment, in order to set section to free status. But the
searching region of segmap is from start segno to last segno of main
area, it's not necessary. So let's just only check all segment bitmap
of target section.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
extent tree/node slab cache is created during f2fs insmod,
how, it isn't destroyed during f2fs rmmod, this patch fix
it by destroy extent tree/node slab cache once rmmod f2fs.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs has been shipped on many smartphone devices during a couple of years.
So, it is worth to relocate Kconfig into main page from misc filesystems for
developers to choose it more easily.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode has inline_data, it should report -ENOENT when accessing out-of-bound
region.
This is used by f2fs_fiemap which treats -ENOENT with no error.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When fsync is done through checkpoint, previous f2fs missed to clear append
and update flag. This patch fixes to clear them.
This was originally catched by Changman Lee before.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch is for looking into gc performance of f2fs in detail.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: fix build errors]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now in f2fs, we share functions and structures for batch mode and real-time mode
discard. For real-time mode discard, in shared function add_discard_addrs, we
will use uninitialized trim_minlen in struct cp_control to compare with length
of contiguous free blocks to decide whether skipping discard fragmented freespace
or not, this makes us ignore small discard sometimes. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by : Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In a preempt-off enviroment a alot of FS activity (write/delete) I run
into a CPU stall:
| NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:2:59]
| Modules linked in:
| CPU: 0 PID: 59 Comm: kworker/u2:2 Tainted: G W 3.19.0-00010-g10c11c51ffed #153
| Workqueue: writeback bdi_writeback_workfn (flush-179:0)
| task: df230000 ti: df23e000 task.ti: df23e000
| PC is at __submit_merged_bio+0x6c/0x110
| LR is at f2fs_submit_merged_bio+0x74/0x80
…
| [<c00085c4>] (gic_handle_irq) from [<c0012e84>] (__irq_svc+0x44/0x5c)
| Exception stack(0xdf23fb48 to 0xdf23fb90)
| fb40: deef3484 ffff0001 ffff0001 00000027 deef3484 00000000
| fb60: deef3440 00000000 de426000 deef34ec deefc440 df23fbb4 df23fbb8 df23fb90
| fb80: c02191f0 c0218fa0 60000013 ffffffff
| [<c0012e84>] (__irq_svc) from [<c0218fa0>] (__submit_merged_bio+0x6c/0x110)
| [<c0218fa0>] (__submit_merged_bio) from [<c02191f0>] (f2fs_submit_merged_bio+0x74/0x80)
| [<c02191f0>] (f2fs_submit_merged_bio) from [<c021624c>] (sync_dirty_dir_inodes+0x70/0x78)
| [<c021624c>] (sync_dirty_dir_inodes) from [<c0216358>] (write_checkpoint+0x104/0xc10)
| [<c0216358>] (write_checkpoint) from [<c021231c>] (f2fs_sync_fs+0x80/0xbc)
| [<c021231c>] (f2fs_sync_fs) from [<c0221eb8>] (f2fs_balance_fs_bg+0x4c/0x68)
| [<c0221eb8>] (f2fs_balance_fs_bg) from [<c021e9b8>] (f2fs_write_node_pages+0x40/0x110)
| [<c021e9b8>] (f2fs_write_node_pages) from [<c00de620>] (do_writepages+0x34/0x48)
| [<c00de620>] (do_writepages) from [<c0145714>] (__writeback_single_inode+0x50/0x228)
| [<c0145714>] (__writeback_single_inode) from [<c0146184>] (writeback_sb_inodes+0x1a8/0x378)
| [<c0146184>] (writeback_sb_inodes) from [<c01463e4>] (__writeback_inodes_wb+0x90/0xc8)
| [<c01463e4>] (__writeback_inodes_wb) from [<c01465f8>] (wb_writeback+0x1dc/0x28c)
| [<c01465f8>] (wb_writeback) from [<c0146dd8>] (bdi_writeback_workfn+0x2ac/0x460)
| [<c0146dd8>] (bdi_writeback_workfn) from [<c003c3fc>] (process_one_work+0x11c/0x3a4)
| [<c003c3fc>] (process_one_work) from [<c003c844>] (worker_thread+0x17c/0x490)
| [<c003c844>] (worker_thread) from [<c0041398>] (kthread+0xec/0x100)
| [<c0041398>] (kthread) from [<c000ed10>] (ret_from_fork+0x14/0x24)
As it turns out, the code loops in sync_dirty_dir_inodes() and waits for
others to make progress but since it never leaves the CPU there is no
progress made. At the time of this stall, there is also a rm process
blocked:
| rm R running 0 1989 1774 0x00000000
| [<c047c55c>] (__schedule) from [<c00486dc>] (__cond_resched+0x30/0x4c)
| [<c00486dc>] (__cond_resched) from [<c047c8c8>] (_cond_resched+0x4c/0x54)
| [<c047c8c8>] (_cond_resched) from [<c00e1aec>] (truncate_inode_pages_range+0x1f0/0x5e8)
| [<c00e1aec>] (truncate_inode_pages_range) from [<c00e1fd8>] (truncate_inode_pages+0x28/0x30)
| [<c00e1fd8>] (truncate_inode_pages) from [<c00e2148>] (truncate_inode_pages_final+0x60/0x64)
| [<c00e2148>] (truncate_inode_pages_final) from [<c020c92c>] (f2fs_evict_inode+0x4c/0x268)
| [<c020c92c>] (f2fs_evict_inode) from [<c0137214>] (evict+0x94/0x140)
| [<c0137214>] (evict) from [<c01377e8>] (iput+0xc8/0x134)
| [<c01377e8>] (iput) from [<c01333e4>] (d_delete+0x154/0x180)
| [<c01333e4>] (d_delete) from [<c0129870>] (vfs_rmdir+0x114/0x12c)
| [<c0129870>] (vfs_rmdir) from [<c012d644>] (do_rmdir+0x158/0x168)
| [<c012d644>] (do_rmdir) from [<c012dd90>] (SyS_unlinkat+0x30/0x3c)
| [<c012dd90>] (SyS_unlinkat) from [<c000ec40>] (ret_fast_syscall+0x0/0x4c)
As explained by Jaegeuk Kim:
|This inode is the directory (c.f., do_rmdir) causing a infinite loop on
|sync_dirty_dir_inodes.
|The sync_dirty_dir_inodes tries to flush dirty dentry pages, but if the
|inode is under eviction, it submits bios and do it again until eviction
|is finished.
This patch adds a cond_resched() (as suggested by Jaegeuk) after a BIO
is submitted so other thread can make progress.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[Jaegeuk Kim: change fs/f2fs to f2fs in subject as naming convention]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cp_payload is introduced for sit bitmap to support large volume, and it is
just after the block of f2fs_checkpoint + nat bitmap, so the first segment
should include F2FS_CP_PACKS + NR_CURSEG_TYPE + cp_payload + orphan blocks.
However, current max orphan inodes calculation don't consider cp_payload,
this patch fix it by reducing the number of cp_payload from total blocks of
the first segment when calculate max orphan inodes.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't need to collect dirty sit entries and flush sit journal to sit
entries when there's no dirty sit entries. This patch check dirty_sentries
earlier just like flush_nat_entries.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
block operations is used to flush all dirty node and dentry blocks in
the page cache and suspend ordinary writing activities, however, there
are some facts such like cp error or mount read-only etc which lead to
block operations can't be invoked. Current trace point print block_ops
start premature even if block_ops doesn't have opportunity to execute.
This patch fix it by move block_ops trace point just before block_ops.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a page is cached but its block was deallocated, we don't need to make
the page dirty again by gc and truncate_partial_data_page.
In that case, it needs to check its block allocation all the time instead
of giving up-to-date page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If page's on-disk block was deallocated, let's remove up-to-date flag to avoid
further access with wrong contents.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cp_pack_start_sum is calculated in do_checkpoint and is equal to
cpu_to_le32(1 + cp_payload_blks + orphan_blocks). The number of
orphan inode blocks is take advantage of by recover_orphan_inodes
to readahead meta pages and recovery inodes. However, current codes
forget to reduce the number of cp payload blocks when calculate
the number of orphan inode blocks. This patch fix it.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a generic ioctl for fs shutdown, which was used by xfs.
If this shutdown is triggered, filesystem stops any further IOs according to the
following options.
1. FS_GOING_DOWN_FULLSYNC
: this will flush all the data and dentry blocks, and do checkpoint before
shutdown.
2. FS_GOING_DOWN_METASYNC
: this will do checkpoint before shutdown.
3. FS_GOING_DOWN_NOSYNC
: this will trigger shutdown as is.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During the roll-forward recovery, -ENOENT for f2fs_iget can be skipped.
So, this error value should not be propagated.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We update extent cache for all user inode of f2fs including dir inode, so this
patch gives another chance to try to get physical address of page from extent
cache for dir inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch switch to check FI_NO_EXTENT in f2fs_{lookup,update}_extent_cache
instead of f2fs_{lookup,update}_extent_tree or {lookup,update}_extent_info.
No functionality modification in this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a fast lookup path for rb-tree extent cache.
In this patch we add a recently accessed extent node pointer 'cached_en' in
extent tree. In lookup path of extent cache, we will firstly lookup the last
accessed extent node which cached_en points, if we do not hit in this node,
we will try to lookup extent node in rb-tree.
By this way we can avoid unnecessary slow lookup in rb-tree sometimes.
Note that, side-effect of this patch is that we will increase memory cost,
because we will store a pointer variable in each struct extent tree
additionally.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds trace for lookup/update/shrink/destroy ops in rb-tree extent cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add and show stat info of total memory footprint for extent tree,node
in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables rb-tree based extent cache in f2fs.
When we mount with "-o extent_cache", f2fs will try to add recently accessed
page-block mappings into rb-tree based extent cache as much as possible, instead
of original one extent info cache.
By this way, f2fs can support more effective cache between dnode page cache and
disk. It will supply high hit ratio in the cache with fewer memory when dnode
page cache are reclaimed in environment of low memory.
Storage: Sandisk sd card 64g
1.append write file (offset: 0, size: 128M);
2.override write file (offset: 2M, size: 1M);
3.override write file (offset: 4M, size: 1M);
...
4.override write file (offset: 48M, size: 1M);
...
5.override write file (offset: 112M, size: 1M);
6.sync
7.echo 3 > /proc/sys/vm/drop_caches
8.read file (size:128M, unit: 4k, count: 32768)
(time dd if=/mnt/f2fs/128m bs=4k count=32768)
Extent Hit Ratio:
before patched
Hit Ratio 121 / 1071 1071 / 1071
Performance:
before patched
real 0m37.051s 0m35.556s
user 0m0.040s 0m0.026s
sys 0m2.990s 0m2.251s
Memory Cost:
before patched
Tree Count: 0 1 (size: 24 bytes)
Node Count: 0 45 (size: 1440 bytes)
v3:
o retest and given more details of test result.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mount option 'extent_cache' in f2fs.
It is try to use a rb-tree based extent cache to cache more mapping information
with less memory if this option is set, otherwise we will use the original one
extent info cache.
Suggested-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds core functions including slab cache init function and
init/lookup/update/shrink/destroy function for rb-tree based extent cache.
Thank Jaegeuk Kim and Changman Lee as they gave much suggestion about detail
design and implementation of extent cache.
Todo:
* register rb-based extent cache shrink with mm shrink interface.
v2:
o move set_extent_info and __is_{extent,back,front}_mergeable into f2fs.h.
o introduce __{attach,detach}_extent_node for code readability.
o add cond_resched() when fail to invoke kmem_cache_alloc/radix_tree_insert.
o fix some coding style and typo issues.
v3:
o fix oops due to using an unassigned pointer.
o use list_del to remove extent node in shrink list.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: add static for some funcitons and declare in f2fs.h]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce infra macro and data structure for rb-tree based extent cache:
Macros:
* EXT_TREE_VEC_SIZE: indicate vector size for gang lookup in extent tree.
* F2FS_MIN_EXTENT_LEN: indicate minimum length of extent managed in cache.
* EXTENT_CACHE_SHRINK_NUMBER: indicate number of extent in cache will be shrunk.
Basic data structures for extent cache:
* struct extent_tree: extent tree entry per inode.
* struct extent_node: extent info node linked in extent tree.
Besides, adding new extent cache related fields in f2fs_sb_info.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch, we do these jobs:
1. rename {check,update}_extent_cache to {lookup,update}_extent_info;
2. introduce universal lookup/update interface of extent cache:
f2fs_{lookup,update}_extent_cache including above two real functions, then
export them to function callers.
So after above cleanup, we can add new rb-tree based extent cache into exported
interfaces.
v2:
o remove "f2fs_" for inner function {lookup,update}_extent_info suggested by
Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_map_bh to clean codes of check_extent_cache.
v2:
o cleanup f2fs_map_bh pointed out by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename a filed name from 'blk_addr' to 'blk' in struct {f2fs_extent,extent_info}
as annotation of this field descripts its meaning well to us.
By this way, we can avoid long statement in code of following patches.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Move ext_lock out of struct extent_info, then in the following patches we can
use variables with struct extent_info type as a parameter to pass pure data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should stat inline data information for temp file in f2fs_tmpfile if we
enable inline_data feature.
Otherwise, inline data stat number will be wrong after this temp file is
evicted.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses ->writepage of meta mapping in sync_meta_pages instead of
f2fs_write_meta_page, by this way, in its caller we can ignore any changes
(e.g. changing name) of this registered function.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_update_dentry to remove redundant code in
f2fs_add_inline_entry and __f2fs_add_link.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inline_dentry_addr is introduced with inline dentry feature without being used,
now we do not need to keep it for any reason, so remove it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs updates from Jaegeuk Kim:
"Major changes are to:
- add f2fs_io_tracer and F2FS_IOC_GETVERSION
- fix wrong acl assignment from parent
- fix accessing wrong data blocks
- fix wrong condition check for f2fs_sync_fs
- align start block address for direct_io
- add and refactor the readahead flows of FS metadata
- refactor atomic and volatile write policies
But most of patches are for clean-ups and minor bug fixes. Some of
them refactor old code too"
* tag 'for-f2fs-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (64 commits)
f2fs: use spinlock for segmap_lock instead of rwlock
f2fs: fix accessing wrong indexed data blocks
f2fs: avoid variable length array
f2fs: fix sparse warnings
f2fs: allocate data blocks in advance for f2fs_direct_IO
f2fs: introduce macros to convert bytes and blocks in f2fs
f2fs: call set_buffer_new for get_block
f2fs: check node page contents all the time
f2fs: avoid data offset overflow when lseeking huge file
f2fs: fix to use highmem for pages of newly created directory
f2fs: introduce a batched trim
f2fs: merge {invalidate,release}page for meta/node/data pages
f2fs: show the number of writeback pages in stat
f2fs: keep PagePrivate during releasepage
f2fs: should fail mount when trying to recover data on read-only dev
f2fs: split UMOUNT and FASTBOOT flags
f2fs: avoid write_checkpoint if f2fs is mounted readonly
f2fs: support norecovery mount option
f2fs: fix not to drop mount options when retrying fill_super
f2fs: merge flags in struct f2fs_sb_info
...
rwlock can provide better concurrency when there are much more readers than
writers because readers can hold the rwlock simultaneously.
But now, for segmap_lock rwlock in struct free_segmap_info, there is only one
reader 'mount' from below call path:
->f2fs_fill_super
->build_segment_manager
->build_dirty_segmap
->init_dirty_segmap
->find_next_inuse
read_lock
...
read_unlock
Now that our concurrency can not be improved since there is no other reader for
this lock, we do not need to use rwlock_t type for segmap_lock, let's replace it
with spinlock_t type.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the following test.
This causes:
attempt to access beyond end of device
sdb2: rw=16384, want=14413962000, limit=16777216
The reason is:
- f2fs_write_begin
- f2fs_convert_inline_inode returns -ENOSPC
- f2fs_write_failed
- truncate_blocks
- truncate_partial_data_page
- find_data_page
- get_dnode_of_data returns wrong data index retrieved from inline_data
- f2fs_submit_page_bio(wrong data index)
- submit_bio(wrong data index)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch resolves the following warnings.
include/trace/events/f2fs.h:150:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:180:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool
include/trace/events/f2fs.h:150:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:180:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/f2fs/checkpoint.c:27:19: warning: symbol 'inode_entry_slab' was not declared. Should it be static?
fs/f2fs/checkpoint.c:577:15: warning: cast to restricted __le32
fs/f2fs/checkpoint.c:592:15: warning: cast to restricted __le32
fs/f2fs/trace.c:19:1: warning: symbol 'pids' was not declared. Should it be static?
fs/f2fs/trace.c:21:21: warning: symbol 'last_io' was not declared. Should it be static?
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two macros for transition between byte and block offsets.
Currently, f2fs only supports 4KB blocks, so use the default size for now.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong handling of buffer_new flag in get_block.
If f2fs allocates new blocks and mapped buffer_head, it needs to set buffer_new
for the bh_result.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_node_page, if the page is up-to-date, we assumed that the page was not
reclaimed at all.
But, sometimes it was reported that its contents was missing.
So, just for sure, let's check its mapping and contents.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
xfstest generic/285 complains our issue in lseeking huge file.
Here is the detail output of generic/285:
"./check -f2fs tests/generic/285
Ran: generic/285
Failures: generic/285
Failed 1 of 1 tests
10. Test a huge file for offset overflow
10.01 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ
10.02 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ
10.03 SEEK_DATA expected 0 or 0, got 0. succ
10.04 SEEK_DATA expected 1 or 1, got 1. succ
10.05 SEEK_HOLE expected 8589934592 or 8589934592, got 0. FAIL
10.06 SEEK_DATA expected 8589869056 or 8589869056, got 8589869056. succ
10.07 SEEK_DATA expected 8589869057 or 8589869057, got 8589869057. succ
10.08 SEEK_DATA expected 8589869056 or 8589869056, got 4294901760. FAIL"
The reason of this issue is:
We will calculate current offset through left shifting page-offset with
PAGE_CACHE_SHIFT bits, but our page-offset is a type of unsigned long, its size
is 4 bytes in 32-bits machine.
So if our page-offset is bigger than (1 << 32 / pagesize - 1), result of left
shifting will overflow.
Let's fix this issue by casting type of page-offset to type of current offset:
loff_t.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit a78186ebe5 ("f2fs: use highmem for directory pages"), we have set
__GFP_HIGHMEM into dir mapping's gfp flag in f2fs_iget, so high address memory
could be used for these existing dir's page.
But we forgot to set flag for newly created dir, due to this reason, our newly
created dir pages could not be allocated from high address memory. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a batched trimming feature, which submits split discard
commands.
This is to avoid long latency due to huge trim commands.
If fstrim was triggered ranging from 0 to the end of device, we should lock
all the checkpoint-related mutexes, resulting in very long latency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges ->{invalidate,release}page function for meta/node/data pages.
After this, duplication of codes could be removed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If device is read-only, we should not proceed data recovery.
But, if the previous checkpoint was done by normal clean shutdown, it's safe to
proceed the recovery, since there will be no data to be recovered.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds FASTBOOT flag into checkpoint as follows.
- CP_UMOUNT_FLAG is set when system is umounted.
- CP_FASTBOOT_FLAG is set when intermediate checkpoint having node summaries
was done.
So, if you get CP_UMOUNT_FLAG from checkpoint, the system was umounted cleanly.
Instead, if there was sudden-power-off, you can get CP_FASTBOOT_FLAG or nothing.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do not change any partition when f2fs is changed to readonly mode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a mount option, norecovery, which is mostly same as
disable_roll_forward. The only difference is that norecovery should be activated
with read-only mount option.
This can be used when user wants to check whether f2fs is mountable or not
without any recovery process. (e.g., xfstests/200)
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If wrong mount option was requested, f2fs tries to fill_super again.
But, during the next trial, f2fs has no valid mount options, since
parse_options deleted all the separators in the original string.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, there are several variables with Boolean type as below:
struct f2fs_sb_info {
...
int s_dirty;
bool need_fsck;
bool s_closing;
...
bool por_doing;
...
}
For this there are some issues:
1. there are some space of f2fs_sb_info is wasted due to aligning after Boolean
type variables by compiler.
2. if we continuously add new flag into f2fs_sb_info, structure will be messed
up.
So in this patch, we try to:
1. switch s_dirty to Boolean type variable since it has two status 0/1.
2. merge s_dirty/need_fsck/s_closing/por_doing variables into s_flag.
3. introduce an enum type which can indicate different states of sbi.
4. use new introduced universal interfaces is_sbi_flag_set/{set,clear}_sbi_flag
to operate flags for sbi.
After that, above issues will be fixed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use pointer parameter @wait to pass result in {in,de}create_sleep_time for
cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. make truncate_inline_date static;
2. remove parameter @from of truncate_inline_date as callers only pass zero.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No modification in functionality, just clean codes with f2fs_radix_tree_insert.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch we add the FS_IOC_GETVERSION ioctl for getting i_generation from
inode, after that, users can list file's generation number by using "lsattr -v".
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During the recovery, any xattr blocks should not be found, since they are
written into cold log, not the warm node chain.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We will encounter deadloop in below scenario:
1. increase page count for F2FS_DIRTY_META type in following path:
->recover_fsync_data
->recover_data
->do_recover_data
->recover_data_page
->change_curseg
->write_sum_page
->set_page_dirty
2. fail in recover_data()
3. invalidate meta pages in truncate_inode_pages_final without decreasing page
count.
4. deadloop when sync_meta_pages as page count will always be non-zero.
message:
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
[<c1129a37>] pagevec_lookup_tag+0x27/0x30
[<f0e774c7>] sync_meta_pages+0x87/0x160 [f2fs]
[<f0e86dd9>] recover_fsync_data+0xeb9/0xf10 [f2fs]
[<f0e75398>] f2fs_fill_super+0x888/0x980 [f2fs]
[<c11733ca>] mount_bdev+0x16a/0x1a0
[<f0e7180f>] f2fs_mount+0x1f/0x30 [f2fs]
[<c1173da6>] mount_fs+0x36/0x170
[<c118b6f5>] vfs_kern_mount+0x55/0xe0
[<c118d63f>] do_mount+0x1df/0x9f0
[<c118e110>] SyS_mount+0x70/0xb0
[<c15a0c48>] sysenter_do_call+0x12/0x12
To avoid page count leak, let's add ->invalidatepage and ->releasepage in
f2fs_meta_aops as f2fs_node_aops to release meta page count correctly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the previous checkpoint was done without CP_UMOUNT flag, it needs to do
checkpoint with CP_UMOUNT for the next fast boot.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to trigger checkpoint with umount flag when kill_sb was called.
In kill_sb, f2fs_sync_fs was finally called, but at this time, f2fs can't do
checkpoint with CP_UMOUNT.
After then, f2fs_put_super is not doing checkpoint, since it is not dirty.
So, this patch adds a flag to indicate f2fs_sync_fs is called during umount.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our value of memory footprint statistics showed in debugfs is not calculated
correctly. Fix it in this patch.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If cp_error is set, we should avoid all the infinite loop.
In f2fs_sync_file, there is a hole, and this patch fixes that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fs/f2fs/trace.c:19:12: sparse: symbol 'pids_lock' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use kzalloc to allocate memory in __recover_inline_status, and use this
all-zero memory to check the inline date content of inode page by comparing
them. This is low effective and not needed, let's check inline date content
directly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: make the code more neat]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch aligns the start block address of a file for direct io to the f2fs's
section size.
Some flash devices manage an over 4KB-sized page as a write unit, and if the
direct_io'ed data are written but not aligned to that unit, the performance can
be degraded due to the partial page copies.
Thus, since f2fs has a section that is well aligned to FTL units, we can align
the block address to the section size so that f2fs avoids this misalignment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two slab cache inode_entry_slab and winode_slab using the same
structure as below:
struct dir_inode_entry {
struct list_head list; /* list head */
struct inode *inode; /* vfs inode pointer */
};
struct inode_entry {
struct list_head list;
struct inode *inode;
};
It's a little waste that the two cache can not share their memory space for each
other.
So in this patch we remove one redundant winode_slab slab cache, then use more
universal name struct inode_entry as remaining data structure name of slab,
finally we reuse the inode_entry_slab to store dirty dir item and gc item for
more effective.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
as one parameter.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds missing parameter _type_ for trace_f2fs_submit_page_bio, then
use DECLARE_EVENT_CLASS/DEFINE_EVENT_CONDITION pair to cleanup some trace event
code related to f2fs_submit_page_{m,}bio.
Additionally, after we remove redundant code, size of code can be reduced:
text data bss dec hex filename
176787 8712 56 185555 2d4d3 f2fs.ko.org
174408 8648 56 183112 2cb48 f2fs.ko
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds block count by in-place-update in stat.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The __f2fs_add_link is covered by cp_rwsem all the time.
This calls init_inode_metadata, which conducts some acl operations including
memory allocation with GFP_KERNEL previously.
But, under memory pressure, f2fs_write_data_page can be called, which also
grabs cp_rwsem too.
In this case, this incurs a deadlock pointed by Chao.
Thread #1 Thread #2
down_read
down_write
down_read
-> here down_read should wait forever.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two key functions to trace process ids and IOs.
The basic idea is to
1. remain process ids, pids, in page->private.
2. show pids in IO traces.
So, later we can retrieve process information according to IO traces.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds:
o initial trace.c and trace.h with skeleton functions
o Kconfig and Makefile to activate this feature
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().
changes from v2:
o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
o fix one bug when using truncate_inode_pages_range which is pointed out by
Jaegeuk Kim.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.
changes from v2:
o update description of memory usage gain for 64-bit machine suggested by
Changman Lee.
changes from v1:
o introduce inline copy_node_info() to copy valid data from node info suggested
by Jaegeuk Kim, it can avoid bug.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.
Changes from v1
o remove inappropriate 'unlikely' in npages_for_summary_flush.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.
In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.
So, for small discards, we need to add candidates only just being obsoleted.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
o f2fs_ioc_abort_volatile_write
- If transaction was failed, all the grabbed pages and data should be written.
o f2fs_ioc_release_volatile_write
- This is to enhance the performance of PERSIST mode in sqlite.
In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to force to write dirty_exceeded for f2fs_balance_fs_bg.
This flag was only meaningful to write bypassing conditions.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To improve recovery speed, f2fs try to readahead many contiguous blocks in warm
node segment, but for most time, abnormal power-off do not occur frequently, so
when mount a normal power-off f2fs image, by contrary ra so many blocks and then
invalid them will hurt the performance of mount.
It's better to just ra the first next-block for normal condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch does cleanup work, it introduces is_valid_blkaddr() to include
verification code for blkaddr with upper and down boundary value which were in
ra_meta_pages previous.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1.We use zero as upper boundary value for ra SSA/CP blocks, we will skip
readahead as verification failure with max number, it causes low performance.
2.Low boundary value is not accurate for SSA/CP/POR region verification, so
these values need to be redefined.
This patch fixes above issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As inline_{dir,inode} stat is increased/decreased concurrently by multi threads,
so the value is not so accurate, let's use atomic type for counting accurately.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added some commentaries for code readability and cleaned up if-statement
clearly.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode state is dirty, go straight to write.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inmemory pages should be handled by invalidate_page since it needs to be
released int the truncation path.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_read_inode, if we failed __recover_inline_status, the inode has inline
flag without increasing its count.
Later, f2fs_evict_inode will decrease the count, which causes -1.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch revists retrial paths in f2fs.
The basic idea is to use cond_resched instead of retrying from the very early
stage.
Suggested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to fix:
BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
(radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
(radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
(gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
(f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
(gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
(kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)
The reason is that f2fs calls radix_tree_insert under enabled preemption.
So, before calling it, we need to call radix_tree_preload.
Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
semaphore to cover the radix tree operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previoulsy, we used rwlock for nat_entry lock.
But, now we have a lot of complex operations in set_node_addr.
(e.g., allocating kernel memories, handling radix_trees, and so on)
So, this patches tries to change spinlock to rw_semaphore to give CPUs to other
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We've already made fi and sbi for inode. Let's avoid duplicated work.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the wrong error number in error path of f2fs_write_begin.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Little cleanup to distinguish each phase easily
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode has converted inline_data which was written to the disk, we should
set its inode flag for further fsync so that this inline_data can be recovered
from sudden power off.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After flushing dirty nat entries, it has to be no more dirty nat
entries.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's meaningless to check dirty_nat_cnt after re-dirtying nat entries in
journal. And although there are rooms for dirty nat entires if dirty_nat_cnt
is zero, it's also meaningless to check __has_cursum_space.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
A deadlock can be occurred:
Thread 1] Thread 2]
- f2fs_write_data_pages - f2fs_write_begin
- lock_page(page #0)
- grab_cache_page(page #X)
- get_node_page(inode_page)
- grab_cache_page(page #0)
: to convert inline_data
- f2fs_write_data_page
- f2fs_write_inline_data
- get_node_page(inode_page)
In this case, trying to lock inode_page and page #0 causes deadlock.
In order to avoid this, this patch adds a rule for this locking policy,
which is that page #0 should be locked followed by inode_page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Two jump labels were adjusted in the implementation of the
create_node_manager_caches() function because these identifiers
contained typos.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_evict_inode,
commit_inmemory_pages
f2fs_gc
f2fs_iget
iget_locked
-> wait for inode free
Here, if the inode is same as the one to be evicted, f2fs should wait forever.
Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid
this.
But, the commit_inmem_pages calls f2fs_balance_fs by default, even if
f2fs_evict_inode wants to free inmemory pages only.
Hence, this patch adds to trigger f2fs_balance_fs only when there is something
to write.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It used nat_entry_set when create slab for sit_entry_set.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a node page is request to be written during the reclaiming path, we should
submit the bio to avoid pending to recliam it.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO,
and we manage fields related to inode cache separately in struct f2fs_sb_info
for each inode cache type.
This makes codes a bit messy, so that this patch intorduce a new struct
inode_management to wrap inner fields as following which make codes more neat.
/* for inner inode cache management */
struct inode_management {
struct radix_tree_root ino_root; /* ino entry array */
spinlock_t ino_lock; /* for ino entry lock */
struct list_head ino_list; /* inode list head */
unsigned long ino_num; /* number of entries */
};
struct f2fs_sb_info {
...
struct inode_management im[MAX_INO_ENTRY]; /* manage inode cache */
...
}
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because we have checked the contrary condition in case of "if" judgment, we do
not need to check the condition again in case of "else" judgment. Let's remove
it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_remount, we will stop gc thread and set need_restart_gc as true when new
option is set without BG_GC, then if any error occurred in the following
procedure, we can restore to start the gc thread.
But after that, We will fail to restore gc thread in start_gc_thread as BG_GC is
not set in new option, so we'd better move this condition judgment out of
start_gc_thread to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The locked page should be released before returning the function.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode.
Otherwise, we can make some dirty pages during the truncation, and those pages
will be written through f2fs_write_data_page.
At that moment, the inode has still inline_data, so that it tries to write non-
zero pages into inline_data area.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is like this.
One trhead triggers:
f2fs_write_data_pages
lock_page
f2fs_write_data_page
f2fs_lock_op <- wait
The other thread triggers:
f2fs_truncate
truncate_blocks
f2fs_lock_op
truncate_partial_data_page
lock_page <- wait for locking the page
This patch resolves this bug by relocating truncate_partial_data_page.
This function is just to truncate user data page and not related to FS
consistency as well.
And, we don't need to call truncate_inline_data. Rather than that,
f2fs_write_data_page will finally update inline_data later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The # of inline_data inode is decreased only when it has inline_data.
After clearing the flag, we can't decreased the number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to write node pages if checkpoint is not doing in order to avoid
memory pressure.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to control the memory footprint used by ino entries.
This will conduct best effort, not strictly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a system wants to reduce the booting time as a top priority, now we can
use a mount option, -o fastboot.
With this option, f2fs conducts a little bit slow write_checkpoint, but
it can avoid the node page reads during the next mount time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__submit_merged_bio f2fs_write_end_io f2fs_write_end_io
wait_io = X wait_io = x
complete(X) complete(X)
wait_io = NULL
wait_for_completion()
free(X)
spin_lock(X)
kernel panic
In order to avoid this, this patch removes the wait_io facility.
Instead, we can use wait_on_all_pages_writeback(sbi) to wait for end_ios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is a chance to make a huge sized discard command, we don't need
to split it out, since each blkdev_issue_discard should wait one at a
time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch simplifies the inline_data usage with the following rule.
1. inline_data is set during the file creation.
2. If new data is requested to be written ranges out of inline_data,
f2fs converts that inode permanently.
3. There is no cases which converts non-inline_data inode to inline_data.
4. The inline_data flag should be changed under inode page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There's no point in using test_and_clear_bit_le() when we don't use the
return value of the function. Just use clear_bit_le() instead.
Coverity-id: 1016434
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>