kernel_optimize_test/Documentation/filesystems
Nick Piggin b827e496c8 mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
..
caching CacheFiles: Fix the documentation to use the correct credential pointer names 2009-04-24 13:28:30 -07:00
configfs
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
9p.txt
00-INDEX nilfs2: add document 2009-04-07 08:31:12 -07:00
adfs.txt
affs.txt
afs.txt
autofs4-mount-control.txt
automount-support.txt
befs.txt
bfs.txt
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
cifs.txt
coda.txt
cramfs.txt
dentry-locking.txt
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking
dlmfs.txt
dnotify.txt
ecryptfs.txt
exofs.txt exofs: Documentation 2009-03-31 19:44:38 +03:00
Exporting
ext2.txt trivial: fix orphan dates in ext2 documentation 2009-03-23 14:21:26 -07:00
ext3.txt trivial: document ext3 semantics of 'ro' option a bit better 2009-03-30 15:21:56 +02:00
ext4.txt ext4: Regularize mount options 2009-03-28 10:59:57 -04:00
fiemap.txt
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt
gfs2-glocks.txt
gfs2.txt
hfs.txt
hfsplus.txt
hpfs.txt
inotify.txt
isofs.txt
jfs.txt
knfsd-stats.txt Document /proc/fs/nfsd/pool_stats 2009-03-27 19:24:27 -04:00
Locking mm: close page_mkwrite races 2009-05-02 15:36:09 -07:00
locks.txt
mandatory-locking.txt
ncpfs.txt
nfs41-server.txt nfsd41: Documentation/filesystems/nfs41-server.txt 2009-04-03 17:41:24 -07:00
nfs-rdma.txt update port number in NFS/RDMA documentation 2009-01-27 17:20:14 -05:00
nfsroot.txt
nilfs2.txt nilfs2: clean up sketch file 2009-04-07 08:31:19 -07:00
ntfs.txt
ocfs2.txt ocfs2: add mount option and Kconfig option for acl 2009-01-05 08:36:52 -08:00
omfs.txt
porting
proc.txt documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls 2009-04-02 19:04:53 -07:00
quota.txt
ramfs-rootfs-initramfs.txt
relay.txt
romfs.txt
rpc-cache.txt
seq_file.txt
sharedsubtree.txt
smbfs.txt
spufs.txt
squashfs.txt Squashfs: fix documentation typo, Cramfs filesystem limit is 256 MiB 2009-03-05 00:40:13 +00:00
sysfs-pci.txt PCI: Introduce /sys/bus/pci/devices/.../remove 2009-03-20 14:58:48 -07:00
sysfs.txt PATCH [2/2] Documentation/filesystems/sysfs.txt: fix descriptions of device attributes 2009-02-22 09:28:15 -08:00
sysv-fs.txt
tmpfs.txt
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt
vfat.txt
vfs.txt Documentation/filesystems: remove out of date reference to BKL being held 2009-04-20 23:01:16 -04:00
xfs.txt
xip.txt