kernel_optimize_test/mm
Nick Piggin b827e496c8 mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
..
allocpercpu.c percpu: __percpu_depopulate_mask can take a const mask 2009-04-06 13:44:15 -07:00
backing-dev.c block: change the request allocation/congestion logic to be sync/async based 2009-04-06 08:04:53 -07:00
bootmem.c bootmem, x86: further fixes for arch-specific bootmem wrapping 2009-03-01 16:06:56 +09:00
bounce.c
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c
fadvise.c [CVE-2009-0029] System call wrapper special cases 2009-01-14 14:15:18 +01:00
failslab.c kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c 2009-04-03 12:23:01 +02:00
filemap_xip.c mm: do_xip_mapping_read: fix length calculation 2009-04-02 19:04:49 -07:00
filemap.c Export filemap_write_and_wait_range 2009-04-16 07:47:49 -07:00
fremap.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
highmem.c mm: introduce debug_kmap_atomic 2009-04-01 08:59:14 -07:00
hugetlb.c hugetlb: chg cannot become less than 0 2009-04-01 08:59:13 -07:00
internal.h nommu: there is no mlock() for NOMMU, so don't provide the bits 2009-04-01 08:59:14 -07:00
Kconfig mm: point the UNEVICTABLE_LRU config option at the documentation 2009-04-13 15:04:31 -07:00
Kconfig.debug generic debug pagealloc: build fix 2009-04-02 19:04:48 -07:00
maccess.c
madvise.c [CVE-2009-0029] System call wrappers part 14 2009-01-14 14:15:24 +01:00
Makefile generic debug pagealloc 2009-04-01 08:59:13 -07:00
memcontrol.c memcg: fix try_get_mem_cgroup_from_swapcache() 2009-05-02 15:36:09 -07:00
memory_hotplug.c mm: remove GFP_HIGHUSER_PAGECACHE 2009-01-06 15:59:01 -08:00
memory.c mm: close page_mkwrite races 2009-05-02 15:36:09 -07:00
mempolicy.c [CVE-2009-0029] System call wrappers part 28 2009-01-14 14:15:30 +01:00
mempool.c
migrate.c FS-Cache: Recruit a page flags for cache management 2009-04-03 16:42:36 +01:00
mincore.c [CVE-2009-0029] System call wrappers part 14 2009-01-14 14:15:24 +01:00
mlock.c Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-02-17 14:27:39 -08:00
mm_init.c
mmap.c mm: pass correct mm when growing stack 2009-04-16 14:41:25 -07:00
mmu_notifier.c
mmzone.c
mprotect.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
mremap.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
msync.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
nommu.c nommu: fix a number of issues with the per-MM VMA patch 2009-04-02 19:04:48 -07:00
oom_kill.c memcg: show memcg information during OOM 2009-04-02 19:04:55 -07:00
page_alloc.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask 2009-04-05 10:33:07 -07:00
page_cgroup.c memcg: remove redundant message at swapon 2009-04-02 19:04:56 -07:00
page_io.c block: fix bad definition of BIO_RW_SYNC 2009-02-18 10:32:00 +01:00
page_isolation.c
page-writeback.c mm: fix proc_dointvec_userhz_jiffies "breakage" 2009-04-01 08:59:13 -07:00
pagewalk.c
pdflush.c mm: add /proc controls for pdflush threads 2009-04-07 08:31:03 -07:00
percpu.c percpu: generalize embedding first chunk setup helper 2009-03-10 16:27:48 +09:00
prio_tree.c
quicklist.c cpumask: replace node_to_cpumask with cpumask_of_node. 2009-03-13 14:49:46 +10:30
readahead.c FS-Cache: Recruit a page flags for cache management 2009-04-03 16:42:36 +01:00
rmap.c mm: fix mlocked page counter mismatch 2009-02-11 14:25:35 -08:00
shmem_acl.c
shmem.c shmem: respect MAX_LFS_FILESIZE 2009-04-13 15:04:33 -07:00
slab.c Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-04-06 13:30:00 -07:00
slob.c kmemtrace: trace kfree() calls with NULL or zero-length objects 2009-04-03 12:23:10 +02:00
slub.c kmemtrace: trace kfree() calls with NULL or zero-length objects 2009-04-03 12:23:10 +02:00
sparse-vmemmap.c
sparse.c mm: mminit_validate_memmodel_limits(): remove redundant test 2009-04-01 08:59:11 -07:00
swap_state.c memcg: mem+swap controller core 2009-01-08 08:31:05 -08:00
swap.c FS-Cache: Recruit a page flags for cache management 2009-04-03 16:42:36 +01:00
swapfile.c PM/hibernate: fix "swap breaks after hibernation failures" 2009-02-21 14:17:17 -08:00
thrash.c
truncate.c FS-Cache: Recruit a page flags for cache management 2009-04-03 16:42:36 +01:00
util.c mm: document get_user_pages_fast() 2009-04-13 15:04:32 -07:00
vmalloc.c vmap: remove needless lock and list in vmap 2009-04-01 08:59:11 -07:00
vmscan.c vmscan,memcg: reintroduce sc->may_swap 2009-04-21 13:41:51 -07:00
vmstat.c mm: align vmstat_work's timer 2009-04-02 19:04:48 -07:00