kernel_optimize_test

History

Nick Piggin b827e496c8 mm: close page_mkwrite races Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2009-05-02 15:36:09 -07:00
..
allocpercpu.c	percpu: __percpu_depopulate_mask can take a const mask	2009-04-06 13:44:15 -07:00
backing-dev.c	block: change the request allocation/congestion logic to be sync/async based	2009-04-06 08:04:53 -07:00
bootmem.c	bootmem, x86: further fixes for arch-specific bootmem wrapping	2009-03-01 16:06:56 +09:00
bounce.c
debug-pagealloc.c	generic debug pagealloc	2009-04-01 08:59:13 -07:00
dmapool.c
fadvise.c	[CVE-2009-0029] System call wrapper special cases	2009-01-14 14:15:18 +01:00
failslab.c	kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c	2009-04-03 12:23:01 +02:00
filemap_xip.c	mm: do_xip_mapping_read: fix length calculation	2009-04-02 19:04:49 -07:00
filemap.c	Export filemap_write_and_wait_range	2009-04-16 07:47:49 -07:00
fremap.c	Do not account for the address space used by hugetlbfs using VM_ACCOUNT	2009-02-10 10:48:42 -08:00
highmem.c	mm: introduce debug_kmap_atomic	2009-04-01 08:59:14 -07:00
hugetlb.c	hugetlb: chg cannot become less than 0	2009-04-01 08:59:13 -07:00
internal.h	nommu: there is no mlock() for NOMMU, so don't provide the bits	2009-04-01 08:59:14 -07:00
Kconfig	mm: point the UNEVICTABLE_LRU config option at the documentation	2009-04-13 15:04:31 -07:00
Kconfig.debug	generic debug pagealloc: build fix	2009-04-02 19:04:48 -07:00
maccess.c
madvise.c	[CVE-2009-0029] System call wrappers part 14	2009-01-14 14:15:24 +01:00
Makefile	generic debug pagealloc	2009-04-01 08:59:13 -07:00
memcontrol.c	memcg: fix try_get_mem_cgroup_from_swapcache()	2009-05-02 15:36:09 -07:00
memory_hotplug.c	mm: remove GFP_HIGHUSER_PAGECACHE	2009-01-06 15:59:01 -08:00
memory.c	mm: close page_mkwrite races	2009-05-02 15:36:09 -07:00
mempolicy.c	[CVE-2009-0029] System call wrappers part 28	2009-01-14 14:15:30 +01:00
mempool.c
migrate.c	FS-Cache: Recruit a page flags for cache management	2009-04-03 16:42:36 +01:00
mincore.c	[CVE-2009-0029] System call wrappers part 14	2009-01-14 14:15:24 +01:00
mlock.c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-02-17 14:27:39 -08:00
mm_init.c
mmap.c	mm: pass correct mm when growing stack	2009-04-16 14:41:25 -07:00
mmu_notifier.c
mmzone.c
mprotect.c	Do not account for the address space used by hugetlbfs using VM_ACCOUNT	2009-02-10 10:48:42 -08:00
mremap.c	[CVE-2009-0029] System call wrappers part 13	2009-01-14 14:15:23 +01:00
msync.c	[CVE-2009-0029] System call wrappers part 13	2009-01-14 14:15:23 +01:00
nommu.c	nommu: fix a number of issues with the per-MM VMA patch	2009-04-02 19:04:48 -07:00
oom_kill.c	memcg: show memcg information during OOM	2009-04-02 19:04:55 -07:00
page_alloc.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask	2009-04-05 10:33:07 -07:00
page_cgroup.c	memcg: remove redundant message at swapon	2009-04-02 19:04:56 -07:00
page_io.c	block: fix bad definition of BIO_RW_SYNC	2009-02-18 10:32:00 +01:00
page_isolation.c
page-writeback.c	mm: fix proc_dointvec_userhz_jiffies "breakage"	2009-04-01 08:59:13 -07:00
pagewalk.c
pdflush.c	mm: add /proc controls for pdflush threads	2009-04-07 08:31:03 -07:00
percpu.c	percpu: generalize embedding first chunk setup helper	2009-03-10 16:27:48 +09:00
prio_tree.c
quicklist.c	cpumask: replace node_to_cpumask with cpumask_of_node.	2009-03-13 14:49:46 +10:30
readahead.c	FS-Cache: Recruit a page flags for cache management	2009-04-03 16:42:36 +01:00
rmap.c	mm: fix mlocked page counter mismatch	2009-02-11 14:25:35 -08:00
shmem_acl.c
shmem.c	shmem: respect MAX_LFS_FILESIZE	2009-04-13 15:04:33 -07:00
slab.c	Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-04-06 13:30:00 -07:00
slob.c	kmemtrace: trace kfree() calls with NULL or zero-length objects	2009-04-03 12:23:10 +02:00
slub.c	kmemtrace: trace kfree() calls with NULL or zero-length objects	2009-04-03 12:23:10 +02:00
sparse-vmemmap.c
sparse.c	mm: mminit_validate_memmodel_limits(): remove redundant test	2009-04-01 08:59:11 -07:00
swap_state.c	memcg: mem+swap controller core	2009-01-08 08:31:05 -08:00
swap.c	FS-Cache: Recruit a page flags for cache management	2009-04-03 16:42:36 +01:00
swapfile.c	PM/hibernate: fix "swap breaks after hibernation failures"	2009-02-21 14:17:17 -08:00
thrash.c
truncate.c	FS-Cache: Recruit a page flags for cache management	2009-04-03 16:42:36 +01:00
util.c	mm: document get_user_pages_fast()	2009-04-13 15:04:32 -07:00
vmalloc.c	vmap: remove needless lock and list in vmap	2009-04-01 08:59:11 -07:00
vmscan.c	vmscan,memcg: reintroduce sc->may_swap	2009-04-21 13:41:51 -07:00
vmstat.c	mm: align vmstat_work's timer	2009-04-02 19:04:48 -07:00