kernel_optimize_test/kernel
Lee Schermerhorn 06808b0827 hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced.
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	sysctl vm.nr_hugepages[_mempolicy]=32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:12 -08:00
..
gcov
irq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-12-09 19:43:33 -08:00
power PM / Runtime: Export the PM runtime workqueue 2009-12-06 16:17:56 +01:00
time Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
trace Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-15 08:53:10 -08:00
async.c
audit_tree.c
audit_watch.c
audit.c
audit.h
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c remove CONFIG_SECURITY_FILE_CAPABILITIES compile option 2009-11-24 15:06:47 +11:00
cgroup_freezer.c
cgroup.c cgroup: fix strstrip() misuse 2009-10-29 07:39:25 -07:00
compat.c
configs.c
cpu.c Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-12 11:34:10 -08:00
cpuset.c sched: Fix balance vs hotplug race 2009-12-06 21:10:56 +01:00
cred-internals.h
cred.c
delayacct.c
dma.c
exec_domain.c
exit.c tty: Move the leader test in disassociate 2009-12-11 15:18:08 -08:00
extable.c
fork.c Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block 2009-12-08 08:19:16 -08:00
freezer.c
futex_compat.c
futex.c futex: Take mmap_sem for get_user_pages in fault_in_user_writeable 2009-12-08 14:59:36 +01:00
groups.c
hrtimer.c hrtimer: move timer stats helper functions to hrtimer.c 2009-12-10 13:08:11 +01:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c hw-breakpoints: Modify breakpoints without unregistering them 2009-12-09 09:48:20 +01:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c hw-breakpoints: Fix broken hw-breakpoint sample module 2009-11-10 11:23:29 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
kexec.c
kfifo.c
kgdb.c kgdb: Always process the whole breakpoint list on activate or deactivate 2009-12-11 08:43:20 -06:00
kmod.c security: report the module name to security_module_request 2009-11-10 09:33:46 +11:00
kprobes.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 15:30:21 -08:00
ksysfs.c
kthread.c sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c 2009-11-03 07:25:00 +01:00
latencytop.c
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
lockdep.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
Makefile Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm 2009-12-08 08:02:38 -08:00
module.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
mutex-debug.c
mutex-debug.h
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c
ns_cgroup.c
nsproxy.c
panic.c
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-11 20:47:30 -08:00
pid_namespace.c
pid.c
pm_qos_params.c pm_qos: clean up racy global "name" variable 2009-10-14 15:31:10 +02:00
posix-cpu-timers.c posix-cpu-timers: optimize and document timer_create callback 2009-11-18 12:36:05 +01:00
posix-timers.c
printk.c Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 09:50:22 -08:00
profile.c
ptrace.c
rcupdate.c rcu: Re-arrange code to reduce #ifdef pain 2009-11-22 18:58:16 +01:00
rcutiny.c rcu: Eliminate unneeded function wrapping 2009-11-22 18:58:16 +01:00
rcutorture.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
rcutree_plugin.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree_trace.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
relay.c
res_counter.c
resource.c resources: when allocate_resource() fails, leave resource untouched 2009-11-04 13:06:46 -08:00
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rwsem.c
sched_clock.c
sched_cpupri.c
sched_cpupri.h
sched_debug.c sched: Remove forced2_migrations stats 2009-12-10 20:32:39 +01:00
sched_fair.c sched: Update normalized values on user updates via proc 2009-12-09 10:04:02 +01:00
sched_features.h sched: Discard some old bits 2009-12-09 10:03:07 +01:00
sched_idletask.c sched: Protect sched_rr_get_param() access to task->sched_class 2009-12-09 10:01:07 +01:00
sched_rt.c sched: Protect sched_rr_get_param() access to task->sched_class 2009-12-09 10:01:07 +01:00
sched_stats.h
sched.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
seccomp.c
semaphore.c
signal.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 15:30:21 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 2009-12-08 07:38:50 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c generic-ipi: Add smp_call_function_any() 2009-11-18 14:52:25 +01:00
softirq.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
softlockup.c percpu: make percpu symbols under kernel/ and mm/ unique 2009-10-29 22:34:13 +09:00
spinlock.c locking: Reduce ifdefs in kernel/spinlock.c 2009-11-13 20:53:28 +01:00
srcu.c rcu: Add synchronize_srcu_expedited() 2009-10-26 09:40:30 +01:00
stacktrace.c
stop_machine.c
sys_ni.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
sys.c Merge branch 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-09 08:07:17 -08:00
sysctl_binary.c sysctl binary: Reorder the tests to process wild card entries first. 2009-11-12 01:42:31 -08:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
sysctl.c hugetlb: derive huge pages nodes allowed from task mempolicy 2009-12-15 08:53:12 -08:00
taskstats.c
test_kprobes.c
time.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-08 19:27:08 -08:00
timeconst.pl
timer.c
tracepoint.c
tsacct.c
uid16.c
up.c
user_namespace.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c uids: Prevent tear down race 2009-11-02 16:02:39 +01:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
utsname.c
wait.c
workqueue.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2009-12-10 09:35:44 -08:00