kernel_optimize_test/kernel
Peter Williams 2dd73a4f09 [PATCH] sched: implement smpnice
Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running.  The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each.  However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU.  Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met.  The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance.  Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example.  Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2.  If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop.  If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue.  This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens.  This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on.  While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction.  To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists.  This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function.  The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load.  This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created.  A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required.  As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain.  This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
   priority tasks, with smpnice we would like to see the high priority task
   scheduled on one cpu, two other cpus getting one normal task each and the
   fourth cpu getting the remaining two normal tasks.  but with current
   smpnice extra normal priority task keeps jumping from one cpu to another
   cpu having the normal priority task.  This is because of the
   busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
   cpu with high priority task in max_load calculations but including that in
   total and avg_load calcuations..  leading to max_load < avg_load and load
   balance between cpus running normal priority tasks(2 Vs 1) will always show
   imbalanace as one normal priority and the extra normal priority task will
   keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
   priority task each..  P2 and P3 are idle.  With this patch, one of the
   normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
   sense to look at the highest weighted runqueue in the busy group..
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing one
   low priority(with other thread being idle)..  Package-1 thinks that it need
   to take the low priority thread from Package-0.  And find_busiest_queue()
   returns the cpu thread with highest priority task..  And ultimately(with
   help of active load balance) we move high priority task to Package-1.  And
   same continues with Package-0 now, moving high priority task from package-1
   to package-0..  Even without the presence of active load balance, load
   balance will fail to balance the above scenario..  Fix find_busiest_queue
   to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
..
irq [PATCH] adjust handle_IRR_event() return type 2006-06-23 07:43:08 -07:00
power [PATCH] pm_trace is dangerous 2006-06-27 17:32:35 -07:00
time [PATCH] time: rename clocksource functions 2006-06-26 09:58:21 -07:00
.gitignore gitignore: ignore more generated files 2006-01-03 11:35:26 +01:00
acct.c [PATCH] fix kernel-doc in kernel/ dir 2006-06-27 17:32:39 -07:00
audit.c [PATCH] spin/rwlock init cleanups 2006-06-27 17:32:39 -07:00
audit.h [PATCH] log more info for directory entry change events 2006-06-20 05:25:28 -04:00
auditfilter.c [PATCH] log more info for directory entry change events 2006-06-20 05:25:28 -04:00
auditsc.c [PATCH] fix kernel-doc in kernel/ dir 2006-06-27 17:32:39 -07:00
capability.c [PATCH] refactor capable() to one implementation, add __capable() helper 2006-03-25 08:22:56 -08:00
compat.c [PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__ 2006-06-25 10:01:15 -07:00
configs.c update the email address of Randy Dunlap 2006-01-03 13:37:51 +01:00
cpu.c [PATCH] cpu hotplug: make [un]register_cpu_notifier init time only 2006-06-27 17:32:41 -07:00
cpuset.c [PATCH] proc: Use struct pid not struct task_ref 2006-06-26 09:58:26 -07:00
dma.c
exec_domain.c [PATCH] Fix module refcount leak in __set_personality() 2006-03-24 07:33:30 -08:00
exit.c [PATCH] proc: Rewrite the proc dentry flush on exit optimization 2006-06-26 09:58:24 -07:00
extable.c [PATCH] symbol_put_addr() locks kernel 2006-05-15 11:20:55 -07:00
fork.c [PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXIT 2006-06-26 09:58:27 -07:00
futex_compat.c [PATCH] futex: check and validate timevals 2006-03-31 12:18:59 -08:00
futex.c [PATCH] VFS: Permit filesystem to override root dentry on mount 2006-06-23 07:42:45 -07:00
hrtimer.c [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 2006-06-27 17:32:41 -07:00
itimer.c [PATCH] hrtimers: remove data field 2006-03-26 08:57:03 -08:00
kallsyms.c [PATCH] fix missing includes 2005-10-30 17:37:32 -08:00
Kconfig.hz
Kconfig.preempt
kexec.c [PATCH] Add a sysfs file to determine if a kexec kernel is loaded 2006-06-23 07:43:02 -07:00
kfifo.c
kmod.c [PATCH] wait_for_helper: trivial style cleanup 2006-03-28 18:36:41 -08:00
kprobes.c [PATCH] Notify page fault call chain 2006-06-26 09:58:22 -07:00
ksysfs.c [PATCH] Add a sysfs file to determine if a kexec kernel is loaded 2006-06-23 07:43:02 -07:00
kthread.c [PATCH] kthread: move kernel-doc and put it into DocBook 2006-06-25 10:01:24 -07:00
Makefile Merge branch 'x86-64' 2006-06-26 10:51:09 -07:00
module.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild 2006-06-26 11:05:15 -07:00
mutex-debug.c [PATCH] poison: add & use more constants 2006-06-27 17:32:38 -07:00
mutex-debug.h [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs 2006-06-26 09:58:16 -07:00
mutex.c [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs 2006-06-26 09:58:16 -07:00
mutex.h [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs 2006-06-26 09:58:16 -07:00
panic.c [PATCH] the scheduled unexport of panic_timeout 2006-04-11 06:18:40 -07:00
params.c [PATCH] Change dash2underscore() return value to char 2006-03-28 09:16:03 -08:00
pid.c [PATCH] pidhash: Refactor the pid hash table 2006-03-31 12:19:00 -08:00
posix-cpu-timers.c [PATCH] arm_timer: remove a racy and obsolete PF_EXITING check 2006-06-17 10:52:13 -07:00
posix-timers.c [PATCH] hrtimers: remove data field 2006-03-26 08:57:03 -08:00
printk.c [PATCH] printk time parameter 2006-06-25 10:01:13 -07:00
profile.c [PATCH] cpu hotplug: revert init patch submitted for 2.6.17 2006-06-27 17:32:40 -07:00
ptrace.c [PATCH] coredump: kill ptrace related stuff 2006-06-26 09:58:27 -07:00
rcupdate.c [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 2006-06-27 17:32:41 -07:00
rcutorture.c [PATCH] rcutorture: add call_rcu_bh() operations 2006-06-27 17:32:40 -07:00
relay.c [PATCH] relay: consolidate sendfile() and read() code 2006-03-23 19:58:45 +01:00
resource.c [PATCH] catch valid mem range at onlining memory 2006-06-27 17:32:36 -07:00
sched.c [PATCH] sched: implement smpnice 2006-06-27 17:32:44 -07:00
seccomp.c
signal.c [PATCH] coredump: kill ptrace related stuff 2006-06-26 09:58:27 -07:00
softirq.c [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 2006-06-27 17:32:41 -07:00
softlockup.c [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 2006-06-27 17:32:41 -07:00
spinlock.c [PATCH] BUILD_LOCK_OPS: cleanup preempt_disable() usage 2006-03-23 07:38:16 -08:00
stop_machine.c [PATCH] kthread: convert stop_machine into a kthread 2006-06-25 10:01:22 -07:00
sys_ni.c [PATCH] sys_move_pages: 32bit support (i386, x86_64) 2006-06-23 07:42:53 -07:00
sys.c [PATCH] kernel/sys.c: cleanups 2006-06-25 10:01:06 -07:00
sysctl.c [PATCH] vdso: randomize the i386 vDSO by moving it into a vma 2006-06-27 17:32:38 -07:00
time.c [PATCH] Time: Introduce arch generic time accessors 2006-06-26 09:58:20 -07:00
timer.c [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 2006-06-27 17:32:41 -07:00
uid16.c [PATCH] Add more prevent_tail_call() 2006-04-19 16:27:18 -07:00
unwind.c [PATCH] x86_64: allow unwinder to build without module support 2006-06-26 10:48:18 -07:00
user.c [PATCH] selinux: add hooks for key subsystem 2006-06-22 15:05:55 -07:00
wait.c
workqueue.c [PATCH] cpu hotplug: revert init patch submitted for 2.6.17 2006-06-27 17:32:40 -07:00