forked from luck/tmp_suning_uos_patched
e11da5b4fd
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed
|
||
---|---|---|
.. | ||
backing-dev.c | ||
bootmem.c | ||
bounce.c | ||
compaction.c | ||
debug-pagealloc.c | ||
dmapool.c | ||
fadvise.c | ||
failslab.c | ||
filemap_xip.c | ||
filemap.c | ||
fremap.c | ||
highmem.c | ||
hugetlb.c | ||
hwpoison-inject.c | ||
init-mm.c | ||
internal.h | ||
Kconfig | ||
Kconfig.debug | ||
kmemcheck.c | ||
kmemleak-test.c | ||
kmemleak.c | ||
ksm.c | ||
maccess.c | ||
madvise.c | ||
Makefile | ||
memblock.c | ||
memcontrol.c | ||
memory_hotplug.c | ||
memory-failure.c | ||
memory.c | ||
mempolicy.c | ||
mempool.c | ||
migrate.c | ||
mincore.c | ||
mlock.c | ||
mm_init.c | ||
mmap.c | ||
mmu_context.c | ||
mmu_notifier.c | ||
mmzone.c | ||
mprotect.c | ||
mremap.c | ||
msync.c | ||
nommu.c | ||
oom_kill.c | ||
page_alloc.c | ||
page_cgroup.c | ||
page_io.c | ||
page_isolation.c | ||
page-writeback.c | ||
pagewalk.c | ||
percpu-km.c | ||
percpu-vm.c | ||
percpu.c | ||
prio_tree.c | ||
quicklist.c | ||
readahead.c | ||
rmap.c | ||
shmem.c | ||
slab.c | ||
slob.c | ||
slub.c | ||
sparse-vmemmap.c | ||
sparse.c | ||
swap_state.c | ||
swap.c | ||
swapfile.c | ||
thrash.c | ||
truncate.c | ||
util.c | ||
vmalloc.c | ||
vmscan.c | ||
vmstat.c |