forked from luck/tmp_suning_uos_patched
4efaceb1c5
swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
---|---|---|
.. | ||
kasan | ||
backing-dev.c | ||
balloon_compaction.c | ||
cleancache.c | ||
cma_debug.c | ||
cma.c | ||
cma.h | ||
compaction.c | ||
debug_page_ref.c | ||
debug.c | ||
dmapool.c | ||
early_ioremap.c | ||
fadvise.c | ||
failslab.c | ||
filemap.c | ||
frame_vector.c | ||
frontswap.c | ||
gup_benchmark.c | ||
gup.c | ||
highmem.c | ||
hmm.c | ||
huge_memory.c | ||
hugetlb_cgroup.c | ||
hugetlb.c | ||
hwpoison-inject.c | ||
init-mm.c | ||
internal.h | ||
interval_tree.c | ||
Kconfig | ||
Kconfig.debug | ||
khugepaged.c | ||
kmemleak-test.c | ||
kmemleak.c | ||
ksm.c | ||
list_lru.c | ||
maccess.c | ||
madvise.c | ||
Makefile | ||
memblock.c | ||
memcontrol.c | ||
memfd.c | ||
memory_hotplug.c | ||
memory-failure.c | ||
memory.c | ||
mempolicy.c | ||
mempool.c | ||
memtest.c | ||
migrate.c | ||
mincore.c | ||
mlock.c | ||
mm_init.c | ||
mmap.c | ||
mmu_context.c | ||
mmu_gather.c | ||
mmu_notifier.c | ||
mmzone.c | ||
mprotect.c | ||
mremap.c | ||
msync.c | ||
nommu.c | ||
oom_kill.c | ||
page_alloc.c | ||
page_counter.c | ||
page_ext.c | ||
page_idle.c | ||
page_io.c | ||
page_isolation.c | ||
page_owner.c | ||
page_poison.c | ||
page_vma_mapped.c | ||
page-writeback.c | ||
pagewalk.c | ||
percpu-internal.h | ||
percpu-km.c | ||
percpu-stats.c | ||
percpu-vm.c | ||
percpu.c | ||
pgtable-generic.c | ||
process_vm_access.c | ||
quicklist.c | ||
readahead.c | ||
rmap.c | ||
rodata_test.c | ||
shmem.c | ||
shuffle.c | ||
shuffle.h | ||
slab_common.c | ||
slab.c | ||
slab.h | ||
slob.c | ||
slub.c | ||
sparse-vmemmap.c | ||
sparse.c | ||
swap_cgroup.c | ||
swap_slots.c | ||
swap_state.c | ||
swap.c | ||
swapfile.c | ||
truncate.c | ||
usercopy.c | ||
userfaultfd.c | ||
util.c | ||
vmacache.c | ||
vmalloc.c | ||
vmpressure.c | ||
vmscan.c | ||
vmstat.c | ||
workingset.c | ||
z3fold.c | ||
zbud.c | ||
zpool.c | ||
zsmalloc.c | ||
zswap.c |