kernel_optimize_test/mm
Laurent Dufour f85086f95f mm: don't rely on system state to detect hot-plug operations
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state.  In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1].  So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done.  At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

  $ ls -l /sys/devices/system/memory/memory21/node*
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation.  An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

  $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
        -m size=$MEM,slots=255,maxmem=4294967296k  \
        -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
        -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
        -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
        -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
        -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
        -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
        -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
        -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce63391 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26 10:33:57 -07:00
..
kasan Kbuild updates for v5.9 2020-08-09 14:10:26 -07:00
backing-dev.c writeback: remove struct bdi_writeback_congested 2020-07-08 17:05:53 -06:00
balloon_compaction.c
cleancache.c
cma_debug.c debugfs: make sure we can remove u32_array files cleanly 2020-07-10 13:54:00 -07:00
cma.c cma: don't quit at first error when activating reserved areas 2020-08-12 10:57:57 -07:00
cma.h mm: cma: fix the name of CMA areas 2020-08-12 10:57:57 -07:00
compaction.c mm: replace hpage_nr_pages with thp_nr_pages 2020-08-14 19:56:56 -07:00
debug_page_ref.c
debug_vm_pgtable.c Documentation/mm: add descriptions for arch page table helpers 2020-08-07 11:33:23 -07:00
debug.c mm, dump_page: do not crash with bad compound_mapcount() 2020-08-07 11:33:23 -07:00
dmapool.c mm/dmapool.c: micro-optimisation remove unnecessary branch 2020-04-07 10:43:42 -07:00
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c mm: return void from various readahead functions 2020-06-02 10:59:06 -07:00
failslab.c
filemap.c mm: fix wake_page_function() comment typos 2020-09-20 10:38:47 -07:00
frame_vector.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
frontswap.c mm/frontswap: mark various intentional data races 2020-08-14 19:56:56 -07:00
gup_benchmark.c mm/gup_benchmark: support pin_user_pages() and related calls 2020-04-02 09:35:27 -07:00
gup.c mm/gup: fix gup_fast with dynamic page table folding 2020-09-26 10:33:57 -07:00
highmem.c
hmm.c mm: do page fault accounting in handle_mm_fault 2020-08-12 10:58:02 -07:00
huge_memory.c mm/thp: fix __split_huge_pmd_locked() for migration PMD 2020-09-19 13:13:38 -07:00
hugetlb_cgroup.c hugetlb_cgroup: convert comma to semicolon 2020-08-21 09:52:52 -07:00
hugetlb.c mm/hugetlb: fix a race between hugetlb sysctl handlers 2020-09-05 12:14:30 -07:00
hwpoison-inject.c
init-mm.c mmap locking API: add MMAP_LOCK_INITIALIZER 2020-06-09 09:39:14 -07:00
internal.h mm: replace hpage_nr_pages with thp_nr_pages 2020-08-14 19:56:56 -07:00
interval_tree.c
ioremap.c mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
Kconfig mm/sparse: cleanup the code surrounding memory_present() 2020-08-07 11:33:27 -07:00
Kconfig.debug treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
khugepaged.c mm/khugepaged.c: fix khugepaged's request size in collapse_file 2020-09-05 12:14:30 -07:00
kmemleak-test.c
kmemleak.c mm/kmemleak: silence KCSAN splats in checksum 2020-08-14 19:56:56 -07:00
ksm.c ksm: reinstate memcg charge on copied pages 2020-09-19 13:13:38 -07:00
list_lru.c mm/list_lru: fix a data race in list_lru_count_one 2020-08-14 19:56:57 -07:00
maccess.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
madvise.c mm: madvise: fix vma user-after-free 2020-09-05 12:14:29 -07:00
Makefile mm: move lib/ioremap.c to mm/ 2020-08-07 11:33:26 -07:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: update huge page-table entry callbacks 2020-04-02 09:35:29 -07:00
memblock.c mm/memblock: expose only miminal interface to add/walk physmem 2020-07-10 15:08:09 +02:00
memcontrol.c mm: memcontrol: fix missing suffix of workingset_restore 2020-09-26 10:33:57 -07:00
memfd.c
memory_hotplug.c mm: don't rely on system state to detect hot-plug operations 2020-09-26 10:33:57 -07:00
memory-failure.c mm/migrate: introduce a standard migration target allocation function 2020-08-12 10:58:02 -07:00
memory.c mm: fix misplaced unlock_page in do_wp_page() 2020-09-24 08:41:32 -07:00
mempolicy.c mm: replace hpage_nr_pages with thp_nr_pages 2020-08-14 19:56:56 -07:00
mempool.c mm/mempool: fix a data race in mempool_free() 2020-08-14 19:56:57 -07:00
memremap.c memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC 2020-09-04 09:59:59 +02:00
memtest.c
migrate.c mm/migrate: correct thp migration stats 2020-09-26 10:33:57 -07:00
mincore.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
mlock.c mlock: fix unevictable_pgs event counts on THP 2020-09-19 13:13:38 -07:00
mm_init.c mm: adjust vm_committed_as_batch according to vm overcommit policy 2020-08-07 11:33:26 -07:00
mmap.c mm: remove unnecessary wrapper function do_mmap_pgoff() 2020-08-07 11:33:27 -07:00
mmu_gather.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mmu_notifier.c mm: mmu_notifier: fix and extend kerneldoc 2020-08-12 10:57:57 -07:00
mmzone.c
mprotect.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mremap.c mm/mremap: start addresses are properly aligned 2020-08-07 11:33:27 -07:00
msync.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
nommu.c mm/nommu.c: delete duplicated words 2020-08-12 10:57:58 -07:00
oom_kill.c mm, oom: show process exiting information in __oom_kill_process() 2020-08-12 10:57:56 -07:00
page_alloc.c mm: replace memmap_context by meminit_context 2020-09-26 10:33:57 -07:00
page_counter.c mm/page_counter: fix various data races at memsw 2020-08-14 19:56:57 -07:00
page_ext.c mm/page_ext.c: drop pfn_present() check when onlining 2020-04-07 10:43:40 -07:00
page_idle.c mm/page_idle.c: skip offline pages 2020-06-08 11:05:55 -07:00
page_io.c mm/page_io: mark various intentional data races 2020-08-14 19:56:56 -07:00
page_isolation.c mm/memory_hotplug: drain per-cpu pages again during memory offline 2020-09-19 13:13:39 -07:00
page_owner.c mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention 2020-06-03 20:09:45 -07:00
page_poison.c
page_reporting.c mm/page_reporting: add budget limit on how many pages can be reported per pass 2020-04-07 10:43:39 -07:00
page_reporting.h mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
page_vma_mapped.c mm: replace hpage_nr_pages with thp_nr_pages 2020-08-14 19:56:56 -07:00
page-writeback.c mm: remove vm_total_pages 2020-08-07 11:33:28 -07:00
pagewalk.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
percpu-internal.h mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-km.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-stats.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-vm.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu.c percpu: fix first chunk size calculation for populated bitmap 2020-09-17 17:34:39 +00:00
pgalloc-track.h mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
pgtable-generic.c mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
process_vm_access.c mm/gup: remove task_struct pointer for all gup code 2020-08-12 10:58:04 -07:00
ptdump.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
readahead.c mm: use memalloc_nofs_save in readahead path 2020-06-02 10:59:07 -07:00
rmap.c mm/rmap: fixup copying of soft dirty and uffd ptes 2020-09-05 12:14:30 -07:00
rodata_test.c mm/rodata_test.c: fix missing function declaration 2020-08-21 09:52:53 -07:00
shmem.c tmpfs: restore functionality of nr_inodes=0 2020-09-19 13:13:38 -07:00
shuffle.c mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
shuffle.h mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
slab_common.c mm/slab_common.c: delete duplicated word 2020-08-12 10:57:58 -07:00
slab.c mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() 2020-08-07 11:33:25 -07:00
slab.h mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() 2020-08-07 11:33:25 -07:00
slob.c mm: memcg: convert vmstat slab counters to bytes 2020-08-07 11:33:24 -07:00
slub.c mm: slub: fix conversion of freelist_corrupted() 2020-09-05 12:14:29 -07:00
sparse-vmemmap.c mm/sparse: only sub-section aligned range would be populated 2020-08-07 11:33:27 -07:00
sparse.c mm/sparse: cleanup the code surrounding memory_present() 2020-08-07 11:33:27 -07:00
swap_cgroup.c mm: memcontrol: make swap tracking an integral part of memory control 2020-06-03 20:09:48 -07:00
swap_slots.c mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized 2020-08-07 11:33:24 -07:00
swap_state.c mm/swap_state: mark various intentional data races 2020-08-14 19:56:57 -07:00
swap.c mlock: fix unevictable_pgs event counts on THP 2020-09-19 13:13:38 -07:00
swapfile.c mm, THP, swap: fix allocating cluster for swapfile by mistake 2020-09-26 10:33:57 -07:00
truncate.c
usercopy.c mm/usercopy.c: delete duplicated word 2020-08-12 10:57:58 -07:00
userfaultfd.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
util.c mm: remove unnecessary wrapper function do_mmap_pgoff() 2020-08-07 11:33:27 -07:00
vmacache.c kernel: better document the use_mm/unuse_mm API contract 2020-06-10 19:14:18 -07:00
vmalloc.c mm/vunmap: add cond_resched() in vunmap_pmd_range 2020-08-21 09:52:53 -07:00
vmpressure.c mm: vmpressure: use mem_cgroup_is_root API 2020-04-02 09:35:31 -07:00
vmscan.c mm: fix check_move_unevictable_pages() on THP 2020-09-19 13:13:38 -07:00
vmstat.c Merge branch 'simplify-do_wp_page' 2020-09-04 09:31:54 -07:00
workingset.c mm: replace hpage_nr_pages with thp_nr_pages 2020-08-14 19:56:56 -07:00
z3fold.c mm/z3fold: silence kmemleak false positives of slots 2020-05-28 11:35:40 -07:00
zbud.c mm: use false for bool variable 2020-06-04 19:06:24 -07:00
zpool.c mm/zpool.c: delete duplicated word and fix grammar 2020-08-12 10:57:58 -07:00
zsmalloc.c mm/zsmalloc.c: fix duplicated words 2020-08-12 10:57:58 -07:00
zswap.c mm/zswap: allow setting default status, compressor and allocator in Kconfig 2020-04-07 10:43:41 -07:00