kernel_optimize_test/mm
David Hildenbrand e822969cab mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
Patch series "mm: fix max_pfn not falling on section boundary", v2.

Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage.  We
have to properly initialize these memmaps.

Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.

This patch (of 3):

If max_pfn is not aligned to a section boundary, we can easily run into
BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB).  I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).

The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.

E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):

[  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[  200.477500] #PF: supervisor read access in kernel mode
[  200.478334] #PF: error_code(0x0000) - not-present page
[  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[  200.479557] Oops: 0000 [#4] SMP NOPTI
[  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
[  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[  200.488897] Call Trace:
[  200.489115]  kpageflags_read+0xe9/0x140
[  200.489447]  proc_reg_read+0x3c/0x60
[  200.489755]  vfs_read+0xc2/0x170
[  200.490037]  ksys_pread64+0x65/0xa0
[  200.490352]  do_syscall_64+0x5c/0xa0
[  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:

[root@localhost ~]# cat /proc/kpageflags > /dev/null
[  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[  111.517907] #PF: supervisor read access in kernel mode
[  111.518333] #PF: error_code(0x0000) - not-present page
[  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash).  Commit 907ec5fca3 ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.

After this patch, there are still problems to solve.  E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone.  A follow-up patch will take care of this.

Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>	[4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:23 +00:00
..
kasan RISC-V Patches for the 5.6 Merge Window, Part 1 2020-01-31 11:23:29 -08:00
backing-dev.c memcg: fix a crash in wb_workfn when a device disappears 2020-01-31 10:30:36 -08:00
balloon_compaction.c
cleancache.c
cma_debug.c mm/cma_debug.c: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
cma.c mm/cma.c: switch to bitmap_zalloc() for cma bitmap allocation 2019-12-01 12:59:09 -08:00
cma.h
compaction.c mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() 2019-10-14 15:04:01 -07:00
debug_page_ref.c
debug.c mm/hotplug: silence a lockdep splat with printk() 2020-01-31 10:30:39 -08:00
dmapool.c
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c
failslab.c
filemap.c mm/filemap.c: clean up filemap_write_and_wait() 2020-01-31 10:30:37 -08:00
frame_vector.c
frontswap.c
gup_benchmark.c mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" 2020-01-31 10:30:38 -08:00
gup.c mm, tree-wide: rename put_user_page*() to unpin_user_page*() 2020-01-31 10:30:38 -08:00
highmem.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
hmm.c mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap 2019-11-23 19:56:45 -04:00
huge_memory.c Merge branch 'akpm' (patches from Andrew) 2020-01-31 12:16:36 -08:00
hugetlb_cgroup.c mm: hugetlb controller for cgroups v2 2019-12-16 12:41:40 -08:00
hugetlb.c mm/hugetlb: defer freeing of huge pages if in non-task context 2020-01-04 13:55:09 -08:00
hwpoison-inject.c mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
init-mm.c mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch 2019-10-19 06:32:32 -04:00
internal.h mm, pcpu: make zone pcp updates and reset internal to the mm 2019-12-01 12:59:06 -08:00
interval_tree.c
Kconfig mm/Kconfig: fix trivial help text punctuation 2019-12-01 12:59:10 -08:00
Kconfig.debug
khugepaged.c mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() 2019-12-01 12:59:09 -08:00
kmemleak-test.c
kmemleak.c mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t 2020-01-31 10:30:36 -08:00
ksm.c * PPC secure guest support 2019-12-04 11:08:30 -08:00
list_lru.c
maccess.c uaccess: Add strict non-pagefault kernel-space read function 2019-11-02 12:39:12 -07:00
madvise.c mm: make do_madvise() available internally 2020-01-20 17:04:02 -07:00
Makefile kcov: ignore fault-inject and stacktrace 2020-01-31 10:30:41 -08:00
mapping_dirty_helpers.c mm: Add write-protect and clean utilities for address space ranges 2019-11-06 13:03:36 +01:00
memblock.c memblock: Use __func__ in remaining memblock_dbg() call sites 2020-01-31 10:30:38 -08:00
memcontrol.c mm/memcontrol.c: cleanup some useless code 2020-01-31 10:30:38 -08:00
memfd.c
memory_hotplug.c mm/memory_hotplug: pass in nid to online_pages() 2020-01-31 10:30:39 -08:00
memory-failure.c mm/memory-failure.c: use page_shift() in add_to_kill() 2019-12-01 12:59:04 -08:00
memory.c Linux 5.5-rc3 2019-12-25 10:41:37 +01:00
mempolicy.c mm/mempolicy.c: fix out of bounds write in mpol_parse_str() 2020-01-31 10:30:36 -08:00
mempool.c
memremap.c mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages 2020-01-31 10:30:37 -08:00
memtest.c
migrate.c mm/migrate: add stable check in migrate_vma_insert_page() 2020-01-31 10:30:39 -08:00
mincore.c
mlock.c
mm_init.c
mmap.c mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma() 2020-01-31 10:30:39 -08:00
mmu_context.c
mmu_gather.c
mmu_notifier.c mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier 2020-01-14 11:54:47 -04:00
mmzone.c
mprotect.c autonuma: reduce cache footprint when scanning page tables 2019-12-01 12:59:09 -08:00
mremap.c mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area 2019-12-01 06:29:19 -08:00
msync.c
nommu.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
oom_kill.c mm, oom: dump stack of victim when reaping failed 2020-01-31 10:30:38 -08:00
page_alloc.c mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section 2020-02-04 03:05:23 +00:00
page_counter.c
page_ext.c mm, page_owner: fix off-by-one error in __set_page_owner_handle() 2019-10-14 15:04:00 -07:00
page_idle.c
page_io.c mm/page_io.c: annotate refault stalls from swap_readpage 2019-12-01 12:59:11 -08:00
page_isolation.c mm/page_isolation: fix potential warning from user 2020-01-31 10:30:39 -08:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-19 06:32:31 -04:00
page_poison.c
page_vma_mapped.c mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page 2020-01-31 10:30:38 -08:00
page-writeback.c mm/page-writeback.c: improve arithmetic divisions 2020-01-13 18:19:02 -08:00
pagewalk.c mm: Add a walk_page_mapping() function to the pagewalk code 2019-11-06 13:02:43 +01:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c bitmap: genericize percpu bitmap region iterators 2020-01-20 16:40:56 +01:00
pgtable-generic.c asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDED 2019-12-01 06:29:19 -08:00
process_vm_access.c mm, tree-wide: rename put_user_page*() to unpin_user_page*() 2020-01-31 10:30:38 -08:00
readahead.c
rmap.c mm, thp: do not queue fully unmapped pages for deferred split 2019-12-01 12:59:09 -08:00
rodata_test.c
shmem.c mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment 2020-01-13 18:19:01 -08:00
shuffle.c
shuffle.h
slab_common.c mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid 2020-01-13 18:19:02 -08:00
slab.c mm, debug_pagealloc: don't rely on static keys too early 2020-01-13 18:19:02 -08:00
slab.h mm: clean up and clarify lruvec lookup procedure 2019-12-01 12:59:06 -08:00
slob.c
slub.c mm/slub.c: avoid slub allocation while holding list_lock 2020-01-31 10:30:36 -08:00
sparse-vmemmap.c
sparse.c mm/sparse.c: reset section's mem_map when fully deactivated 2020-01-31 10:30:36 -08:00
swap_cgroup.c
swap_slots.c
swap_state.c
swap.c mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages 2020-01-31 10:30:37 -08:00
swapfile.c mm/swapfile.c: swap_next should increase position index 2020-01-31 10:30:38 -08:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c
userfaultfd.c mm: fix typos in comments when calling __SetPageUptodate() 2019-12-01 12:59:10 -08:00
util.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
vmacache.c
vmalloc.c Linux 5.5-rc7 2020-01-20 08:05:16 +01:00
vmpressure.c
vmscan.c mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE 2020-01-31 10:30:38 -08:00
vmstat.c mm/memcontrol: use vmstat names for printing statistics 2019-12-04 19:44:11 -08:00
workingset.c mm: vmscan: detect file thrashing at the reclaim root 2019-12-01 12:59:07 -08:00
z3fold.c mm/z3fold.c: add inter-page compaction 2019-12-01 12:59:07 -08:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc.c: fix the migrated zspage statistics. 2020-01-04 13:55:09 -08:00
zswap.c zswap: potential NULL dereference on error in init_zswap() 2020-01-31 10:30:39 -08:00