From 5b32af91b5de6f95ad99e4eaaf57777376af124f Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:14 -0700 Subject: [PATCH 001/164] percpu: return number of released bytes from pcpu_free_area() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm: memcg accounting of percpu memory", v3. This patchset adds percpu memory accounting to memory cgroups. It's based on the rework of the slab controller and reuses concepts and features introduced for the per-object slab accounting. Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. Percpu allocations by their nature are scattered over multiple pages, so they can't be tracked on the per-page basis. So the per-object tracking introduced by the new slab controller is reused. The patchset implements charging of percpu allocations, adds memcg-level statistics, enables accounting for percpu allocations made by memory cgroup internals and provides some basic tests. To implement the accounting of percpu memory without a significant memory and performance overhead the following approach is used: all accounted allocations are placed into a separate percpu chunk (or chunks). These chunks are similar to default chunks, except that they do have an attached vector of pointers to obj_cgroup objects, which is big enough to save a pointer for each allocated object. On the allocation, if the allocation has to be accounted (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root memory cgroup, etc), the memory cgroup is getting charged and if the maximum limit is not exceeded the allocation is performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the allocation is forced over the limit, depending on gfp (as any other kernel memory allocation). The memory cgroup information is saved in the obj_cgroup vector at the corresponding offset. On the release time the memcg information is restored from the vector and the cgroup is getting uncharged. Unaccounted allocations (at this point the absolute majority of all percpu allocations) are performed in the old way, so no additional overhead is expected. To avoid pinning dying memory cgroups by outstanding allocations, obj_cgroup API is used instead of directly saving memory cgroup pointers. obj_cgroup is basically a pointer to a memory cgroup with a standalone reference counter. The trick is that it can be atomically swapped to point at the parent cgroup, so that the original memory cgroup can be released prior to all objects, which has been charged to it. Because all charges and statistics are fully recursive, it's perfectly correct to uncharge the parent cgroup instead. This scheme is used in the slab memory accounting, and percpu memory can just follow the scheme. This patch (of 5): To implement accounting of percpu memory we need the information about the size of freed object. Return it from pcpu_free_area(). Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Johannes Weiner Cc: Michal Hocko Cc: David Rientjes Cc: Joonsoo Kim Cc: Mel Gorman Cc: Pekka Enberg Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long cC: Michal Koutnýutny@suse.com> Cc: Bixuan Cui Cc: Michal Koutný Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com Signed-off-by: Linus Torvalds --- mm/percpu.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index b626766160ce..5e9eefdce21f 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits, * * This function determines the size of an allocation to free using * the boundary bitmap and clears the allocation map. + * + * RETURNS: + * Number of freed bytes. */ -static void pcpu_free_area(struct pcpu_chunk *chunk, int off) +static int pcpu_free_area(struct pcpu_chunk *chunk, int off) { struct pcpu_block_md *chunk_md = &chunk->chunk_md; - int bit_off, bits, end, oslot; + int bit_off, bits, end, oslot, freed; lockdep_assert_held(&pcpu_lock); pcpu_stats_area_dealloc(chunk); @@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off) bits = end - bit_off; bitmap_clear(chunk->alloc_map, bit_off, bits); + freed = bits * PCPU_MIN_ALLOC_SIZE; + /* update metadata */ - chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE; + chunk->free_bytes += freed; /* update first free bit */ chunk_md->first_free = min(chunk_md->first_free, bit_off); @@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off) pcpu_block_update_hint_free(chunk, bit_off, bits); pcpu_chunk_relocate(chunk, oslot); + + return freed; } static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits) From 3c7be18ac9a06bc67196bfdabb7c21e1bbacdc13 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:17 -0700 Subject: [PATCH 002/164] mm: memcg/percpu: account percpu memory to memory cgroups MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning] [akpm@linux-foundation.org: move unreachable code, per Roman] [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin Signed-off-by: Bixuan Cui Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Dennis Zhou Cc: Christoph Lameter Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Tejun Heo Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long Cc: Bixuan Cui Cc: Michal Koutný Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds --- mm/percpu-internal.h | 55 ++++++++++++- mm/percpu-km.c | 5 +- mm/percpu-stats.c | 36 +++++---- mm/percpu-vm.c | 5 +- mm/percpu.c | 185 ++++++++++++++++++++++++++++++++++++++----- 5 files changed, 246 insertions(+), 40 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 0468ba500bd4..18b768ac7dca 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -5,6 +5,25 @@ #include #include +/* + * There are two chunk types: root and memcg-aware. + * Chunks of each type have separate slots list. + * + * Memcg-aware chunks have an attached vector of obj_cgroup pointers, which is + * used to store memcg membership data of a percpu object. Obj_cgroups are + * ref-counted pointers to a memory cgroup with an ability to switch dynamically + * to the parent memory cgroup. This allows to reclaim a deleted memory cgroup + * without reclaiming of all outstanding objects, which hold a reference at it. + */ +enum pcpu_chunk_type { + PCPU_CHUNK_ROOT, +#ifdef CONFIG_MEMCG_KMEM + PCPU_CHUNK_MEMCG, +#endif + PCPU_NR_CHUNK_TYPES, + PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES +}; + /* * pcpu_block_md is the metadata block struct. * Each chunk's bitmap is split into a number of full blocks. @@ -54,6 +73,9 @@ struct pcpu_chunk { int end_offset; /* additional area required to have the region end page aligned */ +#ifdef CONFIG_MEMCG_KMEM + struct obj_cgroup **obj_cgroups; /* vector of object cgroups */ +#endif int nr_pages; /* # of pages served by this chunk */ int nr_populated; /* # of populated pages */ @@ -63,7 +85,7 @@ struct pcpu_chunk { extern spinlock_t pcpu_lock; -extern struct list_head *pcpu_slot; +extern struct list_head *pcpu_chunk_lists; extern int pcpu_nr_slots; extern int pcpu_nr_empty_pop_pages; @@ -106,6 +128,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk) return pcpu_nr_pages_to_map_bits(chunk->nr_pages); } +#ifdef CONFIG_MEMCG_KMEM +static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk) +{ + if (chunk->obj_cgroups) + return PCPU_CHUNK_MEMCG; + return PCPU_CHUNK_ROOT; +} + +static inline bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type) +{ + return chunk_type == PCPU_CHUNK_MEMCG; +} + +#else +static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk) +{ + return PCPU_CHUNK_ROOT; +} + +static inline bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type) +{ + return false; +} +#endif + +static inline struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type) +{ + return &pcpu_chunk_lists[pcpu_nr_slots * + pcpu_is_memcg_chunk(chunk_type)]; +} + #ifdef CONFIG_PERCPU_STATS #include diff --git a/mm/percpu-km.c b/mm/percpu-km.c index 20d2b69a13b0..35c9941077ee 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, /* nada */ } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT; struct pcpu_chunk *chunk; @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) unsigned long flags; int i; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); if (!chunk) return NULL; diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 32558063c3f9..c8400a2adbc2 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void) { struct pcpu_chunk *chunk; int slot, max_nr_alloc; + enum pcpu_chunk_type type; max_nr_alloc = 0; - for (slot = 0; slot < pcpu_nr_slots; slot++) - list_for_each_entry(chunk, &pcpu_slot[slot], list) - max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + for (slot = 0; slot < pcpu_nr_slots; slot++) + list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot], + list) + max_nr_alloc = max(max_nr_alloc, + chunk->nr_alloc); return max_nr_alloc; } @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, P("cur_min_alloc", cur_min_alloc); P("cur_med_alloc", cur_med_alloc); P("cur_max_alloc", cur_max_alloc); +#ifdef CONFIG_MEMCG_KMEM + P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk))); +#endif seq_putc(m, '\n'); } @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) struct pcpu_chunk *chunk; int slot, max_nr_alloc; int *buffer; + enum pcpu_chunk_type type; alloc_buffer: spin_lock_irq(&pcpu_lock); @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v) chunk_map_stats(m, pcpu_reserved_chunk, buffer); } - for (slot = 0; slot < pcpu_nr_slots; slot++) { - list_for_each_entry(chunk, &pcpu_slot[slot], list) { - if (chunk == pcpu_first_chunk) { - seq_puts(m, "Chunk: <- First Chunk\n"); - chunk_map_stats(m, chunk, buffer); - - - } else { - seq_puts(m, "Chunk:\n"); - chunk_map_stats(m, chunk, buffer); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) { + for (slot = 0; slot < pcpu_nr_slots; slot++) { + list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot], + list) { + if (chunk == pcpu_first_chunk) { + seq_puts(m, "Chunk: <- First Chunk\n"); + chunk_map_stats(m, chunk, buffer); + } else { + seq_puts(m, "Chunk:\n"); + chunk_map_stats(m, chunk, buffer); + } } - } } diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index a2b395acef89..e46f7a6917f9 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, pcpu_free_pages(chunk, pages, page_start, page_end); } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { struct pcpu_chunk *chunk; struct vm_struct **vms; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); if (!chunk) return NULL; diff --git a/mm/percpu.c b/mm/percpu.c index 5e9eefdce21f..dc1a213293aa 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -37,9 +37,14 @@ * takes care of normal allocations. * * The allocator organizes chunks into lists according to free size and - * tries to allocate from the fullest chunk first. Each chunk is managed - * by a bitmap with metadata blocks. The allocation map is updated on - * every allocation and free to reflect the current state while the boundary + * memcg-awareness. To make a percpu allocation memcg-aware the __GFP_ACCOUNT + * flag should be passed. All memcg-aware allocations are sharing one set + * of chunks and all unaccounted allocations and allocations performed + * by processes belonging to the root memory cgroup are using the second set. + * + * The allocator tries to allocate from the fullest chunk first. Each chunk + * is managed by a bitmap with metadata blocks. The allocation map is updated + * on every allocation and free to reflect the current state while the boundary * map is only updated on allocation. Each metadata block contains * information to help mitigate the need to iterate over large portions * of the bitmap. The reverse mapping from page to chunk is stored in @@ -81,6 +86,7 @@ #include #include #include +#include #include #include @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */ +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */ /* chunks which need their map areas extended, protected by pcpu_lock */ static LIST_HEAD(pcpu_map_extend_chunks); @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot, bool move_front) { if (chunk != pcpu_reserved_chunk) { + struct list_head *pcpu_slot; + + pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk)); if (move_front) list_move(&chunk->list, &pcpu_slot[slot]); else @@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, panic("%s: Failed to allocate %zu bytes\n", __func__, alloc_size); +#ifdef CONFIG_MEMCG_KMEM + /* first chunk isn't memcg-aware */ + chunk->obj_cgroups = NULL; +#endif pcpu_init_md_blocks(chunk); /* manage populated page bitmap */ @@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, return chunk; } -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp) { struct pcpu_chunk *chunk; int region_bits; @@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) if (!chunk->md_blocks) goto md_blocks_fail; +#ifdef CONFIG_MEMCG_KMEM + if (pcpu_is_memcg_chunk(type)) { + chunk->obj_cgroups = + pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) * + sizeof(struct obj_cgroup *), gfp); + if (!chunk->obj_cgroups) + goto objcg_fail; + } +#endif + pcpu_init_md_blocks(chunk); /* init metadata */ @@ -1415,6 +1438,10 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) return chunk; +#ifdef CONFIG_MEMCG_KMEM +objcg_fail: + pcpu_mem_free(chunk->md_blocks); +#endif md_blocks_fail: pcpu_mem_free(chunk->bound_map); bound_map_fail: @@ -1429,6 +1456,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk) { if (!chunk) return; +#ifdef CONFIG_MEMCG_KMEM + pcpu_mem_free(chunk->obj_cgroups); +#endif pcpu_mem_free(chunk->md_blocks); pcpu_mem_free(chunk->bound_map); pcpu_mem_free(chunk->alloc_map); @@ -1505,7 +1535,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end, gfp_t gfp); static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end); -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp); +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp); static void pcpu_destroy_chunk(struct pcpu_chunk *chunk); static struct page *pcpu_addr_to_page(void *addr); static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai); @@ -1547,6 +1578,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) return pcpu_get_page_chunk(pcpu_addr_to_page(addr)); } +#ifdef CONFIG_MEMCG_KMEM +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, + struct obj_cgroup **objcgp) +{ + struct obj_cgroup *objcg; + + if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) || + memcg_kmem_bypass()) + return PCPU_CHUNK_ROOT; + + objcg = get_obj_cgroup_from_current(); + if (!objcg) + return PCPU_CHUNK_ROOT; + + if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) { + obj_cgroup_put(objcg); + return PCPU_FAIL_ALLOC; + } + + *objcgp = objcg; + return PCPU_CHUNK_MEMCG; +} + +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, + struct pcpu_chunk *chunk, int off, + size_t size) +{ + if (!objcg) + return; + + if (chunk) { + chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg; + } else { + obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + obj_cgroup_put(objcg); + } +} + +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) +{ + struct obj_cgroup *objcg; + + if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk))) + return; + + objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; + chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; + + obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + + obj_cgroup_put(objcg); +} + +#else /* CONFIG_MEMCG_KMEM */ +static enum pcpu_chunk_type +pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp) +{ + return PCPU_CHUNK_ROOT; +} + +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, + struct pcpu_chunk *chunk, int off, + size_t size) +{ +} + +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ + /** * pcpu_alloc - the percpu allocator * @size: size of area to allocate in bytes @@ -1568,6 +1670,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, gfp_t pcpu_gfp; bool is_atomic; bool do_warn; + enum pcpu_chunk_type type; + struct list_head *pcpu_slot; + struct obj_cgroup *objcg = NULL; static int warn_limit = 10; struct pcpu_chunk *chunk, *next; const char *err; @@ -1602,16 +1707,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, return NULL; } + type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg); + if (unlikely(type == PCPU_FAIL_ALLOC)) + return NULL; + pcpu_slot = pcpu_chunk_list(type); + if (!is_atomic) { /* * pcpu_balance_workfn() allocates memory under this mutex, * and it may wait for memory reclaim. Allow current task * to become OOM victim, in case of memory pressure. */ - if (gfp & __GFP_NOFAIL) + if (gfp & __GFP_NOFAIL) { mutex_lock(&pcpu_alloc_mutex); - else if (mutex_lock_killable(&pcpu_alloc_mutex)) + } else if (mutex_lock_killable(&pcpu_alloc_mutex)) { + pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size); return NULL; + } } spin_lock_irqsave(&pcpu_lock, flags); @@ -1666,7 +1778,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, } if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) { - chunk = pcpu_create_chunk(pcpu_gfp); + chunk = pcpu_create_chunk(type, pcpu_gfp); if (!chunk) { err = "failed to allocate new chunk"; goto fail; @@ -1723,6 +1835,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, trace_percpu_alloc_percpu(reserved, is_atomic, size, align, chunk->base_addr, off, ptr); + pcpu_memcg_post_alloc_hook(objcg, chunk, off, size); + return ptr; fail_unlock: @@ -1744,6 +1858,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, } else { mutex_unlock(&pcpu_alloc_mutex); } + + pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size); + return NULL; } @@ -1803,8 +1920,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) } /** - * pcpu_balance_workfn - manage the amount of free chunks and populated pages - * @work: unused + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages + * @type: chunk type * * Reclaim all fully free chunks except for the first one. This is also * responsible for maintaining the pool of empty populated pages. However, @@ -1813,11 +1930,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) * allocation causes the failure as it is possible that requests can be * serviced from already backed regions. */ -static void pcpu_balance_workfn(struct work_struct *work) +static void __pcpu_balance_workfn(enum pcpu_chunk_type type) { /* gfp flags passed to underlying allocators */ const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; LIST_HEAD(to_free); + struct list_head *pcpu_slot = pcpu_chunk_list(type); struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1]; struct pcpu_chunk *chunk, *next; int slot, nr_to_pop, ret; @@ -1915,7 +2033,7 @@ static void pcpu_balance_workfn(struct work_struct *work) if (nr_to_pop) { /* ran out of chunks to populate, create a new one and retry */ - chunk = pcpu_create_chunk(gfp); + chunk = pcpu_create_chunk(type, gfp); if (chunk) { spin_lock_irq(&pcpu_lock); pcpu_chunk_relocate(chunk, -1); @@ -1927,6 +2045,20 @@ static void pcpu_balance_workfn(struct work_struct *work) mutex_unlock(&pcpu_alloc_mutex); } +/** + * pcpu_balance_workfn - manage the amount of free chunks and populated pages + * @work: unused + * + * Call __pcpu_balance_workfn() for each chunk type. + */ +static void pcpu_balance_workfn(struct work_struct *work) +{ + enum pcpu_chunk_type type; + + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_balance_workfn(type); +} + /** * free_percpu - free percpu area * @ptr: pointer to area to free @@ -1941,8 +2073,9 @@ void free_percpu(void __percpu *ptr) void *addr; struct pcpu_chunk *chunk; unsigned long flags; - int off; + int size, off; bool need_balance = false; + struct list_head *pcpu_slot; if (!ptr) return; @@ -1956,7 +2089,11 @@ void free_percpu(void __percpu *ptr) chunk = pcpu_chunk_addr_search(addr); off = addr - chunk->base_addr; - pcpu_free_area(chunk, off); + size = pcpu_free_area(chunk, off); + + pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk)); + + pcpu_memcg_free_hook(chunk, off, size); /* if there are more than one fully free chunks, wake up grim reaper */ if (chunk->free_bytes == pcpu_unit_size) { @@ -2267,6 +2404,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, int map_size; unsigned long tmp_addr; size_t alloc_size; + enum pcpu_chunk_type type; #define PCPU_SETUP_BUG_ON(cond) do { \ if (unlikely(cond)) { \ @@ -2384,13 +2522,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, * empty chunks. */ pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2; - pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]), - SMP_CACHE_BYTES); - if (!pcpu_slot) + pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots * + sizeof(pcpu_chunk_lists[0]) * + PCPU_NR_CHUNK_TYPES, + SMP_CACHE_BYTES); + if (!pcpu_chunk_lists) panic("%s: Failed to allocate %zu bytes\n", __func__, - pcpu_nr_slots * sizeof(pcpu_slot[0])); - for (i = 0; i < pcpu_nr_slots; i++) - INIT_LIST_HEAD(&pcpu_slot[i]); + pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) * + PCPU_NR_CHUNK_TYPES); + + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + for (i = 0; i < pcpu_nr_slots; i++) + INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]); /* * The end of the static region needs to be aligned with the From 772616b031f06e05846488b01dab46a7c832da13 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:21 -0700 Subject: [PATCH 003/164] mm: memcg/percpu: per-memcg percpu memory statistics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [guro@fb.com: v3] Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Dennis Zhou Acked-by: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Tejun Heo Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long Cc: Bixuan Cui Cc: Michal Koutný Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ include/linux/memcontrol.h | 8 ++++++++ mm/memcontrol.c | 4 +++- mm/percpu.c | 10 ++++++++++ 4 files changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index fa4018afa5a4..6be43781ec7f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back. Amount of memory used for storing in-kernel data structures. + percpu + Amount of memory used for storing per-cpu kernel + data structures. + sock Amount of memory used in network transmission buffers diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1bb49b600310..2c2d301eac33 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -32,6 +32,7 @@ struct kmem_cache; enum memcg_stat_item { MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, MEMCG_SOCK, + MEMCG_PERCPU_B, MEMCG_NR_STAT, }; @@ -339,6 +340,13 @@ struct mem_cgroup { extern struct mem_cgroup *root_mem_cgroup; +static __always_inline bool memcg_stat_item_in_bytes(int idx) +{ + if (idx == MEMCG_PERCPU_B) + return true; + return vmstat_item_in_bytes(idx); +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8d9ceea7fe4d..36d5300f9b69 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -781,7 +781,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return; - if (vmstat_item_in_bytes(idx)) + if (memcg_stat_item_in_bytes(idx)) threshold <<= PAGE_SHIFT; x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); @@ -1488,6 +1488,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg) seq_buf_printf(&s, "slab %llu\n", (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B))); + seq_buf_printf(&s, "percpu %llu\n", + (u64)memcg_page_state(memcg, MEMCG_PERCPU_B)); seq_buf_printf(&s, "sock %llu\n", (u64)memcg_page_state(memcg, MEMCG_SOCK) * PAGE_SIZE); diff --git a/mm/percpu.c b/mm/percpu.c index dc1a213293aa..f4709629e6de 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1610,6 +1610,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, if (chunk) { chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg; + + rcu_read_lock(); + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, + size * num_possible_cpus()); + rcu_read_unlock(); } else { obj_cgroup_uncharge(objcg, size * num_possible_cpus()); obj_cgroup_put(objcg); @@ -1628,6 +1633,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + rcu_read_lock(); + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, + -(size * num_possible_cpus())); + rcu_read_unlock(); + obj_cgroup_put(objcg); } From 3e38e0aaca9eafb12b1c4b731d1c10975cbe7974 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:25 -0700 Subject: [PATCH 004/164] mm: memcg: charge memcg percpu memory to the parent cgroup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Memory cgroups are using large chunks of percpu memory to store vmstat data. Yet this memory is not accounted at all, so in the case when there are many (dying) cgroups, it's not exactly clear where all the memory is. Because the size of memory cgroup internal structures can dramatically exceed the size of object or page which is pinning it in the memory, it's not a good idea to simply ignore it. It actually breaks the isolation between cgroups. Let's account the consumed percpu memory to the parent cgroup. [guro@fb.com: add WARN_ON_ONCE()s, per Johannes] Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Dennis Zhou Acked-by: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Tejun Heo Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long Cc: Bixuan Cui Cc: Michal Koutný Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 36d5300f9b69..f1fd265b9f9e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5131,13 +5131,18 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1; - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat); + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, + GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_local) { kfree(pn); return 1; } - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat); + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat, + GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_cpu) { free_percpu(pn->lruvec_stat_local); kfree(pn); @@ -5211,11 +5216,16 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, + GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_local) goto fail; - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, + GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) goto fail; @@ -5264,7 +5274,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) struct mem_cgroup *memcg; long error = -ENOMEM; + memalloc_use_memcg(parent); memcg = mem_cgroup_alloc(); + memalloc_unuse_memcg(); if (IS_ERR(memcg)) return ERR_CAST(memcg); From 90631e1dea55bfc14a8ee8c594aa136b243d4c88 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:29 -0700 Subject: [PATCH 005/164] kselftests: cgroup: add perpcu memory accounting test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a simple test to check the percpu memory accounting. The test creates a cgroup tree with 1000 child cgroups and checks values of memory.current and memory.stat::percpu. Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Cc: Christoph Lameter Cc: David Rientjes Cc: Dennis Zhou Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Shakeel Butt Cc: Tejun Heo Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long Cc: Michal Koutný Cc: Bixuan Cui Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200608230819.832349-6-guro@fb.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/cgroup/test_kmem.c | 70 +++++++++++++++++++++- 1 file changed, 69 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c index 5224dae216e5..0941aa16157e 100644 --- a/tools/testing/selftests/cgroup/test_kmem.c +++ b/tools/testing/selftests/cgroup/test_kmem.c @@ -18,6 +18,15 @@ #include "cgroup_util.h" +/* + * Memory cgroup charging and vmstat data aggregation is performed using + * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum + * discrepancy between charge and vmstat entries is number of cpus multiplied + * by 32 pages multiplied by 2. + */ +#define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs()) + + static int alloc_dcache(const char *cgroup, void *arg) { unsigned long i; @@ -180,7 +189,7 @@ static int test_kmem_memcg_deletion(const char *root) goto cleanup; sum = slab + anon + file + kernel_stack; - if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) { + if (abs(sum - current) < MAX_VMSTAT_ERROR) { ret = KSFT_PASS; } else { printf("memory.current = %ld\n", current); @@ -331,6 +340,64 @@ static int test_kmem_dead_cgroups(const char *root) return ret; } +/* + * This test creates a sub-tree with 1000 memory cgroups. + * Then it checks that the memory.current on the parent level + * is greater than 0 and approximates matches the percpu value + * from memory.stat. + */ +static int test_percpu_basic(const char *root) +{ + int ret = KSFT_FAIL; + char *parent, *child; + long current, percpu; + int i; + + parent = cg_name(root, "percpu_basic_test"); + if (!parent) + goto cleanup; + + if (cg_create(parent)) + goto cleanup; + + if (cg_write(parent, "cgroup.subtree_control", "+memory")) + goto cleanup; + + for (i = 0; i < 1000; i++) { + child = cg_name_indexed(parent, "child", i); + if (!child) + return -1; + + if (cg_create(child)) + goto cleanup_children; + + free(child); + } + + current = cg_read_long(parent, "memory.current"); + percpu = cg_read_key_long(parent, "memory.stat", "percpu "); + + if (current > 0 && percpu > 0 && abs(current - percpu) < + MAX_VMSTAT_ERROR) + ret = KSFT_PASS; + else + printf("memory.current %ld\npercpu %ld\n", + current, percpu); + +cleanup_children: + for (i = 0; i < 1000; i++) { + child = cg_name_indexed(parent, "child", i); + cg_destroy(child); + free(child); + } + +cleanup: + cg_destroy(parent); + free(parent); + + return ret; +} + #define T(x) { x, #x } struct kmem_test { int (*fn)(const char *root); @@ -341,6 +408,7 @@ struct kmem_test { T(test_kmem_proc_kpagecgroup), T(test_kmem_kernel_stacks), T(test_kmem_dead_cgroups), + T(test_percpu_basic), }; #undef T From 8ca39e6874f812a393bb66d9fdbb7598d5f0451c Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Tue, 11 Aug 2020 18:30:32 -0700 Subject: [PATCH 006/164] mm/hugetlb: add mempolicy check in the reservation routine In the reservation routine, we only check whether the cpuset meets the memory allocation requirements. But we ignore the mempolicy of MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent memory allocation may fail due to mempolicy restrictions and receives the SIGBUS signal. This can be reproduced by the follow steps. 1) Compile the test case. cd tools/testing/selftests/vm/ gcc map_hugetlb.c -o map_hugetlb 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the system. Each node will pre-allocate one huge page. echo 2 > /proc/sys/vm/nr_hugepages 3) Run test case(mmap 4MB). We receive the SIGBUS signal. numactl --membind=3D0 ./map_hugetlb 4 With this patch applied, the mmap will fail in the step 3) and throw "mmap: Cannot allocate memory". [akpm@linux-foundation.org: include sched.h for `current'] Reported-by: Jianchao Guo Suggested-by: Michal Hocko Signed-off-by: Muchun Song Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Cc: David Rientjes Cc: Mel Gorman Cc: Michel Lespinasse Cc: Baoquan He Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds --- include/linux/mempolicy.h | 16 +++++++++++++++- mm/hugetlb.c | 22 ++++++++++++++++++---- mm/mempolicy.c | 2 +- 3 files changed, 34 insertions(+), 6 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ea9c15b60a96..5f1648c23e29 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -6,7 +6,7 @@ #ifndef _LINUX_MEMPOLICY_H #define _LINUX_MEMPOLICY_H 1 - +#include #include #include #include @@ -152,6 +152,15 @@ extern int huge_node(struct vm_area_struct *vma, extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, const nodemask_t *mask); +extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); + +static inline nodemask_t *policy_nodemask_current(gfp_t gfp) +{ + struct mempolicy *mpol = get_task_policy(current); + + return policy_nodemask(gfp, mpol); +} + extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -281,5 +290,10 @@ static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma, static inline void mpol_put_task_policy(struct task_struct *task) { } + +static inline nodemask_t *policy_nodemask_current(gfp_t gfp) +{ + return NULL; +} #endif /* CONFIG_NUMA */ #endif diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e52c878940bb..dffafb5bf2ed 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3458,13 +3458,21 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); -static unsigned int cpuset_mems_nr(unsigned int *array) +static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; + nodemask_t *mpol_allowed; + unsigned int *array = h->free_huge_pages_node; + gfp_t gfp_mask = htlb_alloc_mask(h); - for_each_node_mask(node, cpuset_current_mems_allowed) - nr += array[node]; + mpol_allowed = policy_nodemask_current(gfp_mask); + + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (!mpol_allowed || + (mpol_allowed && node_isset(node, *mpol_allowed))) + nr += array[node]; + } return nr; } @@ -3643,12 +3651,18 @@ static int hugetlb_acct_memory(struct hstate *h, long delta) * we fall back to check against current free page availability as * a best attempt and hopefully to minimize the impact of changing * semantics that cpuset has. + * + * Apart from cpuset, we also have memory policy mechanism that + * also determines from which node the kernel will allocate memory + * in a NUMA system. So similar to cpuset, we also should consider + * the memory policy of the current task. Similar to the description + * above. */ if (delta > 0) { if (gather_surplus_pages(h, delta) < 0) goto out; - if (delta > cpuset_mems_nr(h->free_huge_pages_node)) { + if (delta > allowed_mems_nr(h)) { return_unused_surplus_pages(h, delta); goto out; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b9e85d467352..7af44d7cdd11 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1890,7 +1890,7 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) * Return a nodemask representing a mempolicy for filtering nodes for * page allocation */ -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) { /* Lower zones don't get a nodemask applied for MPOL_BIND */ if (unlikely(policy->mode == MPOL_BIND) && From ccc5dc67340c109e624e07e02790e9fbdec900d6 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:36 -0700 Subject: [PATCH 007/164] mm/vmscan: make active/inactive ratio as 1:1 for anon lru Patch series "workingset protection/detection on the anonymous LRU list", v7. * PROBLEM In current implementation, newly created or swap-in anonymous page is started on the active list. Growing the active list results in rebalancing active/inactive list so old pages on the active list are demoted to the inactive list. Hence, hot page on the active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list and system can contain total 100 pages. Numbers denote the number of pages on active/inactive list (active | inactive). (h) stands for hot pages and (uo) stands for used-once pages. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) As we can see, hot pages are swapped-out and it would cause swap-in later. * SOLUTION Since this is what we want to avoid, this patchset implements workingset protection. Like as the file LRU list, newly created or swap-in anonymous page is started on the inactive list. Also, like as the file LRU list, if enough reference happens, the page will be promoted. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) hot pages remains in the active list. :) * EXPERIMENT I tested this scenario on my test bed and confirmed that this problem happens on current implementation. I also checked that it is fixed by this patchset. * SUBJECT workingset detection * PROBLEM Later part of the patchset implements the workingset detection for the anonymous LRU list. There is a corner case that workingset protection could cause thrashing. If we can avoid thrashing by workingset detection, we can get the better performance. Following is an example of thrashing due to the workingset protection. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (will be hot) pages 50(h) | 50(wh) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 4. workload: 50 (will be hot) pages 50(h) | 50(wh), swap-in 50(wh) 5. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 6. repeat 4, 5 Without workingset detection, this kind of workload cannot be promoted and thrashing happens forever. * SOLUTION Therefore, this patchset implements workingset detection. All the infrastructure for workingset detecion is already implemented, so there is not much work to do. First, extend workingset detection code to deal with the anonymous LRU list. Then, make swap cache handles the exceptional value for the shadow entry. Lastly, install/retrieve the shadow value into/from the swap cache and check the refault distance. * EXPERIMENT I made a test program to imitates above scenario and confirmed that problem exists. Then, I checked that this patchset fixes it. My test setup is a virtual machine with 8 cpus and 6100MB memory. But, the amount of the memory that the test program can use is about 280 MB. This is because the system uses large ram-backed swap and large ramdisk to capture the trace. Test scenario is like as below. 1. allocate cold memory (512MB) 2. allocate hot-1 memory (96MB) 3. activate hot-1 memory (96MB) 4. allocate another hot-2 memory (96MB) 5. access cold memory (128MB) 6. access hot-2 memory (96MB) 7. repeat 5, 6 Since hot-1 memory (96MB) is on the active list, the inactive list can contains roughly 190MB pages. hot-2 memory's re-access interval (96+128 MB) is more 190MB, so it cannot be promoted without workingset detection and swap-in/out happens repeatedly. With this patchset, workingset detection works and promotion happens. Therefore, swap-in/out occurs less. Here is the result. (average of 5 runs) type swap-in swap-out base 863240 989945 patch 681565 809273 As we can see, patched kernel do less swap-in/out. * OVERALL TEST (ebizzy using modified random function) ebizzy is the test program that main thread allocates lots of memory and child threads access them randomly during the given times. Swap-in will happen if allocated memory is larger than the system memory. The random function that represents the zipf distribution is used to make hot/cold memory. Hot/cold ratio is controlled by the parameter. If the parameter is high, hot memory is accessed much larger than cold one. If the parameter is low, the number of access on each memory would be similar. I uses various parameters in order to show the effect of patchset on various hot/cold ratio workload. My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB ram swap. Result format is as following. param: 1-1024-0.1 - 1 (number of thread) - 1024 (allocated memory size, MB) - 0.1 (zipf distribution alpha, 0.1 works like as roughly uniform random, 1.3 works like as small portion of memory is hot and the others are cold) pswpin: smaller is better std: standard deviation improvement: negative is better * single thread param pswpin std improvement base 1-1024.0-0.1 14101983.40 79441.19 prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 ) detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 ) base 1-1024.0-0.7 7998368.80 43469.32 prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 ) detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 ) base 1-1024.0-1.3 1017400.80 38756.30 prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 ) detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 ) base 1-1280.0-0.1 22895541.40 50016.08 prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 ) detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 ) base 1-1280.0-0.7 13717645.60 46250.65 prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 ) detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 ) base 1-1280.0-1.3 1654251.40 4159.68 prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 ) detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 ) base 1-1536.0-0.1 31621775.40 31156.28 prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 ) detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 ) base 1-1536.0-0.7 19620760.60 60937.60 prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 ) detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 ) base 1-1536.0-1.3 2378142.40 20994.43 prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 ) detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 ) base 1-1792.0-0.1 40259714.80 90750.70 prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 ) detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 ) base 1-1792.0-0.7 25704884.40 69429.68 prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 ) detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 ) base 1-1792.0-1.3 3129497.00 32731.86 prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 ) detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 ) base 1-2048.0-0.1 48746924.40 50863.88 prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 ) detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 ) base 1-2048.0-0.7 32046424.40 78624.22 prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 ) detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 ) base 1-2048.0-1.3 3916723.60 24048.55 prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 ) detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 ) * multi thread param pswpin std improvement base 8-1024.0-0.1 16219822.60 329474.01 prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 ) detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 ) base 8-1024.0-0.7 9174107.80 537619.33 prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 ) detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 ) base 8-1024.0-1.3 1108495.60 83555.98 prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 ) detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 ) base 8-1280.0-0.1 25776114.20 450480.45 prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 ) detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 ) base 8-1280.0-0.7 15218968.00 702007.69 prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 ) detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 ) base 8-1280.0-1.3 1792482.80 30512.90 prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 ) detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 ) base 8-1536.0-0.1 33923761.40 575455.85 prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 ) detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 ) base 8-1536.0-0.7 20628907.80 303851.34 prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 ) detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 ) base 8-1536.0-1.3 2588106.40 130769.20 prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 ) detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 ) base 8-1792.0-0.1 43328279.20 946469.12 prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 ) detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 ) base 8-1792.0-0.7 27155647.40 536253.57 prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 ) detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 ) base 8-1792.0-1.3 3260372.80 137907.92 prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 ) detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 ) base 8-2048.0-0.1 50583989.80 710121.48 prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 ) detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 ) base 8-2048.0-0.7 33765479.60 812659.55 prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 ) detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 ) base 8-2048.0-1.3 3941675.80 28436.45 prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 ) detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 ) As we can see, all the cases show improvement. Especially, test case with zipf distribution 1.3 show more improvements. It means that if there is a hot/cold tendency in anon pages, this patchset works better. This patch (of 6): Current implementation of LRU management for anonymous page has some problems. Most important one is that it doesn't protect the workingset, that is, pages on the active LRU list. Although, this problem will be fixed in the following patchset, the preparation is required and this patch does it. What following patch does is to implement workingset protection. After the following patchset, newly created or swap-in pages will start their lifetime on the inactive list. If inactive list is too small, there is not enough chance to be referenced and the page cannot become the workingset. In order to provide the newly anonymous or swap-in pages enough chance to be referenced again, this patch makes active/inactive LRU ratio as 1:1. This is just a temporary measure. Later patch in the series introduces workingset detection for anonymous LRU that will be used to better decide if pages should start on the active and inactive list. Afterwards this patch is effectively reverted. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Michal Hocko Cc: Hugh Dickins Cc: Minchan Kim Cc: Mel Gorman Cc: Matthew Wilcox Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 72da290b171b..e34fc04b7045 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2208,7 +2208,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru) active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru); gb = (inactive + active) >> (30 - PAGE_SHIFT); - if (gb) + if (gb && is_file_lru(inactive_lru)) inactive_ratio = int_sqrt(10 * gb); else inactive_ratio = 1; From b518154e59aab3ad0780a169c5cc84bd4ee4357e Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:40 -0700 Subject: [PATCH 008/164] mm/vmscan: protect the workingset on anonymous LRU In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Minchan Kim Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/swap.h | 2 +- kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 9 ++++----- mm/migrate.c | 2 +- mm/swap.c | 13 +++++++------ mm/swapfile.c | 2 +- mm/userfaultfd.c | 2 +- mm/vmscan.c | 4 +--- 10 files changed, 19 insertions(+), 21 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7eb59bc552a5..51ec9cdb92c0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -352,7 +352,7 @@ extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); -extern void lru_cache_add_active_or_unevictable(struct page *page, +extern void lru_cache_add_inactive_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 25de10c904e6..49047d479c57 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) { get_page(new_page); page_add_new_anon_rmap(new_page, vma, addr, false); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); } else /* no new page, just dec_mm_counter for old_page */ dec_mm_counter(mm, MM_ANONPAGES); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 206f52b36ffb..863c495776d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -640,7 +640,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, entry = mk_huge_pmd(page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); page_add_new_anon_rmap(page, vma, haddr, true); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b52bd46ad146..15a9af791014 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1173,7 +1173,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); page_add_new_anon_rmap(new_page, vma, address, true); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..6fe8b5b22c57 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2715,7 +2715,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) */ ptep_clear_flush_notify(vma, vmf->address, vmf->pte); page_add_new_anon_rmap(new_page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); /* * We call the notify macro here because, when using secondary * mmu page tables (such as kvm shadow page tables), we want the @@ -3266,10 +3266,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* ksm created a completely new copy */ if (unlikely(page != swapcache && swapcache)) { page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } else { do_page_add_anon_rmap(page, vma, vmf->address, exclusive); - activate_page(page); } swap_free(entry); @@ -3414,7 +3413,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); @@ -3672,7 +3671,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, false); diff --git a/mm/migrate.c b/mm/migrate.c index d179657f8685..819e55130134 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2830,7 +2830,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, inc_mm_counter(mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr, false); if (!is_zone_device_page(page)) - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); get_page(page); if (flush) { diff --git a/mm/swap.c b/mm/swap.c index de257c0a89b1..9285e60c7d6e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -476,23 +476,24 @@ void lru_cache_add(struct page *page) EXPORT_SYMBOL(lru_cache_add); /** - * lru_cache_add_active_or_unevictable + * lru_cache_add_inactive_or_unevictable * @page: the page to be added to LRU * @vma: vma in which page is mapped for determining reclaimability * - * Place @page on the active or unevictable LRU list, depending on its + * Place @page on the inactive or unevictable LRU list, depending on its * evictability. Note that if the page is not evictable, it goes * directly back onto it's zone's unevictable list, it does NOT use a * per cpu pagevec. */ -void lru_cache_add_active_or_unevictable(struct page *page, +void lru_cache_add_inactive_or_unevictable(struct page *page, struct vm_area_struct *vma) { + bool unevictable; + VM_BUG_ON_PAGE(PageLRU(page), page); - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) - SetPageActive(page); - else if (!TestSetPageMlocked(page)) { + unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED; + if (unlikely(unevictable) && !TestSetPageMlocked(page)) { /* * We use the irq-unsafe __mod_zone_page_stat because this * counter is not modified from interrupt context, and the pte diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..82183432fdd0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1915,7 +1915,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, page_add_anon_rmap(page, vma, addr, false); } else { /* ksm created a completely new copy */ page_add_new_anon_rmap(page, vma, addr, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } swap_free(entry); /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b80419320c7d..9a3d451402d7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -123,7 +123,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, inc_mm_counter(dst_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, dst_vma, dst_addr, false); - lru_cache_add_active_or_unevictable(page, dst_vma); + lru_cache_add_inactive_or_unevictable(page, dst_vma); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); diff --git a/mm/vmscan.c b/mm/vmscan.c index e34fc04b7045..783cd7fdc61a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -998,8 +998,6 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; if (referenced_ptes) { - if (PageSwapBacked(page)) - return PAGEREF_ACTIVATE; /* * All mapped pages start out with page table * references from the instantiating fault, so we need @@ -1022,7 +1020,7 @@ static enum page_references page_check_references(struct page *page, /* * Activate file-backed executable pages after first usage. */ - if (vm_flags & VM_EXEC) + if ((vm_flags & VM_EXEC) && !PageSwapBacked(page)) return PAGEREF_ACTIVATE; return PAGEREF_KEEP; From 170b04b7ae49634df103810dad67b22cf8a99aa6 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:43 -0700 Subject: [PATCH 009/164] mm/workingset: prepare the workingset detection infrastructure for anon LRU To prepare the workingset detection for anon LRU, this patch splits workingset event counters for refault, activate and restore into anon and file variants, as well as the refaults counter in struct lruvec. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Minchan Kim Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 16 +++++++++++----- mm/memcontrol.c | 16 +++++++++++----- mm/vmscan.c | 15 ++++++++++----- mm/vmstat.c | 9 ++++++--- mm/workingset.c | 8 +++++--- 5 files changed, 43 insertions(+), 21 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 635a96cd9b1f..efbd95dd3bbf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -173,9 +173,15 @@ enum node_stat_item { NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_NODES, - WORKINGSET_REFAULT, - WORKINGSET_ACTIVATE, - WORKINGSET_RESTORE, + WORKINGSET_REFAULT_BASE, + WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE, + WORKINGSET_REFAULT_FILE, + WORKINGSET_ACTIVATE_BASE, + WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE, + WORKINGSET_ACTIVATE_FILE, + WORKINGSET_RESTORE_BASE, + WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE, + WORKINGSET_RESTORE_FILE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. @@ -277,8 +283,8 @@ struct lruvec { unsigned long file_cost; /* Non-resident age, driven by LRU movement */ atomic_long_t nonresident_age; - /* Refaults at the time of last reclaim cycle */ - unsigned long refaults; + /* Refaults at the time of last reclaim cycle, anon=0, file=1 */ + unsigned long refaults[2]; /* Various lruvec state flags (enum lruvec_flags) */ unsigned long flags; #ifdef CONFIG_MEMCG diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1fd265b9f9e..c6c579217855 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1530,12 +1530,18 @@ static char *memory_stat_format(struct mem_cgroup *memcg) seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT), memcg_events(memcg, PGMAJFAULT)); - seq_buf_printf(&s, "workingset_refault %lu\n", - memcg_page_state(memcg, WORKINGSET_REFAULT)); - seq_buf_printf(&s, "workingset_activate %lu\n", - memcg_page_state(memcg, WORKINGSET_ACTIVATE)); + seq_buf_printf(&s, "workingset_refault_anon %lu\n", + memcg_page_state(memcg, WORKINGSET_REFAULT_ANON)); + seq_buf_printf(&s, "workingset_refault_file %lu\n", + memcg_page_state(memcg, WORKINGSET_REFAULT_FILE)); + seq_buf_printf(&s, "workingset_activate_anon %lu\n", + memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON)); + seq_buf_printf(&s, "workingset_activate_file %lu\n", + memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE)); seq_buf_printf(&s, "workingset_restore %lu\n", - memcg_page_state(memcg, WORKINGSET_RESTORE)); + memcg_page_state(memcg, WORKINGSET_RESTORE_ANON)); + seq_buf_printf(&s, "workingset_restore %lu\n", + memcg_page_state(memcg, WORKINGSET_RESTORE_FILE)); seq_buf_printf(&s, "workingset_nodereclaim %lu\n", memcg_page_state(memcg, WORKINGSET_NODERECLAIM)); diff --git a/mm/vmscan.c b/mm/vmscan.c index 783cd7fdc61a..017f323318a3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2683,7 +2683,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (!sc->force_deactivate) { unsigned long refaults; - if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_ANON); + if (refaults != target_lruvec->refaults[0] || + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) sc->may_deactivate |= DEACTIVATE_ANON; else sc->may_deactivate &= ~DEACTIVATE_ANON; @@ -2694,8 +2697,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * rid of any stale active pages quickly. */ refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE); - if (refaults != target_lruvec->refaults || + WORKINGSET_ACTIVATE_FILE); + if (refaults != target_lruvec->refaults[1] || inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) sc->may_deactivate |= DEACTIVATE_FILE; else @@ -2972,8 +2975,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) unsigned long refaults; target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); - target_lruvec->refaults = refaults; + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); + target_lruvec->refaults[0] = refaults; + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE); + target_lruvec->refaults[1] = refaults; } /* diff --git a/mm/vmstat.c b/mm/vmstat.c index 2b866cbab11d..fef03463a0cf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1167,9 +1167,12 @@ const char * const vmstat_text[] = { "nr_isolated_anon", "nr_isolated_file", "workingset_nodes", - "workingset_refault", - "workingset_activate", - "workingset_restore", + "workingset_refault_anon", + "workingset_refault_file", + "workingset_activate_anon", + "workingset_activate_file", + "workingset_restore_anon", + "workingset_restore_file", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", diff --git a/mm/workingset.c b/mm/workingset.c index b199726924dd..941bbaa6c262 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -6,6 +6,7 @@ */ #include +#include #include #include #include @@ -280,6 +281,7 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) */ void workingset_refault(struct page *page, void *shadow) { + bool file = page_is_file_lru(page); struct mem_cgroup *eviction_memcg; struct lruvec *eviction_lruvec; unsigned long refault_distance; @@ -346,7 +348,7 @@ void workingset_refault(struct page *page, void *shadow) memcg = page_memcg(page); lruvec = mem_cgroup_lruvec(memcg, pgdat); - inc_lruvec_state(lruvec, WORKINGSET_REFAULT); + inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); /* * Compare the distance to the existing workingset size. We @@ -366,7 +368,7 @@ void workingset_refault(struct page *page, void *shadow) SetPageActive(page); workingset_age_nonresident(lruvec, hpage_nr_pages(page)); - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file); /* Page was active prior to eviction */ if (workingset) { @@ -375,7 +377,7 @@ void workingset_refault(struct page *page, void *shadow) spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); spin_unlock_irq(&page_pgdat(page)->lru_lock); - inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file); } out: rcu_read_unlock(); From 3852f6768ede542ed48b9077bedf482c7ecb6327 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:47 -0700 Subject: [PATCH 010/164] mm/swapcache: support to handle the shadow entries Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Minchan Kim Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/swap.h | 17 +++++++++---- mm/shmem.c | 3 ++- mm/swap_state.c | 57 +++++++++++++++++++++++++++++++++++++++----- mm/swapfile.c | 2 ++ mm/vmscan.c | 2 +- 5 files changed, 69 insertions(+), 12 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 51ec9cdb92c0..8a4c592698a9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -414,9 +414,13 @@ extern struct address_space *swapper_spaces[]; extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); extern int add_to_swap(struct page *page); -extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); -extern void __delete_from_swap_cache(struct page *, swp_entry_t entry); +extern int add_to_swap_cache(struct page *page, swp_entry_t entry, + gfp_t gfp, void **shadowp); +extern void __delete_from_swap_cache(struct page *page, + swp_entry_t entry, void *shadow); extern void delete_from_swap_cache(struct page *); +extern void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); extern struct page *lookup_swap_cache(swp_entry_t entry, @@ -570,13 +574,13 @@ static inline int add_to_swap(struct page *page) } static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, - gfp_t gfp_mask) + gfp_t gfp_mask, void **shadowp) { return -1; } static inline void __delete_from_swap_cache(struct page *page, - swp_entry_t entry) + swp_entry_t entry, void *shadow) { } @@ -584,6 +588,11 @@ static inline void delete_from_swap_cache(struct page *page) { } +static inline void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end) +{ +} + static inline int page_swapcount(struct page *page) { return 0; diff --git a/mm/shmem.c b/mm/shmem.c index eb6b36d89722..49bde088a7ea 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1434,7 +1434,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) list_add(&info->swaplist, &shmem_swaplist); if (add_to_swap_cache(page, swap, - __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) { + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, + NULL) == 0) { spin_lock_irq(&info->lock); shmem_recalc_inode(inode); info->swapped++; diff --git a/mm/swap_state.c b/mm/swap_state.c index e82f4f8b1f63..a29b33c7c236 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -110,12 +110,14 @@ void show_swap_cache_info(void) * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, * but sets SwapCache flag and private instead of mapping and index. */ -int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) +int add_to_swap_cache(struct page *page, swp_entry_t entry, + gfp_t gfp, void **shadowp) { struct address_space *address_space = swap_address_space(entry); pgoff_t idx = swp_offset(entry); XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page)); unsigned long i, nr = hpage_nr_pages(page); + void *old; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapCache(page), page); @@ -125,16 +127,25 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) SetPageSwapCache(page); do { + unsigned long nr_shadows = 0; + xas_lock_irq(&xas); xas_create_range(&xas); if (xas_error(&xas)) goto unlock; for (i = 0; i < nr; i++) { VM_BUG_ON_PAGE(xas.xa_index != idx + i, page); + old = xas_load(&xas); + if (xa_is_value(old)) { + nr_shadows++; + if (shadowp) + *shadowp = old; + } set_page_private(page + i, entry.val + i); xas_store(&xas, page); xas_next(&xas); } + address_space->nrexceptional -= nr_shadows; address_space->nrpages += nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); ADD_CACHE_INFO(add_total, nr); @@ -154,7 +165,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) * This must be called only on pages that have * been verified to be in the swap cache. */ -void __delete_from_swap_cache(struct page *page, swp_entry_t entry) +void __delete_from_swap_cache(struct page *page, + swp_entry_t entry, void *shadow) { struct address_space *address_space = swap_address_space(entry); int i, nr = hpage_nr_pages(page); @@ -166,12 +178,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(PageWriteback(page), page); for (i = 0; i < nr; i++) { - void *entry = xas_store(&xas, NULL); + void *entry = xas_store(&xas, shadow); VM_BUG_ON_PAGE(entry != page, entry); set_page_private(page + i, 0); xas_next(&xas); } ClearPageSwapCache(page); + if (shadow) + address_space->nrexceptional += nr; address_space->nrpages -= nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); ADD_CACHE_INFO(del_total, nr); @@ -208,7 +222,7 @@ int add_to_swap(struct page *page) * Add it to the swap cache. */ err = add_to_swap_cache(page, entry, - __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL); if (err) /* * add_to_swap_cache() doesn't return -EEXIST, so we can safely @@ -246,13 +260,44 @@ void delete_from_swap_cache(struct page *page) struct address_space *address_space = swap_address_space(entry); xa_lock_irq(&address_space->i_pages); - __delete_from_swap_cache(page, entry); + __delete_from_swap_cache(page, entry, NULL); xa_unlock_irq(&address_space->i_pages); put_swap_page(page, entry); page_ref_sub(page, hpage_nr_pages(page)); } +void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end) +{ + unsigned long curr = begin; + void *old; + + for (;;) { + unsigned long nr_shadows = 0; + swp_entry_t entry = swp_entry(type, curr); + struct address_space *address_space = swap_address_space(entry); + XA_STATE(xas, &address_space->i_pages, curr); + + xa_lock_irq(&address_space->i_pages); + xas_for_each(&xas, old, end) { + if (!xa_is_value(old)) + continue; + xas_store(&xas, NULL); + nr_shadows++; + } + address_space->nrexceptional -= nr_shadows; + xa_unlock_irq(&address_space->i_pages); + + /* search the next swapcache until we meet end */ + curr >>= SWAP_ADDRESS_SPACE_SHIFT; + curr++; + curr <<= SWAP_ADDRESS_SPACE_SHIFT; + if (curr > end) + break; + } +} + /* * If we are the only user, then try to free up the swap cache. * @@ -429,7 +474,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, __SetPageSwapBacked(page); /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK)) { + if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, NULL)) { put_swap_page(page, entry); goto fail_unlock; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 82183432fdd0..e653eea1eb88 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -696,6 +696,7 @@ static void add_to_avail_list(struct swap_info_struct *p) static void swap_range_free(struct swap_info_struct *si, unsigned long offset, unsigned int nr_entries) { + unsigned long begin = offset; unsigned long end = offset + nr_entries - 1; void (*swap_slot_free_notify)(struct block_device *, unsigned long); @@ -721,6 +722,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } + clear_shadow_from_swap_cache(si->type, begin, end); } static void set_cluster_next(struct swap_info_struct *si, unsigned long next) diff --git a/mm/vmscan.c b/mm/vmscan.c index 017f323318a3..e84c4dd08c4e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -896,7 +896,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; mem_cgroup_swapout(page, swap); - __delete_from_swap_cache(page, swap); + __delete_from_swap_cache(page, swap, NULL); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); workingset_eviction(page, target_memcg); From aae466b0052e1888edd1d7f473d4310d64936196 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:50 -0700 Subject: [PATCH 011/164] mm/swap: implement workingset detection for anonymous LRU This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Minchan Kim Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/swap.h | 6 ++++++ mm/memory.c | 11 ++++------- mm/swap_state.c | 23 ++++++++++++++++++----- mm/vmscan.c | 7 ++++--- mm/workingset.c | 15 +++++++++++---- 5 files changed, 43 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 8a4c592698a9..661046994db4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -414,6 +414,7 @@ extern struct address_space *swapper_spaces[]; extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); extern int add_to_swap(struct page *page); +extern void *get_shadow_from_swap_cache(swp_entry_t entry); extern int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp, void **shadowp); extern void __delete_from_swap_cache(struct page *page, @@ -573,6 +574,11 @@ static inline int add_to_swap(struct page *page) return 0; } +static inline void *get_shadow_from_swap_cache(swp_entry_t entry) +{ + return NULL; +} + static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask, void **shadowp) { diff --git a/mm/memory.c b/mm/memory.c index 6fe8b5b22c57..de311fc7639e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3098,6 +3098,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) int locked; int exclusive = 0; vm_fault_t ret = 0; + void *shadow = NULL; if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) goto out; @@ -3149,13 +3150,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_page; } - /* - * XXX: Move to lru_cache_add() when it - * supports new vs putback - */ - spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); + shadow = get_shadow_from_swap_cache(entry); + if (shadow) + workingset_refault(page, shadow); lru_cache_add(page); swap_readpage(page, true); diff --git a/mm/swap_state.c b/mm/swap_state.c index a29b33c7c236..b73aabdfd35a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -106,6 +106,20 @@ void show_swap_cache_info(void) printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); } +void *get_shadow_from_swap_cache(swp_entry_t entry) +{ + struct address_space *address_space = swap_address_space(entry); + pgoff_t idx = swp_offset(entry); + struct page *page; + + page = find_get_entry(address_space, idx); + if (xa_is_value(page)) + return page; + if (page) + put_page(page); + return NULL; +} + /* * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, * but sets SwapCache flag and private instead of mapping and index. @@ -406,6 +420,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, { struct swap_info_struct *si; struct page *page; + void *shadow = NULL; *new_page_allocated = false; @@ -474,7 +489,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, __SetPageSwapBacked(page); /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, NULL)) { + if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) { put_swap_page(page, entry); goto fail_unlock; } @@ -484,10 +499,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, goto fail_unlock; } - /* XXX: Move to lru_cache_add() when it supports new vs putback */ - spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); + if (shadow) + workingset_refault(page, shadow); /* Caller will initiate read into locked page */ SetPageWorkingset(page); diff --git a/mm/vmscan.c b/mm/vmscan.c index e84c4dd08c4e..66d73fea80e4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -854,6 +854,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, { unsigned long flags; int refcount; + void *shadow = NULL; BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); @@ -896,13 +897,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; mem_cgroup_swapout(page, swap); - __delete_from_swap_cache(page, swap, NULL); + if (reclaimed && !mapping_exiting(mapping)) + shadow = workingset_eviction(page, target_memcg); + __delete_from_swap_cache(page, swap, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); - workingset_eviction(page, target_memcg); } else { void (*freepage)(struct page *); - void *shadow = NULL; freepage = mapping->a_ops->freepage; /* diff --git a/mm/workingset.c b/mm/workingset.c index 941bbaa6c262..8cbe4e3cbe5c 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -353,15 +353,22 @@ void workingset_refault(struct page *page, void *shadow) /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if - * all the memory was available to the page cache. Whether - * cache can compete with anon or not depends on having swap. + * all the memory was available to the workingset. Whether + * workingset competition needs to consider anon or not depends + * on having swap. */ workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); + if (!file) { + workingset_size += lruvec_page_state(eviction_lruvec, + NR_INACTIVE_FILE); + } if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { - workingset_size += lruvec_page_state(eviction_lruvec, - NR_INACTIVE_ANON); workingset_size += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON); + if (file) { + workingset_size += lruvec_page_state(eviction_lruvec, + NR_INACTIVE_ANON); + } } if (refault_distance > workingset_size) goto out; From 4002570c5c585c4c9a889e45dd104ed663257eec Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:30:54 -0700 Subject: [PATCH 012/164] mm/vmscan: restore active/inactive ratio for anonymous LRU Now that workingset detection is implemented for anonymous LRU, we don't need large inactive list to allow detecting frequently accessed pages before they are reclaimed, anymore. This effectively reverts the temporary measure put in by commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru". Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Minchan Kim Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 66d73fea80e4..846b7e9b8d2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2207,7 +2207,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru) active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru); gb = (inactive + active) >> (30 - PAGE_SHIFT); - if (gb && is_file_lru(inactive_lru)) + if (gb) inactive_ratio = int_sqrt(10 * gb); else inactive_ratio = 1; From 471e78cc7687337abd10626196f5addb30b01824 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Tue, 11 Aug 2020 18:30:57 -0700 Subject: [PATCH 013/164] /proc/PID/smaps: consistent whitespace output format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The keys in smaps output are padded to fixed width with spaces. All except for THPeligible that uses tabs (only since commit c06306696f83 ("mm: thp: fix false negative of shmem vma's THP eligibility")). Unify the output formatting to save time debugging some naïve parsers. (Part of the unification is also aligning FilePmdMapped with others.) Signed-off-by: Michal Koutný Signed-off-by: Andrew Morton Acked-by: Yang Shi Cc: Alexey Dobriyan Cc: Matthew Wilcox Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dbda4499a859..5066b0251ed8 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -786,7 +786,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); - SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); + SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ", mss->private_hugetlb >> 10, 7); @@ -816,7 +816,7 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); - seq_printf(m, "THPeligible: %d\n", + seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma)); if (arch_pkeys_enabled()) From facdaa917c4d5a376d09d25865f5a863f906234a Mon Sep 17 00:00:00 2001 From: Nitin Gupta Date: Tue, 11 Aug 2020 18:31:00 -0700 Subject: [PATCH 014/164] mm: proactive compaction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta Signed-off-by: Andrew Morton Tested-by: Oleksandr Natalenko Reviewed-by: Vlastimil Babka Reviewed-by: Khalid Aziz Reviewed-by: Oleksandr Natalenko Cc: Vlastimil Babka Cc: Khalid Aziz Cc: Michal Hocko Cc: Mel Gorman Cc: Matthew Wilcox Cc: Mike Kravetz Cc: Joonsoo Kim Cc: David Rientjes Cc: Nitin Gupta Cc: Oleksandr Natalenko Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 15 ++ include/linux/compaction.h | 2 + kernel/sysctl.c | 9 ++ mm/compaction.c | 183 +++++++++++++++++++++++- mm/internal.h | 1 + mm/vmstat.c | 18 +++ 6 files changed, 223 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d997cc3c26d0..4b9d2e8e9142 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness +======================== + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed =========================== diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 6fa0eea3f530..7a242d46454e 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; +extern int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; +extern int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f785de3caac0..ab72e52f8a7b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2851,6 +2851,15 @@ static struct ctl_table vm_table[] = { .mode = 0200, .proc_handler = sysctl_compaction_handler, }, + { + .procname = "compaction_proactiveness", + .data = &sysctl_compaction_proactiveness, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &one_hundred, + }, { .procname = "extfrag_threshold", .data = &sysctl_extfrag_threshold, diff --git a/mm/compaction.c b/mm/compaction.c index 86375605faa9..544a98811c82 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,6 +50,24 @@ static inline void count_compact_events(enum vm_event_item item, long delta) #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order) #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) +/* + * Fragmentation score check interval for proactive compaction purposes. + */ +static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; + +/* + * Page order with-respect-to which proactive compaction + * calculates external fragmentation, which is used as + * the "fragmentation score" of a node/zone. + */ +#if defined CONFIG_TRANSPARENT_HUGEPAGE +#define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER +#elif defined HUGETLB_PAGE_ORDER +#define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER +#else +#define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) +#endif + static unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; @@ -1857,6 +1875,76 @@ static inline bool is_via_compact_memory(int order) return order == -1; } +static bool kswapd_is_running(pg_data_t *pgdat) +{ + return pgdat->kswapd && (pgdat->kswapd->state == TASK_RUNNING); +} + +/* + * A zone's fragmentation score is the external fragmentation wrt to the + * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value + * in the range [0, 100]. + * + * The scaling factor ensures that proactive compaction focuses on larger + * zones like ZONE_NORMAL, rather than smaller, specialized zones like + * ZONE_DMA32. For smaller zones, the score value remains close to zero, + * and thus never exceeds the high threshold for proactive compaction. + */ +static int fragmentation_score_zone(struct zone *zone) +{ + unsigned long score; + + score = zone->present_pages * + extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); + return div64_ul(score, zone->zone_pgdat->node_present_pages + 1); +} + +/* + * The per-node proactive (background) compaction process is started by its + * corresponding kcompactd thread when the node's fragmentation score + * exceeds the high threshold. The compaction process remains active till + * the node's score falls below the low threshold, or one of the back-off + * conditions is met. + */ +static int fragmentation_score_node(pg_data_t *pgdat) +{ + unsigned long score = 0; + int zoneid; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + struct zone *zone; + + zone = &pgdat->node_zones[zoneid]; + score += fragmentation_score_zone(zone); + } + + return score; +} + +static int fragmentation_score_wmark(pg_data_t *pgdat, bool low) +{ + int wmark_low; + + /* + * Cap the low watermak to avoid excessive compaction + * activity in case a user sets the proactivess tunable + * close to 100 (maximum). + */ + wmark_low = max(100 - sysctl_compaction_proactiveness, 5); + return low ? wmark_low : min(wmark_low + 10, 100); +} + +static bool should_proactive_compact_node(pg_data_t *pgdat) +{ + int wmark_high; + + if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) + return false; + + wmark_high = fragmentation_score_wmark(pgdat, false); + return fragmentation_score_node(pgdat) > wmark_high; +} + static enum compact_result __compact_finished(struct compact_control *cc) { unsigned int order; @@ -1883,6 +1971,25 @@ static enum compact_result __compact_finished(struct compact_control *cc) return COMPACT_PARTIAL_SKIPPED; } + if (cc->proactive_compaction) { + int score, wmark_low; + pg_data_t *pgdat; + + pgdat = cc->zone->zone_pgdat; + if (kswapd_is_running(pgdat)) + return COMPACT_PARTIAL_SKIPPED; + + score = fragmentation_score_zone(cc->zone); + wmark_low = fragmentation_score_wmark(pgdat, true); + + if (score > wmark_low) + ret = COMPACT_CONTINUE; + else + ret = COMPACT_SUCCESS; + + goto out; + } + if (is_via_compact_memory(cc->order)) return COMPACT_CONTINUE; @@ -1941,6 +2048,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) } } +out: if (cc->contended || fatal_signal_pending(current)) ret = COMPACT_CONTENDED; @@ -2421,6 +2529,41 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, return rc; } +/* + * Compact all zones within a node till each zone's fragmentation score + * reaches within proactive compaction thresholds (as determined by the + * proactiveness tunable). + * + * It is possible that the function returns before reaching score targets + * due to various back-off conditions, such as, contention on per-node or + * per-zone locks. + */ +static void proactive_compact_node(pg_data_t *pgdat) +{ + int zoneid; + struct zone *zone; + struct compact_control cc = { + .order = -1, + .mode = MIGRATE_SYNC_LIGHT, + .ignore_skip_hint = true, + .whole_zone = true, + .gfp_mask = GFP_KERNEL, + .proactive_compaction = true, + }; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + zone = &pgdat->node_zones[zoneid]; + if (!populated_zone(zone)) + continue; + + cc.zone = zone; + + compact_zone(&cc, NULL); + + VM_BUG_ON(!list_empty(&cc.freepages)); + VM_BUG_ON(!list_empty(&cc.migratepages)); + } +} /* Compact all zones within a node */ static void compact_node(int nid) @@ -2467,6 +2610,13 @@ static void compact_nodes(void) /* The written value is actually unused, all memory is compacted */ int sysctl_compact_memory; +/* + * Tunable for proactive compaction. It determines how + * aggressively the kernel should compact memory in the + * background. It takes values in the range [0, 100]. + */ +int __read_mostly sysctl_compaction_proactiveness = 20; + /* * This is the entry point for compacting all nodes via * /proc/sys/vm/compact_memory @@ -2646,6 +2796,7 @@ static int kcompactd(void *p) { pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; + unsigned int proactive_defer = 0; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); @@ -2661,12 +2812,34 @@ static int kcompactd(void *p) unsigned long pflags; trace_mm_compaction_kcompactd_sleep(pgdat->node_id); - wait_event_freezable(pgdat->kcompactd_wait, - kcompactd_work_requested(pgdat)); + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, + kcompactd_work_requested(pgdat), + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { - psi_memstall_enter(&pflags); - kcompactd_do_work(pgdat); - psi_memstall_leave(&pflags); + psi_memstall_enter(&pflags); + kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); + continue; + } + + /* kcompactd wait timeout */ + if (should_proactive_compact_node(pgdat)) { + unsigned int prev_score, score; + + if (proactive_defer) { + proactive_defer--; + continue; + } + prev_score = fragmentation_score_node(pgdat); + proactive_compact_node(pgdat); + score = fragmentation_score_node(pgdat); + /* + * Defer proactive compaction if the fragmentation + * score did not go down i.e. no progress made. + */ + proactive_defer = score < prev_score ? + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; + } } return 0; diff --git a/mm/internal.h b/mm/internal.h index 9886db20d94f..42cf0b610847 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -239,6 +239,7 @@ struct compact_control { bool no_set_skip_hint; /* Don't mark blocks for skipping */ bool ignore_block_suitable; /* Scan blocks considered unsuitable */ bool direct_compaction; /* False from kcompactd or /proc/... */ + bool proactive_compaction; /* kcompactd proactive compaction */ bool whole_zone; /* Whole zone should/has been scanned */ bool contended; /* Signal lock or sched contention */ bool rescan; /* Rescanning the same pageblock */ diff --git a/mm/vmstat.c b/mm/vmstat.c index fef03463a0cf..f183aa37994e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1096,6 +1096,24 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total); } +/* + * Calculates external fragmentation within a zone wrt the given order. + * It is defined as the percentage of pages found in blocks of size + * less than 1 << order. It returns values in range [0, 100]. + */ +int extfrag_for_order(struct zone *zone, unsigned int order) +{ + struct contig_page_info info; + + fill_contig_page_info(zone, order, &info); + if (info.free_pages == 0) + return 0; + + return div_u64((info.free_pages - + (info.free_blocks_suitable << order)) * 100, + info.free_pages); +} + /* Same as __fragmentation index but allocs contig_page_info on stack */ int fragmentation_index(struct zone *zone, unsigned int order) { From 25788738eb9ce46fe6a0fd84a3ceef5c795d41f0 Mon Sep 17 00:00:00 2001 From: Nitin Gupta Date: Tue, 11 Aug 2020 18:31:04 -0700 Subject: [PATCH 015/164] mm: fix compile error due to COMPACTION_HPAGE_ORDER Fix compile error when COMPACTION_HPAGE_ORDER is assigned to HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined is to check for CONFIG_HUGETLBFS. Reported-by: Nathan Chancellor Signed-off-by: Nitin Gupta Signed-off-by: Andrew Morton Tested-by: Nathan Chancellor Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 544a98811c82..ed8ea1511634 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -62,7 +62,7 @@ static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; */ #if defined CONFIG_TRANSPARENT_HUGEPAGE #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER -#elif defined HUGETLB_PAGE_ORDER +#elif defined CONFIG_HUGETLBFS #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER #else #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) From d34c0a7599ea8c301bc471dfa1eb2bf2db6752d1 Mon Sep 17 00:00:00 2001 From: Nitin Gupta Date: Tue, 11 Aug 2020 18:31:07 -0700 Subject: [PATCH 016/164] mm: use unsigned types for fragmentation score Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Luis Chamberlain Cc: Kees Cook Cc: Iurii Zaikin Cc: Vlastimil Babka Cc: Joonsoo Kim Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 4 ++-- kernel/sysctl.c | 2 +- mm/compaction.c | 18 +++++++++--------- mm/vmstat.c | 2 +- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 7a242d46454e..25a521d299c1 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; -extern int sysctl_compaction_proactiveness; +extern unsigned int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; -extern int extfrag_for_order(struct zone *zone, unsigned int order); +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ab72e52f8a7b..287862f91717 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2854,7 +2854,7 @@ static struct ctl_table vm_table[] = { { .procname = "compaction_proactiveness", .data = &sysctl_compaction_proactiveness, - .maxlen = sizeof(int), + .maxlen = sizeof(sysctl_compaction_proactiveness), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, diff --git a/mm/compaction.c b/mm/compaction.c index ed8ea1511634..b7d433f1706a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -53,7 +53,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta) /* * Fragmentation score check interval for proactive compaction purposes. */ -static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; +static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; /* * Page order with-respect-to which proactive compaction @@ -1890,7 +1890,7 @@ static bool kswapd_is_running(pg_data_t *pgdat) * ZONE_DMA32. For smaller zones, the score value remains close to zero, * and thus never exceeds the high threshold for proactive compaction. */ -static int fragmentation_score_zone(struct zone *zone) +static unsigned int fragmentation_score_zone(struct zone *zone) { unsigned long score; @@ -1906,9 +1906,9 @@ static int fragmentation_score_zone(struct zone *zone) * the node's score falls below the low threshold, or one of the back-off * conditions is met. */ -static int fragmentation_score_node(pg_data_t *pgdat) +static unsigned int fragmentation_score_node(pg_data_t *pgdat) { - unsigned long score = 0; + unsigned int score = 0; int zoneid; for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { @@ -1921,17 +1921,17 @@ static int fragmentation_score_node(pg_data_t *pgdat) return score; } -static int fragmentation_score_wmark(pg_data_t *pgdat, bool low) +static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) { - int wmark_low; + unsigned int wmark_low; /* * Cap the low watermak to avoid excessive compaction * activity in case a user sets the proactivess tunable * close to 100 (maximum). */ - wmark_low = max(100 - sysctl_compaction_proactiveness, 5); - return low ? wmark_low : min(wmark_low + 10, 100); + wmark_low = max(100U - sysctl_compaction_proactiveness, 5U); + return low ? wmark_low : min(wmark_low + 10, 100U); } static bool should_proactive_compact_node(pg_data_t *pgdat) @@ -2615,7 +2615,7 @@ int sysctl_compact_memory; * aggressively the kernel should compact memory in the * background. It takes values in the range [0, 100]. */ -int __read_mostly sysctl_compaction_proactiveness = 20; +unsigned int __read_mostly sysctl_compaction_proactiveness = 20; /* * This is the entry point for compacting all nodes via diff --git a/mm/vmstat.c b/mm/vmstat.c index f183aa37994e..3bb034c99887 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1101,7 +1101,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in * It is defined as the percentage of pages found in blocks of size * less than 1 << order. It returns values in range [0, 100]. */ -int extfrag_for_order(struct zone *zone, unsigned int order) +unsigned int extfrag_for_order(struct zone *zone, unsigned int order) { struct contig_page_info info; From 860b32729a21e0565585a9e2ecea2e5244d65acd Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Tue, 11 Aug 2020 18:31:10 -0700 Subject: [PATCH 017/164] mm/compaction: correct the comments of compact_defer_shift There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi Signed-off-by: Andrew Morton Reviewed-by: Alexander Duyck Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 1 + mm/compaction.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index efbd95dd3bbf..8379432f4f2f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -536,6 +536,7 @@ struct zone { * On compaction failure, 1< Date: Tue, 11 Aug 2020 18:31:13 -0700 Subject: [PATCH 018/164] mm: mempolicy: fix kerneldoc of numa_map_to_online_node() Fix W=1 compile warnings (invalid kerneldoc): mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node' mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node' Signed-off-by: Krzysztof Kozlowski Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.org Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7af44d7cdd11..92030ead2506 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -129,7 +129,7 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES]; /** * numa_map_to_online_node - Find closest online node - * @nid: Node id to start the search + * @node: Node id to start the search * * Lookup the next closest node by distance if @nid is not online. */ From 4605f057aace9291e80bf9982cc7c8babc917f56 Mon Sep 17 00:00:00 2001 From: Wenchao Hao Date: Tue, 11 Aug 2020 18:31:16 -0700 Subject: [PATCH 019/164] mm/mempolicy.c: check parameters first in kernel_get_mempolicy Previous implementatoin calls untagged_addr() before error check, while if the error check failed and return EINVAL, the untagged_addr() call is just useless work. Signed-off-by: Wenchao Hao Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 92030ead2506..25b7e412c20b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1632,11 +1632,11 @@ static int kernel_get_mempolicy(int __user *policy, int pval; nodemask_t nodes; - addr = untagged_addr(addr); - if (nmask != NULL && maxnode < nr_node_ids) return -EINVAL; + addr = untagged_addr(addr); + err = do_get_mempolicy(&pval, &nodes, addr, flags); if (err) From f3f3416c2234c0a9372316bec5c48cbd58b565c5 Mon Sep 17 00:00:00 2001 From: Yanfei Xu Date: Tue, 11 Aug 2020 18:31:19 -0700 Subject: [PATCH 020/164] include/linux/mempolicy.h: fix typo Change "interlave" to "interleave". Signed-off-by: Yanfei Xu Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200810063454.9357-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds --- include/linux/mempolicy.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5f1648c23e29..5f1c74df264d 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -28,7 +28,7 @@ struct mm_struct; * the process policy is used. Interrupts ignore the memory policy * of the current process. * - * Locking policy for interlave: + * Locking policy for interleave: * In process context there is no locking because only the process accesses * its own state. All vma manipulation is somewhat protected by a down_read on * mmap_lock. From 9066e5cfb73cdbcdbb49e87999482ab615e9fc76 Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Tue, 11 Aug 2020 18:31:22 -0700 Subject: [PATCH 021/164] mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao Signed-off-by: Andrew Morton Tested-by: Naresh Kamboju Acked-by: Michal Hocko Cc: David Rientjes Cc: Qian Cai Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds --- fs/proc/base.c | 11 ++++++++++- include/linux/oom.h | 4 ++-- mm/oom_kill.c | 22 ++++++++++------------ 3 files changed, 22 insertions(+), 15 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index a333caeca291..617db4e0faa0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -551,8 +551,17 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, { unsigned long totalpages = totalram_pages() + total_swap_pages; unsigned long points = 0; + long badness; + + badness = oom_badness(task, totalpages); + /* + * Special case OOM_SCORE_ADJ_MIN for all others scale the + * badness value into [0, 2000] range which we have been + * exporting for a long time so userspace might depend on it. + */ + if (badness != LONG_MIN) + points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3; - points = oom_badness(task, totalpages) * 1000 / totalpages; seq_printf(m, "%lu\n", points); return 0; diff --git a/include/linux/oom.h b/include/linux/oom.h index c696c265f019..f022f581ac29 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -48,7 +48,7 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; struct task_struct *chosen; - unsigned long chosen_points; + long chosen_points; /* Used to print the constraint info. */ enum oom_constraint constraint; @@ -107,7 +107,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) bool __oom_reap_task_mm(struct mm_struct *mm); -extern unsigned long oom_badness(struct task_struct *p, +long oom_badness(struct task_struct *p, unsigned long totalpages); extern bool out_of_memory(struct oom_control *oc); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d30ce75f23fb..48e0db54d838 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -196,17 +196,17 @@ static bool is_dump_unreclaim_slabs(void) * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */ -unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) +long oom_badness(struct task_struct *p, unsigned long totalpages) { long points; long adj; if (oom_unkillable_task(p)) - return 0; + return LONG_MIN; p = find_lock_task_mm(p); if (!p) - return 0; + return LONG_MIN; /* * Do not even consider tasks which are explicitly marked oom @@ -218,7 +218,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); - return 0; + return LONG_MIN; } /* @@ -233,11 +233,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) adj *= totalpages / 1000; points += adj; - /* - * Never return 0 for an eligible task regardless of the root bonus and - * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). - */ - return points > 0 ? points : 1; + return points; } static const char * const oom_constraint_text[] = { @@ -310,7 +306,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) static int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; - unsigned long points; + long points; if (oom_unkillable_task(task)) goto next; @@ -336,12 +332,12 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) * killed first if it triggers an oom, then select it. */ if (oom_task_origin(task)) { - points = ULONG_MAX; + points = LONG_MAX; goto select; } points = oom_badness(task, oc->totalpages); - if (!points || points < oc->chosen_points) + if (points == LONG_MIN || points < oc->chosen_points) goto next; select: @@ -365,6 +361,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) */ static void select_bad_process(struct oom_control *oc) { + oc->chosen_points = LONG_MIN; + if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { From de3f32e1424ca5c751d4cd78cbed6d17ca261c24 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 11 Aug 2020 18:31:25 -0700 Subject: [PATCH 022/164] doc, mm: sync up oom_score_adj documentation There are at least two notes in the oom section. The 3% discount for root processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes"). Likewise children of the selected oom victim are not sacrificed since bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic") Drop both of them. Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Cc: Jonathan Corbet Cc: David Rientjes Cc: Yafang Shao Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 8 -------- 1 file changed, 8 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e024a9efffd8..eb96a440a064 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1633,9 +1633,6 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that @@ -1671,11 +1668,6 @@ The value of /proc//oom_score_adj may be reduced no lower than the last value set by a CAP_SYS_RESOURCE process. To reduce the value any lower requires CAP_SYS_RESOURCE. -Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. - 3.2 /proc//oom_score - Display current oom-killer score ------------------------------------------------------------- From b1aa7c9377bdff904efa0fbde1f96e1769730d8a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 11 Aug 2020 18:31:28 -0700 Subject: [PATCH 023/164] doc, mm: clarify /proc//oom_score value range The exported value includes oom_score_adj so the range is no [0, 1000] as described in the previous section but rather [0, 2000]. Mention that fact explicitly. Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Cc: Jonathan Corbet Cc: David Rientjes Cc: Yafang Shao Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.org Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index eb96a440a064..533c79e8d2cd 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1676,6 +1676,9 @@ This file can be used to check the current score used by the oom-killer for any given . Use it together with /proc//oom_score_adj to tune which process should be killed in an out-of-memory situation. +Please note that the exported value includes oom_score_adj so it is +effectively in range [0,2000]. + 3.3 /proc//io - Display the IO accounting fields ------------------------------------------------------- From 619b5b469bcab84ea3bee1d8d04451c781d23feb Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Tue, 11 Aug 2020 18:31:32 -0700 Subject: [PATCH 024/164] mm, oom: show process exiting information in __oom_kill_process() When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes Signed-off-by: Yafang Shao Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Tetsuo Handa Cc: Qian Cai Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds --- mm/oom_kill.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 48e0db54d838..e90f25d6385d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -861,6 +861,8 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) p = find_lock_task_mm(victim); if (!p) { + pr_info("%s: OOM victim %d (%s) is already exiting. Skip killing the task\n", + message, task_pid_nr(victim), victim->comm); put_task_struct(victim); return; } else if (victim != p) { From 15568299b7d9988063afce60731df605ab236e2a Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 11 Aug 2020 18:31:35 -0700 Subject: [PATCH 025/164] hugetlbfs: prevent filesystem stacking of hugetlbfs syzbot found issues with having hugetlbfs on a union/overlay as reported in [1]. Due to the limitations (no write) and special functionality of hugetlbfs, it does not work well in filesystem stacking. There are no know use cases for hugetlbfs stacking. Rather than making modifications to get hugetlbfs working in such environments, simply prevent stacking. [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/ Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com Suggested-by: Amir Goldstein Signed-off-by: Mike Kravetz Signed-off-by: Andrew Morton Acked-by: Miklos Szeredi Cc: Al Viro Cc: Matthew Wilcox Cc: Colin Walters Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com Signed-off-by: Linus Torvalds --- fs/hugetlbfs/inode.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 523954d00dff..b5c109703daa 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1364,6 +1364,12 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_magic = HUGETLBFS_MAGIC; sb->s_op = &hugetlbfs_ops; sb->s_time_gran = 1; + + /* + * Due to the special and limited functionality of hugetlbfs, it does + * not work well as a stacking filesystem. + */ + sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx)); if (!sb->s_root) goto out_free; From 34ae204f18519f0920bd50a644abd6fefc8dbfcf Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 11 Aug 2020 18:31:38 -0700 Subject: [PATCH 026/164] hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem in at least read mode. This is because the explicit locking in huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring the code, the call to huge_pte_alloc in the else block at the beginning of hugetlb_fault was missed. Unfortunately, that else clause is exercised when there is no page table entry. This will likely lead to a call to huge_pmd_share. If huge_pmd_share thinks pmd sharing is possible, it will traverse the mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is modifying the tree, bad things such as addressing exceptions or worse could happen. Simply remove the else clause. It should have been removed previously. The code following the else will call huge_pte_alloc with the appropriate locking. To prevent this type of issue in the future, add routines to assert that i_mmap_rwsem is held, and call these routines in huge pmd sharing routines. Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Suggested-by: Matthew Wilcox Signed-off-by: Mike Kravetz Signed-off-by: Andrew Morton Cc: Michal Hocko Cc: Hugh Dickins Cc: Naoya Horiguchi Cc: "Aneesh Kumar K.V" Cc: Andrea Arcangeli Cc: "Kirill A.Shutemov" Cc: Davidlohr Bueso Cc: Prakash Sangappa Cc: Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com Signed-off-by: Linus Torvalds --- include/linux/fs.h | 10 ++++++++++ include/linux/hugetlb.h | 8 +++++--- mm/hugetlb.c | 15 +++++++-------- mm/rmap.c | 2 +- 4 files changed, 23 insertions(+), 12 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 011af396aa17..7c69dd7c6160 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -518,6 +518,16 @@ static inline void i_mmap_unlock_read(struct address_space *mapping) up_read(&mapping->i_mmap_rwsem); } +static inline void i_mmap_assert_locked(struct address_space *mapping) +{ + lockdep_assert_held(&mapping->i_mmap_rwsem); +} + +static inline void i_mmap_assert_write_locked(struct address_space *mapping) +{ + lockdep_assert_held_write(&mapping->i_mmap_rwsem); +} + /* * Might pages of this file be mapped into userspace? */ diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 50650d0d01b9..a520bf26e5d8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -164,7 +164,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz); pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, @@ -203,8 +204,9 @@ static inline struct address_space *hugetlb_page_mapping_lock_write( return NULL; } -static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, - pte_t *ptep) +static inline int huge_pmd_unshare(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { return 0; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dffafb5bf2ed..8a18f1234e80 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3967,7 +3967,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, continue; ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, &address, ptep)) { + if (huge_pmd_unshare(mm, vma, &address, ptep)) { spin_unlock(ptl); /* * We just unmapped a page of PMDs by clearing a PUD. @@ -4554,10 +4554,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); - } else { - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); - if (!ptep) - return VM_FAULT_OOM; } /* @@ -5034,7 +5030,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, if (!ptep) continue; ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, &address, ptep)) { + if (huge_pmd_unshare(mm, vma, &address, ptep)) { pages++; spin_unlock(ptl); shared_pmd = true; @@ -5415,12 +5411,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) * returns: 1 successfully unmapped a shared pte page * 0 the underlying pte page is not shared, or it is the last user */ -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { pgd_t *pgd = pgd_offset(mm, *addr); p4d_t *p4d = p4d_offset(pgd, *addr); pud_t *pud = pud_offset(p4d, *addr); + i_mmap_assert_write_locked(vma->vm_file->f_mapping); BUG_ON(page_count(virt_to_page(ptep)) == 0); if (page_count(virt_to_page(ptep)) == 1) return 0; @@ -5438,7 +5436,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) return NULL; } -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { return 0; } diff --git a/mm/rmap.c b/mm/rmap.c index 5fe2dedce1fc..6cce9ef06753 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1469,7 +1469,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * do this outside rmap routines. */ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - if (huge_pmd_unshare(mm, &address, pvmw.pte)) { + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { /* * huge_pmd_unshare unmapped an entire PMD * page. There is no way of knowing exactly From 0744f2807a4ffb111a31c0769878ec8b125b5241 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 11 Aug 2020 18:31:41 -0700 Subject: [PATCH 027/164] mm/migrate: optimize migrate_vma_setup() for holes Patch series "mm/migrate: optimize migrate_vma_setup() for holes". A simple optimization for migrate_vma_*() when the source vma is not an anonymous vma and a new test case to exercise it. This patch (of 2): When migrating system memory to device private memory, if the source address range is a valid VMA range and there is no memory or a zero page, the source PFN array is marked as valid but with no PFN. This lets the device driver allocate private memory and clear it, then insert the new device private struct page into the CPU's page tables when migrate_vma_pages() is called. migrate_vma_pages() only inserts the new page if the VMA is an anonymous range. There is no point in telling the device driver to allocate device private memory and then not migrate the page. Instead, mark the source PFN array entries as not migrating to avoid this overhead. [rcampbell@nvidia.com: v2] Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.com Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Cc: Jerome Glisse Cc: John Hubbard Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: "Bharata B Rao" Cc: Shuah Khan Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 819e55130134..b52f30113366 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2168,6 +2168,16 @@ static int migrate_vma_collect_hole(unsigned long start, struct migrate_vma *migrate = walk->private; unsigned long addr; + /* Only allow populating anonymous memory. */ + if (!vma_is_anonymous(walk->vma)) { + for (addr = start; addr < end; addr += PAGE_SIZE) { + migrate->src[migrate->npages] = 0; + migrate->dst[migrate->npages] = 0; + migrate->npages++; + } + return 0; + } + for (addr = start; addr < end; addr += PAGE_SIZE) { migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; migrate->dst[migrate->npages] = 0; @@ -2260,8 +2270,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, pte = *ptep; if (pte_none(pte)) { - mpfn = MIGRATE_PFN_MIGRATE; - migrate->cpages++; + if (vma_is_anonymous(vma)) { + mpfn = MIGRATE_PFN_MIGRATE; + migrate->cpages++; + } goto next; } From b0fc0f3fca80cf8ea8631eaec6b922bca4c4c2dd Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 11 Aug 2020 18:31:45 -0700 Subject: [PATCH 028/164] mm/migrate: add migrate-shared test for migrate_vma_*() Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that !vma_anonymous() ranges won't be migrated. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Cc: Jerome Glisse Cc: John Hubbard Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: "Bharata B Rao" Cc: Shuah Khan Link: http://lkml.kernel.org/r/20200710194840.7602-3-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200709165711.26584-3-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/hmm-tests.c | 35 ++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 91d38a29956b..93fc5cadce61 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -941,6 +941,41 @@ TEST_F(hmm, migrate_fault) hmm_buffer_free(buffer); } +/* + * Migrate anonymous shared memory to device private memory. + */ +TEST_F(hmm, migrate_shared) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ASSERT_EQ(ret, -ENOENT); + + hmm_buffer_free(buffer); +} + /* * Try to migrate various memory types to device private memory. */ From 4958e4d86ecb011a81f5d80ce2023ef9f10692d8 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Tue, 11 Aug 2020 18:31:48 -0700 Subject: [PATCH 029/164] mm: thp: remove debug_cow switch Since commit 3917c80280c93a7123f ("thp: change CoW semantics for anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not used anymore. So, just remove it. Signed-off-by: Yang Shi Signed-off-by: Andrew Morton Reviewed-by: Zi Yan Cc: Kirill A. Shutemov Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 7 ------- mm/huge_memory.c | 21 --------------------- 2 files changed, 28 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 17c4c4975145..467302056e17 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -181,13 +181,6 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1< Date: Tue, 11 Aug 2020 18:31:51 -0700 Subject: [PATCH 030/164] mm/vmstat: add events for THP migration without split Add following new vmstat events which will help in validating THP migration without split. Statistics reported through these new VM events will help in performance debugging. 1. THP_MIGRATION_SUCCESS 2. THP_MIGRATION_FAILURE 3. THP_MIGRATION_SPLIT In addition, these new events also update normal page migration statistics appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here, this updates current trace event 'mm_migrate_pages' to accommodate now available THP statistics. [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/] [ziy@nvidia.com: v2] Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/] Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Signed-off-by: Zi Yan Signed-off-by: Andrew Morton Reviewed-by: Daniel Jordan Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Zi Yan Cc: John Hubbard Cc: Naoya Horiguchi Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- Documentation/vm/page_migration.rst | 27 +++++++++++++++ include/linux/vm_event_item.h | 3 ++ include/trace/events/migrate.h | 17 ++++++++-- mm/migrate.c | 52 ++++++++++++++++++++++++----- mm/vmstat.c | 3 ++ 5 files changed, 91 insertions(+), 11 deletions(-) diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index 1d6cd7db4e43..68883ac485fa 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -253,5 +253,32 @@ which are function pointers of struct address_space_operations. PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag for own purpose. +Monitoring Migration +===================== + +The following events (counters) can be used to monitor page migration. + +1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a + page was migrated. If the page was a non-THP page, then this counter is + increased by one. If the page was a THP, then this counter is increased by + the number of THP subpages. For example, migration of a single 2MB THP that + has 4KB-size base pages (subpages) will cause this counter to increase by + 512. + +2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for + _SUCCESS, above: this will be increased by the number of subpages, if it was + a THP. + +3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. + +4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. + +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had + to be split. After splitting, a migration retry was used for it's sub-pages. + +THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or +PGMIGRATE_FAIL events. For example, a THP migration failure will cause both +THP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. + Christoph Lameter, May 8, 2006. Minchan Kim, Mar 28, 2016. diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 24fc7c3ae7d6..2e6ca53b9bbd 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -56,6 +56,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #endif #ifdef CONFIG_MIGRATION PGMIGRATE_SUCCESS, PGMIGRATE_FAIL, + THP_MIGRATION_SUCCESS, + THP_MIGRATION_FAIL, + THP_MIGRATION_SPLIT, #endif #ifdef CONFIG_COMPACTION COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED, diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 705b33d1e395..4d434398d64d 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -46,13 +46,18 @@ MIGRATE_REASON TRACE_EVENT(mm_migrate_pages, TP_PROTO(unsigned long succeeded, unsigned long failed, - enum migrate_mode mode, int reason), + unsigned long thp_succeeded, unsigned long thp_failed, + unsigned long thp_split, enum migrate_mode mode, int reason), - TP_ARGS(succeeded, failed, mode, reason), + TP_ARGS(succeeded, failed, thp_succeeded, thp_failed, + thp_split, mode, reason), TP_STRUCT__entry( __field( unsigned long, succeeded) __field( unsigned long, failed) + __field( unsigned long, thp_succeeded) + __field( unsigned long, thp_failed) + __field( unsigned long, thp_split) __field( enum migrate_mode, mode) __field( int, reason) ), @@ -60,13 +65,19 @@ TRACE_EVENT(mm_migrate_pages, TP_fast_assign( __entry->succeeded = succeeded; __entry->failed = failed; + __entry->thp_succeeded = thp_succeeded; + __entry->thp_failed = thp_failed; + __entry->thp_split = thp_split; __entry->mode = mode; __entry->reason = reason; ), - TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s", + TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu nr_thp_failed=%lu nr_thp_split=%lu mode=%s reason=%s", __entry->succeeded, __entry->failed, + __entry->thp_succeeded, + __entry->thp_failed, + __entry->thp_split, __print_symbolic(__entry->mode, MIGRATE_MODE), __print_symbolic(__entry->reason, MIGRATE_REASON)) ); diff --git a/mm/migrate.c b/mm/migrate.c index b52f30113366..5ea275878383 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1418,22 +1418,35 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, enum migrate_mode mode, int reason) { int retry = 1; + int thp_retry = 1; int nr_failed = 0; int nr_succeeded = 0; + int nr_thp_succeeded = 0; + int nr_thp_failed = 0; + int nr_thp_split = 0; int pass = 0; + bool is_thp = false; struct page *page; struct page *page2; int swapwrite = current->flags & PF_SWAPWRITE; - int rc; + int rc, nr_subpages; if (!swapwrite) current->flags |= PF_SWAPWRITE; - for(pass = 0; pass < 10 && retry; pass++) { + for (pass = 0; pass < 10 && (retry || thp_retry); pass++) { retry = 0; + thp_retry = 0; list_for_each_entry_safe(page, page2, from, lru) { retry: + /* + * THP statistics is based on the source huge page. + * Capture required information that might get lost + * during migration. + */ + is_thp = PageTransHuge(page); + nr_subpages = hpage_nr_pages(page); cond_resched(); if (PageHuge(page)) @@ -1464,15 +1477,30 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, unlock_page(page); if (!rc) { list_safe_reset_next(page, page2, lru); + nr_thp_split++; goto retry; } } + if (is_thp) { + nr_thp_failed++; + nr_failed += nr_subpages; + goto out; + } nr_failed++; goto out; case -EAGAIN: + if (is_thp) { + thp_retry++; + break; + } retry++; break; case MIGRATEPAGE_SUCCESS: + if (is_thp) { + nr_thp_succeeded++; + nr_succeeded += nr_subpages; + break; + } nr_succeeded++; break; default: @@ -1482,19 +1510,27 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, * removed from migration page list and not * retried in the next outer loop. */ + if (is_thp) { + nr_thp_failed++; + nr_failed += nr_subpages; + break; + } nr_failed++; break; } } } - nr_failed += retry; + nr_failed += retry + thp_retry; + nr_thp_failed += thp_retry; rc = nr_failed; out: - if (nr_succeeded) - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); - if (nr_failed) - count_vm_events(PGMIGRATE_FAIL, nr_failed); - trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason); + count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); + count_vm_events(PGMIGRATE_FAIL, nr_failed); + count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded); + count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed); + count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split); + trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded, + nr_thp_failed, nr_thp_split, mode, reason); if (!swapwrite) current->flags &= ~PF_SWAPWRITE; diff --git a/mm/vmstat.c b/mm/vmstat.c index 3bb034c99887..727a26d1ec1d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1277,6 +1277,9 @@ const char * const vmstat_text[] = { #ifdef CONFIG_MIGRATION "pgmigrate_success", "pgmigrate_fail", + "thp_migration_success", + "thp_migration_fail", + "thp_migration_split", #endif #ifdef CONFIG_COMPACTION "compact_migrate_scanned", From 835832ba01bb444c7e45139e4b807527c119dafc Mon Sep 17 00:00:00 2001 From: Jianqun Xu Date: Tue, 11 Aug 2020 18:31:54 -0700 Subject: [PATCH 031/164] mm/cma.c: fix NULL pointer dereference when cma could not be activated In some case the cma area could not be activated, but the cma_alloc be used under this case, then the kernel will crash caused by NULL pointer dereference. Add bitmap valid check in cma_alloc to avoid this issue. Signed-off-by: Jianqun Xu Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Link: http://lkml.kernel.org/r/20200615010123.15596-1-jay.xu@rock-chips.com Signed-off-by: Linus Torvalds --- mm/cma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/cma.c b/mm/cma.c index 26ecff818881..3a18f8d8ea5e 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -425,7 +425,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, struct page *page = NULL; int ret = -ENOMEM; - if (!cma || !cma->count) + if (!cma || !cma->count || !cma->bitmap) return NULL; pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma, From 18e98e56f4407b8c766b6f6d9f2850fd2a081e4d Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 11 Aug 2020 18:31:57 -0700 Subject: [PATCH 032/164] mm: cma: fix the name of CMA areas Patch series "mm: fix the names of general cma and hugetlb cma", v2. The current code of CMA can only work when users pass a const string as name parameter. we need to fix the way to handle names in CMA. On the other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each hugetlb should get a different CMA name. This patch (of 2): If users give a name saved in stack, the current code will generate magic pointer. if users don't give a name(NULL), kasprintf() will always return NULL as we are at the early stage. that means cma_init_reserved_mem() will return -ENOMEM if users set name parameter as NULL. [natechancellor@gmail.com: return cma->name directly in cma_get_name] Link: https://github.com/ClangBuiltLinux/linux/issues/1063 Link: http://lkml.kernel.org/r/20200623015840.621964-1-natechancellor@gmail.com Signed-off-by: Barry Song Signed-off-by: Nathan Chancellor Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Acked-by: Roman Gushchin Link: http://lkml.kernel.org/r/20200616223131.33828-2-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds --- mm/cma.c | 15 +++++++-------- mm/cma.h | 4 +++- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 3a18f8d8ea5e..aa2df0bd49c3 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -52,7 +52,7 @@ unsigned long cma_get_size(const struct cma *cma) const char *cma_get_name(const struct cma *cma) { - return cma->name ? cma->name : "(undefined)"; + return cma->name; } static unsigned long cma_bitmap_aligned_mask(const struct cma *cma, @@ -202,13 +202,12 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, * subsystems (like slab allocator) are available. */ cma = &cma_areas[cma_area_count]; - if (name) { - cma->name = name; - } else { - cma->name = kasprintf(GFP_KERNEL, "cma%d\n", cma_area_count); - if (!cma->name) - return -ENOMEM; - } + + if (name) + snprintf(cma->name, CMA_MAX_NAME, name); + else + snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); + cma->base_pfn = PFN_DOWN(base); cma->count = size >> PAGE_SHIFT; cma->order_per_bit = order_per_bit; diff --git a/mm/cma.h b/mm/cma.h index 6698fa63279b..20f6e24bc477 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -4,6 +4,8 @@ #include +#define CMA_MAX_NAME 64 + struct cma { unsigned long base_pfn; unsigned long count; @@ -15,7 +17,7 @@ struct cma { spinlock_t mem_head_lock; struct debugfs_u32_array dfs_bitmap; #endif - const char *name; + char name[CMA_MAX_NAME]; }; extern struct cma cma_areas[MAX_CMA_AREAS]; From 29d0f41d232393482ced6a233126832df3429ad2 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 11 Aug 2020 18:32:00 -0700 Subject: [PATCH 033/164] mm: hugetlb: fix the name of hugetlb CMA Once we enable CMA_DEBUGFS, we will get the below errors: directory 'cma-hugetlb' with parent 'cma' already present. We should have different names for different CMA areas. Signed-off-by: Barry Song Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Acked-by: Roman Gushchin Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8a18f1234e80..6b6b0d5cc642 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5707,12 +5707,14 @@ void __init hugetlb_cma_reserve(int order) reserved = 0; for_each_node_state(nid, N_ONLINE) { int res; + char name[20]; size = min(per_node, hugetlb_cma_size - reserved); size = round_up(size, PAGE_SIZE << order); + snprintf(name, 20, "hugetlb%d", nid); res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order, - 0, false, "hugetlb", + 0, false, name, &hugetlb_cma[nid], nid); if (res) { pr_warn("hugetlb_cma: reservation failed: err %d, node %d", From 3a5139f1c5bb76d69756fb8f13fffa173e261153 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 11 Aug 2020 18:32:03 -0700 Subject: [PATCH 034/164] cma: don't quit at first error when activating reserved areas The routine cma_init_reserved_areas is designed to activate all reserved cma areas. It quits when it first encounters an error. This can leave some areas in a state where they are reserved but not activated. There is no feedback to code which performed the reservation. Attempting to allocate memory from areas in such a state will result in a BUG. Modify cma_init_reserved_areas to always attempt to activate all areas. The called routine, cma_activate_area is responsible for leaving the area in a valid state. No one is making active use of returned error codes, so change the routine to void. How to reproduce: This example uses kernelcore, hugetlb and cma as an easy way to reproduce. However, this is a more general cma issue. Two node x86 VM 16GB total, 8GB per node Kernel command line parameters, kernelcore=4G hugetlb_cma=8G Related boot time messages, hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node cma: Reserved 4096 MiB at 0x0000000100000000 hugetlb_cma: reserved 4096 MiB on node 0 cma: Reserved 4096 MiB at 0x0000000300000000 hugetlb_cma: reserved 4096 MiB on node 1 cma: CMA area hugetlb could not be activated # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... Call Trace: bitmap_find_next_zero_area_off+0x51/0x90 cma_alloc+0x1a5/0x310 alloc_fresh_huge_page+0x78/0x1a0 alloc_pool_huge_page+0x6f/0xf0 set_max_huge_pages+0x10c/0x250 nr_hugepages_store_common+0x92/0x120 ? __kmalloc+0x171/0x270 kernfs_fop_write+0xc1/0x1a0 vfs_write+0xc7/0x1f0 ksys_write+0x5f/0xe0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator") Signed-off-by: Mike Kravetz Signed-off-by: Andrew Morton Reviewed-by: Roman Gushchin Acked-by: Barry Song Cc: Marek Szyprowski Cc: Michal Nazarewicz Cc: Kyungmin Park Cc: Joonsoo Kim Cc: Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds --- mm/cma.c | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index aa2df0bd49c3..7f415d7cda9f 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -93,17 +93,15 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, mutex_unlock(&cma->lock); } -static int __init cma_activate_area(struct cma *cma) +static void __init cma_activate_area(struct cma *cma) { unsigned long base_pfn = cma->base_pfn, pfn = base_pfn; unsigned i = cma->count >> pageblock_order; struct zone *zone; cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); - if (!cma->bitmap) { - cma->count = 0; - return -ENOMEM; - } + if (!cma->bitmap) + goto out_error; WARN_ON_ONCE(!pfn_valid(pfn)); zone = page_zone(pfn_to_page(pfn)); @@ -133,25 +131,22 @@ static int __init cma_activate_area(struct cma *cma) spin_lock_init(&cma->mem_head_lock); #endif - return 0; + return; not_in_zone: - pr_err("CMA area %s could not be activated\n", cma->name); bitmap_free(cma->bitmap); +out_error: cma->count = 0; - return -EINVAL; + pr_err("CMA area %s could not be activated\n", cma->name); + return; } static int __init cma_init_reserved_areas(void) { int i; - for (i = 0; i < cma_area_count; i++) { - int ret = cma_activate_area(&cma_areas[i]); - - if (ret) - return ret; - } + for (i = 0; i < cma_area_count; i++) + cma_activate_area(&cma_areas[i]); return 0; } From af161bee93332a1ff10ba029f41936d21850ae82 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 11 Aug 2020 18:32:06 -0700 Subject: [PATCH 035/164] include/linux/sched/mm.h: optimize current_gfp_context() The current_gfp_context() converts a number of PF_MEMALLOC_* per-process flags into the corresponding GFP_* flags for memory allocation. In that function, current->flags is accessed 3 times. That may lead to duplicated access of the same memory location. This is not usually a problem with minimal debug config options on as the compiler can optimize away the duplicated memory accesses. With most of the debug config options on, however, that may not be the case. For example, the x86-64 object size of the __need_fs_reclaim() in a debug kernel that calls current_gfp_context() was 309 bytes. With this patch applied, the object size is reduced to 202 bytes. This is a saving of 107 bytes and will probably be slightly faster too. Use READ_ONCE() to access current->flags to prevent the compiler from possibly accessing current->flags multiple times. Signed-off-by: Waiman Long Signed-off-by: Andrew Morton Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Mathieu Desnoyers Cc: Michel Lespinasse Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com Signed-off-by: Linus Torvalds --- include/linux/sched/mm.h | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 85023ddc2dc2..f889e332912f 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -178,14 +178,16 @@ static inline bool in_vfork(struct task_struct *tsk) */ static inline gfp_t current_gfp_context(gfp_t flags) { - if (unlikely(current->flags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { + unsigned int pflags = READ_ONCE(current->flags); + + if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { /* * NOIO implies both NOIO and NOFS and it is a weaker context * so always make sure it makes precedence */ - if (current->flags & PF_MEMALLOC_NOIO) + if (pflags & PF_MEMALLOC_NOIO) flags &= ~(__GFP_IO | __GFP_FS); - else if (current->flags & PF_MEMALLOC_NOFS) + else if (pflags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; } return flags; From d49653f35adff8c778e7c5fbd4dbdf929594eca8 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Tue, 11 Aug 2020 18:32:09 -0700 Subject: [PATCH 036/164] mm: mmu_notifier: fix and extend kerneldoc Fix W=1 compile warnings (invalid kerneldoc): mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin' mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr' mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register' mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put' mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put' mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert' Signed-off-by: Krzysztof Kozlowski Signed-off-by: Andrew Morton Reviewed-by: Jason Gunthorpe Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org Signed-off-by: Linus Torvalds --- mm/mmu_notifier.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 352bb9f3ecc0..4fc918163dd3 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -166,7 +166,7 @@ static void mn_itree_inv_end(struct mmu_notifier_subscriptions *subscriptions) /** * mmu_interval_read_begin - Begin a read side critical section against a VA * range - * interval_sub: The interval subscription + * @interval_sub: The interval subscription * * mmu_iterval_read_begin()/mmu_iterval_read_retry() implement a * collision-retry scheme similar to seqcount for the VA range under @@ -686,7 +686,7 @@ EXPORT_SYMBOL_GPL(__mmu_notifier_register); /** * mmu_notifier_register - Register a notifier on a mm - * @mn: The notifier to attach + * @subscription: The notifier to attach * @mm: The mm to attach the notifier to * * Must not hold mmap_lock nor any other VM related lock when calling @@ -856,7 +856,7 @@ static void mmu_notifier_free_rcu(struct rcu_head *rcu) /** * mmu_notifier_put - Release the reference on the notifier - * @mn: The notifier to act on + * @subscription: The notifier to act on * * This function must be paired with each mmu_notifier_get(), it releases the * reference obtained by the get. If this is the last reference then process @@ -965,7 +965,8 @@ static int __mmu_interval_notifier_insert( * @interval_sub: Interval subscription to register * @start: Starting virtual address to monitor * @length: Length of the range to monitor - * @mm : mm_struct to attach to + * @mm: mm_struct to attach to + * @ops: Interval notifier operations to be called on matching events * * This function subscribes the interval notifier for notifications from the * mm. Upon return the ops related to mmu_interval_notifier will be called From fe124c95df9e2acf202910b0510300e37afe074b Mon Sep 17 00:00:00 2001 From: Daniel Jordan Date: Tue, 11 Aug 2020 18:32:12 -0700 Subject: [PATCH 037/164] x86/mm: use max memory block size on bare metal Some of our servers spend significant time at kernel boot initializing memory block sysfs directories and then creating symlinks between them and the corresponding nodes. The slowness happens because the machines get stuck with the smallest supported memory block size on x86 (128M), which results in 16,288 directories to cover the 2T of installed RAM. The search for each memory block is noticeable even with commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup"). Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on the end of boot memory") chooses the block size based on alignment with memory end. That addresses hotplug failures in qemu guests, but for bare metal systems whose memory end isn't aligned to even the smallest size, it leaves them at 128M. Make kernels that aren't running on a hypervisor use the largest supported size (2G) to minimize overhead on big machines. Kernel boot goes 7% faster on the aforementioned servers, shaving off half a second. [daniel.m.jordan@oracle.com: v3] Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan Signed-off-by: Andrew Morton Acked-by: David Hildenbrand Cc: Andy Lutomirski Cc: Dave Hansen Cc: David Hildenbrand Cc: Michal Hocko Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Steven Sistare Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds --- arch/x86/mm/init_64.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3b246ae40c8f..a4ac13cc3fdc 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1452,6 +1452,15 @@ static unsigned long probe_memory_block_size(void) goto done; } + /* + * Use max block size to minimize overhead on bare metal, where + * alignment for memory hotplug isn't a concern. + */ + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + bz = MAX_BLOCK_SIZE; + goto done; + } + /* Find the largest allowed block size that aligns to memory end */ for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { if (IS_ALIGNED(boot_mem_end, bz)) From d622ecec5f57ee7e27fbc6db222d8cbaf3c59478 Mon Sep 17 00:00:00 2001 From: Jia He Date: Tue, 11 Aug 2020 18:32:16 -0700 Subject: [PATCH 038/164] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. This is the preparation to set a fallback value for dev_dax->target_node. Signed-off-by: Jia He Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Cc: Dan Williams Cc: Michal Hocko Cc: Catalin Marinas Cc: Will Deacon Cc: Tony Luck Cc: Fenghua Yu Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Vishal Verma Cc: Dave Jiang Cc: Baoquan He Cc: Chuhong Yuan Cc: Mike Rapoport Cc: Logan Gunthorpe Cc: Masahiro Yamada Cc: Jonathan Cameron Cc: Kaly Xin Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com Signed-off-by: Linus Torvalds --- arch/arm64/mm/numa.c | 10 ---------- arch/ia64/mm/numa.c | 2 -- arch/sh/mm/init.c | 9 --------- arch/x86/mm/numa.c | 1 - mm/memory_hotplug.c | 10 ++++++++++ 5 files changed, 10 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..73f8b49d485c 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -461,13 +461,3 @@ void __init arm64_numa_init(void) numa_init(dummy_numa_init); } - -/* - * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... - */ -int memory_add_physaddr_to_nid(u64 addr) -{ - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); - return 0; -} diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index 5e1015eb6d0d..f34964271101 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr) return 0; return nid; } - -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #endif diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index 613de8096335..cd1379360f08 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -425,15 +425,6 @@ int arch_add_memory(int nid, u64 start, u64 size, return ret; } -#ifdef CONFIG_NUMA -int memory_add_physaddr_to_nid(u64 addr) -{ - /* Node 0 for now.. */ - return 0; -} -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -#endif - void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index b05f45e5e8e2..aa76ec2d359b 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start) nid = numa_meminfo.blk[0].nid; return nid; } -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ac6961abaa10..6289a909ea36 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, return err; } +#ifdef CONFIG_NUMA +int __weak memory_add_physaddr_to_nid(u64 start) +{ + pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + start); + return 0; +} +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); +#endif + /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, From b4223a510e2ab1bf0f971d50af7c1431014b25ad Mon Sep 17 00:00:00 2001 From: Jia He Date: Tue, 11 Aug 2020 18:32:20 -0700 Subject: [PATCH 039/164] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat") Signed-off-by: Jia He Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Dan Williams Cc: [5.6+] Cc: Andy Lutomirski Cc: Baoquan He Cc: Borislav Petkov Cc: Catalin Marinas Cc: Chuhong Yuan Cc: Dave Hansen Cc: Dave Jiang Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Kaly Xin Cc: Logan Gunthorpe Cc: Masahiro Yamada Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Rich Felker Cc: Thomas Gleixner Cc: Tony Luck Cc: Vishal Verma Cc: Will Deacon Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6289a909ea36..ce3d858319bd 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1757,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) */ rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); if (rc) - goto done; + return rc; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1781,9 +1781,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); - return rc; + return 0; } /** From de1193f0be66f88e2c6d0fd965137668fc2ec4a5 Mon Sep 17 00:00:00 2001 From: Charan Teja Reddy Date: Tue, 11 Aug 2020 18:32:24 -0700 Subject: [PATCH 040/164] mm, memory_hotplug: update pcp lists everytime onlining a memory block When onlining a first memory block in a zone, pcp lists are not updated thus pcp struct will have the default setting of ->high = 0,->batch = 1. This means till the second memory block in a zone(if it have) is onlined the pcp lists of this zone will not contain any pages because pcp's ->count is always greater than ->high thus free_pcppages_bulk() is called to free batch size(=1) pages every time system wants to add a page to the pcp list through free_unref_page(). To put this in a word, system is not using benefits offered by the pcp lists when there is a single onlineable memory block in a zone. Correct this by always updating the pcp lists when memory block is onlined. Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset") Signed-off-by: Charan Teja Reddy Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Vinayak Menon Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ce3d858319bd..0a9e1972fbe7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -854,8 +854,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, node_states_set_node(nid, &arg); if (need_zonelists_rebuild) build_all_zonelists(NULL); - else - zone_pcp_update(zone); + zone_pcp_update(zone); init_per_zone_wmark_min(); From 1067b261cc973d68c29de54ef0587c56786e6bd5 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:27 -0700 Subject: [PATCH 041/164] mm: drop duplicated words in Drop the doubled words "used" and "by". Drop the repeated acronym "TLB" and make several other fixes around it. (capital letters, spellos) Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: SeongJae Park Link: http://lkml.kernel.org/r/2bb6e13e-44df-4920-52d9-4d3539945f73@infradead.org Signed-off-by: Linus Torvalds --- include/linux/pgtable.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 53e97da1e8e2..a124c21e3204 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -804,7 +804,7 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, /* * No-op macros that just return the current protection value. Defined here - * because these macros can be used used even if CONFIG_MMU is not defined. + * because these macros can be used even if CONFIG_MMU is not defined. */ #ifndef pgprot_nx @@ -1234,7 +1234,7 @@ static inline int pmd_trans_unstable(pmd_t *pmd) * Technically a PTE can be PROTNONE even when not doing NUMA balancing but * the only case the kernel cares is for NUMA balancing and is only ever set * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked - * _PAGE_PROTNONE so by by default, implement the helper as "always no". It + * _PAGE_PROTNONE so by default, implement the helper as "always no". It * is the responsibility of the caller to distinguish between PROT_NONE * protections and NUMA hinting fault protections. */ @@ -1318,10 +1318,10 @@ static inline int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) /* * ARCHes with special requirements for evicting THP backing TLB entries can * implement this. Otherwise also, it can help optimize normal TLB flush in - * THP regime. stock flush_tlb_range() typically has optimization to nuke the - * entire TLB TLB if flush span is greater than a threshold, which will - * likely be true for a single huge page. Thus a single thp flush will - * invalidate the entire TLB which is not desitable. + * THP regime. Stock flush_tlb_range() typically has optimization to nuke the + * entire TLB if flush span is greater than a threshold, which will + * likely be true for a single huge page. Thus a single THP flush will + * invalidate the entire TLB which is not desirable. * e.g. see arch/arc: flush_pmd_tlb_range */ #define flush_pmd_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end) From 11192337206d2211bce3ba4ec1575439b6045654 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:30 -0700 Subject: [PATCH 042/164] mm: drop duplicated words in Drop the doubled words "to" and "the". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: SeongJae Park Link: http://lkml.kernel.org/r/d9fae8d6-0d60-4d52-9385-3199ee98de49@infradead.org Signed-off-by: Linus Torvalds --- include/linux/mm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..f97b10117d44 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -479,7 +479,7 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags) { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } /* - * vm_fault is filled by the the pagefault handler and passed to the vma's + * vm_fault is filled by the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * @@ -2599,7 +2599,7 @@ extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); -/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */ +/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ extern int expand_downwards(struct vm_area_struct *vma, unsigned long address); #if VM_GROWSUP From 3ecabd31341fb4307f156a2bf4467c3f8fa9101b Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:33 -0700 Subject: [PATCH 043/164] include/linux/highmem.h: fix duplicated words in a comment Change the doubled word "is" in a comment to "it is". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/ad605959-0083-4794-8d31-6b073300dd6f@infradead.org Signed-off-by: Linus Torvalds --- include/linux/highmem.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index d6e82e3de027..14e6202ce47f 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -73,7 +73,7 @@ static inline void kunmap(struct page *page) * no global lock is needed and because the kmap code must perform a global TLB * invalidation when the kmap pool wraps. * - * However when holding an atomic kmap is is not legal to sleep, so atomic + * However when holding an atomic kmap it is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. * * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap From c82f16b52cb3b50ac1b47e6c590e3dddefc5a97c Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:36 -0700 Subject: [PATCH 044/164] include/linux/frontswap.h: drop duplicated word in a comment Drop the doubled word "in" in a comment. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Konrad Rzeszutek Wilk Link: http://lkml.kernel.org/r/3af7ed91-ad62-8445-40a4-9e07a64b9523@infradead.org Signed-off-by: Linus Torvalds --- include/linux/frontswap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/frontswap.h b/include/linux/frontswap.h index 6d775984905b..b07d88c92bb2 100644 --- a/include/linux/frontswap.h +++ b/include/linux/frontswap.h @@ -10,7 +10,7 @@ /* * Return code to denote that requested number of * frontswap pages are unused(moved to page cache). - * Used in in shmem_unuse and try_to_unuse. + * Used in shmem_unuse and try_to_unuse. */ #define FRONTSWAP_PAGES_UNUSED 2 From 0845f83122d93ae3bafa7ea10209de2148a3b9bc Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:40 -0700 Subject: [PATCH 045/164] include/linux/memcontrol.h: drop duplicate word and fix spello Drop the doubled word "for" in a comment. Fix spello of "incremented". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Acked-by: Chris Down Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Link: http://lkml.kernel.org/r/b04aa2e4-7c95-12f0-599d-43d07fb28134@infradead.org Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2c2d301eac33..385237e4cb44 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,8 +65,8 @@ struct mem_cgroup_id { /* * Per memcg event counter is incremented at every pagein/pageout. With THP, - * it will be incremated by the number of pages. This counter is used for - * for trigger some periodic events. This is straightforward and better + * it will be incremented by the number of pages. This counter is used + * to trigger some periodic events. This is straightforward and better * than using jiffies etc. to handle periodic memcg event. */ enum mem_cgroup_events_target { From c56771b3095d4f1af82847081b6781fc03a13904 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Tue, 11 Aug 2020 18:32:43 -0700 Subject: [PATCH 046/164] sh/mm: drop unused MAX_PHYSADDR_BITS The macro is not used anywhere, so remove the definition. Signed-off-by: Arvind Sankar Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Dave Hansen Acked-by: Mike Rapoport Link: http://lkml.kernel.org/r/20200723231544.17274-3-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds --- arch/sh/include/asm/sparsemem.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/sh/include/asm/sparsemem.h b/arch/sh/include/asm/sparsemem.h index 4eb899751e45..084706bb8cca 100644 --- a/arch/sh/include/asm/sparsemem.h +++ b/arch/sh/include/asm/sparsemem.h @@ -5,11 +5,9 @@ #ifdef __KERNEL__ /* * SECTION_SIZE_BITS 2^N: how big each section will be - * MAX_PHYSADDR_BITS 2^N: how much physical address space we have - * MAX_PHYSMEM_BITS 2^N: how much memory we can have in that space + * MAX_PHYSMEM_BITS 2^N: how much physical address space we have */ #define SECTION_SIZE_BITS 26 -#define MAX_PHYSADDR_BITS 32 #define MAX_PHYSMEM_BITS 32 #endif From 6f8c00ff5aa88c847af9af35131d72fae480b566 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Tue, 11 Aug 2020 18:32:46 -0700 Subject: [PATCH 047/164] sparc: drop unused MAX_PHYSADDR_BITS The macro is not used anywhere, so remove the definition. Signed-off-by: Arvind Sankar Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Dave Hansen Acked-by: David S. Miller Acked-by: Mike Rapoport Link: http://lkml.kernel.org/r/20200723231544.17274-4-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds --- arch/sparc/include/asm/sparsemem.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/sparc/include/asm/sparsemem.h b/arch/sparc/include/asm/sparsemem.h index 1dd1b61432db..aa9a676bc341 100644 --- a/arch/sparc/include/asm/sparsemem.h +++ b/arch/sparc/include/asm/sparsemem.h @@ -7,7 +7,6 @@ #include #define SECTION_SIZE_BITS 30 -#define MAX_PHYSADDR_BITS MAX_PHYS_ADDRESS_BITS #define MAX_PHYSMEM_BITS MAX_PHYS_ADDRESS_BITS #endif /* !(__KERNEL__) */ From a1c1dbeb2e1a995c374d4954b36269070725f3a0 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:49 -0700 Subject: [PATCH 048/164] mm/compaction.c: delete duplicated word Drop the repeated word "a". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 5e8f2e22b73e..b89581bf859c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1477,7 +1477,7 @@ static void isolate_freepages(struct compact_control *cc) * this pfn aligned down to the pageblock boundary, because we do * block_start_pfn -= pageblock_nr_pages in the for loop. * For ending point, take care when isolating in last pageblock of a - * a zone which ends in the middle of a pageblock. + * zone which ends in the middle of a pageblock. * The low boundary is the end of the pageblock the migration scanner * is using. */ From ce89fddfe0b73e99b2edd8cfd21afd461258a2df Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:53 -0700 Subject: [PATCH 049/164] mm/filemap.c: delete duplicated word Drop the repeated word "the". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index f2bb5ff0293d..8e75bce0346d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2885,7 +2885,7 @@ static struct page *do_read_cache_page(struct address_space *mapping, * Case a, the page will be up to date when the page is unlocked. * There is no need to serialise on the page lock here as the page * is pinned so the lock gives no additional protection. Even if the - * the page is truncated, the data is still valid if PageUptodate as + * page is truncated, the data is still valid if PageUptodate as * it's a race vs truncate race. * Case b, the page will not be up to date * Case c, the page may be truncated but in itself, the data may still From 0cb80a2fb5bcd10f8aa7747e35511b52053232ae Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:56 -0700 Subject: [PATCH 050/164] mm/hmm.c: delete duplicated word Drop the repeated word "pages". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-4-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hmm.c b/mm/hmm.c index 0809baee49d0..bb279319bf40 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -249,7 +249,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, swp_entry_t entry = pte_to_swp_entry(pte); /* - * Never fault in device private pages pages, but just report + * Never fault in device private pages, but just report * the PFN even if not present. */ if (hmm_is_device_private_entry(range, entry)) { From 9e7ee40097ec654d5b9c7803b8e6dc560d17bbe3 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:32:59 -0700 Subject: [PATCH 051/164] mm/hugetlb.c: delete duplicated words Drop the repeated word "the" in two places. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Mike Kravetz Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-5-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6b6b0d5cc642..b66bf74e999e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -133,7 +133,7 @@ void hugepage_put_subpool(struct hugepage_subpool *spool) /* * Subpool accounting for allocating and reserving pages. * Return -ENOMEM if there are not enough resources to satisfy the - * the request. Otherwise, return the number of pages by which the + * request. Otherwise, return the number of pages by which the * global pools must be adjusted (upward). The returned value may * only be different than the passed value (delta) in the case where * a subpool minimum size must be maintained. @@ -2167,7 +2167,7 @@ static void return_unused_surplus_pages(struct hstate *h, * evenly across all nodes with memory. Iterate across these nodes * until we can no longer free unreserved surplus pages. This occurs * when the nodes with surplus pages have no free pages. - * free_pool_huge_page() will balance the the freed pages across the + * free_pool_huge_page() will balance the freed pages across the * on-line nodes with memory and will handle the hstate accounting. * * Note that we decrement resv_huge_pages as we free the pages. If From ac5ddd0fcea9bfee1d48884e7fa94e706d213aac Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:02 -0700 Subject: [PATCH 052/164] mm/memcontrol.c: delete duplicated words Drop the repeated word "down". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6c579217855..d59fd9af6e63 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2422,7 +2422,7 @@ static void high_work_func(struct work_struct *work) * * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the * overage ratio to a delay. - * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the + * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down the * proposed penalty in order to reduce to a reasonable number of jiffies, and * to produce a reasonable delay curve. * From a1a0aea592c1ef05283bc7af8d66ac3de87b2e07 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:05 -0700 Subject: [PATCH 053/164] mm/memory.c: delete duplicated words Drop the repeated word "to" in two places. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index de311fc7639e..325bb575e7ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1800,7 +1800,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, * @pfn: source kernel pfn * @pgprot: pgprot flags for the inserted page * - * This is exactly like vmf_insert_pfn(), except that it allows drivers to + * This is exactly like vmf_insert_pfn(), except that it allows drivers * to override pgprot on a per-page basis. * * This only makes sense for IO mappings, and it makes no sense for @@ -1936,7 +1936,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * @pfn: source kernel pfn * @pgprot: pgprot flags for the inserted page * - * This is exactly like vmf_insert_mixed(), except that it allows drivers to + * This is exactly like vmf_insert_mixed(), except that it allows drivers * to override pgprot on a per-page basis. * * Typically this function should be used by drivers to set caching- and From eaf444deed482f908395defa4d9b67ce55b72208 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:08 -0700 Subject: [PATCH 054/164] mm/migrate.c: delete duplicated word Drop the repeated word "and". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 5ea275878383..52896b4921a7 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2667,7 +2667,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) /** * migrate_vma_setup() - prepare to migrate a range of memory - * @args: contains the vma, start, and and pfns arrays for the migration + * @args: contains the vma, start, and pfns arrays for the migration * * Returns: negative errno on failures, 0 when 0 or more pages were migrated * without an error. From c08b342c214c4473b329f828cdc516d2f50a192f Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:11 -0700 Subject: [PATCH 055/164] mm/nommu.c: delete duplicated words Drop the repeated word "that" in two places. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-9-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/nommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/nommu.c b/mm/nommu.c index 340ae7774c13..75a327149af1 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1762,8 +1762,8 @@ EXPORT_SYMBOL_GPL(access_process_vm); * @newsize: The proposed filesize of the inode * * Check the shared mappings on an inode on behalf of a shrinking truncate to - * make sure that that any outstanding VMAs aren't broken and then shrink the - * vm_regions that extend that beyond so that do_mmap() doesn't + * make sure that any outstanding VMAs aren't broken and then shrink the + * vm_regions that extend beyond so that do_mmap() doesn't * automatically grant mappings that are too large. */ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, From 047b9967d59940f9f64ed13d02e98e9ca7cb5ffb Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:14 -0700 Subject: [PATCH 056/164] mm/page_alloc.c: delete or fix duplicated words Drop the repeated word "them" and "that". Change "the the" to "to the". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 167732f4d124..d80c9bac489c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4282,7 +4282,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, /* * If an allocation failed after direct reclaim, it could be because * pages are pinned on the per-cpu lists or in high alloc reserves. - * Shrink them them and try again + * Shrink them and try again */ if (!page && !drained) { unreserve_highatomic_pageblock(ac, false); @@ -6192,7 +6192,7 @@ static int zone_batchsize(struct zone *zone) * locking. * * Any new users of pcp->batch and pcp->high should ensure they can cope with - * those fields changing asynchronously (acording the the above rule). + * those fields changing asynchronously (acording to the above rule). * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters @@ -8203,7 +8203,7 @@ void *__init alloc_large_system_hash(const char *tablename, * race condition. So you can't expect this function should be exact. * * Returns a page without holding a reference. If the caller wants to - * dereference that page (e.g., dumping), it has to make sure that that it + * dereference that page (e.g., dumping), it has to make sure that it * cannot get removed (e.g., via memory unplug) concurrently. * */ From af44c12fe7c9704c4d185446a8ffdea18792017d Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:17 -0700 Subject: [PATCH 057/164] mm/shmem.c: delete duplicated word Drop the repeated word "the". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 49bde088a7ea..271548ca20f3 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1686,7 +1686,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, * Swap in the page pointed to by *pagep. * Caller has to make sure that *pagep contains a valid swapped page. * Returns 0 and the page in pagep if success. On failure, returns the - * the error code and NULL in *pagep. + * error code and NULL in *pagep. */ static int shmem_swapin_page(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, From 081a06fa29906643c62ea9477bc374c23417e6f9 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:19 -0700 Subject: [PATCH 058/164] mm/slab_common.c: delete duplicated word Drop the repeated word "and". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-12-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index a513f3237155..f9ccd5dc13f3 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -419,7 +419,7 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) /* * On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the * @slab_caches_to_rcu_destroy list. The slab pages are freed - * through RCU and and the associated kmem_cache are dereferenced + * through RCU and the associated kmem_cache are dereferenced * while freeing the pages, so the kmem_caches should be freed only * after the pending RCU operations are finished. As rcu_barrier() * is a pretty slow operation, we batch all pending destructions From 5ce1be0e40fe64cbf540ba54dd10824a8cef99e1 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:23 -0700 Subject: [PATCH 059/164] mm/usercopy.c: delete duplicated word Drop the repeated word "the". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-13-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/usercopy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..b3de3c4eefba 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -43,7 +43,7 @@ static noinline int check_stack_object(const void *obj, unsigned long len) /* * Reject: object partially overlaps the stack (passing the - * the check above means at least one end is within the stack, + * check above means at least one end is within the stack, * so if this check fails, the other end is outside the stack). */ if (obj < stack || stackend < obj + len) From 1eba09c15decaf4e390aba0cd409e8a4e821c2a8 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:26 -0700 Subject: [PATCH 060/164] mm/vmscan.c: delete or fix duplicated words Drop the repeated word "marked". Change "time time" to "same time". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 846b7e9b8d2e..738115ed75e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2798,7 +2798,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) set_bit(PGDAT_DIRTY, &pgdat->flags); /* - * If kswapd scans pages marked marked for immediate + * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU * faster than they are written so also forcibly stall. @@ -3373,7 +3373,7 @@ static bool pgdat_watermark_boosted(pg_data_t *pgdat, int highest_zoneidx) /* * Check for watermark boosts top-down as the higher zones * are more likely to be boosted. Both watermarks and boosts - * should not be checked at the time time as reclaim would + * should not be checked at the same time as reclaim would * start prematurely when there is no boosting and a lower * zone is balanced. */ From b6aa2c83428d49f1331d89a9e3d77850ee0696ba Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:28 -0700 Subject: [PATCH 061/164] mm/zpool.c: delete duplicated word and fix grammar Drop the repeated word "if". Fix subject/verb agreement. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-15-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/zpool.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/zpool.c b/mm/zpool.c index 863669212070..3744a2d1a624 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -239,15 +239,15 @@ const char *zpool_get_type(struct zpool *zpool) } /** - * zpool_malloc_support_movable() - Check if the zpool support - * allocate movable memory + * zpool_malloc_support_movable() - Check if the zpool supports + * allocating movable memory * @zpool: The zpool to check * - * This returns if the zpool support allocate movable memory. + * This returns if the zpool supports allocating movable memory. * * Implementations must guarantee this to be thread-safe. * - * Returns: true if if the zpool support allocate movable memory, false if not + * Returns: true if the zpool supports allocating movable memory, false if not */ bool zpool_malloc_support_movable(struct zpool *zpool) { From b956b5ac28cd709567f10c0ef67e65898ded2e45 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:31 -0700 Subject: [PATCH 062/164] mm/zsmalloc.c: fix duplicated words Change "as as" to "as a". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Zi Yan Link: http://lkml.kernel.org/r/20200801173822.14973-16-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 952a01e45c6a..c36fdff9a371 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -79,7 +79,7 @@ /* * Object location (, ) is encoded as - * as single (unsigned long) handle value. + * a single (unsigned long) handle value. * * Note that object index starts from 0. * From bfe00c5bbd9ee37b99c281429556c335271d027b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:34 -0700 Subject: [PATCH 063/164] syscalls: use uaccess_kernel in addr_limit_user_check Patch series "clean up address limit helpers", v2. In preparation for eventually phasing out direct use of set_fs(), this series removes the segment_eq() arch helper that is only used to implement or duplicate the uaccess_kernel() API, and then adds descriptive helpers to force the kernel address limit. This patch (of 6): Use the uaccess_kernel helper instead of duplicating it. [hch@lst.de: arm: don't call addr_limit_user_check for nommu] Link: http://lkml.kernel.org/r/20200721045834.GA9613@lst.de Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Tested-by: Guenter Roeck Acked-by: Linus Torvalds Cc: Nick Hu Cc: Greentime Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/20200714105505.935079-1-hch@lst.de Link: http://lkml.kernel.org/r/20200710135706.537715-1-hch@lst.de Link: http://lkml.kernel.org/r/20200710135706.537715-2-hch@lst.de Signed-off-by: Linus Torvalds --- arch/arm/kernel/signal.c | 2 ++ include/linux/syscalls.h | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index ab2568996ddb..c9dc912b83f0 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -713,7 +713,9 @@ struct page *get_signal_page(void) /* Defer to generic check */ asmlinkage void addr_limit_check_failed(void) { +#ifdef CONFIG_MMU addr_limit_user_check(); +#endif } #ifdef CONFIG_DEBUG_RSEQ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a2429d336593..dc2b827c81e5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -263,7 +263,7 @@ static inline void addr_limit_user_check(void) return; #endif - if (CHECK_DATA_CORRUPTION(!segment_eq(get_fs(), USER_DS), + if (CHECK_DATA_CORRUPTION(uaccess_kernel(), "Invalid address limit on user-mode return")) force_sig(SIGKILL); From 9af0f90aea46081566bf2014b6dfeb87b792005c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:38 -0700 Subject: [PATCH 064/164] nds32: use uaccess_kernel in show_regs Use the uaccess_kernel helper instead of duplicating it. Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Acked-by: Linus Torvalds Acked-by: Greentime Hu Cc: Nick Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/20200710135706.537715-3-hch@lst.de Signed-off-by: Linus Torvalds --- arch/nds32/kernel/process.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c index e85bbbadc0e7..e01ad5d17224 100644 --- a/arch/nds32/kernel/process.c +++ b/arch/nds32/kernel/process.c @@ -121,7 +121,7 @@ void show_regs(struct pt_regs *regs) regs->uregs[3], regs->uregs[2], regs->uregs[1], regs->uregs[0]); pr_info(" IRQs o%s Segment %s\n", interrupts_enabled(regs) ? "n" : "ff", - segment_eq(get_fs(), KERNEL_DS)? "kernel" : "user"); + uaccess_kernel() ? "kernel" : "user"); } EXPORT_SYMBOL(show_regs); From efbfc62e1d9ce835fd3ea5f25db1844e1496ced6 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:41 -0700 Subject: [PATCH 065/164] riscv: include in To ensure TASK_SIZE is defined for USER_DS. Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Acked-by: Linus Torvalds Acked-by: Palmer Dabbelt Cc: Nick Hu Cc: Greentime Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/20200710135706.537715-4-hch@lst.de Signed-off-by: Linus Torvalds --- arch/riscv/include/asm/uaccess.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h index 8ce9d607b53d..22de922d6ecb 100644 --- a/arch/riscv/include/asm/uaccess.h +++ b/arch/riscv/include/asm/uaccess.h @@ -8,6 +8,8 @@ #ifndef _ASM_RISCV_UACCESS_H #define _ASM_RISCV_UACCESS_H +#include /* for TASK_SIZE */ + /* * User space memory access functions */ From 428e2976a5bf7e7f5554286d7a5a33b8147b106a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:44 -0700 Subject: [PATCH 066/164] uaccess: remove segment_eq segment_eq is only used to implement uaccess_kernel. Just open code uaccess_kernel in the arch uaccess headers and remove one layer of indirection. Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Acked-by: Linus Torvalds Acked-by: Greentime Hu Acked-by: Geert Uytterhoeven Cc: Nick Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Palmer Dabbelt Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/uaccess.h | 2 +- arch/arc/include/asm/segment.h | 3 +-- arch/arm/include/asm/uaccess.h | 4 ++-- arch/arm64/include/asm/uaccess.h | 2 +- arch/csky/include/asm/segment.h | 2 +- arch/h8300/include/asm/segment.h | 2 +- arch/ia64/include/asm/uaccess.h | 2 +- arch/m68k/include/asm/segment.h | 2 +- arch/microblaze/include/asm/uaccess.h | 2 +- arch/mips/include/asm/uaccess.h | 2 +- arch/nds32/include/asm/uaccess.h | 2 +- arch/nios2/include/asm/uaccess.h | 2 +- arch/openrisc/include/asm/uaccess.h | 2 +- arch/parisc/include/asm/uaccess.h | 2 +- arch/powerpc/include/asm/uaccess.h | 3 +-- arch/riscv/include/asm/uaccess.h | 4 +--- arch/s390/include/asm/uaccess.h | 2 +- arch/sh/include/asm/segment.h | 3 +-- arch/sparc/include/asm/uaccess_32.h | 2 +- arch/sparc/include/asm/uaccess_64.h | 2 +- arch/x86/include/asm/uaccess.h | 2 +- arch/xtensa/include/asm/uaccess.h | 2 +- include/asm-generic/uaccess.h | 4 ++-- include/linux/uaccess.h | 2 -- 24 files changed, 25 insertions(+), 32 deletions(-) diff --git a/arch/alpha/include/asm/uaccess.h b/arch/alpha/include/asm/uaccess.h index 1fe2b56cb861..1b6f25efa247 100644 --- a/arch/alpha/include/asm/uaccess.h +++ b/arch/alpha/include/asm/uaccess.h @@ -20,7 +20,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * Is a address valid? This does a straightforward calculation rather diff --git a/arch/arc/include/asm/segment.h b/arch/arc/include/asm/segment.h index 6a2a5be5026d..871f8ab11bfd 100644 --- a/arch/arc/include/asm/segment.h +++ b/arch/arc/include/asm/segment.h @@ -14,8 +14,7 @@ typedef unsigned long mm_segment_t; #define KERNEL_DS MAKE_MM_SEG(0) #define USER_DS MAKE_MM_SEG(TASK_SIZE) - -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) #endif /* __ASSEMBLY__ */ #endif /* __ASMARC_SEGMENT_H */ diff --git a/arch/arm/include/asm/uaccess.h b/arch/arm/include/asm/uaccess.h index b5fdd30252f8..a13d90206472 100644 --- a/arch/arm/include/asm/uaccess.h +++ b/arch/arm/include/asm/uaccess.h @@ -76,7 +76,7 @@ static inline void set_fs(mm_segment_t fs) modify_domain(DOMAIN_KERNEL, fs ? DOMAIN_CLIENT : DOMAIN_MANAGER); } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* * We use 33-bit arithmetic here. Success returns zero, failure returns @@ -267,7 +267,7 @@ extern int __put_user_8(void *, unsigned long long); */ #define USER_DS KERNEL_DS -#define segment_eq(a, b) (1) +#define uaccess_kernel() (true) #define __addr_ok(addr) ((void)(addr), 1) #define __range_ok(addr, size) ((void)(addr), 0) #define get_fs() (KERNEL_DS) diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 8d7c466f809b..991dd5f031e4 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -50,7 +50,7 @@ static inline void set_fs(mm_segment_t fs) CONFIG_ARM64_UAO)); } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* * Test whether a block of memory is a valid user space address. diff --git a/arch/csky/include/asm/segment.h b/arch/csky/include/asm/segment.h index db2640d5f575..79ede9b1a646 100644 --- a/arch/csky/include/asm/segment.h +++ b/arch/csky/include/asm/segment.h @@ -13,6 +13,6 @@ typedef struct { #define USER_DS ((mm_segment_t) { 0x80000000UL }) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASM_CSKY_SEGMENT_H */ diff --git a/arch/h8300/include/asm/segment.h b/arch/h8300/include/asm/segment.h index a407978f9f9f..37950725d9b9 100644 --- a/arch/h8300/include/asm/segment.h +++ b/arch/h8300/include/asm/segment.h @@ -33,7 +33,7 @@ static inline mm_segment_t get_fs(void) return USER_DS; } -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASSEMBLY__ */ diff --git a/arch/ia64/include/asm/uaccess.h b/arch/ia64/include/asm/uaccess.h index 8aa473a4b0f4..179243c3dfc7 100644 --- a/arch/ia64/include/asm/uaccess.h +++ b/arch/ia64/include/asm/uaccess.h @@ -50,7 +50,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * When accessing user memory, we need to make sure the entire area really is in diff --git a/arch/m68k/include/asm/segment.h b/arch/m68k/include/asm/segment.h index c6686559e9b7..2b5e68a71ef7 100644 --- a/arch/m68k/include/asm/segment.h +++ b/arch/m68k/include/asm/segment.h @@ -52,7 +52,7 @@ static inline void set_fs(mm_segment_t val) #define set_fs(x) (current_thread_info()->addr_limit = (x)) #endif -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASSEMBLY__ */ diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h index 6723c56ec378..304b04ffea2f 100644 --- a/arch/microblaze/include/asm/uaccess.h +++ b/arch/microblaze/include/asm/uaccess.h @@ -41,7 +41,7 @@ # define get_fs() (current_thread_info()->addr_limit) # define set_fs(val) (current_thread_info()->addr_limit = (val)) -# define segment_eq(a, b) ((a).seg == (b).seg) +# define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #ifndef CONFIG_MMU diff --git a/arch/mips/include/asm/uaccess.h b/arch/mips/include/asm/uaccess.h index 62b298c50905..61fc01f177a6 100644 --- a/arch/mips/include/asm/uaccess.h +++ b/arch/mips/include/asm/uaccess.h @@ -72,7 +72,7 @@ extern u64 __ua_limit; #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * eva_kernel_access() - determine whether kernel memory access on an EVA system diff --git a/arch/nds32/include/asm/uaccess.h b/arch/nds32/include/asm/uaccess.h index 3a9219f53ee0..010ba5f1d7dd 100644 --- a/arch/nds32/include/asm/uaccess.h +++ b/arch/nds32/include/asm/uaccess.h @@ -44,7 +44,7 @@ static inline void set_fs(mm_segment_t fs) current_thread_info()->addr_limit = fs; } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) #define __range_ok(addr, size) (size <= get_fs() && addr <= (get_fs() -size)) diff --git a/arch/nios2/include/asm/uaccess.h b/arch/nios2/include/asm/uaccess.h index e83f831a76f9..a741abbed6fb 100644 --- a/arch/nios2/include/asm/uaccess.h +++ b/arch/nios2/include/asm/uaccess.h @@ -30,7 +30,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(seg) (current_thread_info()->addr_limit = (seg)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define __access_ok(addr, len) \ (((signed long)(((long)get_fs().seg) & \ diff --git a/arch/openrisc/include/asm/uaccess.h b/arch/openrisc/include/asm/uaccess.h index 17c24f14615f..48b691530d3e 100644 --- a/arch/openrisc/include/asm/uaccess.h +++ b/arch/openrisc/include/asm/uaccess.h @@ -43,7 +43,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* Ensure that the range from addr to addr+size is all within the process' * address space diff --git a/arch/parisc/include/asm/uaccess.h b/arch/parisc/include/asm/uaccess.h index ebbb9ffe038c..ed2cd4fb479b 100644 --- a/arch/parisc/include/asm/uaccess.h +++ b/arch/parisc/include/asm/uaccess.h @@ -14,7 +14,7 @@ #define KERNEL_DS ((mm_segment_t){0}) #define USER_DS ((mm_segment_t){1}) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 64c04ab09112..00699903f1ef 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -38,8 +38,7 @@ static inline void set_fs(mm_segment_t fs) set_thread_flag(TIF_FSCHECK); } -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (get_fs().seg) #ifdef __powerpc64__ diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h index 22de922d6ecb..f56c66b3f5fe 100644 --- a/arch/riscv/include/asm/uaccess.h +++ b/arch/riscv/include/asm/uaccess.h @@ -64,11 +64,9 @@ static inline void set_fs(mm_segment_t fs) current_thread_info()->addr_limit = fs; } -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (get_fs().seg) - /** * access_ok: - Checks if a user space pointer is valid * @addr: User space pointer to start of block to check diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 324438889fe1..f09444d6aeab 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -32,7 +32,7 @@ #define USER_DS_SACF (3) #define get_fs() (current->thread.mm_segment) -#define segment_eq(a,b) (((a) & 2) == ((b) & 2)) +#define uaccess_kernel() ((get_fs() & 2) == KERNEL_DS) void set_fs(mm_segment_t fs); diff --git a/arch/sh/include/asm/segment.h b/arch/sh/include/asm/segment.h index 33d1d28057cb..02e54a3335d6 100644 --- a/arch/sh/include/asm/segment.h +++ b/arch/sh/include/asm/segment.h @@ -24,8 +24,7 @@ typedef struct { #define USER_DS KERNEL_DS #endif -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) diff --git a/arch/sparc/include/asm/uaccess_32.h b/arch/sparc/include/asm/uaccess_32.h index d6d8413eca83..0a2d3ebc4bb8 100644 --- a/arch/sparc/include/asm/uaccess_32.h +++ b/arch/sparc/include/asm/uaccess_32.h @@ -28,7 +28,7 @@ #define get_fs() (current->thread.current_ds) #define set_fs(val) ((current->thread.current_ds) = (val)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* We have there a nice not-mapped page at PAGE_OFFSET - PAGE_SIZE, so that this test * can be fairly lightweight. diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h index bf9d330073b2..698cf69f74e9 100644 --- a/arch/sparc/include/asm/uaccess_64.h +++ b/arch/sparc/include/asm/uaccess_64.h @@ -32,7 +32,7 @@ #define get_fs() ((mm_segment_t){(current_thread_info()->current_ds)}) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define set_fs(val) \ do { \ diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 2f3e8f2a958f..ecefaffd15d4 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -33,7 +33,7 @@ static inline void set_fs(mm_segment_t fs) set_thread_flag(TIF_FSCHECK); } -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (current->thread.addr_limit.seg) /* diff --git a/arch/xtensa/include/asm/uaccess.h b/arch/xtensa/include/asm/uaccess.h index e57f0d0a88d8..b9758119feca 100644 --- a/arch/xtensa/include/asm/uaccess.h +++ b/arch/xtensa/include/asm/uaccess.h @@ -35,7 +35,7 @@ #define get_fs() (current->thread.current_ds) #define set_fs(val) (current->thread.current_ds = (val)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define __kernel_ok (uaccess_kernel()) #define __user_ok(addr, size) \ diff --git a/include/asm-generic/uaccess.h b/include/asm-generic/uaccess.h index e935318804f8..ba68ee4dabfa 100644 --- a/include/asm-generic/uaccess.h +++ b/include/asm-generic/uaccess.h @@ -86,8 +86,8 @@ static inline void set_fs(mm_segment_t fs) } #endif -#ifndef segment_eq -#define segment_eq(a, b) ((a).seg == (b).seg) +#ifndef uaccess_kernel +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif #define access_ok(addr, size) __access_ok((unsigned long)(addr),(size)) diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 0a76ddc07d59..5c62d0c6f15b 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -6,8 +6,6 @@ #include #include -#define uaccess_kernel() segment_eq(get_fs(), KERNEL_DS) - #include /* From 3d13f313ce4c34c524ccc37986fe77172f601ff3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:47 -0700 Subject: [PATCH 067/164] uaccess: add force_uaccess_{begin,end} helpers Add helpers to wrap the get_fs/set_fs magic for undoing any damange done by set_fs(KERNEL_DS). There is no real functional benefit, but this documents the intent of these calls better, and will allow stubbing the functions out easily for kernels builds that do not allow address space overrides in the future. [hch@lst.de: drop two incorrect hunks, fix a commit log typo] Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Acked-by: Linus Torvalds Acked-by: Mark Rutland Acked-by: Greentime Hu Acked-by: Geert Uytterhoeven Cc: Nick Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Palmer Dabbelt Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de Signed-off-by: Linus Torvalds --- arch/arm64/kernel/sdei.c | 2 +- arch/m68k/include/asm/tlbflush.h | 6 +++--- arch/mips/kernel/unaligned.c | 27 +++++++++++++-------------- arch/nds32/mm/alignment.c | 7 +++---- arch/sh/kernel/traps_32.c | 12 +++++------- drivers/firmware/arm_sdei.c | 5 ++--- include/linux/uaccess.h | 18 ++++++++++++++++++ kernel/events/callchain.c | 5 ++--- kernel/events/core.c | 5 ++--- kernel/kthread.c | 5 ++--- kernel/stacktrace.c | 5 ++--- mm/maccess.c | 22 ++++++++++------------ 12 files changed, 63 insertions(+), 56 deletions(-) diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c index dab88260b137..7689f2031c0c 100644 --- a/arch/arm64/kernel/sdei.c +++ b/arch/arm64/kernel/sdei.c @@ -180,7 +180,7 @@ static __kprobes unsigned long _sdei_handler(struct pt_regs *regs, /* * We didn't take an exception to get here, set PAN. UAO will be cleared - * by sdei_event_handler()s set_fs(USER_DS) call. + * by sdei_event_handler()s force_uaccess_begin() call. */ __uaccess_enable_hw_pan(); diff --git a/arch/m68k/include/asm/tlbflush.h b/arch/m68k/include/asm/tlbflush.h index 191e75a6bb24..5337bc2c262f 100644 --- a/arch/m68k/include/asm/tlbflush.h +++ b/arch/m68k/include/asm/tlbflush.h @@ -85,10 +85,10 @@ static inline void flush_tlb_mm(struct mm_struct *mm) static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) { if (vma->vm_mm == current->active_mm) { - mm_segment_t old_fs = get_fs(); - set_fs(USER_DS); + mm_segment_t old_fs = force_uaccess_begin(); + __flush_tlb_one(addr); - set_fs(old_fs); + force_uaccess_end(old_fs); } } diff --git a/arch/mips/kernel/unaligned.c b/arch/mips/kernel/unaligned.c index 0adce604fa44..126a5f3f4e4c 100644 --- a/arch/mips/kernel/unaligned.c +++ b/arch/mips/kernel/unaligned.c @@ -191,17 +191,16 @@ static void emulate_load_store_insn(struct pt_regs *regs, * memory, so we need to "switch" the address limit to * user space, so that address check can work properly. */ - seg = get_fs(); - set_fs(USER_DS); + seg = force_uaccess_begin(); switch (insn.spec3_format.func) { case lhe_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadHWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -209,12 +208,12 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lwe_op: if (!access_ok(addr, 4)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -222,12 +221,12 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lhue_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadHWUE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -235,35 +234,35 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case she_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } compute_return_epc(regs); value = regs->regs[insn.spec3_format.rt]; StoreHWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } break; case swe_op: if (!access_ok(addr, 4)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } compute_return_epc(regs); value = regs->regs[insn.spec3_format.rt]; StoreWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } break; default: - set_fs(seg); + force_uaccess_end(seg); goto sigill; } - set_fs(seg); + force_uaccess_end(seg); } #endif break; diff --git a/arch/nds32/mm/alignment.c b/arch/nds32/mm/alignment.c index c8b9061a2ee3..1eb7ded6992b 100644 --- a/arch/nds32/mm/alignment.c +++ b/arch/nds32/mm/alignment.c @@ -512,7 +512,7 @@ int do_unaligned_access(unsigned long addr, struct pt_regs *regs) { unsigned long inst; int ret = -EFAULT; - mm_segment_t seg = get_fs(); + mm_segment_t seg; inst = get_inst(regs->ipc); @@ -520,13 +520,12 @@ int do_unaligned_access(unsigned long addr, struct pt_regs *regs) "Faulting addr: 0x%08lx, pc: 0x%08lx [inst: 0x%08lx ]\n", addr, regs->ipc, inst); - set_fs(USER_DS); - + seg = force_uaccess_begin(); if (inst & NDS32_16BIT_INSTRUCTION) ret = do_16((inst >> 16) & 0xffff, regs); else ret = do_32(inst, regs); - set_fs(seg); + force_uaccess_end(seg); return ret; } diff --git a/arch/sh/kernel/traps_32.c b/arch/sh/kernel/traps_32.c index 058c6181bb30..b62ad0ba2395 100644 --- a/arch/sh/kernel/traps_32.c +++ b/arch/sh/kernel/traps_32.c @@ -482,8 +482,6 @@ asmlinkage void do_address_error(struct pt_regs *regs, error_code = lookup_exception_vector(); #endif - oldfs = get_fs(); - if (user_mode(regs)) { int si_code = BUS_ADRERR; unsigned int user_action; @@ -491,13 +489,13 @@ asmlinkage void do_address_error(struct pt_regs *regs, local_irq_enable(); inc_unaligned_user_access(); - set_fs(USER_DS); + oldfs = force_uaccess_begin(); if (copy_from_user(&instruction, (insn_size_t *)(regs->pc & ~1), sizeof(instruction))) { - set_fs(oldfs); + force_uaccess_end(oldfs); goto uspace_segv; } - set_fs(oldfs); + force_uaccess_end(oldfs); /* shout about userspace fixups */ unaligned_fixups_notify(current, instruction, regs); @@ -520,11 +518,11 @@ asmlinkage void do_address_error(struct pt_regs *regs, goto uspace_segv; } - set_fs(USER_DS); + oldfs = force_uaccess_begin(); tmp = handle_unaligned_access(instruction, regs, &user_mem_access, 0, address); - set_fs(oldfs); + force_uaccess_end(oldfs); if (tmp == 0) return; /* sorted */ diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c index e7e36aab2386..b4b9ce97f415 100644 --- a/drivers/firmware/arm_sdei.c +++ b/drivers/firmware/arm_sdei.c @@ -1136,15 +1136,14 @@ int sdei_event_handler(struct pt_regs *regs, * access kernel memory. * Do the same here because this doesn't come via the same entry code. */ - orig_addr_limit = get_fs(); - set_fs(USER_DS); + orig_addr_limit = force_uaccess_begin(); err = arg->callback(event_num, regs, arg->callback_arg); if (err) pr_err_ratelimited("event %u on CPU %u failed with error: %d\n", event_num, smp_processor_id(), err); - set_fs(orig_addr_limit); + force_uaccess_end(orig_addr_limit); return err; } diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 5c62d0c6f15b..94b285411659 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -8,6 +8,24 @@ #include +/* + * Force the uaccess routines to be wired up for actual userspace access, + * overriding any possible set_fs(KERNEL_DS) still lingering around. Undone + * using force_uaccess_end below. + */ +static inline mm_segment_t force_uaccess_begin(void) +{ + mm_segment_t fs = get_fs(); + + set_fs(USER_DS); + return fs; +} + +static inline void force_uaccess_end(mm_segment_t oldfs) +{ + set_fs(oldfs); +} + /* * Architectures should provide two primitives (raw_copy_{to,from}_user()) * and get rid of their private instances of copy_{to,from}_user() and diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index c6ce894e4ce9..58cbe357fb2b 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -217,10 +217,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, if (add_mark) perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); perf_callchain_user(&ctx, regs); - set_fs(fs); + force_uaccess_end(fs); } } diff --git a/kernel/events/core.c b/kernel/events/core.c index d1f0a7e5b182..6961333ebad5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6453,10 +6453,9 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, /* Data. */ sp = perf_user_stack_pointer(regs); - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); rem = __output_copy_user(handle, (void *) sp, dump_size); - set_fs(fs); + force_uaccess_end(fs); dyn_size = dump_size - rem; perf_output_skip(handle, rem); diff --git a/kernel/kthread.c b/kernel/kthread.c index b2807e7be772..3edaa380dc7b 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1258,8 +1258,7 @@ void kthread_use_mm(struct mm_struct *mm) if (active_mm != mm) mmdrop(active_mm); - to_kthread(tsk)->oldfs = get_fs(); - set_fs(USER_DS); + to_kthread(tsk)->oldfs = force_uaccess_begin(); } EXPORT_SYMBOL_GPL(kthread_use_mm); @@ -1274,7 +1273,7 @@ void kthread_unuse_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(!tsk->mm); - set_fs(to_kthread(tsk)->oldfs); + force_uaccess_end(to_kthread(tsk)->oldfs); task_lock(tsk); sync_mm_rss(mm); diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c index 2af66e449aa6..946f44a9e86a 100644 --- a/kernel/stacktrace.c +++ b/kernel/stacktrace.c @@ -233,10 +233,9 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size) if (current->flags & PF_KTHREAD) return 0; - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); arch_stack_walk_user(consume_entry, &c, task_pt_regs(current)); - set_fs(fs); + force_uaccess_end(fs); return c.len; } diff --git a/mm/maccess.c b/mm/maccess.c index f98ff91e32c6..3bd70405f2d8 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -205,15 +205,14 @@ long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) long copy_from_user_nofault(void *dst, const void __user *src, size_t size) { long ret = -EFAULT; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs = force_uaccess_begin(); - set_fs(USER_DS); if (access_ok(src, size)) { pagefault_disable(); ret = __copy_from_user_inatomic(dst, src, size); pagefault_enable(); } - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret) return -EFAULT; @@ -233,15 +232,14 @@ EXPORT_SYMBOL_GPL(copy_from_user_nofault); long copy_to_user_nofault(void __user *dst, const void *src, size_t size) { long ret = -EFAULT; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs = force_uaccess_begin(); - set_fs(USER_DS); if (access_ok(dst, size)) { pagefault_disable(); ret = __copy_to_user_inatomic(dst, src, size); pagefault_enable(); } - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret) return -EFAULT; @@ -270,17 +268,17 @@ EXPORT_SYMBOL_GPL(copy_to_user_nofault); long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, long count) { - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; long ret; if (unlikely(count <= 0)) return 0; - set_fs(USER_DS); + old_fs = force_uaccess_begin(); pagefault_disable(); ret = strncpy_from_user(dst, unsafe_addr, count); pagefault_enable(); - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret >= count) { ret = count; @@ -310,14 +308,14 @@ long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, */ long strnlen_user_nofault(const void __user *unsafe_addr, long count) { - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; int ret; - set_fs(USER_DS); + old_fs = force_uaccess_begin(); pagefault_disable(); ret = strnlen_user(unsafe_addr, count); pagefault_enable(); - set_fs(old_fs); + force_uaccess_end(old_fs); return ret; } From fe81417596fa8b6577fedb7e206ff3e4c7015c13 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:33:50 -0700 Subject: [PATCH 068/164] exec: use force_uaccess_begin during exec and exit Both exec and exit want to ensure that the uaccess routines actually do access user pointers. Use the newly added force_uaccess_begin helper instead of an open coded set_fs for that to prepare for kernel builds where set_fs() does not exist. Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Acked-by: Linus Torvalds Cc: Nick Hu Cc: Greentime Hu Cc: Vincent Chen Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de Signed-off-by: Linus Torvalds --- fs/exec.c | 7 ++++++- kernel/exit.c | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 3698252719a3..29ef78ae9f50 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1402,7 +1402,12 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out_unlock; - set_fs(USER_DS); + /* + * Ensure that the uaccess routines can actually operate on userspace + * pointers: + */ + force_uaccess_begin(); + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); diff --git a/kernel/exit.c b/kernel/exit.c index e731c414e024..c2d2961576f2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -732,7 +732,7 @@ void __noreturn do_exit(long code) * mm_release()->clear_child_tid() from writing to a user-controlled * kernel address. */ - set_fs(USER_DS); + force_uaccess_begin(); if (unlikely(in_atomic())) { pr_info("note: %s[%d] exited with preempt_count %d\n", From bd72866b8da499e60633ff28f8a4f6e09ca78efe Mon Sep 17 00:00:00 2001 From: Luc Van Oostenryck Date: Tue, 11 Aug 2020 18:33:54 -0700 Subject: [PATCH 069/164] alpha: fix annotation of io{read,write}{16,32}be() These accessors must be used to read/write a big-endian bus. The value returned or written is native-endian. However, these accessors are defined using be{16,32}_to_cpu() or cpu_to_be{16,32}() to make the endian conversion but these expect a __be{16,32} when none is present. Keeping them would need a force cast that would solve nothing at all. So, do the conversion using swab{16,32}, like done in asm-generic for similar situations. Reported-by: kernel test robot Signed-off-by: Luc Van Oostenryck Signed-off-by: Andrew Morton Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Stephen Boyd Cc: Arnd Bergmann Link: http://lkml.kernel.org/r/20200622114232.80039-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/io.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index a4d0c19f1e79..640e1a2f57b4 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -489,10 +489,10 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) } #endif -#define ioread16be(p) be16_to_cpu(ioread16(p)) -#define ioread32be(p) be32_to_cpu(ioread32(p)) -#define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) -#define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) +#define ioread16be(p) swab16(ioread16(p)) +#define ioread32be(p) swab32(ioread32(p)) +#define iowrite16be(v,p) iowrite16(swab16(v), (p)) +#define iowrite32be(v,p) iowrite32(swab32(v), (p)) #define inb_p inb #define inw_p inw From c5f748e2f2ad970078de11bd6b12bcd81147c636 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:33:57 -0700 Subject: [PATCH 070/164] include/linux/compiler-clang.h: drop duplicated word in a comment Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Nathan Chancellor Link: http://lkml.kernel.org/r/6a18c301-3505-742f-4dd7-0f38d0e537b9@infradead.org Signed-off-by: Linus Torvalds --- include/linux/compiler-clang.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index 8a072d00e688..cee0c728d39a 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -40,7 +40,7 @@ #endif /* - * Not all versions of clang implement the the type-generic versions + * Not all versions of clang implement the type-generic versions * of the builtin overflow checkers. Fortunately, clang implements * __has_builtin allowing us to avoid awkward version * checks. Unfortunately, we don't know which version of gcc clang From cd1a406fa46f81b1a859ef874b8413a6fa6ac7e8 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:34:00 -0700 Subject: [PATCH 071/164] include/linux/exportfs.h: drop duplicated word in a comment Drop the doubled word "a" in a comment. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Alexander Viro Link: http://lkml.kernel.org/r/c61b707a-8fd8-5b1b-aab0-679122881543@infradead.org Signed-off-by: Linus Torvalds --- include/linux/exportfs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h index d896b8657085..3ceb72b67a7a 100644 --- a/include/linux/exportfs.h +++ b/include/linux/exportfs.h @@ -178,7 +178,7 @@ struct fid { * get_name: * @get_name should find a name for the given @child in the given @parent * directory. The name should be stored in the @name (with the - * understanding that it is already pointing to a a %NAME_MAX+1 sized + * understanding that it is already pointing to a %NAME_MAX+1 sized * buffer. get_name() should return %0 on success, a negative error code * or error. @get_name will be called without @parent->i_mutex held. * From 121ae8da9cd49f7139bf80f26352337458e63fe1 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:34:04 -0700 Subject: [PATCH 072/164] include/linux/async_tx.h: drop duplicated word in a comment Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Dan Williams Link: http://lkml.kernel.org/r/e85802f7-8f48-8b4c-29b3-ea237a2c7ae9@infradead.org Signed-off-by: Linus Torvalds --- include/linux/async_tx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h index 75e582b8d2d9..4c328fef403c 100644 --- a/include/linux/async_tx.h +++ b/include/linux/async_tx.h @@ -36,7 +36,7 @@ struct dma_chan_ref { /** * async_tx_flags - modifiers for the async_* calls * @ASYNC_TX_XOR_ZERO_DST: this flag must be used for xor operations where the - * the destination address is not a source. The asynchronous case handles this + * destination address is not a source. The asynchronous case handles this * implicitly, the synchronous case needs to zero the destination block. * @ASYNC_TX_XOR_DROP_DST: this flag must be used if the destination address is * also one of the source addresses. In the synchronous case the destination From f48ff83e9c1a8fc2e8bfedb4855933eb1ed159b2 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:34:07 -0700 Subject: [PATCH 073/164] include/linux/xz.h: drop duplicated word Drop the doubled word "than" in a comment. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Cc: Lasse Collin Link: http://lkml.kernel.org/r/05ebba7a-c1e4-01ae-fc7b-15c081b33f3e@infradead.org Signed-off-by: Linus Torvalds --- include/linux/xz.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/xz.h b/include/linux/xz.h index 64cffa6ddfce..24ad7d875977 100644 --- a/include/linux/xz.h +++ b/include/linux/xz.h @@ -28,7 +28,7 @@ * enum xz_mode - Operation mode * * @XZ_SINGLE: Single-call mode. This uses less RAM than - * than multi-call modes, because the LZMA2 + * multi-call modes, because the LZMA2 * dictionary doesn't need to be allocated as * part of the decoder state. All required data * structures are allocated at initialization, From 8043fc147a97ec2eefc582487f344f2cbe86d12e Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 11 Aug 2020 18:34:10 -0700 Subject: [PATCH 074/164] kernel: add a kernel_wait helper Add a helper that waits for a pid and stores the status in the passed in kernel pointer. Use it to fix the usage of kernel_wait4 in call_usermodehelper_exec_sync that only happens to work due to the implicit set_fs(KERNEL_DS) for kernel threads. Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Acked-by: "Eric W. Biederman" Cc: Luis Chamberlain Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de Signed-off-by: Linus Torvalds --- include/linux/sched/task.h | 1 + kernel/exit.c | 16 ++++++++++++++++ kernel/umh.c | 29 ++++------------------------- 3 files changed, 21 insertions(+), 25 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ae3060f0b0c9..a98965007eef 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -88,6 +88,7 @@ struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); +int kernel_wait(pid_t pid, int *stat); extern void free_task(struct task_struct *tsk); diff --git a/kernel/exit.c b/kernel/exit.c index c2d2961576f2..733e80f334e7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1626,6 +1626,22 @@ long kernel_wait4(pid_t upid, int __user *stat_addr, int options, return ret; } +int kernel_wait(pid_t pid, int *stat) +{ + struct wait_opts wo = { + .wo_type = PIDTYPE_PID, + .wo_pid = find_get_pid(pid), + .wo_flags = WEXITED, + }; + int ret; + + ret = do_wait(&wo); + if (ret > 0 && wo.wo_stat) + *stat = wo.wo_stat; + put_pid(wo.wo_pid); + return ret; +} + SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr, int, options, struct rusage __user *, ru) { diff --git a/kernel/umh.c b/kernel/umh.c index a25433f9cd9a..fcf3ee803630 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -119,37 +119,16 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info) { pid_t pid; - /* If SIGCLD is ignored kernel_wait4 won't populate the status. */ + /* If SIGCLD is ignored do_wait won't populate the status. */ kernel_sigaction(SIGCHLD, SIG_DFL); pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD); - if (pid < 0) { + if (pid < 0) sub_info->retval = pid; - } else { - int ret = -ECHILD; - /* - * Normally it is bogus to call wait4() from in-kernel because - * wait4() wants to write the exit code to a userspace address. - * But call_usermodehelper_exec_sync() always runs as kernel - * thread (workqueue) and put_user() to a kernel address works - * OK for kernel threads, due to their having an mm_segment_t - * which spans the entire address space. - * - * Thus the __user pointer cast is valid here. - */ - kernel_wait4(pid, (int __user *)&ret, 0, NULL); - - /* - * If ret is 0, either call_usermodehelper_exec_async failed and - * the real error code is already in sub_info->retval or - * sub_info->retval is 0 anyway, so don't mess with it then. - */ - if (ret) - sub_info->retval = ret; - } + else + kernel_wait(pid, &sub_info->retval); /* Restore default kernel sig handler */ kernel_sigaction(SIGCHLD, SIG_IGN); - umh_complete(sub_info); } From 09c60546f04f732d194a171b3a4ccc9ae1e704ba Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Tue, 11 Aug 2020 18:34:13 -0700 Subject: [PATCH 075/164] ./Makefile: add debug option to enable function aligned on 32 bytes Recently 0day reported many strange performance changes (regression or improvement), in which there was no obvious relation between the culprit commit and the benchmark at the first look, and it causes people to doubt the test itself is wrong. Upon further check, many of these cases are caused by the change to the alignment of kernel text or data, as whole text/data of kernel are linked together, change in one domain may affect alignments of other domains. gcc has an option '-falign-functions=n' to force text aligned, and with that option enabled, some of those performance changes will be gone, like [1][2][3]. Add this option so that developers and 0day can easily find performance bump caused by text alignment change, as tracking these strange bump is quite time consuming. Though it can't help in other cases like data alignment changes like [4]. Following is some size data for v5.7 kernel built with a RHEL config used in 0day: text data bss dec filename 19738771 13292906 5554236 38585913 vmlinux.noalign 19758591 13297002 5529660 38585253 vmlinux.align32 Raw vmlinux size in bytes: v5.7 v5.7+align32 253950832 254018000 +0.02% Some benchmark data, most of them have no big change: * hackbench: [ -1.8%, +0.5%] * fsmark: [ -3.2%, +3.4%] # ext4/xfs/btrfs * kbuild: [ -2.0%, +0.9%] * will-it-scale: [ -0.5%, +1.8%] # mmap1/pagefault3 * netperf: - TCP_CRR [+16.6%, +97.4%] - TCP_RR [-18.5%, -1.8%] - TCP_STREAM [ -1.1%, +1.9%] [1] https://lore.kernel.org/lkml/20200114085637.GA29297@shao2-debian/ [2] https://lore.kernel.org/lkml/20200330011254.GA14393@feng-iot/ [3] https://lore.kernel.org/lkml/1d98d1f0-fe84-6df7-f5bd-f4cb2cdb7f45@intel.com/ [4] https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/ Signed-off-by: Feng Tang Signed-off-by: Andrew Morton Cc: Masahiro Yamada Cc: Michal Marek Cc: Andi Kleen Cc: Huang Ying Cc: Andy Shevchenko Link: http://lkml.kernel.org/r/1595475001-90945-1-git-send-email-feng.tang@intel.com Signed-off-by: Linus Torvalds --- Makefile | 4 ++++ lib/Kconfig.debug | 11 +++++++++++ 2 files changed, 15 insertions(+) diff --git a/Makefile b/Makefile index 168dd19cad7c..254e80a96b23 100644 --- a/Makefile +++ b/Makefile @@ -893,6 +893,10 @@ KBUILD_CFLAGS += $(CC_FLAGS_SCS) export CC_FLAGS_SCS endif +ifdef CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_32B +KBUILD_CFLAGS += -falign-functions=32 +endif + # arch Makefile may override CC so keep this after arch Makefile is included NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a164785c3b48..ba12a58b2947 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -365,6 +365,17 @@ config SECTION_MISMATCH_WARN_ONLY If unsure, say Y. +config DEBUG_FORCE_FUNCTION_ALIGN_32B + bool "Force all function address 32B aligned" if EXPERT + help + There are cases that a commit from one domain changes the function + address alignment of other domains, and cause magic performance + bump (regression or improvement). Enable this option will help to + verify if the bump is caused by function alignment changes, while + it will slightly increase the kernel size and affect icache usage. + + It is mainly for debug and performance tuning use. + # # Select this config option from the architecture Kconfig, if it # is preferred to always offer frame pointers as a config From 376653435dacf84a8aca87e66aff94079a817cf2 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Tue, 11 Aug 2020 18:34:16 -0700 Subject: [PATCH 076/164] kernel.h: remove duplicate include of asm/div64.h This seems to have been added inadvertently in commit 72deb455b5ec ("block: remove CONFIG_LBDAF") Fixes: 72deb455b5ec ("block: remove CONFIG_LBDAF") Signed-off-by: Arvind Sankar Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Link: http://lkml.kernel.org/r/20200727034852.2813453-1-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 7339a00c895e..e19c13616666 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -17,7 +17,6 @@ #include #include #include -#include #define STACK_MAGIC 0xdeadbeef From 7f317d34906c1033f0752fc137dda04e43979bb8 Mon Sep 17 00:00:00 2001 From: "Alexander A. Klimov" Date: Tue, 11 Aug 2020 18:34:19 -0700 Subject: [PATCH 077/164] include/: replace HTTP links with HTTPS ones Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Signed-off-by: Alexander A. Klimov Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Link: http://lkml.kernel.org/r/20200726110117.16346-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds --- include/clocksource/timer-ti-dm.h | 2 +- include/linux/btree.h | 2 +- include/linux/delay.h | 2 +- include/linux/dma/k3-psil.h | 2 +- include/linux/dma/k3-udma-glue.h | 2 +- include/linux/dma/ti-cppi5.h | 2 +- include/linux/irqchip/irq-omap-intc.h | 2 +- include/linux/jhash.h | 2 +- include/linux/leds-ti-lmu-common.h | 2 +- include/linux/platform_data/davinci-cpufreq.h | 2 +- include/linux/platform_data/davinci_asp.h | 2 +- include/linux/platform_data/elm.h | 2 +- include/linux/platform_data/gpio-davinci.h | 2 +- include/linux/platform_data/gpmc-omap.h | 2 +- include/linux/platform_data/mtd-davinci-aemif.h | 2 +- include/linux/platform_data/omap-twl4030.h | 2 +- include/linux/platform_data/uio_pruss.h | 2 +- include/linux/platform_data/usb-omap.h | 2 +- include/linux/soc/ti/k3-ringacc.h | 2 +- include/linux/soc/ti/knav_qmss.h | 2 +- include/linux/soc/ti/ti-msgmgr.h | 2 +- include/linux/wkup_m3_ipc.h | 2 +- include/linux/xxhash.h | 2 +- include/linux/xz.h | 2 +- include/linux/zlib.h | 2 +- include/soc/arc/aux.h | 2 +- include/uapi/linux/elf.h | 2 +- include/uapi/linux/map_to_7segment.h | 2 +- include/uapi/linux/types.h | 2 +- include/uapi/linux/usb/ch9.h | 2 +- 30 files changed, 30 insertions(+), 30 deletions(-) diff --git a/include/clocksource/timer-ti-dm.h b/include/clocksource/timer-ti-dm.h index 531ca87fcd08..4c61dade8835 100644 --- a/include/clocksource/timer-ti-dm.h +++ b/include/clocksource/timer-ti-dm.h @@ -1,7 +1,7 @@ /* * OMAP Dual-Mode Timers * - * Copyright (C) 2010 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2010 Texas Instruments Incorporated - https://www.ti.com/ * Tarun Kanti DebBarma * Thara Gopinath * diff --git a/include/linux/btree.h b/include/linux/btree.h index 68f858c831b1..243ee544397a 100644 --- a/include/linux/btree.h +++ b/include/linux/btree.h @@ -10,7 +10,7 @@ * * A B+Tree is a data structure for looking up arbitrary (currently allowing * unsigned long, u32, u64 and 2 * u64) keys into pointers. The data structure - * is described at http://en.wikipedia.org/wiki/B-tree, we currently do not + * is described at https://en.wikipedia.org/wiki/B-tree, we currently do not * use binary search to find the key on lookups. * * Each B+Tree consists of a head, that contains bookkeeping information and diff --git a/include/linux/delay.h b/include/linux/delay.h index 5e016a4029d9..1d0e2ce6b6d9 100644 --- a/include/linux/delay.h +++ b/include/linux/delay.h @@ -16,7 +16,7 @@ * 3. CPU clock rate changes. * * Please see this thread: - * http://lists.openwall.net/linux-kernel/2011/01/09/56 + * https://lists.openwall.net/linux-kernel/2011/01/09/56 */ #include diff --git a/include/linux/dma/k3-psil.h b/include/linux/dma/k3-psil.h index 61d5cc0ad601..1962f75fa2d3 100644 --- a/include/linux/dma/k3-psil.h +++ b/include/linux/dma/k3-psil.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef K3_PSIL_H_ diff --git a/include/linux/dma/k3-udma-glue.h b/include/linux/dma/k3-udma-glue.h index caadbab1632a..5eb34ad973a7 100644 --- a/include/linux/dma/k3-udma-glue.h +++ b/include/linux/dma/k3-udma-glue.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef K3_UDMA_GLUE_H_ diff --git a/include/linux/dma/ti-cppi5.h b/include/linux/dma/ti-cppi5.h index 579356ae447e..5896441ee604 100644 --- a/include/linux/dma/ti-cppi5.h +++ b/include/linux/dma/ti-cppi5.h @@ -2,7 +2,7 @@ /* * CPPI5 descriptors interface * - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef __TI_CPPI5_H__ diff --git a/include/linux/irqchip/irq-omap-intc.h b/include/linux/irqchip/irq-omap-intc.h index 216e5adf80ce..dca379c0d7eb 100644 --- a/include/linux/irqchip/irq-omap-intc.h +++ b/include/linux/irqchip/irq-omap-intc.h @@ -2,7 +2,7 @@ /** * irq-omap-intc.h - INTC Idle Functions * - * Copyright (C) 2014 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2014 Texas Instruments Incorporated - https://www.ti.com * * Author: Felipe Balbi */ diff --git a/include/linux/jhash.h b/include/linux/jhash.h index ba2f6a9776b6..19ddd43aee68 100644 --- a/include/linux/jhash.h +++ b/include/linux/jhash.h @@ -5,7 +5,7 @@ * * Copyright (C) 2006. Bob Jenkins (bob_jenkins@burtleburtle.net) * - * http://burtleburtle.net/bob/hash/ + * https://burtleburtle.net/bob/hash/ * * These are the credits from Bob's sources: * diff --git a/include/linux/leds-ti-lmu-common.h b/include/linux/leds-ti-lmu-common.h index 5eb111f38803..420b61e5a213 100644 --- a/include/linux/leds-ti-lmu-common.h +++ b/include/linux/leds-ti-lmu-common.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ // TI LMU Common Core -// Copyright (C) 2018 Texas Instruments Incorporated - http://www.ti.com/ +// Copyright (C) 2018 Texas Instruments Incorporated - https://www.ti.com/ #ifndef _TI_LMU_COMMON_H_ #define _TI_LMU_COMMON_H_ diff --git a/include/linux/platform_data/davinci-cpufreq.h b/include/linux/platform_data/davinci-cpufreq.h index 3fbf9f2793b5..bc208c64e3d7 100644 --- a/include/linux/platform_data/davinci-cpufreq.h +++ b/include/linux/platform_data/davinci-cpufreq.h @@ -2,7 +2,7 @@ /* * TI DaVinci CPUFreq platform support. * - * Copyright (C) 2009 Texas Instruments, Inc. http://www.ti.com/ + * Copyright (C) 2009 Texas Instruments, Inc. https://www.ti.com/ */ #ifndef _MACH_DAVINCI_CPUFREQ_H diff --git a/include/linux/platform_data/davinci_asp.h b/include/linux/platform_data/davinci_asp.h index 7fe80f1c7e08..5d1fb0d78a22 100644 --- a/include/linux/platform_data/davinci_asp.h +++ b/include/linux/platform_data/davinci_asp.h @@ -1,7 +1,7 @@ /* * TI DaVinci Audio Serial Port support * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/elm.h b/include/linux/platform_data/elm.h index 0f491d8abfdd..3cc78f0447b1 100644 --- a/include/linux/platform_data/elm.h +++ b/include/linux/platform_data/elm.h @@ -2,7 +2,7 @@ /* * BCH Error Location Module * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com/ */ #ifndef __ELM_H diff --git a/include/linux/platform_data/gpio-davinci.h b/include/linux/platform_data/gpio-davinci.h index a93841bfb9f7..e182a46e609f 100644 --- a/include/linux/platform_data/gpio-davinci.h +++ b/include/linux/platform_data/gpio-davinci.h @@ -1,7 +1,7 @@ /* * DaVinci GPIO Platform Related Defines * - * Copyright (C) 2013 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2013 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/gpmc-omap.h b/include/linux/platform_data/gpmc-omap.h index ef663e570552..c9cc4e32435d 100644 --- a/include/linux/platform_data/gpmc-omap.h +++ b/include/linux/platform_data/gpmc-omap.h @@ -2,7 +2,7 @@ /* * OMAP GPMC Platform data * - * Copyright (C) 2014 Texas Instruments, Inc. - http://www.ti.com + * Copyright (C) 2014 Texas Instruments, Inc. - https://www.ti.com * Roger Quadros */ diff --git a/include/linux/platform_data/mtd-davinci-aemif.h b/include/linux/platform_data/mtd-davinci-aemif.h index a403dd51dacc..a49826214a39 100644 --- a/include/linux/platform_data/mtd-davinci-aemif.h +++ b/include/linux/platform_data/mtd-davinci-aemif.h @@ -1,7 +1,7 @@ /* * TI DaVinci AEMIF support * - * Copyright 2010 (C) Texas Instruments, Inc. http://www.ti.com/ + * Copyright 2010 (C) Texas Instruments, Inc. https://www.ti.com/ * * This file is licensed under the terms of the GNU General Public License * version 2. This program is licensed "as is" without any warranty of any diff --git a/include/linux/platform_data/omap-twl4030.h b/include/linux/platform_data/omap-twl4030.h index 8419c8caf54e..0dd851ea1c72 100644 --- a/include/linux/platform_data/omap-twl4030.h +++ b/include/linux/platform_data/omap-twl4030.h @@ -3,7 +3,7 @@ * omap-twl4030.h - ASoC machine driver for TI SoC based boards with twl4030 * codec, header. * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com * All rights reserved. * * Author: Peter Ujfalusi diff --git a/include/linux/platform_data/uio_pruss.h b/include/linux/platform_data/uio_pruss.h index 3d47d219827f..31f2e22661bc 100644 --- a/include/linux/platform_data/uio_pruss.h +++ b/include/linux/platform_data/uio_pruss.h @@ -3,7 +3,7 @@ * * Platform data for uio_pruss driver * - * Copyright (C) 2010-11 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2010-11 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/usb-omap.h b/include/linux/platform_data/usb-omap.h index fa579b4c666b..5e70d667031c 100644 --- a/include/linux/platform_data/usb-omap.h +++ b/include/linux/platform_data/usb-omap.h @@ -1,7 +1,7 @@ /* * usb-omap.h - Platform data for the various OMAP USB IPs * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com * * This software is distributed under the terms of the GNU General Public * License ("GPL") version 2, as published by the Free Software Foundation. diff --git a/include/linux/soc/ti/k3-ringacc.h b/include/linux/soc/ti/k3-ringacc.h index 7ac115432fa1..5a472eca5ee4 100644 --- a/include/linux/soc/ti/k3-ringacc.h +++ b/include/linux/soc/ti/k3-ringacc.h @@ -2,7 +2,7 @@ /* * K3 Ring Accelerator (RA) subsystem interface * - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef __SOC_TI_K3_RINGACC_API_H_ diff --git a/include/linux/soc/ti/knav_qmss.h b/include/linux/soc/ti/knav_qmss.h index 9745df6ed9d3..c75ef99c99ca 100644 --- a/include/linux/soc/ti/knav_qmss.h +++ b/include/linux/soc/ti/knav_qmss.h @@ -1,7 +1,7 @@ /* * Keystone Navigator Queue Management Sub-System header * - * Copyright (C) 2014 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2014 Texas Instruments Incorporated - https://www.ti.com * Author: Sandeep Nair * Cyril Chemparathy * Santosh Shilimkar diff --git a/include/linux/soc/ti/ti-msgmgr.h b/include/linux/soc/ti/ti-msgmgr.h index eac8e0c6fe11..1f6e76d423cf 100644 --- a/include/linux/soc/ti/ti-msgmgr.h +++ b/include/linux/soc/ti/ti-msgmgr.h @@ -1,7 +1,7 @@ /* * Texas Instruments' Message Manager * - * Copyright (C) 2015-2016 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2015-2016 Texas Instruments Incorporated - https://www.ti.com/ * Nishanth Menon * * This program is free software; you can redistribute it and/or modify diff --git a/include/linux/wkup_m3_ipc.h b/include/linux/wkup_m3_ipc.h index e497e621dbb7..3f496967b538 100644 --- a/include/linux/wkup_m3_ipc.h +++ b/include/linux/wkup_m3_ipc.h @@ -1,7 +1,7 @@ /* * TI Wakeup M3 for AMx3 SoCs Power Management Routines * - * Copyright (C) 2015 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2015 Texas Instruments Incorporated - https://www.ti.com/ * Dave Gerlach * * This program is free software; you can redistribute it and/or diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h index 52b073fea17f..df42511438d0 100644 --- a/include/linux/xxhash.h +++ b/include/linux/xxhash.h @@ -34,7 +34,7 @@ * ("BSD"). * * You can contact the author at: - * - xxHash homepage: http://cyan4973.github.io/xxHash/ + * - xxHash homepage: https://cyan4973.github.io/xxHash/ * - xxHash source repository: https://github.com/Cyan4973/xxHash */ diff --git a/include/linux/xz.h b/include/linux/xz.h index 24ad7d875977..9884c8440188 100644 --- a/include/linux/xz.h +++ b/include/linux/xz.h @@ -2,7 +2,7 @@ * XZ decompressor * * Authors: Lasse Collin - * Igor Pavlov + * Igor Pavlov * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/include/linux/zlib.h b/include/linux/zlib.h index c757d848a758..78ede944c082 100644 --- a/include/linux/zlib.h +++ b/include/linux/zlib.h @@ -23,7 +23,7 @@ The data format used by the zlib library is described by RFCs (Request for - Comments) 1950 to 1952 in the files http://www.ietf.org/rfc/rfc1950.txt + Comments) 1950 to 1952 in the files https://www.ietf.org/rfc/rfc1950.txt (zlib format), rfc1951.txt (deflate format) and rfc1952.txt (gzip format). */ diff --git a/include/soc/arc/aux.h b/include/soc/arc/aux.h index e223c4ffa153..9c2eff6140b6 100644 --- a/include/soc/arc/aux.h +++ b/include/soc/arc/aux.h @@ -22,7 +22,7 @@ static inline int read_aux_reg(u32 r) /* * function helps elide unused variable warning - * see: http://lists.infradead.org/pipermail/linux-snps-arc/2016-November/001748.html + * see: https://lists.infradead.org/pipermail/linux-snps-arc/2016-November/001748.html */ static inline void write_aux_reg(u32 r, u32 v) { diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index c6dd0215482e..22220945a5fd 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -53,7 +53,7 @@ typedef __s64 Elf64_Sxword; * * - Oracle: Linker and Libraries. * Part No: 817–1984–19, August 2011. - * http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf + * https://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf * * - System V ABI AMD64 Architecture Processor Supplement * Draft Version 0.99.4, diff --git a/include/uapi/linux/map_to_7segment.h b/include/uapi/linux/map_to_7segment.h index f9ed18134b83..13a06e5e966e 100644 --- a/include/uapi/linux/map_to_7segment.h +++ b/include/uapi/linux/map_to_7segment.h @@ -24,7 +24,7 @@ * of (ASCII) characters to a 7-segments notation. * * The 7 segment's wikipedia notation below is used as standard. - * See: http://en.wikipedia.org/wiki/Seven_segment_display + * See: https://en.wikipedia.org/wiki/Seven_segment_display * * Notation: +-a-+ * f b diff --git a/include/uapi/linux/types.h b/include/uapi/linux/types.h index 2fce8b6876e9..f6d2f83cbe29 100644 --- a/include/uapi/linux/types.h +++ b/include/uapi/linux/types.h @@ -7,7 +7,7 @@ #ifndef __ASSEMBLY__ #ifndef __KERNEL__ #ifndef __EXPORTED_HEADERS__ -#warning "Attempt to use kernel headers from user space, see http://kernelnewbies.org/KernelHeaders" +#warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders" #endif /* __EXPORTED_HEADERS__ */ #endif diff --git a/include/uapi/linux/usb/ch9.h b/include/uapi/linux/usb/ch9.h index 48766fdf6580..0f865ae4ba89 100644 --- a/include/uapi/linux/usb/ch9.h +++ b/include/uapi/linux/usb/ch9.h @@ -1229,7 +1229,7 @@ struct usb_set_sel_req { * As per USB compliance update, a device that is actively drawing * more than 100mA from USB must report itself as bus-powered in * the GetStatus(DEVICE) call. - * http://compliance.usb.org/index.asp?UpdateFile=Electrical&Format=Standard#34 + * https://compliance.usb.org/index.asp?UpdateFile=Electrical&Format=Standard#34 */ #define USB_SELF_POWER_VBUS_MAX_DRAW 100 From 9e58c5e2fcd8d8d8820ea277e0f577f563eed1f1 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Tue, 11 Aug 2020 18:34:23 -0700 Subject: [PATCH 078/164] include/linux/poison.h: remove obsolete comment When the definition was changed, the comment became stale. Just remove it since there isn't anything useful to say here. Fixes: b8a0255db958 ("include/linux/poison.h: use POISON_POINTER_DELTA for poison pointers") Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Vasily Kulikov Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200730174108.GJ23808@casper.infradead.org Signed-off-by: Linus Torvalds --- include/linux/poison.h | 4 ---- 1 file changed, 4 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index df34330b4e34..dc8ae5d8db03 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -24,10 +24,6 @@ #define LIST_POISON2 ((void *) 0x122 + POISON_POINTER_DELTA) /********** include/linux/timer.h **********/ -/* - * Magic number "tsta" to indicate a static timer initializer - * for the object debugging code. - */ #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) /********** mm/page_poison.c **********/ From 25fd529c34d063d1bef23742f2e8f8341c639dc3 Mon Sep 17 00:00:00 2001 From: Luc Van Oostenryck Date: Tue, 11 Aug 2020 18:34:26 -0700 Subject: [PATCH 079/164] sparse: group the defines by functionality By popular demand, reorder the defines for sparse annotations and group them by functionality. Signed-off-by: Luc Van Oostenryck Signed-off-by: Andrew Morton Acked-by: Miguel Ojeda Cc: Geert Uytterhoeven Link: lore.kernel.org/r/CAMuHMdWQsirja-h3wBcZezk+H2Q_HShhAks8Hc8ps5fTAp=ObQ@mail.gmail.com Link: http://lkml.kernel.org/r/20200621143652.53798-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds --- include/linux/compiler_types.h | 44 +++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 19 deletions(-) diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 2e231ba8fe3f..4b33cb385f96 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -5,48 +5,54 @@ #ifndef __ASSEMBLY__ #ifdef __CHECKER__ +/* address spaces */ # define __kernel __attribute__((address_space(0))) # define __user __attribute__((noderef, address_space(__user))) -# define __safe __attribute__((safe)) -# define __force __attribute__((force)) -# define __nocast __attribute__((nocast)) # define __iomem __attribute__((noderef, address_space(__iomem))) +# define __percpu __attribute__((noderef, address_space(__percpu))) +# define __rcu __attribute__((noderef, address_space(__rcu))) +extern void __chk_user_ptr(const volatile void __user *); +extern void __chk_io_ptr(const volatile void __iomem *); +/* context/locking */ # define __must_hold(x) __attribute__((context(x,1,1))) # define __acquires(x) __attribute__((context(x,0,1))) # define __releases(x) __attribute__((context(x,1,0))) # define __acquire(x) __context__(x,1) # define __release(x) __context__(x,-1) # define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0) -# define __percpu __attribute__((noderef, address_space(__percpu))) -# define __rcu __attribute__((noderef, address_space(__rcu))) +/* other */ +# define __force __attribute__((force)) +# define __nocast __attribute__((nocast)) +# define __safe __attribute__((safe)) # define __private __attribute__((noderef)) -extern void __chk_user_ptr(const volatile void __user *); -extern void __chk_io_ptr(const volatile void __iomem *); # define ACCESS_PRIVATE(p, member) (*((typeof((p)->member) __force *) &(p)->member)) #else /* __CHECKER__ */ +/* address spaces */ +# define __kernel # ifdef STRUCTLEAK_PLUGIN -# define __user __attribute__((user)) +# define __user __attribute__((user)) # else # define __user # endif -# define __kernel -# define __safe -# define __force -# define __nocast # define __iomem -# define __chk_user_ptr(x) (void)0 -# define __chk_io_ptr(x) (void)0 -# define __builtin_warning(x, y...) (1) +# define __percpu +# define __rcu +# define __chk_user_ptr(x) (void)0 +# define __chk_io_ptr(x) (void)0 +/* context/locking */ # define __must_hold(x) # define __acquires(x) # define __releases(x) -# define __acquire(x) (void)0 -# define __release(x) (void)0 +# define __acquire(x) (void)0 +# define __release(x) (void)0 # define __cond_lock(x,c) (c) -# define __percpu -# define __rcu +/* other */ +# define __force +# define __nocast +# define __safe # define __private # define ACCESS_PRIVATE(p, member) ((p)->member) +# define __builtin_warning(x, y...) (1) #endif /* __CHECKER__ */ /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ From 5959f829a93c18ccf15715768317b58f5d9bf2d4 Mon Sep 17 00:00:00 2001 From: Stefano Brivio Date: Tue, 11 Aug 2020 18:34:29 -0700 Subject: [PATCH 080/164] lib/bitmap.c: fix bitmap_cut() for partial overlapping case Patch series "lib: Fix bitmap_cut() for overlaps, add test" This patch (of 2): Yury Norov reports that bitmap_cut() will not produce the right outcome if src and dst partially overlap, with src pointing at some location after dst, because the memmove() affects src before we store the bits that we need to keep, that is, the bits preceding the cut -- as long as we the beginning of the cut is not aligned to a long. Fix this by storing those bits before the memmove(). Note that this is just a theoretical concern so far, as the only user of this function, pipapo_drop() from the nftables set back-end implemented in net/netfilter/nft_set_pipapo.c, always supplies entirely overlapping src and dst. Fixes: 2092767168f0 ("bitmap: Introduce bitmap_cut(): cut bits and shift remaining") Reported-by: Yury Norov Signed-off-by: Stefano Brivio Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Cc: Rasmus Villemoes Cc: Pablo Neira Ayuso Link: http://lkml.kernel.org/r/cover.1592155364.git.sbrivio@redhat.com Link: http://lkml.kernel.org/r/003e38d4428cd6091ef00b5b03354f1bd7d9091e.1592155364.git.sbrivio@redhat.com Signed-off-by: Linus Torvalds --- lib/bitmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index 0364452b1617..c13d859bc7ab 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -212,13 +212,13 @@ void bitmap_cut(unsigned long *dst, const unsigned long *src, unsigned long keep = 0, carry; int i; - memmove(dst, src, len * sizeof(*dst)); - if (first % BITS_PER_LONG) { keep = src[first / BITS_PER_LONG] & (~0UL >> (BITS_PER_LONG - first % BITS_PER_LONG)); } + memmove(dst, src, len * sizeof(*dst)); + while (cut--) { for (i = first / BITS_PER_LONG; i < len; i++) { if (i < len - 1) From bcb32a1d82614ffb8a56f14c648c3970bcd2cf89 Mon Sep 17 00:00:00 2001 From: Stefano Brivio Date: Tue, 11 Aug 2020 18:34:32 -0700 Subject: [PATCH 081/164] lib/test_bitmap.c: add test for bitmap_cut() Inspired by an original patch from Yury Norov: introduce a test for bitmap_cut() that also makes sure functionality is as described for partially overlapping src and dst. Signed-off-by: Stefano Brivio Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Cc: Pablo Neira Ayuso Cc: Rasmus Villemoes Cc: Yury Norov Link: http://lkml.kernel.org/r/5fc45e6bbd4fa837cd9577f8a0c1d639df90a4ce.1592155364.git.sbrivio@redhat.com Signed-off-by: Linus Torvalds --- lib/test_bitmap.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 6b13150667f5..df903c53952b 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -610,6 +610,63 @@ static void __init test_for_each_set_clump8(void) expect_eq_clump8(start, CLUMP_EXP_NUMBITS, clump_exp, &clump); } +struct test_bitmap_cut { + unsigned int first; + unsigned int cut; + unsigned int nbits; + unsigned long in[4]; + unsigned long expected[4]; +}; + +static struct test_bitmap_cut test_cut[] = { + { 0, 0, 8, { 0x0000000aUL, }, { 0x0000000aUL, }, }, + { 0, 0, 32, { 0xdadadeadUL, }, { 0xdadadeadUL, }, }, + { 0, 3, 8, { 0x000000aaUL, }, { 0x00000015UL, }, }, + { 3, 3, 8, { 0x000000aaUL, }, { 0x00000012UL, }, }, + { 0, 1, 32, { 0xa5a5a5a5UL, }, { 0x52d2d2d2UL, }, }, + { 0, 8, 32, { 0xdeadc0deUL, }, { 0x00deadc0UL, }, }, + { 1, 1, 32, { 0x5a5a5a5aUL, }, { 0x2d2d2d2cUL, }, }, + { 0, 15, 32, { 0xa5a5a5a5UL, }, { 0x00014b4bUL, }, }, + { 0, 16, 32, { 0xa5a5a5a5UL, }, { 0x0000a5a5UL, }, }, + { 15, 15, 32, { 0xa5a5a5a5UL, }, { 0x000125a5UL, }, }, + { 15, 16, 32, { 0xa5a5a5a5UL, }, { 0x0000a5a5UL, }, }, + { 16, 15, 32, { 0xa5a5a5a5UL, }, { 0x0001a5a5UL, }, }, + + { BITS_PER_LONG, BITS_PER_LONG, BITS_PER_LONG, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + }, + { 1, BITS_PER_LONG - 1, BITS_PER_LONG, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + { 0x00000001UL, 0x00000001UL, }, + }, + + { 0, BITS_PER_LONG * 2, BITS_PER_LONG * 2 + 1, + { 0xa5a5a5a5UL, 0x00000001UL, 0x00000001UL, 0x00000001UL }, + { 0x00000001UL, }, + }, + { 16, BITS_PER_LONG * 2 + 1, BITS_PER_LONG * 2 + 1 + 16, + { 0x0000ffffUL, 0x5a5a5a5aUL, 0x5a5a5a5aUL, 0x5a5a5a5aUL }, + { 0x2d2dffffUL, }, + }, +}; + +static void __init test_bitmap_cut(void) +{ + unsigned long b[5], *in = &b[1], *out = &b[0]; /* Partial overlap */ + int i; + + for (i = 0; i < ARRAY_SIZE(test_cut); i++) { + struct test_bitmap_cut *t = &test_cut[i]; + + memcpy(in, t->in, sizeof(t->in)); + + bitmap_cut(out, in, t->first, t->cut, t->nbits); + + expect_eq_bitmap(t->expected, out, t->nbits); + } +} + static void __init selftest(void) { test_zero_clear(); @@ -623,6 +680,7 @@ static void __init selftest(void) test_bitmap_parselist_user(); test_mem_optimisations(); test_for_each_set_clump8(); + test_bitmap_cut(); } KSTM_MODULE_LOADERS(test_bitmap); From 0a650e472d2039a5f768acd82f2a7068bf338dfd Mon Sep 17 00:00:00 2001 From: Luc Van Oostenryck Date: Tue, 11 Aug 2020 18:34:35 -0700 Subject: [PATCH 082/164] lib/generic-radix-tree.c: remove unneeded __rcu struct __genradix is defined as having its member 'root' annotated as __rcu. But in the corresponding API RCU is not used. Sparse reports this type mismatch as: lib/generic-radix-tree.c:56:35: warning: incorrect type in initializer (different address spaces) lib/generic-radix-tree.c:56:35: expected struct genradix_root *r lib/generic-radix-tree.c:56:35: got struct genradix_root [noderef] *__val with 6 other ones. So, correct root's type by removing this unneeded __rcu. Signed-off-by: Luc Van Oostenryck Signed-off-by: Andrew Morton Cc: Kent Overstreet Link: http://lkml.kernel.org/r/20200621161745.55396-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds --- include/linux/generic-radix-tree.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h index 02393c0c98f9..bfd00320c7f3 100644 --- a/include/linux/generic-radix-tree.h +++ b/include/linux/generic-radix-tree.h @@ -44,7 +44,7 @@ struct genradix_root; struct __genradix { - struct genradix_root __rcu *root; + struct genradix_root *root; }; /* From 403f177304354990c36d5a7d125bce2b39bcbe2c Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 11 Aug 2020 18:34:38 -0700 Subject: [PATCH 083/164] lib/test_bitops: do the full test during module init Currently, the bitops test consists of two parts: one part is executed during module load, the second part during module unload. This is cumbersome for the user, as he has to perform two steps to execute all tests, and is different from most (all?) other tests. Merge the two parts, so both are executed during module load. Signed-off-by: Geert Uytterhoeven Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Cc: Wei Yang Cc: Jesse Brandeburg Link: http://lkml.kernel.org/r/20200706112900.7097-1-geert@linux-m68k.org Signed-off-by: Linus Torvalds --- lib/test_bitops.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/lib/test_bitops.c b/lib/test_bitops.c index ced25e3a779b..471141ddd691 100644 --- a/lib/test_bitops.c +++ b/lib/test_bitops.c @@ -52,9 +52,9 @@ static unsigned long order_comb_long[][2] = { static int __init test_bitops_startup(void) { - int i; + int i, bit_set; - pr_warn("Loaded test module\n"); + pr_info("Starting bitops test\n"); set_bit(BITOPS_4, g_bitmap); set_bit(BITOPS_7, g_bitmap); set_bit(BITOPS_11, g_bitmap); @@ -81,12 +81,8 @@ static int __init test_bitops_startup(void) order_comb_long[i][0]); } #endif - return 0; -} -static void __exit test_bitops_unstartup(void) -{ - int bit_set; + barrier(); clear_bit(BITOPS_4, g_bitmap); clear_bit(BITOPS_7, g_bitmap); @@ -98,7 +94,13 @@ static void __exit test_bitops_unstartup(void) if (bit_set != BITOPS_LAST) pr_err("ERROR: FOUND SET BIT %d\n", bit_set); - pr_warn("Unloaded test module\n"); + pr_info("Completed bitops test\n"); + + return 0; +} + +static void __exit test_bitops_unstartup(void) +{ } module_init(test_bitops_startup); From f36331770406b8e693a3d8d71ab3ccbbeabc7142 Mon Sep 17 00:00:00 2001 From: Wei Yongjun Date: Tue, 11 Aug 2020 18:34:41 -0700 Subject: [PATCH 084/164] lib/test_lockup.c: make symbol 'test_works' static Fix sparse build warning: lib/test_lockup.c:403:1: warning: symbol '__pcpu_scope_test_works' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Wei Yongjun Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200707112252.9047-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds --- lib/test_lockup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/test_lockup.c b/lib/test_lockup.c index ff26f36d729f..0f81252837b9 100644 --- a/lib/test_lockup.c +++ b/lib/test_lockup.c @@ -400,7 +400,7 @@ static void test_lockup(bool master) test_unlock(master, true); } -DEFINE_PER_CPU(struct work_struct, test_works); +static DEFINE_PER_CPU(struct work_struct, test_works); static void test_work_fn(struct work_struct *work) { From 63646bc9f95f7faf5345e54a59fd1ecdd4ba5316 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:34:44 -0700 Subject: [PATCH 085/164] lib/Kconfig.debug: make TEST_LOCKUP depend on module Since test_lockup is a test module to generate lockups, it is better to limit TEST_LOCKUP to module (=m) or disabled (=n) because we can not use the module parameters when CONFIG_TEST_LOCKUP=y. Signed-off-by: Tiezhu Yang Signed-off-by: Andrew Morton Reviewed-by: Guenter Roeck Cc: Konstantin Khlebnikov Cc: Kees Cook Link: http://lkml.kernel.org/r/1595555407-29875-1-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ba12a58b2947..32300eb79d46 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1078,6 +1078,7 @@ config WQ_WATCHDOG config TEST_LOCKUP tristate "Test module to generate lockups" + depends on m help This builds the "test_lockup" module that helps to make sure that watchdogs and lockup detectors are working properly. From 3adf3bae0d612357da516d39e1584f1547eb6e86 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:34:47 -0700 Subject: [PATCH 086/164] lib/test_lockup.c: fix return value of test_lockup_init() Since filp_open() returns an error pointer, we should use IS_ERR() to check the return value and then return PTR_ERR() if failed to get the actual return value instead of always -EINVAL. E.g. without this patch: [root@localhost loongson]# ls no_such_file ls: cannot access no_such_file: No such file or directory [root@localhost loongson]# modprobe test_lockup file_path=no_such_file lock_sb_umount time_secs=60 state=S modprobe: ERROR: could not insert 'test_lockup': Invalid argument [root@localhost loongson]# dmesg | tail -1 [ 126.100596] test_lockup: cannot find file_path With this patch: [root@localhost loongson]# ls no_such_file ls: cannot access no_such_file: No such file or directory [root@localhost loongson]# modprobe test_lockup file_path=no_such_file lock_sb_umount time_secs=60 state=S modprobe: ERROR: could not insert 'test_lockup': Unknown symbol in module, or unknown parameter (see dmesg) [root@localhost loongson]# dmesg | tail -1 [ 95.134362] test_lockup: failed to open no_such_file: -2 Fixes: aecd42df6d39 ("lib/test_lockup.c: add parameters for locking generic vfs locks") Signed-off-by: Tiezhu Yang Signed-off-by: Andrew Morton Reviewed-by: Guenter Roeck Cc: Konstantin Khlebnikov Cc: Kees Cook Link: http://lkml.kernel.org/r/1595555407-29875-2-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Linus Torvalds --- lib/test_lockup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/test_lockup.c b/lib/test_lockup.c index 0f81252837b9..f1a020bcc763 100644 --- a/lib/test_lockup.c +++ b/lib/test_lockup.c @@ -512,8 +512,8 @@ static int __init test_lockup_init(void) if (test_file_path[0]) { test_file = filp_open(test_file_path, O_RDONLY, 0); if (IS_ERR(test_file)) { - pr_err("cannot find file_path\n"); - return -EINVAL; + pr_err("failed to open %s: %ld\n", test_file_path, PTR_ERR(test_file)); + return PTR_ERR(test_file); } test_inode = file_inode(test_file); } else if (test_lock_inode || From d89775fc929c5a1d91ed518a71b456da0865e5ff Mon Sep 17 00:00:00 2001 From: "Alexander A. Klimov" Date: Tue, 11 Aug 2020 18:34:50 -0700 Subject: [PATCH 087/164] lib/: replace HTTP links with HTTPS ones Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Signed-off-by: Alexander A. Klimov Signed-off-by: Andrew Morton Acked-by: Coly Li [crc64.c] Link: http://lkml.kernel.org/r/20200726112154.16510-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 2 +- lib/crc64.c | 2 +- lib/decompress_bunzip2.c | 2 +- lib/decompress_unlzma.c | 6 +++--- lib/math/rational.c | 2 +- lib/rbtree.c | 2 +- lib/ts_bm.c | 2 +- lib/xxhash.c | 2 +- lib/xz/xz_crc32.c | 2 +- lib/xz/xz_dec_bcj.c | 2 +- lib/xz/xz_dec_lzma2.c | 2 +- lib/xz/xz_lzma2.h | 2 +- lib/xz/xz_stream.h | 2 +- 13 files changed, 15 insertions(+), 15 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 32300eb79d46..5c5fbcb34af4 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2215,7 +2215,7 @@ config LIST_KUNIT_TEST and associated macros. KUnit tests run during boot and output the results to the debug log - in TAP format (http://testanything.org/). Only useful for kernel devs + in TAP format (https://testanything.org/). Only useful for kernel devs running the KUnit test harness, and not intended for inclusion into a production build. diff --git a/lib/crc64.c b/lib/crc64.c index f8928ce28280..47cfa054827f 100644 --- a/lib/crc64.c +++ b/lib/crc64.c @@ -4,7 +4,7 @@ * * This is a basic crc64 implementation following ECMA-182 specification, * which can be found from, - * http://www.ecma-international.org/publications/standards/Ecma-182.htm + * https://www.ecma-international.org/publications/standards/Ecma-182.htm * * Dr. Ross N. Williams has a great document to introduce the idea of CRC * algorithm, here the CRC64 code is also inspired by the table-driven diff --git a/lib/decompress_bunzip2.c b/lib/decompress_bunzip2.c index 7c4932eed748..f9628f3924ce 100644 --- a/lib/decompress_bunzip2.c +++ b/lib/decompress_bunzip2.c @@ -34,7 +34,7 @@ Phone (337) 232-1234 or 1-800-738-2226 Fax (337) 232-1297 - http://www.hospiceacadiana.com/ + https://www.hospiceacadiana.com/ Manuel */ diff --git a/lib/decompress_unlzma.c b/lib/decompress_unlzma.c index ed7a1fd819f2..1cf409ef8d04 100644 --- a/lib/decompress_unlzma.c +++ b/lib/decompress_unlzma.c @@ -8,7 +8,7 @@ *implementation for lzma. *Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - *Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + *Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) *Copyright (C) 1999-2005 Igor Pavlov * *Copyrights of the parts, see headers below. @@ -56,7 +56,7 @@ static long long INIT read_int(unsigned char *ptr, int size) /* Small range coder implementation for lzma. *Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - *Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + *Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) *Copyright (c) 1999-2005 Igor Pavlov */ @@ -213,7 +213,7 @@ rc_bit_tree_decode(struct rc *rc, uint16_t *p, int num_levels, int *symbol) * Small lzma deflate implementation. * Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - * Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + * Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) * Copyright (C) 1999-2005 Igor Pavlov */ diff --git a/lib/math/rational.c b/lib/math/rational.c index 31fb27db2deb..df75c8809693 100644 --- a/lib/math/rational.c +++ b/lib/math/rational.c @@ -27,7 +27,7 @@ * with the fractional part size described in given_denominator. * * for theoretical background, see: - * http://en.wikipedia.org/wiki/Continued_fraction + * https://en.wikipedia.org/wiki/Continued_fraction */ void rational_best_approximation( diff --git a/lib/rbtree.c b/lib/rbtree.c index 8545872e61db..c4ac5c2421f2 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -13,7 +13,7 @@ #include /* - * red-black trees properties: http://en.wikipedia.org/wiki/Rbtree + * red-black trees properties: https://en.wikipedia.org/wiki/Rbtree * * 1) A node is either red or black * 2) The root is black diff --git a/lib/ts_bm.c b/lib/ts_bm.c index 277cb4417ac2..4cf250031f0f 100644 --- a/lib/ts_bm.c +++ b/lib/ts_bm.c @@ -11,7 +11,7 @@ * [1] A Fast String Searching Algorithm, R.S. Boyer and Moore. * Communications of the Association for Computing Machinery, * 20(10), 1977, pp. 762-772. - * http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf + * https://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf * * [2] Handbook of Exact String Matching Algorithms, Thierry Lecroq, 2004 * http://www-igm.univ-mlv.fr/~lecroq/string/string.pdf diff --git a/lib/xxhash.c b/lib/xxhash.c index aa61e2a3802f..d5bb9ff10607 100644 --- a/lib/xxhash.c +++ b/lib/xxhash.c @@ -34,7 +34,7 @@ * ("BSD"). * * You can contact the author at: - * - xxHash homepage: http://cyan4973.github.io/xxHash/ + * - xxHash homepage: https://cyan4973.github.io/xxHash/ * - xxHash source repository: https://github.com/Cyan4973/xxHash */ diff --git a/lib/xz/xz_crc32.c b/lib/xz/xz_crc32.c index 912aae5fa09e..88a2c35e1b59 100644 --- a/lib/xz/xz_crc32.c +++ b/lib/xz/xz_crc32.c @@ -2,7 +2,7 @@ * CRC32 using the polynomial from IEEE-802.3 * * Authors: Lasse Collin - * Igor Pavlov + * Igor Pavlov * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_dec_bcj.c b/lib/xz/xz_dec_bcj.c index a768e6d28bbb..72ddac6ef2ec 100644 --- a/lib/xz/xz_dec_bcj.c +++ b/lib/xz/xz_dec_bcj.c @@ -2,7 +2,7 @@ * Branch/Call/Jump (BCJ) filter decoders * * Authors: Lasse Collin - * Igor Pavlov + * Igor Pavlov * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_dec_lzma2.c b/lib/xz/xz_dec_lzma2.c index 156f26fdc4c9..9f336bc07ed6 100644 --- a/lib/xz/xz_dec_lzma2.c +++ b/lib/xz/xz_dec_lzma2.c @@ -2,7 +2,7 @@ * LZMA2 decoder * * Authors: Lasse Collin - * Igor Pavlov + * Igor Pavlov * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_lzma2.h b/lib/xz/xz_lzma2.h index 071d67bee9f5..92d852d4f87a 100644 --- a/lib/xz/xz_lzma2.h +++ b/lib/xz/xz_lzma2.h @@ -2,7 +2,7 @@ * LZMA2 definitions * * Authors: Lasse Collin - * Igor Pavlov + * Igor Pavlov * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_stream.h b/lib/xz/xz_stream.h index 66cb5a7055ec..430bb3a0d195 100644 --- a/lib/xz/xz_stream.h +++ b/lib/xz/xz_stream.h @@ -19,7 +19,7 @@ /* * See the .xz file format specification at - * http://tukaani.org/xz/xz-file-format.txt + * https://tukaani.org/xz/xz-file-format.txt * to understand the container format. */ From b642e44e8ab335868b549fe5753b783ca47bf3a3 Mon Sep 17 00:00:00 2001 From: Kars Mulder Date: Tue, 11 Aug 2020 18:34:53 -0700 Subject: [PATCH 088/164] kstrto*: correct documentation references to simple_strto*() The documentation of the kstrto*() functions reference the simple_strtoull function by "used as a replacement for [the obsolete] simple_strtoull". All these functions describes themselves as replacements for the function simple_strtoull, even though a function like kstrtol() would be more aptly described as a replacement of simple_strtol(). Fix these references by making the documentation of kstrto*() reference the closest simple_strto*() equivalent available. The functions kstrto[u]int() do not have direct simple_strto[u]int() equivalences, so these are made to refer to simple_strto[u]l() instead. Furthermore, add parentheses after function names, as is standard in kernel documentation. Fixes: 4c925d6031f71 ("kstrto*: add documentation") Signed-off-by: Kars Mulder Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Cc: Eldad Zack Cc: Miguel Ojeda Cc: Geert Uytterhoeven Cc: Mans Rullgard Cc: Petr Mladek Link: http://lkml.kernel.org/r/1ee1-5f234c00-f3-165a6440@234394593 Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 4 ++-- lib/kstrtox.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e19c13616666..47b1dad63ffc 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -346,7 +346,7 @@ int __must_check kstrtoll(const char *s, unsigned int base, long long *res); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtoull. Return code must be checked. + * Used as a replacement for the simple_strtoul(). Return code must be checked. */ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) { @@ -374,7 +374,7 @@ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsign * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtoull. Return code must be checked. + * Used as a replacement for the simple_strtol(). Return code must be checked. */ static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) { diff --git a/lib/kstrtox.c b/lib/kstrtox.c index 1006bf70bf74..252ac414ba9a 100644 --- a/lib/kstrtox.c +++ b/lib/kstrtox.c @@ -115,7 +115,7 @@ static int _kstrtoull(const char *s, unsigned int base, unsigned long long *res) * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must + * Used as a replacement for the obsolete simple_strtoull(). Return code must * be checked. */ int kstrtoull(const char *s, unsigned int base, unsigned long long *res) @@ -139,7 +139,7 @@ EXPORT_SYMBOL(kstrtoull); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must + * Used as a replacement for the obsolete simple_strtoll(). Return code must * be checked. */ int kstrtoll(const char *s, unsigned int base, long long *res) @@ -211,7 +211,7 @@ EXPORT_SYMBOL(_kstrtol); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must + * Used as a replacement for the obsolete simple_strtoul(). Return code must * be checked. */ int kstrtouint(const char *s, unsigned int base, unsigned int *res) @@ -242,7 +242,7 @@ EXPORT_SYMBOL(kstrtouint); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must + * Used as a replacement for the obsolete simple_strtol(). Return code must * be checked. */ int kstrtoint(const char *s, unsigned int base, int *res) From ef0f2685336bbc334e8b6997ce9b155e5f7edd31 Mon Sep 17 00:00:00 2001 From: Kars Mulder Date: Tue, 11 Aug 2020 18:34:56 -0700 Subject: [PATCH 089/164] kstrto*: do not describe simple_strto*() as obsolete/replaced The documentation of the kstrto*() functions describes kstrto*() as "replacements" of the "obsolete" simple_strto*() functions. Both of these terms are inaccurate: they're not replacements because they have different behaviour, and the simple_strto*() are not obsolete because there are cases where they have benefits over kstrto*(). Remove usage of the terms "replacement" and "obsolete" in reference to simple_strto*(), and instead use the term "preferred over". Fixes: 4c925d6031f71 ("kstrto*: add documentation") Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto() functions") Signed-off-by: Kars Mulder Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Cc: Eldad Zack Cc: Miguel Ojeda Cc: Geert Uytterhoeven Cc: Mans Rullgard Cc: Petr Mladek Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027 Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 4 ++-- lib/kstrtox.c | 12 ++++-------- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 47b1dad63ffc..92517ae53e7c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -346,7 +346,7 @@ int __must_check kstrtoll(const char *s, unsigned int base, long long *res); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtoul(). Return code must be checked. + * Preferred over simple_strtoul(). Return code must be checked. */ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) { @@ -374,7 +374,7 @@ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsign * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtol(). Return code must be checked. + * Preferred over simple_strtol(). Return code must be checked. */ static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) { diff --git a/lib/kstrtox.c b/lib/kstrtox.c index 252ac414ba9a..a14ccf905055 100644 --- a/lib/kstrtox.c +++ b/lib/kstrtox.c @@ -115,8 +115,7 @@ static int _kstrtoull(const char *s, unsigned int base, unsigned long long *res) * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull(). Return code must - * be checked. + * Preferred over simple_strtoull(). Return code must be checked. */ int kstrtoull(const char *s, unsigned int base, unsigned long long *res) { @@ -139,8 +138,7 @@ EXPORT_SYMBOL(kstrtoull); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoll(). Return code must - * be checked. + * Preferred over simple_strtoll(). Return code must be checked. */ int kstrtoll(const char *s, unsigned int base, long long *res) { @@ -211,8 +209,7 @@ EXPORT_SYMBOL(_kstrtol); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoul(). Return code must - * be checked. + * Preferred over simple_strtoul(). Return code must be checked. */ int kstrtouint(const char *s, unsigned int base, unsigned int *res) { @@ -242,8 +239,7 @@ EXPORT_SYMBOL(kstrtouint); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtol(). Return code must - * be checked. + * Preferred over simple_strtol(). Return code must be checked. */ int kstrtoint(const char *s, unsigned int base, int *res) { From 6d511020e13d5d6f5f0af853e32ddef75c693ccc Mon Sep 17 00:00:00 2001 From: Rikard Falkeborn Date: Tue, 11 Aug 2020 18:35:03 -0700 Subject: [PATCH 090/164] lib/test_bits.c: add tests of GENMASK Add tests of GENMASK and GENMASK_ULL. A few test cases that should fail compilation are provided under #ifdef TEST_GENMASK_FAILURES [rd.dunlap@gmail.com: add MODULE_LICENSE()] Link: http://lkml.kernel.org/r/dfc74524-0789-2827-4eff-476ddab65699@gmail.com [weiyongjun1@huawei.com: make some functions static] Link: http://lkml.kernel.org/r/20200702150336.4756-1-weiyongjun1@huawei.com Suggested-by: Andy Shevchenko Signed-off-by: Rikard Falkeborn Signed-off-by: Randy Dunlap Signed-off-by: Wei Yongjun Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Acked-by: William Breathitt Gray Cc: Emil Velikov Cc: Syed Nayyar Waris Cc: Andy Shevchenko Cc: Arnd Bergmann Cc: Geert Uytterhoeven Cc: Kees Cook Cc: Linus Walleij Cc: Masahiro Yamada Link: http://lkml.kernel.org/r/20200621054210.14804-2-rikard.falkeborn@gmail.com Link: http://lkml.kernel.org/r/20200608221823.35799-2-rikard.falkeborn@gmail.com Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 11 +++++++ lib/Makefile | 1 + lib/test_bits.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 87 insertions(+) create mode 100644 lib/test_bits.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c5fbcb34af4..9ca1b11af1f6 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2236,6 +2236,17 @@ config LINEAR_RANGES_TEST If unsure, say N. +config BITS_TEST + tristate "KUnit test for bits.h" + depends on KUNIT + help + This builds the bits unit test. + Tests the logic of macros defined in bits.h. + For more information on KUnit and unit tests in general please refer + to the KUnit documentation in Documentation/dev-tools/kunit/. + + If unsure, say N. + config TEST_UDELAY tristate "udelay test driver" help diff --git a/lib/Makefile b/lib/Makefile index 9d1fd82ea145..e290fc5707ea 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -342,3 +342,4 @@ obj-$(CONFIG_PLDMFW) += pldmfw/ # KUnit tests obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o +obj-$(CONFIG_BITS_TEST) += test_bits.o diff --git a/lib/test_bits.c b/lib/test_bits.c new file mode 100644 index 000000000000..c9368a2314e7 --- /dev/null +++ b/lib/test_bits.c @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Test cases for functions and macros in bits.h + */ + +#include +#include + + +static void genmask_test(struct kunit *test) +{ + KUNIT_EXPECT_EQ(test, 1ul, GENMASK(0, 0)); + KUNIT_EXPECT_EQ(test, 3ul, GENMASK(1, 0)); + KUNIT_EXPECT_EQ(test, 6ul, GENMASK(2, 1)); + KUNIT_EXPECT_EQ(test, 0xFFFFFFFFul, GENMASK(31, 0)); + +#ifdef TEST_GENMASK_FAILURES + /* these should fail compilation */ + GENMASK(0, 1); + GENMASK(0, 10); + GENMASK(9, 10); +#endif + + +} + +static void genmask_ull_test(struct kunit *test) +{ + KUNIT_EXPECT_EQ(test, 1ull, GENMASK_ULL(0, 0)); + KUNIT_EXPECT_EQ(test, 3ull, GENMASK_ULL(1, 0)); + KUNIT_EXPECT_EQ(test, 0x000000ffffe00000ull, GENMASK_ULL(39, 21)); + KUNIT_EXPECT_EQ(test, 0xffffffffffffffffull, GENMASK_ULL(63, 0)); + +#ifdef TEST_GENMASK_FAILURES + /* these should fail compilation */ + GENMASK_ULL(0, 1); + GENMASK_ULL(0, 10); + GENMASK_ULL(9, 10); +#endif +} + +static void genmask_input_check_test(struct kunit *test) +{ + unsigned int x, y; + int z, w; + + /* Unknown input */ + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(x, 0)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(0, x)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(x, y)); + + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(z, 0)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(0, z)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(z, w)); + + /* Valid input */ + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(1, 1)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(39, 21)); +} + + +static struct kunit_case bits_test_cases[] = { + KUNIT_CASE(genmask_test), + KUNIT_CASE(genmask_ull_test), + KUNIT_CASE(genmask_input_check_test), + {} +}; + +static struct kunit_suite bits_test_suite = { + .name = "bits-test", + .test_cases = bits_test_cases, +}; +kunit_test_suite(bits_test_suite); + +MODULE_LICENSE("GPL"); From 50161266973bcc662e969e63d68fc7bff71de21b Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:07 -0700 Subject: [PATCH 091/164] checkpatch: add test for possible misuse of IS_ENABLED() without CONFIG_ IS_ENABLED is almost always used with CONFIG_ defines. Add a test to verify that the #define being tested starts with CONFIG_. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Link: http://lkml.kernel.org/r/e7fda760b91b769ba82844ba282d432c0d26d709.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 599b8c4933a7..ba9ab2b42d73 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -6465,6 +6465,12 @@ sub process { } } +# check for IS_ENABLED() without CONFIG_ ($rawline for comments too) + if ($rawline =~ /\bIS_ENABLED\s*\(\s*(\w+)\s*\)/ && $1 !~ /^CONFIG_/) { + WARN("IS_ENABLED_CONFIG", + "IS_ENABLED($1) is normally used as IS_ENABLED(CONFIG_$1)\n" . $herecurr); + } + # check for #if defined CONFIG_ || defined CONFIG__MODULE if ($line =~ /^\+\s*#\s*if\s+defined(?:\s*\(?\s*|\s+)(CONFIG_[A-Z_]+)\s*\)?\s*\|\|\s*defined(?:\s*\(?\s*|\s+)\1_MODULE\s*\)?\s*$/) { my $config = $1; From 65b64b3bec3fa74937fdc18e5ccf10a48ee3a7d3 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:10 -0700 Subject: [PATCH 092/164] checkpatch: add --fix option for ASSIGN_IN_IF Add a --fix option for 2 types of single-line assignment in if statements if ((foo = bar(...)) < BAZ) { expands to: foo = bar(..); if (foo < BAZ) { and if ((foo = bar(...)) { expands to: foo = bar(...); if (foo) { if statements with assignments spanning multiple lines are not converted with the --fix option. if statements with additional logic are also not converted. e.g.: if ((foo = bar(...)) & BAZ == BAZ) { Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Cc: Julia Lawall Link: http://lkml.kernel.org/r/9bc7c782516f37948f202deba511bc95ed279bbd.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index ba9ab2b42d73..5cb791eb4c6f 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5020,8 +5020,30 @@ sub process { my ($s, $c) = ($stat, $cond); if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s) { - ERROR("ASSIGN_IN_IF", - "do not use assignment in if condition\n" . $herecurr); + if (ERROR("ASSIGN_IN_IF", + "do not use assignment in if condition\n" . $herecurr) && + $fix && $perl_version_ok) { + if ($rawline =~ /^\+(\s+)if\s*\(\s*(\!)?\s*\(\s*(($Lval)\s*=\s*$LvalOrFunc)\s*\)\s*(?:($Compare)\s*($FuncArg))?\s*\)\s*(\{)?\s*$/) { + my $space = $1; + my $not = $2; + my $statement = $3; + my $assigned = $4; + my $test = $8; + my $against = $9; + my $brace = $15; + fix_delete_line($fixlinenr, $rawline); + fix_insert_line($fixlinenr, "$space$statement;"); + my $newline = "${space}if ("; + $newline .= '!' if defined($not); + $newline .= '(' if (defined $not && defined($test) && defined($against)); + $newline .= "$assigned"; + $newline .= " $test $against" if (defined($test) && defined($against)); + $newline .= ')' if (defined $not && defined($test) && defined($against)); + $newline .= ')'; + $newline .= " {" if (defined($brace)); + fix_insert_line($fixlinenr + 1, $newline); + } + } } # Find out what is on the end of the line after the From ced69da1db0b57bbb10054bd74201babc094f408 Mon Sep 17 00:00:00 2001 From: Quentin Monnet Date: Tue, 11 Aug 2020 18:35:13 -0700 Subject: [PATCH 093/164] checkpatch: fix CONST_STRUCT when const_structs.checkpatch is missing Checkpatch reports warnings when some specific structs are not declared as const in the code. The list of structs to consider was initially defined in the checkpatch.pl script itself, but it was later moved to an external file (scripts/const_structs.checkpatch), in commit bf1fa1dae68e ("checkpatch: externalize the structs that should be const"). This introduced two minor issues: - When file scripts/const_structs.checkpatch is not present (for example, if checkpatch is run outside of the kernel directory with the "--no-tree" option), a warning is printed to stderr to tell the user that "No structs that should be const will be found". This is fair, but the warning is printed unconditionally, even if the option "--ignore CONST_STRUCT" is passed. In the latter case, we explicitly ask checkpatch to skip this check, so no warning should be printed. - When scripts/const_structs.checkpatch is missing, or even when trying to silence the warning by adding an empty file, $const_structs is set to "", and the regex used for finding structs that should be const, "$line =~ /struct\s+($const_structs)(?!\s*\{)/)", matches all structs found in the code, thus reporting a number of false positives. Let's fix the first item by skipping scripts/const_structs.checkpatch processing if "CONST_STRUCT" checks are ignored, and the second one by skipping the test if $const_structs is not defined. Since we modify the read_words() function a little bit, update the checks for $typedefsfile/$typeOtherTypedefs as well. Signed-off-by: Quentin Monnet Signed-off-by: Andrew Morton Acked-by: Joe Perches Link: http://lkml.kernel.org/r/20200623221822.3727-1-quentin@isovalent.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 5cb791eb4c6f..1abddf8fe56e 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -59,7 +59,7 @@ my $spelling_file = "$D/spelling.txt"; my $codespell = 0; my $codespellfile = "/usr/share/codespell/dictionary.txt"; my $conststructsfile = "$D/const_structs.checkpatch"; -my $typedefsfile = ""; +my $typedefsfile; my $color = "auto"; my $allow_c99_comments = 1; # Can be overridden by --ignore C99_COMMENT_TOLERANCE # git output parsing needs US English output, so first set backtick child process LANGUAGE @@ -756,7 +756,7 @@ sub read_words { next; } - $$wordsRef .= '|' if ($$wordsRef ne ""); + $$wordsRef .= '|' if (defined $$wordsRef); $$wordsRef .= $line; } close($file); @@ -766,16 +766,18 @@ sub read_words { return 0; } -my $const_structs = ""; -read_words(\$const_structs, $conststructsfile) - or warn "No structs that should be const will be found - file '$conststructsfile': $!\n"; +my $const_structs; +if (show_type("CONST_STRUCT")) { + read_words(\$const_structs, $conststructsfile) + or warn "No structs that should be const will be found - file '$conststructsfile': $!\n"; +} -my $typeOtherTypedefs = ""; -if (length($typedefsfile)) { +if (defined($typedefsfile)) { + my $typeOtherTypedefs; read_words(\$typeOtherTypedefs, $typedefsfile) or warn "No additional types will be considered - file '$typedefsfile': $!\n"; + $typeTypedefs .= '|' . $typeOtherTypedefs if (defined $typeOtherTypedefs); } -$typeTypedefs .= '|' . $typeOtherTypedefs if ($typeOtherTypedefs ne ""); sub build_types { my $mods = "(?x: \n" . join("|\n ", (@modifierList, @modifierListFile)) . "\n)"; @@ -6643,7 +6645,8 @@ sub process { # check for various structs that are normally const (ops, kgdb, device_tree) # and avoid what seem like struct definitions 'struct foo {' - if ($line !~ /\bconst\b/ && + if (defined($const_structs) && + $line !~ /\bconst\b/ && $line =~ /\bstruct\s+($const_structs)\b(?!\s*\{)/) { WARN("CONST_STRUCT", "struct $1 should normally be const\n" . $herecurr); From 1a3dcf2e6b35faa1176b9cd8200094fbce16ba19 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:16 -0700 Subject: [PATCH 094/164] checkpatch: add test for repeated words Try to avoid adding repeated words either on the same line or consecutive comment lines in a block e.g.: duplicated word in comment block /* * this is a comment block where the last word of the previous * previous line is also the first word of the next line */ and simple duplication /* test this this again */ Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/cda9b566ad67976e1acd62b053de50ee44a57250.camel@perches.com Inspired-by: Randy Dunlap Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 1abddf8fe56e..114af54c3194 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -588,6 +588,8 @@ our @mode_permission_funcs = ( ["__ATTR", 2], ); +my $word_pattern = '\b[A-Z]?[a-z]{2,}\b'; + #Create a search pattern for all these functions to speed up a loop below our $mode_perms_search = ""; foreach my $entry (@mode_permission_funcs) { @@ -3312,6 +3314,42 @@ sub process { } } +# check for repeated words separated by a single space + if ($rawline =~ /^\+/) { + while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) { + + my $first = $1; + my $second = $2; + + if ($first =~ /(?:struct|union|enum)/) { + pos($rawline) += length($first) + length($second) + 1; + next; + } + + next if ($first ne $second); + next if ($first eq 'long'); + + if (WARN("REPEATED_WORD", + "Possible repeated word: '$first'\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s/\b$first $second\b/$first/; + } + } + + # if it's a repeated word on consecutive lines in a comment block + if ($prevline =~ /$;+\s*$/ && + $prevrawline =~ /($word_pattern)\s*$/) { + my $last_word = $1; + if ($rawline =~ /^\+\s*\*\s*$last_word /) { + if (WARN("REPEATED_WORD", + "Possible repeated word: '$last_word'\n" . $hereprev) && + $fix) { + $fixed[$fixlinenr] =~ s/(\+\s*\*\s*)$last_word /$1/; + } + } + } + } + # check for space before tabs. if ($rawline =~ /^\+/ && $rawline =~ / \t/) { my $herevet = "$here\n" . cat_vet($rawline) . "\n"; From ef3c005c0eb07a60949191bc6ee407d5f43cc502 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:19 -0700 Subject: [PATCH 095/164] checkpatch: remove missing switch/case break test This test doesn't work well and newer compilers are much better at emitting this warning. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Cc: Cambda Zhu Link: http://lkml.kernel.org/r/7e25090c79f6a69d502ab8219863300790192fe2.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 25 ------------------------- 1 file changed, 25 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 114af54c3194..2cbeae6d9aee 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -6543,31 +6543,6 @@ sub process { } } -# check for case / default statements not preceded by break/fallthrough/switch - if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) { - my $has_break = 0; - my $has_statement = 0; - my $count = 0; - my $prevline = $linenr; - while ($prevline > 1 && ($file || $count < 3) && !$has_break) { - $prevline--; - my $rline = $rawlines[$prevline - 1]; - my $fline = $lines[$prevline - 1]; - last if ($fline =~ /^\@\@/); - next if ($fline =~ /^\-/); - next if ($fline =~ /^.(?:\s*(?:case\s+(?:$Ident|$Constant)[\s$;]*|default):[\s$;]*)*$/); - $has_break = 1 if ($rline =~ /fall[\s_-]*(through|thru)/i); - next if ($fline =~ /^.[\s$;]*$/); - $has_statement = 1; - $count++; - $has_break = 1 if ($fline =~ /\bswitch\b|\b(?:break\s*;[\s$;]*$|exit\s*\(\b|return\b|goto\b|continue\b)/); - } - if (!$has_break && $has_statement) { - WARN("MISSING_BREAK", - "Possible switch case/default not preceded by break or fallthrough comment\n" . $herecurr); - } - } - # check for /* fallthrough */ like comment, prefer fallthrough; my @fallthroughs = ( 'fallthrough', From 2fb3244f0a58ceb3d866ac63f644dfa31cae430f Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 11 Aug 2020 18:35:21 -0700 Subject: [PATCH 096/164] autofs: fix doubled word Change doubled word "is" to "it is". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Acked-by: Ian Kent Link: http://lkml.kernel.org/r/5a82befd-40f8-8dc0-3498-cbc0436cad9b@infradead.org Signed-off-by: Linus Torvalds --- include/uapi/linux/auto_dev-ioctl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/auto_dev-ioctl.h b/include/uapi/linux/auto_dev-ioctl.h index 374742651c30..62e625356dc8 100644 --- a/include/uapi/linux/auto_dev-ioctl.h +++ b/include/uapi/linux/auto_dev-ioctl.h @@ -82,7 +82,7 @@ struct args_ismountpoint { /* * All the ioctls use this structure. * When sending a path size must account for the total length - * of the chunk of memory otherwise is is the size of the + * of the chunk of memory otherwise it is the size of the * structure. */ From da27e0a0e5f655f0d58d4e153c3182bb2b290f64 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:24 -0700 Subject: [PATCH 097/164] fs/minix: check return value of sb_getblk() Patch series "fs/minix: fix syzbot bugs and set s_maxbytes". This series fixes all syzbot bugs in the minix filesystem: KASAN: null-ptr-deref Write in get_block KASAN: use-after-free Write in get_block KASAN: use-after-free Read in get_block WARNING in inc_nlink KMSAN: uninit-value in get_block WARNING in drop_nlink It also fixes the minix filesystem to set s_maxbytes correctly, so that userspace sees the correct behavior when exceeding the max file size. This patch (of 6): sb_getblk() can fail, so check its return value. This fixes a NULL pointer dereference. Originally from Qiujun Huang. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+4a88b2b9dc280f47baf4@syzkaller.appspotmail.com Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Qiujun Huang Cc: Alexander Viro Cc: Link: http://lkml.kernel.org/r/20200628060846.682158-1-ebiggers@kernel.org Link: http://lkml.kernel.org/r/20200628060846.682158-2-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/itree_common.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/minix/itree_common.c b/fs/minix/itree_common.c index 043c3fdbc8e7..446148792f41 100644 --- a/fs/minix/itree_common.c +++ b/fs/minix/itree_common.c @@ -75,6 +75,7 @@ static int alloc_branch(struct inode *inode, int n = 0; int i; int parent = minix_new_block(inode); + int err = -ENOSPC; branch[0].key = cpu_to_block(parent); if (parent) for (n = 1; n < num; n++) { @@ -85,6 +86,11 @@ static int alloc_branch(struct inode *inode, break; branch[n].key = cpu_to_block(nr); bh = sb_getblk(inode->i_sb, parent); + if (!bh) { + minix_free_block(inode, nr); + err = -ENOMEM; + break; + } lock_buffer(bh); memset(bh->b_data, 0, bh->b_size); branch[n].bh = bh; @@ -103,7 +109,7 @@ static int alloc_branch(struct inode *inode, bforget(branch[i].bh); for (i = 0; i < n; i++) minix_free_block(inode, block_to_cpu(branch[i].key)); - return -ENOSPC; + return err; } static inline int splice_branch(struct inode *inode, From facb03dddec04e4aac1bb2139accdceb04deb1f3 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:27 -0700 Subject: [PATCH 098/164] fs/minix: don't allow getting deleted inodes If an inode has no links, we need to mark it bad rather than allowing it to be accessed. This avoids WARNINGs in inc_nlink() and drop_nlink() when doing directory operations on a fuzzed filesystem. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+a9ac3de1b5de5fb10efc@syzkaller.appspotmail.com Reported-by: syzbot+df958cf5688a96ad3287@syzkaller.appspotmail.com Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Qiujun Huang Cc: Link: http://lkml.kernel.org/r/20200628060846.682158-3-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/inode.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/minix/inode.c b/fs/minix/inode.c index 7cb5fd38eb14..2bca95abe8f4 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -468,6 +468,13 @@ static struct inode *V1_minix_iget(struct inode *inode) iget_failed(inode); return ERR_PTR(-EIO); } + if (raw_inode->i_nlinks == 0) { + printk("MINIX-fs: deleted inode referenced: %lu\n", + inode->i_ino); + brelse(bh); + iget_failed(inode); + return ERR_PTR(-ESTALE); + } inode->i_mode = raw_inode->i_mode; i_uid_write(inode, raw_inode->i_uid); i_gid_write(inode, raw_inode->i_gid); @@ -501,6 +508,13 @@ static struct inode *V2_minix_iget(struct inode *inode) iget_failed(inode); return ERR_PTR(-EIO); } + if (raw_inode->i_nlinks == 0) { + printk("MINIX-fs: deleted inode referenced: %lu\n", + inode->i_ino); + brelse(bh); + iget_failed(inode); + return ERR_PTR(-ESTALE); + } inode->i_mode = raw_inode->i_mode; i_uid_write(inode, raw_inode->i_uid); i_gid_write(inode, raw_inode->i_gid); From 270ef41094e9fa95273f288d7d785313ceab2ff3 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:30 -0700 Subject: [PATCH 099/164] fs/minix: reject too-large maximum file size If the minix filesystem tries to map a very large logical block number to its on-disk location, block_to_path() can return offsets that are too large, causing out-of-bounds memory accesses when accessing indirect index blocks. This should be prevented by the check against the maximum file size, but this doesn't work because the maximum file size is read directly from the on-disk superblock and isn't validated itself. Fix this by validating the maximum file size at mount time. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+c7d9ec7a1a7272dd71b3@syzkaller.appspotmail.com Reported-by: syzbot+3b7b03a0c28948054fb5@syzkaller.appspotmail.com Reported-by: syzbot+6e056ee473568865f3e6@syzkaller.appspotmail.com Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Qiujun Huang Cc: Link: http://lkml.kernel.org/r/20200628060846.682158-4-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/inode.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/minix/inode.c b/fs/minix/inode.c index 2bca95abe8f4..0dd929346f3f 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -150,6 +150,23 @@ static int minix_remount (struct super_block * sb, int * flags, char * data) return 0; } +static bool minix_check_superblock(struct minix_sb_info *sbi) +{ + if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0) + return false; + + /* + * s_max_size must not exceed the block mapping limitation. This check + * is only needed for V1 filesystems, since V2/V3 support an extra level + * of indirect blocks which places the limit well above U32_MAX. + */ + if (sbi->s_version == MINIX_V1 && + sbi->s_max_size > (7 + 512 + 512*512) * BLOCK_SIZE) + return false; + + return true; +} + static int minix_fill_super(struct super_block *s, void *data, int silent) { struct buffer_head *bh; @@ -228,11 +245,12 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) } else goto out_no_fs; + if (!minix_check_superblock(sbi)) + goto out_illegal_sb; + /* * Allocate the buffer map to keep the superblock small. */ - if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0) - goto out_illegal_sb; i = (sbi->s_imap_blocks + sbi->s_zmap_blocks) * sizeof(bh); map = kzalloc(i, GFP_KERNEL); if (!map) From 32ac86efff91a3e4ef8c3d1cadd4559e23c8e73a Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:33 -0700 Subject: [PATCH 100/164] fs/minix: set s_maxbytes correctly The minix filesystem leaves super_block::s_maxbytes at MAX_NON_LFS rather than setting it to the actual filesystem-specific limit. This is broken because it means userspace doesn't see the standard behavior like getting EFBIG and SIGXFSZ when exceeding the maximum file size. Fix this by setting s_maxbytes correctly. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Qiujun Huang Link: http://lkml.kernel.org/r/20200628060846.682158-5-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/inode.c | 12 +++++++----- fs/minix/itree_v1.c | 2 +- fs/minix/itree_v2.c | 3 +-- fs/minix/minix.h | 1 - 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/minix/inode.c b/fs/minix/inode.c index 0dd929346f3f..7b09a9158e40 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -150,8 +150,10 @@ static int minix_remount (struct super_block * sb, int * flags, char * data) return 0; } -static bool minix_check_superblock(struct minix_sb_info *sbi) +static bool minix_check_superblock(struct super_block *sb) { + struct minix_sb_info *sbi = minix_sb(sb); + if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0) return false; @@ -161,7 +163,7 @@ static bool minix_check_superblock(struct minix_sb_info *sbi) * of indirect blocks which places the limit well above U32_MAX. */ if (sbi->s_version == MINIX_V1 && - sbi->s_max_size > (7 + 512 + 512*512) * BLOCK_SIZE) + sb->s_maxbytes > (7 + 512 + 512*512) * BLOCK_SIZE) return false; return true; @@ -202,7 +204,7 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_zmap_blocks = ms->s_zmap_blocks; sbi->s_firstdatazone = ms->s_firstdatazone; sbi->s_log_zone_size = ms->s_log_zone_size; - sbi->s_max_size = ms->s_max_size; + s->s_maxbytes = ms->s_max_size; s->s_magic = ms->s_magic; if (s->s_magic == MINIX_SUPER_MAGIC) { sbi->s_version = MINIX_V1; @@ -233,7 +235,7 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_zmap_blocks = m3s->s_zmap_blocks; sbi->s_firstdatazone = m3s->s_firstdatazone; sbi->s_log_zone_size = m3s->s_log_zone_size; - sbi->s_max_size = m3s->s_max_size; + s->s_maxbytes = m3s->s_max_size; sbi->s_ninodes = m3s->s_ninodes; sbi->s_nzones = m3s->s_zones; sbi->s_dirsize = 64; @@ -245,7 +247,7 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) } else goto out_no_fs; - if (!minix_check_superblock(sbi)) + if (!minix_check_superblock(s)) goto out_illegal_sb; /* diff --git a/fs/minix/itree_v1.c b/fs/minix/itree_v1.c index 046cc96ee7ad..c0d418209ead 100644 --- a/fs/minix/itree_v1.c +++ b/fs/minix/itree_v1.c @@ -29,7 +29,7 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, inode->i_sb->s_bdev); - } else if (block >= (minix_sb(inode->i_sb)->s_max_size/BLOCK_SIZE)) { + } else if (block >= inode->i_sb->s_maxbytes/BLOCK_SIZE) { if (printk_ratelimit()) printk("MINIX-fs: block_to_path: " "block %ld too big on dev %pg\n", diff --git a/fs/minix/itree_v2.c b/fs/minix/itree_v2.c index f7fc7ecccccc..ee8af2f9e282 100644 --- a/fs/minix/itree_v2.c +++ b/fs/minix/itree_v2.c @@ -32,8 +32,7 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, sb->s_bdev); - } else if ((u64)block * (u64)sb->s_blocksize >= - minix_sb(sb)->s_max_size) { + } else if ((u64)block * (u64)sb->s_blocksize >= sb->s_maxbytes) { if (printk_ratelimit()) printk("MINIX-fs: block_to_path: " "block %ld too big on dev %pg\n", diff --git a/fs/minix/minix.h b/fs/minix/minix.h index df081e8afcc3..168d45d3de73 100644 --- a/fs/minix/minix.h +++ b/fs/minix/minix.h @@ -32,7 +32,6 @@ struct minix_sb_info { unsigned long s_zmap_blocks; unsigned long s_firstdatazone; unsigned long s_log_zone_size; - unsigned long s_max_size; int s_dirsize; int s_namelen; struct buffer_head ** s_imap; From 0a12c4a8069607247cb8edc3b035a664e636fd9a Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:36 -0700 Subject: [PATCH 101/164] fs/minix: fix block limit check for V1 filesystems The minix filesystem reads its maximum file size from its on-disk superblock. This value isn't necessarily a multiple of the block size. When it's not, the V1 block mapping code doesn't allow mapping the last possible block. Commit 6ed6a722f9ab ("minixfs: fix block limit check") fixed this in the V2 mapping code. Fix it in the V1 mapping code too. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Qiujun Huang Link: http://lkml.kernel.org/r/20200628060846.682158-6-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/itree_v1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/minix/itree_v1.c b/fs/minix/itree_v1.c index c0d418209ead..405573a79aab 100644 --- a/fs/minix/itree_v1.c +++ b/fs/minix/itree_v1.c @@ -29,7 +29,7 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, inode->i_sb->s_bdev); - } else if (block >= inode->i_sb->s_maxbytes/BLOCK_SIZE) { + } else if ((u64)block * BLOCK_SIZE >= inode->i_sb->s_maxbytes) { if (printk_ratelimit()) printk("MINIX-fs: block_to_path: " "block %ld too big on dev %pg\n", From f666f9fb9a36f1c833b9d18923572f0e4d304754 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:39 -0700 Subject: [PATCH 102/164] fs/minix: remove expected error message in block_to_path() When truncating a file to a size within the last allowed logical block, block_to_path() is called with the *next* block. This exceeds the limit, causing the "block %ld too big" error message to be printed. This case isn't actually an error; there are just no more blocks past that point. So, remove this error message. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Qiujun Huang Link: http://lkml.kernel.org/r/20200628060846.682158-7-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- fs/minix/itree_v1.c | 12 ++++++------ fs/minix/itree_v2.c | 12 ++++++------ 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/fs/minix/itree_v1.c b/fs/minix/itree_v1.c index 405573a79aab..1fed906042aa 100644 --- a/fs/minix/itree_v1.c +++ b/fs/minix/itree_v1.c @@ -29,12 +29,12 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, inode->i_sb->s_bdev); - } else if ((u64)block * BLOCK_SIZE >= inode->i_sb->s_maxbytes) { - if (printk_ratelimit()) - printk("MINIX-fs: block_to_path: " - "block %ld too big on dev %pg\n", - block, inode->i_sb->s_bdev); - } else if (block < 7) { + return 0; + } + if ((u64)block * BLOCK_SIZE >= inode->i_sb->s_maxbytes) + return 0; + + if (block < 7) { offsets[n++] = block; } else if ((block -= 7) < 512) { offsets[n++] = 7; diff --git a/fs/minix/itree_v2.c b/fs/minix/itree_v2.c index ee8af2f9e282..9d00f31a2d9d 100644 --- a/fs/minix/itree_v2.c +++ b/fs/minix/itree_v2.c @@ -32,12 +32,12 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, sb->s_bdev); - } else if ((u64)block * (u64)sb->s_blocksize >= sb->s_maxbytes) { - if (printk_ratelimit()) - printk("MINIX-fs: block_to_path: " - "block %ld too big on dev %pg\n", - block, sb->s_bdev); - } else if (block < DIRCOUNT) { + return 0; + } + if ((u64)block * (u64)sb->s_blocksize >= sb->s_maxbytes) + return 0; + + if (block < DIRCOUNT) { offsets[n++] = block; } else if ((block -= DIRCOUNT) < INDIRCOUNT(sb)) { offsets[n++] = DIRCOUNT; From 1b0e31861d98920c3823c7519ae762499811df00 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 11 Aug 2020 18:35:43 -0700 Subject: [PATCH 103/164] nilfs2: only call unlock_new_inode() if I_NEW Patch series "nilfs2 updates". This patch (of 3): unlock_new_inode() is only meant to be called after a new inode has already been inserted into the hash table. But nilfs_new_inode() can call it even before it has inserted the inode, triggering the WARNING in unlock_new_inode(). Fix this by only calling unlock_new_inode() if the inode has the I_NEW flag set, indicating that it's in the table. Signed-off-by: Eric Biggers Signed-off-by: Ryusuke Konishi Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/1595860111-3920-1-git-send-email-konishi.ryusuke@gmail.com Link: http://lkml.kernel.org/r/1595860111-3920-2-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Linus Torvalds --- fs/nilfs2/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index 28009ec54420..3318dd1350b2 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -388,7 +388,8 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode) failed_after_creation: clear_nlink(inode); - unlock_new_inode(inode); + if (inode->i_state & I_NEW) + unlock_new_inode(inode); iput(inode); /* * raw_inode will be deleted through * nilfs_evict_inode(). From 2987a4cfc8332b4b516f3e03d29ec0316221e45f Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:46 -0700 Subject: [PATCH 104/164] nilfs2: convert __nilfs_msg to integrate the level and format Reduce object size a bit by removing the KERN_ as a separate argument and adding it to the format string. Reduce overall object size by about ~.5% (x86-64 defconfig w/ nilfs2) old: $ size -t fs/nilfs2/built-in.a | tail -1 191738 8676 44 200458 30f0a (TOTALS) new: $ size -t fs/nilfs2/built-in.a | tail -1 190971 8676 44 199691 30c0b (TOTALS) Signed-off-by: Joe Perches Signed-off-by: Ryusuke Konishi Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/1595860111-3920-3-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Linus Torvalds --- fs/nilfs2/nilfs.h | 9 ++++----- fs/nilfs2/super.c | 16 +++++++++++----- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h index 42395ba52da6..979a41016743 100644 --- a/fs/nilfs2/nilfs.h +++ b/fs/nilfs2/nilfs.h @@ -289,9 +289,8 @@ static inline int nilfs_mark_inode_dirty_sync(struct inode *inode) /* super.c */ extern struct inode *nilfs_alloc_inode(struct super_block *); -extern __printf(3, 4) -void __nilfs_msg(struct super_block *sb, const char *level, - const char *fmt, ...); +__printf(2, 3) +void __nilfs_msg(struct super_block *sb, const char *fmt, ...); extern __printf(3, 4) void __nilfs_error(struct super_block *sb, const char *function, const char *fmt, ...); @@ -299,7 +298,7 @@ void __nilfs_error(struct super_block *sb, const char *function, #ifdef CONFIG_PRINTK #define nilfs_msg(sb, level, fmt, ...) \ - __nilfs_msg(sb, level, fmt, ##__VA_ARGS__) + __nilfs_msg(sb, level fmt, ##__VA_ARGS__) #define nilfs_error(sb, fmt, ...) \ __nilfs_error(sb, __func__, fmt, ##__VA_ARGS__) @@ -307,7 +306,7 @@ void __nilfs_error(struct super_block *sb, const char *function, #define nilfs_msg(sb, level, fmt, ...) \ do { \ - no_printk(fmt, ##__VA_ARGS__); \ + no_printk(level fmt, ##__VA_ARGS__); \ (void)(sb); \ } while (0) #define nilfs_error(sb, fmt, ...) \ diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index 5729ee86da9a..fef4821a1f0f 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -62,19 +62,25 @@ struct kmem_cache *nilfs_btree_path_cache; static int nilfs_setup_super(struct super_block *sb, int is_mount); static int nilfs_remount(struct super_block *sb, int *flags, char *data); -void __nilfs_msg(struct super_block *sb, const char *level, const char *fmt, - ...) +void __nilfs_msg(struct super_block *sb, const char *fmt, ...) { struct va_format vaf; va_list args; + int level; va_start(args, fmt); - vaf.fmt = fmt; + + level = printk_get_level(fmt); + vaf.fmt = printk_skip_level(fmt); vaf.va = &args; + if (sb) - printk("%sNILFS (%s): %pV\n", level, sb->s_id, &vaf); + printk("%c%cNILFS (%s): %pV\n", + KERN_SOH_ASCII, level, sb->s_id, &vaf); else - printk("%sNILFS: %pV\n", level, &vaf); + printk("%c%cNILFS: %pV\n", + KERN_SOH_ASCII, level, &vaf); + va_end(args); } From a1d0747a393a079631130d61faa2a61027d1c789 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Tue, 11 Aug 2020 18:35:49 -0700 Subject: [PATCH 105/164] nilfs2: use a more common logging style Add macros for nilfs_(sb, fmt, ...) and convert the uses of 'nilfs_msg(sb, KERN_, ...)' to 'nilfs_(sb, ...)' so nilfs2 uses a logging style more like the typical kernel logging style. Miscellanea: o Realign arguments for these uses Signed-off-by: Joe Perches Signed-off-by: Ryusuke Konishi Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/1595860111-3920-4-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Linus Torvalds --- fs/nilfs2/alloc.c | 38 +++++++++---------- fs/nilfs2/btree.c | 42 ++++++++++----------- fs/nilfs2/cpfile.c | 10 ++--- fs/nilfs2/dat.c | 14 +++---- fs/nilfs2/direct.c | 14 ++++--- fs/nilfs2/gcinode.c | 2 +- fs/nilfs2/ifile.c | 4 +- fs/nilfs2/inode.c | 29 +++++++-------- fs/nilfs2/ioctl.c | 37 +++++++++---------- fs/nilfs2/mdt.c | 2 +- fs/nilfs2/namei.c | 6 +-- fs/nilfs2/nilfs.h | 9 +++++ fs/nilfs2/page.c | 11 +++--- fs/nilfs2/recovery.c | 32 +++++++--------- fs/nilfs2/segbuf.c | 2 +- fs/nilfs2/segment.c | 38 +++++++++---------- fs/nilfs2/sufile.c | 29 +++++++-------- fs/nilfs2/super.c | 57 ++++++++++++++--------------- fs/nilfs2/sysfs.c | 29 +++++++-------- fs/nilfs2/the_nilfs.c | 85 ++++++++++++++++++++----------------------- 20 files changed, 239 insertions(+), 251 deletions(-) diff --git a/fs/nilfs2/alloc.c b/fs/nilfs2/alloc.c index 235b959fc2b3..adf3bb0a8048 100644 --- a/fs/nilfs2/alloc.c +++ b/fs/nilfs2/alloc.c @@ -613,10 +613,10 @@ void nilfs_palloc_commit_free_entry(struct inode *inode, lock = nilfs_mdt_bgl_lock(inode, group); if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)req->pr_entry_nr); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)req->pr_entry_nr); else nilfs_palloc_group_desc_add_entries(desc, lock, 1); @@ -654,10 +654,10 @@ void nilfs_palloc_abort_alloc_entry(struct inode *inode, lock = nilfs_mdt_bgl_lock(inode, group); if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)req->pr_entry_nr); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)req->pr_entry_nr); else nilfs_palloc_group_desc_add_entries(desc, lock, 1); @@ -763,10 +763,10 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) do { if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)entry_nrs[j]); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)entry_nrs[j]); } else { n++; } @@ -808,10 +808,10 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) ret = nilfs_palloc_delete_entry_block(inode, last_nrs[k]); if (ret && ret != -ENOENT) - nilfs_msg(inode->i_sb, KERN_WARNING, - "error %d deleting block that object (entry=%llu, ino=%lu) belongs to", - ret, (unsigned long long)last_nrs[k], - inode->i_ino); + nilfs_warn(inode->i_sb, + "error %d deleting block that object (entry=%llu, ino=%lu) belongs to", + ret, (unsigned long long)last_nrs[k], + inode->i_ino); } desc_kaddr = kmap_atomic(desc_bh->b_page); @@ -826,9 +826,9 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) if (nfree == nilfs_palloc_entries_per_group(inode)) { ret = nilfs_palloc_delete_bitmap_block(inode, group); if (ret && ret != -ENOENT) - nilfs_msg(inode->i_sb, KERN_WARNING, - "error %d deleting bitmap block of group=%lu, ino=%lu", - ret, group, inode->i_ino); + nilfs_warn(inode->i_sb, + "error %d deleting bitmap block of group=%lu, ino=%lu", + ret, group, inode->i_ino); } } return 0; diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c index 23e043eca237..f42ab57201e7 100644 --- a/fs/nilfs2/btree.c +++ b/fs/nilfs2/btree.c @@ -351,10 +351,10 @@ static int nilfs_btree_node_broken(const struct nilfs_btree_node *node, (flags & NILFS_BTREE_NODE_ROOT) || nchildren < 0 || nchildren > NILFS_BTREE_NODE_NCHILDREN_MAX(size))) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "bad btree node (ino=%lu, blocknr=%llu): level = %d, flags = 0x%x, nchildren = %d", - inode->i_ino, (unsigned long long)blocknr, level, - flags, nchildren); + nilfs_crit(inode->i_sb, + "bad btree node (ino=%lu, blocknr=%llu): level = %d, flags = 0x%x, nchildren = %d", + inode->i_ino, (unsigned long long)blocknr, level, + flags, nchildren); ret = 1; } return ret; @@ -381,9 +381,9 @@ static int nilfs_btree_root_broken(const struct nilfs_btree_node *node, level >= NILFS_BTREE_LEVEL_MAX || nchildren < 0 || nchildren > NILFS_BTREE_ROOT_NCHILDREN_MAX)) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "bad btree root (ino=%lu): level = %d, flags = 0x%x, nchildren = %d", - inode->i_ino, level, flags, nchildren); + nilfs_crit(inode->i_sb, + "bad btree root (ino=%lu): level = %d, flags = 0x%x, nchildren = %d", + inode->i_ino, level, flags, nchildren); ret = 1; } return ret; @@ -450,10 +450,10 @@ static int nilfs_btree_bad_node(const struct nilfs_bmap *btree, { if (unlikely(nilfs_btree_node_get_level(node) != level)) { dump_stack(); - nilfs_msg(btree->b_inode->i_sb, KERN_CRIT, - "btree level mismatch (ino=%lu): %d != %d", - btree->b_inode->i_ino, - nilfs_btree_node_get_level(node), level); + nilfs_crit(btree->b_inode->i_sb, + "btree level mismatch (ino=%lu): %d != %d", + btree->b_inode->i_ino, + nilfs_btree_node_get_level(node), level); return 1; } return 0; @@ -508,7 +508,7 @@ static int __nilfs_btree_get_block(const struct nilfs_bmap *btree, __u64 ptr, out_no_wait: if (!buffer_uptodate(bh)) { - nilfs_msg(btree->b_inode->i_sb, KERN_ERR, + nilfs_err(btree->b_inode->i_sb, "I/O error reading b-tree node block (ino=%lu, blocknr=%llu)", btree->b_inode->i_ino, (unsigned long long)ptr); brelse(bh); @@ -2074,10 +2074,10 @@ static int nilfs_btree_propagate(struct nilfs_bmap *btree, ret = nilfs_btree_do_lookup(btree, path, key, NULL, level + 1, 0); if (ret < 0) { if (unlikely(ret == -ENOENT)) - nilfs_msg(btree->b_inode->i_sb, KERN_CRIT, - "writing node/leaf block does not appear in b-tree (ino=%lu) at key=%llu, level=%d", - btree->b_inode->i_ino, - (unsigned long long)key, level); + nilfs_crit(btree->b_inode->i_sb, + "writing node/leaf block does not appear in b-tree (ino=%lu) at key=%llu, level=%d", + btree->b_inode->i_ino, + (unsigned long long)key, level); goto out; } @@ -2114,11 +2114,11 @@ static void nilfs_btree_add_dirty_buffer(struct nilfs_bmap *btree, if (level < NILFS_BTREE_LEVEL_NODE_MIN || level >= NILFS_BTREE_LEVEL_MAX) { dump_stack(); - nilfs_msg(btree->b_inode->i_sb, KERN_WARNING, - "invalid btree level: %d (key=%llu, ino=%lu, blocknr=%llu)", - level, (unsigned long long)key, - btree->b_inode->i_ino, - (unsigned long long)bh->b_blocknr); + nilfs_warn(btree->b_inode->i_sb, + "invalid btree level: %d (key=%llu, ino=%lu, blocknr=%llu)", + level, (unsigned long long)key, + btree->b_inode->i_ino, + (unsigned long long)bh->b_blocknr); return; } diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c index 8d41311b5db4..86d4d850d130 100644 --- a/fs/nilfs2/cpfile.c +++ b/fs/nilfs2/cpfile.c @@ -322,7 +322,7 @@ int nilfs_cpfile_delete_checkpoints(struct inode *cpfile, int ret, ncps, nicps, nss, count, i; if (unlikely(start == 0 || start > end)) { - nilfs_msg(cpfile->i_sb, KERN_ERR, + nilfs_err(cpfile->i_sb, "cannot delete checkpoints: invalid range [%llu, %llu)", (unsigned long long)start, (unsigned long long)end); return -EINVAL; @@ -376,7 +376,7 @@ int nilfs_cpfile_delete_checkpoints(struct inode *cpfile, cpfile, cno); if (ret == 0) continue; - nilfs_msg(cpfile->i_sb, KERN_ERR, + nilfs_err(cpfile->i_sb, "error %d deleting checkpoint block", ret); break; @@ -981,12 +981,10 @@ int nilfs_cpfile_read(struct super_block *sb, size_t cpsize, int err; if (cpsize > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, - "too large checkpoint size: %zu bytes", cpsize); + nilfs_err(sb, "too large checkpoint size: %zu bytes", cpsize); return -EINVAL; } else if (cpsize < NILFS_MIN_CHECKPOINT_SIZE) { - nilfs_msg(sb, KERN_ERR, - "too small checkpoint size: %zu bytes", cpsize); + nilfs_err(sb, "too small checkpoint size: %zu bytes", cpsize); return -EINVAL; } diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c index 6f4066636be9..8bccdf1158fc 100644 --- a/fs/nilfs2/dat.c +++ b/fs/nilfs2/dat.c @@ -340,11 +340,11 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr) kaddr = kmap_atomic(entry_bh->b_page); entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr); if (unlikely(entry->de_blocknr == cpu_to_le64(0))) { - nilfs_msg(dat->i_sb, KERN_CRIT, - "%s: invalid vblocknr = %llu, [%llu, %llu)", - __func__, (unsigned long long)vblocknr, - (unsigned long long)le64_to_cpu(entry->de_start), - (unsigned long long)le64_to_cpu(entry->de_end)); + nilfs_crit(dat->i_sb, + "%s: invalid vblocknr = %llu, [%llu, %llu)", + __func__, (unsigned long long)vblocknr, + (unsigned long long)le64_to_cpu(entry->de_start), + (unsigned long long)le64_to_cpu(entry->de_end)); kunmap_atomic(kaddr); brelse(entry_bh); return -EINVAL; @@ -471,11 +471,11 @@ int nilfs_dat_read(struct super_block *sb, size_t entry_size, int err; if (entry_size > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, "too large DAT entry size: %zu bytes", + nilfs_err(sb, "too large DAT entry size: %zu bytes", entry_size); return -EINVAL; } else if (entry_size < NILFS_MIN_DAT_ENTRY_SIZE) { - nilfs_msg(sb, KERN_ERR, "too small DAT entry size: %zu bytes", + nilfs_err(sb, "too small DAT entry size: %zu bytes", entry_size); return -EINVAL; } diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c index 533e24ea3a88..f353101955e3 100644 --- a/fs/nilfs2/direct.c +++ b/fs/nilfs2/direct.c @@ -328,16 +328,18 @@ static int nilfs_direct_assign(struct nilfs_bmap *bmap, key = nilfs_bmap_data_get_key(bmap, *bh); if (unlikely(key > NILFS_DIRECT_KEY_MAX)) { - nilfs_msg(bmap->b_inode->i_sb, KERN_CRIT, - "%s (ino=%lu): invalid key: %llu", __func__, - bmap->b_inode->i_ino, (unsigned long long)key); + nilfs_crit(bmap->b_inode->i_sb, + "%s (ino=%lu): invalid key: %llu", + __func__, + bmap->b_inode->i_ino, (unsigned long long)key); return -EINVAL; } ptr = nilfs_direct_get_ptr(bmap, key); if (unlikely(ptr == NILFS_BMAP_INVALID_PTR)) { - nilfs_msg(bmap->b_inode->i_sb, KERN_CRIT, - "%s (ino=%lu): invalid pointer: %llu", __func__, - bmap->b_inode->i_ino, (unsigned long long)ptr); + nilfs_crit(bmap->b_inode->i_sb, + "%s (ino=%lu): invalid pointer: %llu", + __func__, + bmap->b_inode->i_ino, (unsigned long long)ptr); return -EINVAL; } diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c index aa3c328ee189..448320496856 100644 --- a/fs/nilfs2/gcinode.c +++ b/fs/nilfs2/gcinode.c @@ -142,7 +142,7 @@ int nilfs_gccache_wait_and_mark_dirty(struct buffer_head *bh) if (!buffer_uptodate(bh)) { struct inode *inode = bh->b_page->mapping->host; - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "I/O error reading %s block for GC (ino=%lu, vblocknr=%llu)", buffer_nilfs_node(bh) ? "node" : "data", inode->i_ino, (unsigned long long)bh->b_blocknr); diff --git a/fs/nilfs2/ifile.c b/fs/nilfs2/ifile.c index 4140d232cadc..02727ed3a7c6 100644 --- a/fs/nilfs2/ifile.c +++ b/fs/nilfs2/ifile.c @@ -142,8 +142,8 @@ int nilfs_ifile_get_inode_block(struct inode *ifile, ino_t ino, err = nilfs_palloc_get_entry_block(ifile, ino, 0, out_bh); if (unlikely(err)) - nilfs_msg(sb, KERN_WARNING, "error %d reading inode: ino=%lu", - err, (unsigned long)ino); + nilfs_warn(sb, "error %d reading inode: ino=%lu", + err, (unsigned long)ino); return err; } diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index 3318dd1350b2..745d371d6fea 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -104,10 +104,10 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff, * However, the page having this block must * be locked in this case. */ - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): a race condition while inserting a data block at offset=%llu", - __func__, inode->i_ino, - (unsigned long long)blkoff); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): a race condition while inserting a data block at offset=%llu", + __func__, inode->i_ino, + (unsigned long long)blkoff); err = 0; } nilfs_transaction_abort(inode->i_sb); @@ -707,9 +707,8 @@ static void nilfs_truncate_bmap(struct nilfs_inode_info *ii, goto repeat; failed: - nilfs_msg(ii->vfs_inode.i_sb, KERN_WARNING, - "error %d truncating bmap (ino=%lu)", ret, - ii->vfs_inode.i_ino); + nilfs_warn(ii->vfs_inode.i_sb, "error %d truncating bmap (ino=%lu)", + ret, ii->vfs_inode.i_ino); } void nilfs_truncate(struct inode *inode) @@ -920,9 +919,9 @@ int nilfs_set_file_dirty(struct inode *inode, unsigned int nr_dirty) * This will happen when somebody is freeing * this inode. */ - nilfs_msg(inode->i_sb, KERN_WARNING, - "cannot set file dirty (ino=%lu): the file is being freed", - inode->i_ino); + nilfs_warn(inode->i_sb, + "cannot set file dirty (ino=%lu): the file is being freed", + inode->i_ino); spin_unlock(&nilfs->ns_inode_lock); return -EINVAL; /* * NILFS_I_DIRTY may remain for @@ -943,9 +942,9 @@ int __nilfs_mark_inode_dirty(struct inode *inode, int flags) err = nilfs_load_inode_block(inode, &ibh); if (unlikely(err)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "cannot mark inode dirty (ino=%lu): error %d loading inode block", - inode->i_ino, err); + nilfs_warn(inode->i_sb, + "cannot mark inode dirty (ino=%lu): error %d loading inode block", + inode->i_ino, err); return err; } nilfs_update_inode(inode, ibh, flags); @@ -971,8 +970,8 @@ void nilfs_dirty_inode(struct inode *inode, int flags) struct nilfs_mdt_info *mdi = NILFS_MDT(inode); if (is_bad_inode(inode)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "tried to mark bad_inode dirty. ignored."); + nilfs_warn(inode->i_sb, + "tried to mark bad_inode dirty. ignored."); dump_stack(); return; } diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c index 4ba73dbf3e8d..07d26f61f22a 100644 --- a/fs/nilfs2/ioctl.c +++ b/fs/nilfs2/ioctl.c @@ -569,25 +569,25 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode, if (unlikely(ret < 0)) { if (ret == -ENOENT) - nilfs_msg(inode->i_sb, KERN_CRIT, - "%s: invalid virtual block address (%s): ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", - __func__, vdesc->vd_flags ? "node" : "data", - (unsigned long long)vdesc->vd_ino, - (unsigned long long)vdesc->vd_cno, - (unsigned long long)vdesc->vd_offset, - (unsigned long long)vdesc->vd_blocknr, - (unsigned long long)vdesc->vd_vblocknr); + nilfs_crit(inode->i_sb, + "%s: invalid virtual block address (%s): ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", + __func__, vdesc->vd_flags ? "node" : "data", + (unsigned long long)vdesc->vd_ino, + (unsigned long long)vdesc->vd_cno, + (unsigned long long)vdesc->vd_offset, + (unsigned long long)vdesc->vd_blocknr, + (unsigned long long)vdesc->vd_vblocknr); return ret; } if (unlikely(!list_empty(&bh->b_assoc_buffers))) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "%s: conflicting %s buffer: ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", - __func__, vdesc->vd_flags ? "node" : "data", - (unsigned long long)vdesc->vd_ino, - (unsigned long long)vdesc->vd_cno, - (unsigned long long)vdesc->vd_offset, - (unsigned long long)vdesc->vd_blocknr, - (unsigned long long)vdesc->vd_vblocknr); + nilfs_crit(inode->i_sb, + "%s: conflicting %s buffer: ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", + __func__, vdesc->vd_flags ? "node" : "data", + (unsigned long long)vdesc->vd_ino, + (unsigned long long)vdesc->vd_cno, + (unsigned long long)vdesc->vd_offset, + (unsigned long long)vdesc->vd_blocknr, + (unsigned long long)vdesc->vd_vblocknr); brelse(bh); return -EEXIST; } @@ -837,8 +837,7 @@ int nilfs_ioctl_prepare_clean_segments(struct the_nilfs *nilfs, return 0; failed: - nilfs_msg(nilfs->ns_sb, KERN_ERR, "error %d preparing GC: %s", ret, - msg); + nilfs_err(nilfs->ns_sb, "error %d preparing GC: %s", ret, msg); return ret; } @@ -947,7 +946,7 @@ static int nilfs_ioctl_clean_segments(struct inode *inode, struct file *filp, ret = nilfs_ioctl_move_blocks(inode->i_sb, &argv[0], kbufs[0]); if (ret < 0) { - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "error %d preparing GC: cannot read source blocks", ret); } else { diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c index 700870a92bc4..c0361ce45f62 100644 --- a/fs/nilfs2/mdt.c +++ b/fs/nilfs2/mdt.c @@ -199,7 +199,7 @@ static int nilfs_mdt_read_block(struct inode *inode, unsigned long block, out_no_wait: err = -EIO; if (!buffer_uptodate(first_bh)) { - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "I/O error reading meta-data file (ino=%lu, block-offset=%lu)", inode->i_ino, block); goto failed_bh; diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 9fe6d4ab74f0..a6ec7961d4f5 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -272,9 +272,9 @@ static int nilfs_do_unlink(struct inode *dir, struct dentry *dentry) goto out; if (!inode->i_nlink) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "deleting nonexistent file (ino=%lu), %d", - inode->i_ino, inode->i_nlink); + nilfs_warn(inode->i_sb, + "deleting nonexistent file (ino=%lu), %d", + inode->i_ino, inode->i_nlink); set_nlink(inode, 1); } err = nilfs_delete_entry(de, page); diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h index 979a41016743..f8450ee3fd06 100644 --- a/fs/nilfs2/nilfs.h +++ b/fs/nilfs2/nilfs.h @@ -317,6 +317,15 @@ void __nilfs_error(struct super_block *sb, const char *function, #endif /* CONFIG_PRINTK */ +#define nilfs_crit(sb, fmt, ...) \ + nilfs_msg(sb, KERN_CRIT, fmt, ##__VA_ARGS__) +#define nilfs_err(sb, fmt, ...) \ + nilfs_msg(sb, KERN_ERR, fmt, ##__VA_ARGS__) +#define nilfs_warn(sb, fmt, ...) \ + nilfs_msg(sb, KERN_WARNING, fmt, ##__VA_ARGS__) +#define nilfs_info(sb, fmt, ...) \ + nilfs_msg(sb, KERN_INFO, fmt, ##__VA_ARGS__) + extern struct nilfs_super_block * nilfs_read_super_block(struct super_block *, u64, int, struct buffer_head **); extern int nilfs_store_magic_and_option(struct super_block *, diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index d7fc8d369d89..b175f1330408 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -391,9 +391,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent) BUG_ON(!PageLocked(page)); if (!silent) - nilfs_msg(sb, KERN_WARNING, - "discard dirty page: offset=%lld, ino=%lu", - page_offset(page), inode->i_ino); + nilfs_warn(sb, "discard dirty page: offset=%lld, ino=%lu", + page_offset(page), inode->i_ino); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -409,9 +408,9 @@ void nilfs_clear_dirty_page(struct page *page, bool silent) do { lock_buffer(bh); if (!silent) - nilfs_msg(sb, KERN_WARNING, - "discard dirty block: blocknr=%llu, size=%zu", - (u64)bh->b_blocknr, bh->b_size); + nilfs_warn(sb, + "discard dirty block: blocknr=%llu, size=%zu", + (u64)bh->b_blocknr, bh->b_size); set_mask_bits(&bh->b_state, clear_bits, 0); unlock_buffer(bh); diff --git a/fs/nilfs2/recovery.c b/fs/nilfs2/recovery.c index 140b663e91c7..0b453ef8fae5 100644 --- a/fs/nilfs2/recovery.c +++ b/fs/nilfs2/recovery.c @@ -51,7 +51,7 @@ static int nilfs_warn_segment_error(struct super_block *sb, int err) switch (err) { case NILFS_SEG_FAIL_IO: - nilfs_msg(sb, KERN_ERR, "I/O error reading segment"); + nilfs_err(sb, "I/O error reading segment"); return -EIO; case NILFS_SEG_FAIL_MAGIC: msg = "Magic number mismatch"; @@ -72,10 +72,10 @@ static int nilfs_warn_segment_error(struct super_block *sb, int err) msg = "No super root in the last segment"; break; default: - nilfs_msg(sb, KERN_ERR, "unrecognized segment error %d", err); + nilfs_err(sb, "unrecognized segment error %d", err); return -EINVAL; } - nilfs_msg(sb, KERN_WARNING, "invalid segment: %s", msg); + nilfs_warn(sb, "invalid segment: %s", msg); return -EINVAL; } @@ -543,10 +543,10 @@ static int nilfs_recover_dsync_blocks(struct the_nilfs *nilfs, put_page(page); failed_inode: - nilfs_msg(sb, KERN_WARNING, - "error %d recovering data block (ino=%lu, block-offset=%llu)", - err, (unsigned long)rb->ino, - (unsigned long long)rb->blkoff); + nilfs_warn(sb, + "error %d recovering data block (ino=%lu, block-offset=%llu)", + err, (unsigned long)rb->ino, + (unsigned long long)rb->blkoff); if (!err2) err2 = err; next: @@ -669,8 +669,7 @@ static int nilfs_do_roll_forward(struct the_nilfs *nilfs, } if (nsalvaged_blocks) { - nilfs_msg(sb, KERN_INFO, "salvaged %lu blocks", - nsalvaged_blocks); + nilfs_info(sb, "salvaged %lu blocks", nsalvaged_blocks); ri->ri_need_recovery = NILFS_RECOVERY_ROLLFORWARD_DONE; } out: @@ -681,7 +680,7 @@ static int nilfs_do_roll_forward(struct the_nilfs *nilfs, confused: err = -EINVAL; failed: - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d roll-forwarding partial segment at blocknr = %llu", err, (unsigned long long)pseg_start); goto out; @@ -703,8 +702,8 @@ static void nilfs_finish_roll_forward(struct the_nilfs *nilfs, set_buffer_dirty(bh); err = sync_dirty_buffer(bh); if (unlikely(err)) - nilfs_msg(nilfs->ns_sb, KERN_WARNING, - "buffer sync write failed during post-cleaning of recovery."); + nilfs_warn(nilfs->ns_sb, + "buffer sync write failed during post-cleaning of recovery."); brelse(bh); } @@ -739,8 +738,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, err = nilfs_attach_checkpoint(sb, ri->ri_cno, true, &root); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d loading the latest checkpoint", err); + nilfs_err(sb, "error %d loading the latest checkpoint", err); return err; } @@ -751,8 +749,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, if (ri->ri_need_recovery == NILFS_RECOVERY_ROLLFORWARD_DONE) { err = nilfs_prepare_segment_for_recovery(nilfs, sb, ri); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d preparing segment for recovery", + nilfs_err(sb, "error %d preparing segment for recovery", err); goto failed; } @@ -766,8 +763,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, nilfs_detach_log_writer(sb); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d writing segment for recovery", + nilfs_err(sb, "error %d writing segment for recovery", err); goto failed; } diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index 20c479b5e41b..1a8729eded8b 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -505,7 +505,7 @@ static int nilfs_segbuf_wait(struct nilfs_segment_buffer *segbuf) } while (--segbuf->sb_nbio > 0); if (unlikely(atomic_read(&segbuf->sb_err) > 0)) { - nilfs_msg(segbuf->sb_super, KERN_ERR, + nilfs_err(segbuf->sb_super, "I/O error writing log (start-blocknr=%llu, block-count=%lu) in segment %llu", (unsigned long long)segbuf->sb_pseg_start, segbuf->sb_sum.nblocks, diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 91b58c897f92..a651e821c2de 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -158,7 +158,7 @@ static int nilfs_prepare_segment_lock(struct super_block *sb, * it is saved and will be restored on * nilfs_transaction_commit(). */ - nilfs_msg(sb, KERN_WARNING, "journal info from a different FS"); + nilfs_warn(sb, "journal info from a different FS"); save = current->journal_info; } if (!ti) { @@ -1940,9 +1940,9 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err = nilfs_ifile_get_inode_block( ifile, ii->vfs_inode.i_ino, &ibh); if (unlikely(err)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "log writer: error %d getting inode block (ino=%lu)", - err, ii->vfs_inode.i_ino); + nilfs_warn(sci->sc_super, + "log writer: error %d getting inode block (ino=%lu)", + err, ii->vfs_inode.i_ino); return err; } spin_lock(&nilfs->ns_inode_lock); @@ -2449,7 +2449,7 @@ int nilfs_clean_segments(struct super_block *sb, struct nilfs_argv *argv, if (likely(!err)) break; - nilfs_msg(sb, KERN_WARNING, "error %d cleaning segments", err); + nilfs_warn(sb, "error %d cleaning segments", err); set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(sci->sc_interval); } @@ -2457,9 +2457,9 @@ int nilfs_clean_segments(struct super_block *sb, struct nilfs_argv *argv, int ret = nilfs_discard_segments(nilfs, sci->sc_freesegs, sci->sc_nfreesegs); if (ret) { - nilfs_msg(sb, KERN_WARNING, - "error %d on discard request, turning discards off for the device", - ret); + nilfs_warn(sb, + "error %d on discard request, turning discards off for the device", + ret); nilfs_clear_opt(nilfs, DISCARD); } } @@ -2540,9 +2540,9 @@ static int nilfs_segctor_thread(void *arg) /* start sync. */ sci->sc_task = current; wake_up(&sci->sc_wait_task); /* for nilfs_segctor_start_thread() */ - nilfs_msg(sci->sc_super, KERN_INFO, - "segctord starting. Construction interval = %lu seconds, CP frequency < %lu seconds", - sci->sc_interval / HZ, sci->sc_mjcp_freq / HZ); + nilfs_info(sci->sc_super, + "segctord starting. Construction interval = %lu seconds, CP frequency < %lu seconds", + sci->sc_interval / HZ, sci->sc_mjcp_freq / HZ); spin_lock(&sci->sc_state_lock); loop: @@ -2616,8 +2616,8 @@ static int nilfs_segctor_start_thread(struct nilfs_sc_info *sci) if (IS_ERR(t)) { int err = PTR_ERR(t); - nilfs_msg(sci->sc_super, KERN_ERR, - "error %d creating segctord thread", err); + nilfs_err(sci->sc_super, "error %d creating segctord thread", + err); return err; } wait_event(sci->sc_wait_task, sci->sc_task != NULL); @@ -2727,14 +2727,14 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci) nilfs_segctor_write_out(sci); if (!list_empty(&sci->sc_dirty_files)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "disposed unprocessed dirty file(s) when stopping log writer"); + nilfs_warn(sci->sc_super, + "disposed unprocessed dirty file(s) when stopping log writer"); nilfs_dispose_list(nilfs, &sci->sc_dirty_files, 1); } if (!list_empty(&sci->sc_iput_queue)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "disposed unprocessed inode(s) in iput queue when stopping log writer"); + nilfs_warn(sci->sc_super, + "disposed unprocessed inode(s) in iput queue when stopping log writer"); nilfs_dispose_list(nilfs, &sci->sc_iput_queue, 1); } @@ -2812,8 +2812,8 @@ void nilfs_detach_log_writer(struct super_block *sb) spin_lock(&nilfs->ns_inode_lock); if (!list_empty(&nilfs->ns_dirty_files)) { list_splice_init(&nilfs->ns_dirty_files, &garbage_list); - nilfs_msg(sb, KERN_WARNING, - "disposed unprocessed dirty file(s) when detaching log writer"); + nilfs_warn(sb, + "disposed unprocessed dirty file(s) when detaching log writer"); } spin_unlock(&nilfs->ns_inode_lock); up_write(&nilfs->ns_segctor_sem); diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c index bf3f8f05c89b..42ff67c0c14f 100644 --- a/fs/nilfs2/sufile.c +++ b/fs/nilfs2/sufile.c @@ -171,9 +171,9 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs, down_write(&NILFS_MDT(sufile)->mi_sem); for (seg = segnumv; seg < segnumv + nsegs; seg++) { if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: invalid segment number: %llu", - __func__, (unsigned long long)*seg); + nilfs_warn(sufile->i_sb, + "%s: invalid segment number: %llu", + __func__, (unsigned long long)*seg); nerr++; } } @@ -230,9 +230,8 @@ int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create, int ret; if (unlikely(segnum >= nilfs_sufile_get_nsegments(sufile))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: invalid segment number: %llu", - __func__, (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: invalid segment number: %llu", + __func__, (unsigned long long)segnum); return -EINVAL; } down_write(&NILFS_MDT(sufile)->mi_sem); @@ -410,9 +409,8 @@ void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum, kaddr = kmap_atomic(su_bh->b_page); su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr); if (unlikely(!nilfs_segment_usage_clean(su))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: segment %llu must be clean", __func__, - (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: segment %llu must be clean", + __func__, (unsigned long long)segnum); kunmap_atomic(kaddr); return; } @@ -468,9 +466,8 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum, kaddr = kmap_atomic(su_bh->b_page); su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr); if (nilfs_segment_usage_clean(su)) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: segment %llu is already clean", - __func__, (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: segment %llu is already clean", + __func__, (unsigned long long)segnum); kunmap_atomic(kaddr); return; } @@ -1168,12 +1165,12 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize, int err; if (susize > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, - "too large segment usage size: %zu bytes", susize); + nilfs_err(sb, "too large segment usage size: %zu bytes", + susize); return -EINVAL; } else if (susize < NILFS_MIN_SEGMENT_USAGE_SIZE) { - nilfs_msg(sb, KERN_ERR, - "too small segment usage size: %zu bytes", susize); + nilfs_err(sb, "too small segment usage size: %zu bytes", + susize); return -EINVAL; } diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index fef4821a1f0f..2eee5fb1a882 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -112,7 +112,7 @@ static void nilfs_set_error(struct super_block *sb) * * This implements the body of nilfs_error() macro. Normally, * nilfs_error() should be used. As for sustainable errors such as a - * single-shot I/O error, nilfs_msg() should be used instead. + * single-shot I/O error, nilfs_err() should be used instead. * * Callers should not add a trailing newline since this will do it. */ @@ -184,8 +184,7 @@ static int nilfs_sync_super(struct super_block *sb, int flag) } if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, "unable to write superblock: err=%d", - err); + nilfs_err(sb, "unable to write superblock: err=%d", err); if (err == -EIO && nilfs->ns_sbh[1]) { /* * sbp[0] points to newer log than sbp[1], @@ -255,7 +254,7 @@ struct nilfs_super_block **nilfs_prepare_super(struct super_block *sb, sbp[1]->s_magic == cpu_to_le16(NILFS_SUPER_MAGIC)) { memcpy(sbp[0], sbp[1], nilfs->ns_sbsize); } else { - nilfs_msg(sb, KERN_CRIT, "superblock broke"); + nilfs_crit(sb, "superblock broke"); return NULL; } } else if (sbp[1] && @@ -365,9 +364,9 @@ static int nilfs_move_2nd_super(struct super_block *sb, loff_t sb2off) offset = sb2off & (nilfs->ns_blocksize - 1); nsbh = sb_getblk(sb, newblocknr); if (!nsbh) { - nilfs_msg(sb, KERN_WARNING, - "unable to move secondary superblock to block %llu", - (unsigned long long)newblocknr); + nilfs_warn(sb, + "unable to move secondary superblock to block %llu", + (unsigned long long)newblocknr); ret = -EIO; goto out; } @@ -530,7 +529,7 @@ int nilfs_attach_checkpoint(struct super_block *sb, __u64 cno, int curr_mnt, up_read(&nilfs->ns_segctor_sem); if (unlikely(err)) { if (err == -ENOENT || err == -EINVAL) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "Invalid checkpoint (checkpoint number=%llu)", (unsigned long long)cno); err = -EINVAL; @@ -628,8 +627,7 @@ static int nilfs_statfs(struct dentry *dentry, struct kstatfs *buf) err = nilfs_ifile_count_free_inodes(root->ifile, &nmaxinodes, &nfreeinodes); if (unlikely(err)) { - nilfs_msg(sb, KERN_WARNING, - "failed to count free inodes: err=%d", err); + nilfs_warn(sb, "failed to count free inodes: err=%d", err); if (err == -ERANGE) { /* * If nilfs_palloc_count_max_entries() returns @@ -761,7 +759,7 @@ static int parse_options(char *options, struct super_block *sb, int is_remount) break; case Opt_snapshot: if (is_remount) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "\"%s\" option is invalid for remount", p); return 0; @@ -777,8 +775,7 @@ static int parse_options(char *options, struct super_block *sb, int is_remount) nilfs_clear_opt(nilfs, DISCARD); break; default: - nilfs_msg(sb, KERN_ERR, - "unrecognized mount option \"%s\"", p); + nilfs_err(sb, "unrecognized mount option \"%s\"", p); return 0; } } @@ -814,10 +811,10 @@ static int nilfs_setup_super(struct super_block *sb, int is_mount) mnt_count = le16_to_cpu(sbp[0]->s_mnt_count); if (nilfs->ns_mount_state & NILFS_ERROR_FS) { - nilfs_msg(sb, KERN_WARNING, "mounting fs with errors"); + nilfs_warn(sb, "mounting fs with errors"); #if 0 } else if (max_mnt_count >= 0 && mnt_count >= max_mnt_count) { - nilfs_msg(sb, KERN_WARNING, "maximal mount count reached"); + nilfs_warn(sb, "maximal mount count reached"); #endif } if (!max_mnt_count) @@ -880,7 +877,7 @@ int nilfs_check_feature_compatibility(struct super_block *sb, features = le64_to_cpu(sbp->s_feature_incompat) & ~NILFS_FEATURE_INCOMPAT_SUPP; if (features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount because of unsupported optional features (%llx)", (unsigned long long)features); return -EINVAL; @@ -888,7 +885,7 @@ int nilfs_check_feature_compatibility(struct super_block *sb, features = le64_to_cpu(sbp->s_feature_compat_ro) & ~NILFS_FEATURE_COMPAT_RO_SUPP; if (!sb_rdonly(sb) && features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount RDWR because of unsupported optional features (%llx)", (unsigned long long)features); return -EINVAL; @@ -907,12 +904,12 @@ static int nilfs_get_root_dentry(struct super_block *sb, inode = nilfs_iget(sb, root, NILFS_ROOT_INO); if (IS_ERR(inode)) { ret = PTR_ERR(inode); - nilfs_msg(sb, KERN_ERR, "error %d getting root inode", ret); + nilfs_err(sb, "error %d getting root inode", ret); goto out; } if (!S_ISDIR(inode->i_mode) || !inode->i_blocks || !inode->i_size) { iput(inode); - nilfs_msg(sb, KERN_ERR, "corrupt root inode"); + nilfs_err(sb, "corrupt root inode"); ret = -EINVAL; goto out; } @@ -940,7 +937,7 @@ static int nilfs_get_root_dentry(struct super_block *sb, return ret; failed_dentry: - nilfs_msg(sb, KERN_ERR, "error %d getting root dentry", ret); + nilfs_err(sb, "error %d getting root dentry", ret); goto out; } @@ -960,7 +957,7 @@ static int nilfs_attach_snapshot(struct super_block *s, __u64 cno, ret = (ret == -ENOENT) ? -EINVAL : ret; goto out; } else if (!ret) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "The specified checkpoint is not a snapshot (checkpoint number=%llu)", (unsigned long long)cno); ret = -EINVAL; @@ -969,7 +966,7 @@ static int nilfs_attach_snapshot(struct super_block *s, __u64 cno, ret = nilfs_attach_checkpoint(s, cno, false, &root); if (ret) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "error %d while loading snapshot (checkpoint number=%llu)", ret, (unsigned long long)cno); goto out; @@ -1066,7 +1063,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) cno = nilfs_last_cno(nilfs); err = nilfs_attach_checkpoint(sb, cno, true, &fsroot); if (err) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d while loading last checkpoint (checkpoint number=%llu)", err, (unsigned long long)cno); goto failed_unload; @@ -1128,8 +1125,8 @@ static int nilfs_remount(struct super_block *sb, int *flags, char *data) err = -EINVAL; if (!nilfs_valid_fs(nilfs)) { - nilfs_msg(sb, KERN_WARNING, - "couldn't remount because the filesystem is in an incomplete recovery state"); + nilfs_warn(sb, + "couldn't remount because the filesystem is in an incomplete recovery state"); goto restore_opts; } @@ -1161,9 +1158,9 @@ static int nilfs_remount(struct super_block *sb, int *flags, char *data) ~NILFS_FEATURE_COMPAT_RO_SUPP; up_read(&nilfs->ns_sem); if (features) { - nilfs_msg(sb, KERN_WARNING, - "couldn't remount RDWR because of unsupported optional features (%llx)", - (unsigned long long)features); + nilfs_warn(sb, + "couldn't remount RDWR because of unsupported optional features (%llx)", + (unsigned long long)features); err = -EROFS; goto restore_opts; } @@ -1222,7 +1219,7 @@ static int nilfs_parse_snapshot_option(const char *option, return 0; parse_error: - nilfs_msg(NULL, KERN_ERR, "invalid option \"%s\": %s", option, msg); + nilfs_err(NULL, "invalid option \"%s\": %s", option, msg); return 1; } @@ -1325,7 +1322,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags, } else if (!sd.cno) { if (nilfs_tree_is_busy(s->s_root)) { if ((flags ^ s->s_flags) & SB_RDONLY) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "the device already has a %s mount.", sb_rdonly(s) ? "read-only" : "read/write"); err = -EBUSY; diff --git a/fs/nilfs2/sysfs.c b/fs/nilfs2/sysfs.c index e60be7bb55b0..303d71430bdd 100644 --- a/fs/nilfs2/sysfs.c +++ b/fs/nilfs2/sysfs.c @@ -263,8 +263,8 @@ nilfs_checkpoints_checkpoints_number_show(struct nilfs_checkpoints_attr *attr, err = nilfs_cpfile_get_stat(nilfs->ns_cpfile, &cpstat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get checkpoint stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get checkpoint stat: err=%d", + err); return err; } @@ -286,8 +286,8 @@ nilfs_checkpoints_snapshots_number_show(struct nilfs_checkpoints_attr *attr, err = nilfs_cpfile_get_stat(nilfs->ns_cpfile, &cpstat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get checkpoint stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get checkpoint stat: err=%d", + err); return err; } @@ -405,8 +405,8 @@ nilfs_segments_dirty_segments_show(struct nilfs_segments_attr *attr, err = nilfs_sufile_get_stat(nilfs->ns_sufile, &sustat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get segment stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get segment stat: err=%d", + err); return err; } @@ -779,15 +779,15 @@ nilfs_superblock_sb_update_frequency_store(struct nilfs_superblock_attr *attr, err = kstrtouint(skip_spaces(buf), 0, &val); if (err) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to convert string: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to convert string: err=%d", + err); return err; } if (val < NILFS_SB_FREQ) { val = NILFS_SB_FREQ; - nilfs_msg(nilfs->ns_sb, KERN_WARNING, - "superblock update frequency cannot be lesser than 10 seconds"); + nilfs_warn(nilfs->ns_sb, + "superblock update frequency cannot be lesser than 10 seconds"); } down_write(&nilfs->ns_sem); @@ -990,8 +990,7 @@ int nilfs_sysfs_create_device_group(struct super_block *sb) nilfs->ns_dev_subgroups = kzalloc(devgrp_size, GFP_KERNEL); if (unlikely(!nilfs->ns_dev_subgroups)) { err = -ENOMEM; - nilfs_msg(sb, KERN_ERR, - "unable to allocate memory for device group"); + nilfs_err(sb, "unable to allocate memory for device group"); goto failed_create_device_group; } @@ -1101,15 +1100,13 @@ int __init nilfs_sysfs_init(void) nilfs_kset = kset_create_and_add(NILFS_ROOT_GROUP_NAME, NULL, fs_kobj); if (!nilfs_kset) { err = -ENOMEM; - nilfs_msg(NULL, KERN_ERR, - "unable to create sysfs entry: err=%d", err); + nilfs_err(NULL, "unable to create sysfs entry: err=%d", err); goto failed_sysfs_init; } err = sysfs_create_group(&nilfs_kset->kobj, &nilfs_feature_attr_group); if (unlikely(err)) { - nilfs_msg(NULL, KERN_ERR, - "unable to create feature group: err=%d", err); + nilfs_err(NULL, "unable to create feature group: err=%d", err); goto cleanup_sysfs_init; } diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c index 484785cdf96e..221a1cc597f0 100644 --- a/fs/nilfs2/the_nilfs.c +++ b/fs/nilfs2/the_nilfs.c @@ -183,7 +183,7 @@ static int nilfs_store_log_cursor(struct the_nilfs *nilfs, nilfs_get_segnum_of_block(nilfs, nilfs->ns_last_pseg); nilfs->ns_cno = nilfs->ns_last_cno + 1; if (nilfs->ns_segnum >= nilfs->ns_nsegments) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "pointed segment number is out of range: segnum=%llu, nsegments=%lu", (unsigned long long)nilfs->ns_segnum, nilfs->ns_nsegments); @@ -210,12 +210,12 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) int err; if (!valid_fs) { - nilfs_msg(sb, KERN_WARNING, "mounting unchecked fs"); + nilfs_warn(sb, "mounting unchecked fs"); if (s_flags & SB_RDONLY) { - nilfs_msg(sb, KERN_INFO, - "recovery required for readonly filesystem"); - nilfs_msg(sb, KERN_INFO, - "write access will be enabled during recovery"); + nilfs_info(sb, + "recovery required for readonly filesystem"); + nilfs_info(sb, + "write access will be enabled during recovery"); } } @@ -230,12 +230,11 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) goto scan_error; if (!nilfs_valid_sb(sbp[1])) { - nilfs_msg(sb, KERN_WARNING, - "unable to fall back to spare super block"); + nilfs_warn(sb, + "unable to fall back to spare super block"); goto scan_error; } - nilfs_msg(sb, KERN_INFO, - "trying rollback from an earlier position"); + nilfs_info(sb, "trying rollback from an earlier position"); /* * restore super block with its spare and reconfigure @@ -248,9 +247,9 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) /* verify consistency between two super blocks */ blocksize = BLOCK_SIZE << le32_to_cpu(sbp[0]->s_log_block_size); if (blocksize != nilfs->ns_blocksize) { - nilfs_msg(sb, KERN_WARNING, - "blocksize differs between two super blocks (%d != %d)", - blocksize, nilfs->ns_blocksize); + nilfs_warn(sb, + "blocksize differs between two super blocks (%d != %d)", + blocksize, nilfs->ns_blocksize); goto scan_error; } @@ -269,8 +268,7 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) err = nilfs_load_super_root(nilfs, sb, ri.ri_super_root); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, "error %d while loading super root", - err); + nilfs_err(sb, "error %d while loading super root", err); goto failed; } @@ -281,28 +279,28 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) __u64 features; if (nilfs_test_opt(nilfs, NORECOVERY)) { - nilfs_msg(sb, KERN_INFO, - "norecovery option specified, skipping roll-forward recovery"); + nilfs_info(sb, + "norecovery option specified, skipping roll-forward recovery"); goto skip_recovery; } features = le64_to_cpu(nilfs->ns_sbp[0]->s_feature_compat_ro) & ~NILFS_FEATURE_COMPAT_RO_SUPP; if (features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't proceed with recovery because of unsupported optional features (%llx)", (unsigned long long)features); err = -EROFS; goto failed_unload; } if (really_read_only) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "write access unavailable, cannot proceed"); err = -EROFS; goto failed_unload; } sb->s_flags &= ~SB_RDONLY; } else if (nilfs_test_opt(nilfs, NORECOVERY)) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "recovery cancelled because norecovery option was specified for a read/write mount"); err = -EINVAL; goto failed_unload; @@ -318,12 +316,12 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) up_write(&nilfs->ns_sem); if (err) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d updating super block. recovery unfinished.", err); goto failed_unload; } - nilfs_msg(sb, KERN_INFO, "recovery complete"); + nilfs_info(sb, "recovery complete"); skip_recovery: nilfs_clear_recovery_info(&ri); @@ -331,7 +329,7 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) return 0; scan_error: - nilfs_msg(sb, KERN_ERR, "error %d while searching super root", err); + nilfs_err(sb, "error %d while searching super root", err); goto failed; failed_unload: @@ -378,7 +376,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, struct nilfs_super_block *sbp) { if (le32_to_cpu(sbp->s_rev_level) < NILFS_MIN_SUPP_REV) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "unsupported revision (superblock rev.=%d.%d, current rev.=%d.%d). Please check the version of mkfs.nilfs(2).", le32_to_cpu(sbp->s_rev_level), le16_to_cpu(sbp->s_minor_rev_level), @@ -391,13 +389,11 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, nilfs->ns_inode_size = le16_to_cpu(sbp->s_inode_size); if (nilfs->ns_inode_size > nilfs->ns_blocksize) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too large inode size: %d bytes", + nilfs_err(nilfs->ns_sb, "too large inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } else if (nilfs->ns_inode_size < NILFS_MIN_INODE_SIZE) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too small inode size: %d bytes", + nilfs_err(nilfs->ns_sb, "too small inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } @@ -406,8 +402,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, nilfs->ns_blocks_per_segment = le32_to_cpu(sbp->s_blocks_per_segment); if (nilfs->ns_blocks_per_segment < NILFS_SEG_MIN_BLOCKS) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too short segment: %lu blocks", + nilfs_err(nilfs->ns_sb, "too short segment: %lu blocks", nilfs->ns_blocks_per_segment); return -EINVAL; } @@ -417,7 +412,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, le32_to_cpu(sbp->s_r_segments_percentage); if (nilfs->ns_r_segments_percentage < 1 || nilfs->ns_r_segments_percentage > 99) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "invalid reserved segments percentage: %lu", nilfs->ns_r_segments_percentage); return -EINVAL; @@ -503,16 +498,16 @@ static int nilfs_load_super_block(struct the_nilfs *nilfs, if (!sbp[0]) { if (!sbp[1]) { - nilfs_msg(sb, KERN_ERR, "unable to read superblock"); + nilfs_err(sb, "unable to read superblock"); return -EIO; } - nilfs_msg(sb, KERN_WARNING, - "unable to read primary superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "unable to read primary superblock (blocksize = %d)", + blocksize); } else if (!sbp[1]) { - nilfs_msg(sb, KERN_WARNING, - "unable to read secondary superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "unable to read secondary superblock (blocksize = %d)", + blocksize); } /* @@ -534,14 +529,14 @@ static int nilfs_load_super_block(struct the_nilfs *nilfs, } if (!valid[swp]) { nilfs_release_super_block(nilfs); - nilfs_msg(sb, KERN_ERR, "couldn't find nilfs on the device"); + nilfs_err(sb, "couldn't find nilfs on the device"); return -EINVAL; } if (!valid[!swp]) - nilfs_msg(sb, KERN_WARNING, - "broken superblock, retrying with spare superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "broken superblock, retrying with spare superblock (blocksize = %d)", + blocksize); if (swp) nilfs_swap_super_block(nilfs); @@ -575,7 +570,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) blocksize = sb_min_blocksize(sb, NILFS_MIN_BLOCK_SIZE); if (!blocksize) { - nilfs_msg(sb, KERN_ERR, "unable to set blocksize"); + nilfs_err(sb, "unable to set blocksize"); err = -EINVAL; goto out; } @@ -594,7 +589,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) blocksize = BLOCK_SIZE << le32_to_cpu(sbp->s_log_block_size); if (blocksize < NILFS_MIN_BLOCK_SIZE || blocksize > NILFS_MAX_BLOCK_SIZE) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount because of unsupported filesystem blocksize %d", blocksize); err = -EINVAL; @@ -604,7 +599,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) int hw_blocksize = bdev_logical_block_size(sb->s_bdev); if (blocksize < hw_blocksize) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "blocksize %d too small for device (sector-size = %d)", blocksize, hw_blocksize); err = -EINVAL; From 88b2e9b06381551b707d980627ad0591191f7a2d Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Tue, 11 Aug 2020 18:35:53 -0700 Subject: [PATCH 106/164] fs/ufs: avoid potential u32 multiplication overflow The 64 bit ino is being compared to the product of two u32 values, however, the multiplication is being performed using a 32 bit multiply so there is a potential of an overflow. To be fully safe, cast uspi->s_ncg to a u64 to ensure a 64 bit multiplication occurs to avoid any chance of overflow. Fixes: f3e2a520f5fb ("ufs: NFS support") Signed-off-by: Colin Ian King Signed-off-by: Andrew Morton Cc: Evgeniy Dushistov Cc: Alexey Dobriyan Link: http://lkml.kernel.org/r/20200715170355.1081713-1-colin.king@canonical.com Addresses-Coverity: ("Unintentional integer overflow") Signed-off-by: Linus Torvalds --- fs/ufs/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ufs/super.c b/fs/ufs/super.c index 1da0be667409..e3b69fb280e8 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -101,7 +101,7 @@ static struct inode *ufs_nfs_get_inode(struct super_block *sb, u64 ino, u32 gene struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi; struct inode *inode; - if (ino < UFS_ROOTINO || ino > uspi->s_ncg * uspi->s_ipg) + if (ino < UFS_ROOTINO || ino > (u64)uspi->s_ncg * uspi->s_ipg) return ERR_PTR(-ESTALE); inode = ufs_iget(sb, ino); From e348e65a081d761caf72c2753c4e3d67bfb1ef80 Mon Sep 17 00:00:00 2001 From: Yubo Feng Date: Tue, 11 Aug 2020 18:35:56 -0700 Subject: [PATCH 107/164] fatfs: switch write_lock to read_lock in fat_ioctl_get_attributes There is no need to hold write_lock in fat_ioctl_get_attributes. write_lock may make an impact on concurrency of fat_ioctl_get_attributes. Signed-off-by: Yubo Feng Signed-off-by: Andrew Morton Acked-by: OGAWA Hirofumi Link: http://lkml.kernel.org/r/1593308053-12702-1-git-send-email-fengyubo3@huawei.com Signed-off-by: Linus Torvalds --- fs/fat/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fat/file.c b/fs/fat/file.c index 42134c58c87e..f9ee27cf4d7c 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -25,9 +25,9 @@ static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr) { u32 attr; - inode_lock(inode); + inode_lock_shared(inode); attr = fat_make_attrs(inode); - inode_unlock(inode); + inode_unlock_shared(inode); return put_user(attr, user_attr); } From 4ecfed61de766f4655025a71d79c9a0f4793a3f4 Mon Sep 17 00:00:00 2001 From: "Alexander A. Klimov" Date: Tue, 11 Aug 2020 18:35:59 -0700 Subject: [PATCH 108/164] VFAT/FAT/MSDOS FILESYSTEM: replace HTTP links with HTTPS ones Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `xmlns`: For each link, `http://[^# ]*(?:\w|/)`: If neither `gnu\.org/license`, nor `mozilla\.org/MPL`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Signed-off-by: Alexander A. Klimov Signed-off-by: Andrew Morton Acked-by: OGAWA Hirofumi Link: http://lkml.kernel.org/r/20200708200409.22293-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds --- fs/fat/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig index ca31993dcb47..66532a71e8fd 100644 --- a/fs/fat/Kconfig +++ b/fs/fat/Kconfig @@ -41,7 +41,7 @@ config MSDOS_FS they are compressed; to access compressed MSDOS partitions under Linux, you can either use the DOS emulator DOSEMU, described in the DOSEMU-HOWTO, available from - , or try dmsdosfs in + , or try dmsdosfs in . If you intend to use dosemu with a non-compressed MSDOS partition, say Y here) and MSDOS floppies. This means that file access becomes From a090a5a7d73f79a9ae2dcc6e60d89bfc6864a65a Mon Sep 17 00:00:00 2001 From: OGAWA Hirofumi Date: Tue, 11 Aug 2020 18:36:01 -0700 Subject: [PATCH 109/164] fat: fix fat_ra_init() for data clusters == 0 If data clusters == 0, fat_ra_init() calls the ->ent_blocknr() for the cluster beyond ->max_clusters. This checks the limit before initialization to suppress the warning. Reported-by: syzbot+756199124937b31a9b7e@syzkaller.appspotmail.com Signed-off-by: OGAWA Hirofumi Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/87mu462sv4.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds --- fs/fat/fatent.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c index bbfe18c07417..f7e3304b7802 100644 --- a/fs/fat/fatent.c +++ b/fs/fat/fatent.c @@ -657,6 +657,9 @@ static void fat_ra_init(struct super_block *sb, struct fatent_ra *ra, unsigned long ra_pages = sb->s_bdi->ra_pages; unsigned int reada_blocks; + if (fatent->entry >= ent_limit) + return; + if (ra_pages > sb->s_bdi->io_pages) ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1); From a089e3fd5a82aea20f3d9ec4caa5f4c65cc2cfcc Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Tue, 11 Aug 2020 18:36:04 -0700 Subject: [PATCH 110/164] fs/signalfd.c: fix inconsistent return codes for signalfd4 The kernel signalfd4() syscall returns different error codes when called either in compat or native mode. This behaviour makes correct emulation in qemu and testing programs like LTP more complicated. Fix the code to always return -in both modes- EFAULT for unaccessible user memory, and EINVAL when called with an invalid signal mask. Signed-off-by: Helge Deller Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Laurent Vivier Link: http://lkml.kernel.org/r/20200530100707.GA10159@ls3530.fritz.box Signed-off-by: Linus Torvalds --- fs/signalfd.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/signalfd.c b/fs/signalfd.c index 44b6845b071c..5b78719be445 100644 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@ -314,9 +314,10 @@ SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask, { sigset_t mask; - if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, flags); } @@ -325,9 +326,10 @@ SYSCALL_DEFINE3(signalfd, int, ufd, sigset_t __user *, user_mask, { sigset_t mask; - if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, 0); } From aaa3e7fb81d8a8598b993c8012f4b8aa18d92b20 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:36:08 -0700 Subject: [PATCH 111/164] selftests: kmod: use variable NAME in kmod_test_0001() Patch series "kmod/umh: a few fixes". Tiezhu Yang had sent out a patch set with a slew of kmod selftest fixes, and one patch which modified kmod to return 254 when a module was not found. This opened up pandora's box about why that was being used for and low and behold its because when UMH_WAIT_PROC is used we call a kernel_wait4() call but have never unwrapped the error code. The commit log for that fix details the rationale for the approach taken. I'd appreciate some review on that, in particular nfs folks as it seems a case was never really hit before. This patch (of 5): Use the variable NAME instead of "\000" directly in kmod_test_0001(). Signed-off-by: Tiezhu Yang Signed-off-by: Luis Chamberlain Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Cc: Greg Kroah-Hartman Cc: Al Viro Cc: Philipp Reisner Cc: Lars Ellenberg Cc: Jens Axboe Cc: J. Bruce Fields Cc: Chuck Lever Cc: Roopa Prabhu Cc: Nikolay Aleksandrov Cc: David S. Miller Cc: Jakub Kicinski Cc: David Howells Cc: Jarkko Sakkinen Cc: James Morris Cc: "Serge E. Hallyn" Cc: Christian Brauner Cc: Sergei Trofimovich Cc: Alexei Starovoitov Cc: Kees Cook Cc: Josh Triplett Cc: Sergey Kvachonok Cc: Tony Vroon Cc: Shuah Khan Cc: Christoph Hellwig Link: http://lkml.kernel.org/r/20200610154923.27510-1-mcgrof@kernel.org Link: http://lkml.kernel.org/r/20200610154923.27510-2-mcgrof@kernel.org Signed-off-by: Linus Torvalds --- tools/testing/selftests/kmod/kmod.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh index ea2147248ebe..afd42387e8b2 100755 --- a/tools/testing/selftests/kmod/kmod.sh +++ b/tools/testing/selftests/kmod/kmod.sh @@ -343,7 +343,7 @@ kmod_test_0001_driver() kmod_defaults_driver config_num_threads 1 - printf '\000' >"$DIR"/config_test_driver + printf $NAME >"$DIR"/config_test_driver config_trigger ${FUNCNAME[0]} config_expect_result ${FUNCNAME[0]} MODULE_NOT_FOUND } @@ -354,7 +354,7 @@ kmod_test_0001_fs() kmod_defaults_fs config_num_threads 1 - printf '\000' >"$DIR"/config_test_fs + printf $NAME >"$DIR"/config_test_fs config_trigger ${FUNCNAME[0]} config_expect_result ${FUNCNAME[0]} -EINVAL } From 6f9e148c218641ad876845abdea4b0597468c55e Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:36:12 -0700 Subject: [PATCH 112/164] kmod: remove redundant "be an" in the comment There exists redundant "be an" in the comment, remove it. Signed-off-by: Tiezhu Yang Signed-off-by: Luis Chamberlain Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Cc: Alexei Starovoitov Cc: Al Viro Cc: Christian Brauner Cc: Chuck Lever Cc: David Howells Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Jakub Kicinski Cc: James Morris Cc: Jarkko Sakkinen Cc: J. Bruce Fields Cc: Jens Axboe Cc: Josh Triplett Cc: Kees Cook Cc: Lars Ellenberg Cc: Nikolay Aleksandrov Cc: Philipp Reisner Cc: Roopa Prabhu Cc: "Serge E. Hallyn" Cc: Sergei Trofimovich Cc: Sergey Kvachonok Cc: Shuah Khan Cc: Tony Vroon Cc: Christoph Hellwig Link: http://lkml.kernel.org/r/20200610154923.27510-3-mcgrof@kernel.org Signed-off-by: Linus Torvalds --- kernel/kmod.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index 37c3c4b97b8e..3cd075ce2a1e 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -36,9 +36,8 @@ * * If you need less than 50 threads would mean we're dealing with systems * smaller than 3200 pages. This assumes you are capable of having ~13M memory, - * and this would only be an be an upper limit, after which the OOM killer - * would take effect. Systems like these are very unlikely if modules are - * enabled. + * and this would only be an upper limit, after which the OOM killer would take + * effect. Systems like these are very unlikely if modules are enabled. */ #define MAX_KMOD_CONCURRENT 50 static atomic_t kmod_concurrent_max = ATOMIC_INIT(MAX_KMOD_CONCURRENT); From 0776d1231bec0c7ab43baf440a3f5ef5f49dd795 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:36:16 -0700 Subject: [PATCH 113/164] test_kmod: avoid potential double free in trigger_config_run_type() Reset the member "test_fs" of the test configuration after a call of the function "kfree_const" to a null pointer so that a double memory release will not be performed. Fixes: d9c6a72d6fa2 ("kmod: add test driver to stress test the module loader") Signed-off-by: Tiezhu Yang Signed-off-by: Luis Chamberlain Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Cc: Alexei Starovoitov Cc: Al Viro Cc: Christian Brauner Cc: Chuck Lever Cc: David Howells Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Jakub Kicinski Cc: James Morris Cc: Jarkko Sakkinen Cc: J. Bruce Fields Cc: Jens Axboe Cc: Josh Triplett Cc: Kees Cook Cc: Lars Ellenberg Cc: Nikolay Aleksandrov Cc: Philipp Reisner Cc: Roopa Prabhu Cc: "Serge E. Hallyn" Cc: Sergei Trofimovich Cc: Sergey Kvachonok Cc: Shuah Khan Cc: Tony Vroon Cc: Christoph Hellwig Link: http://lkml.kernel.org/r/20200610154923.27510-4-mcgrof@kernel.org Signed-off-by: Linus Torvalds --- lib/test_kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/test_kmod.c b/lib/test_kmod.c index e651c37d56db..eab52770070d 100644 --- a/lib/test_kmod.c +++ b/lib/test_kmod.c @@ -745,7 +745,7 @@ static int trigger_config_run_type(struct kmod_test_device *test_dev, break; case TEST_KMOD_FS_TYPE: kfree_const(config->test_fs); - config->test_driver = NULL; + config->test_fs = NULL; copied = config_copy_test_fs(config, test_str, strlen(test_str)); break; From f38c85f1ba6902e4e2e2bf1b84edf065a904cdeb Mon Sep 17 00:00:00 2001 From: Lepton Wu Date: Tue, 11 Aug 2020 18:36:20 -0700 Subject: [PATCH 114/164] coredump: add %f for executable filename The document reads "%e" should be "executable filename" while actually it could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e" in core_pattern get surprised when they find out they get thread name instead of executable filename. This is either a bug of document or a bug of code. Since the behavior of "%e" is there for long time, it could bring another surprise for users if we "fix" the code. So we just "fix" the document. And more, for users who really need the "executable filename" in core_pattern, we introduce a new "%f" for the real executable filename. We already have "%E" for executable path in kernel, so just reuse most of its code for the new added "%f" format. Signed-off-by: Lepton Wu Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/kernel.rst | 3 ++- fs/coredump.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 2ae9669eb22c..d4b32cc32bb7 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -164,7 +164,8 @@ core_pattern %s signal number %t UNIX time of dump %h hostname - %e executable filename (may be shortened) + %e executable filename (may be shortened, could be changed by prctl etc) + %f executable filename %E executable path %c maximum size of core file by resource limit RLIMIT_CORE % both are dropped diff --git a/fs/coredump.c b/fs/coredump.c index 7237f07ff6be..76e7c10edfc0 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -153,10 +153,10 @@ int cn_esc_printf(struct core_name *cn, const char *fmt, ...) return ret; } -static int cn_print_exe_file(struct core_name *cn) +static int cn_print_exe_file(struct core_name *cn, bool name_only) { struct file *exe_file; - char *pathbuf, *path; + char *pathbuf, *path, *ptr; int ret; exe_file = get_mm_exe_file(current->mm); @@ -175,6 +175,11 @@ static int cn_print_exe_file(struct core_name *cn) goto free_buf; } + if (name_only) { + ptr = strrchr(path, '/'); + if (ptr) + path = ptr + 1; + } ret = cn_esc_printf(cn, "%s", path); free_buf: @@ -301,12 +306,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, utsname()->nodename); up_read(&uts_sem); break; - /* executable */ + /* executable, could be changed by prctl PR_SET_NAME etc */ case 'e': err = cn_esc_printf(cn, "%s", current->comm); break; + /* file name of executable */ + case 'f': + err = cn_print_exe_file(cn, true); + break; case 'E': - err = cn_print_exe_file(cn); + err = cn_print_exe_file(cn, false); break; /* core limit size */ case 'c': From db19c91c3b75cf8ece3ffd92a4e84a306a7547b0 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 11 Aug 2020 18:36:23 -0700 Subject: [PATCH 115/164] exec: change uselib(2) IS_SREG() failure to EACCES Patch series "Relocate execve() sanity checks", v2. While looking at the code paths for the proposed O_MAYEXEC flag, I saw some things that looked like they should be fixed up. exec: Change uselib(2) IS_SREG() failure to EACCES This just regularizes the return code on uselib(2). exec: Move S_ISREG() check earlier This moves the S_ISREG() check even earlier than it was already. exec: Move path_noexec() check earlier This adds the path_noexec() check to the same place as the S_ISREG() check. This patch (of 3): Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so the behavior matches execve(2), and the seemingly documented value. The "not a regular file" failure mode of execve(2) is explicitly documented[1], but it is not mentioned in uselib(2)[2] which does, however, say that open(2) and mmap(2) errors may apply. The documentation for open(2) does not include a "not a regular file" error[3], but mmap(2) does[4], and it is EACCES. [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Acked-by: Christian Brauner Cc: Aleksa Sarai Cc: Alexander Viro Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Tetsuo Handa Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org Signed-off-by: Linus Torvalds --- fs/exec.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 29ef78ae9f50..c56aad307069 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -141,11 +141,10 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) if (IS_ERR(file)) goto out; - error = -EINVAL; + error = -EACCES; if (!S_ISREG(file_inode(file)->i_mode)) goto exit; - error = -EACCES; if (path_noexec(&file->f_path)) goto exit; From 633fb6ac39801514613fbe050db6abdc3fe744d5 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 11 Aug 2020 18:36:26 -0700 Subject: [PATCH 116/164] exec: move S_ISREG() check earlier The execve(2)/uselib(2) syscalls have always rejected non-regular files. Recently, it was noticed that a deadlock was introduced when trying to execute pipes, as the S_ISREG() test was happening too late. This was fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files during execve()"), but it was added after inode_permission() had already run, which meant LSMs could see bogus attempts to execute non-regular files. Move the test into the other inode type checks (which already look for other pathological conditions[1]). Since there is no need to use FMODE_EXEC while we still have access to "acc_mode", also switch the test to MAY_EXEC. Also include a comment with the redundant S_ISREG() checks at the end of execve(2)/uselib(2) to note that they are present to avoid any mistakes. My notes on the call path, and related arguments, checks, etc: do_open_execat() struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, .acc_mode = MAY_EXEC, ... do_filp_open(dfd, filename, open_flags) path_openat(nameidata, open_flags, flags) file = alloc_empty_file(open_flags, current_cred()); do_open(nameidata, file, open_flags) may_open(path, acc_mode, open_flag) /* new location of MAY_EXEC vs S_ISREG() test */ inode_permission(inode, MAY_OPEN | acc_mode) security_inode_permission(inode, acc_mode) vfs_open(path, file) do_dentry_open(file, path->dentry->d_inode, open) /* old location of FMODE_EXEC vs S_ISREG() test */ security_file_open(f) open() [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/ Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Aleksa Sarai Cc: Alexander Viro Cc: Christian Brauner Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Tetsuo Handa Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org Signed-off-by: Linus Torvalds --- fs/exec.c | 14 ++++++++++++-- fs/namei.c | 6 ++++-- fs/open.c | 6 ------ 3 files changed, 16 insertions(+), 10 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index c56aad307069..cc7ec7b9000b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -141,8 +141,13 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) if (IS_ERR(file)) goto out; + /* + * may_open() has already checked for this, so it should be + * impossible to trip now. But we need to be extra cautious + * and check again at the very end too. + */ error = -EACCES; - if (!S_ISREG(file_inode(file)->i_mode)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode))) goto exit; if (path_noexec(&file->f_path)) @@ -908,8 +913,13 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) if (IS_ERR(file)) goto out; + /* + * may_open() has already checked for this, so it should be + * impossible to trip now. But we need to be extra cautious + * and check again at the very end too. + */ err = -EACCES; - if (!S_ISREG(file_inode(file)->i_mode)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode))) goto exit; if (path_noexec(&file->f_path)) diff --git a/fs/namei.c b/fs/namei.c index fde8fe086c09..9cc3b90abdff 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2849,16 +2849,18 @@ static int may_open(const struct path *path, int acc_mode, int flag) case S_IFLNK: return -ELOOP; case S_IFDIR: - if (acc_mode & MAY_WRITE) + if (acc_mode & (MAY_WRITE | MAY_EXEC)) return -EISDIR; break; case S_IFBLK: case S_IFCHR: if (!may_open_dev(path)) return -EACCES; - /*FALLTHRU*/ + fallthrough; case S_IFIFO: case S_IFSOCK: + if (acc_mode & MAY_EXEC) + return -EACCES; flag &= ~O_TRUNC; break; } diff --git a/fs/open.c b/fs/open.c index c80e9f497e9b..9af548fb841b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -779,12 +779,6 @@ static int do_dentry_open(struct file *f, return 0; } - /* Any file opened for execve()/uselib() has to be a regular file. */ - if (unlikely(f->f_flags & FMODE_EXEC && !S_ISREG(inode->i_mode))) { - error = -EACCES; - goto cleanup_file; - } - if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) { error = get_write_access(inode); if (unlikely(error)) From 0fd338b2d2cdf827091ae819ae90ad760b94ad0c Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 11 Aug 2020 18:36:30 -0700 Subject: [PATCH 117/164] exec: move path_noexec() check earlier The path_noexec() check, like the regular file check, was happening too late, letting LSMs see impossible execve()s. Check it earlier as well in may_open() and collect the redundant fs/exec.c path_noexec() test under the same robustness comment as the S_ISREG() check. My notes on the call path, and related arguments, checks, etc: do_open_execat() struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, .acc_mode = MAY_EXEC, ... do_filp_open(dfd, filename, open_flags) path_openat(nameidata, open_flags, flags) file = alloc_empty_file(open_flags, current_cred()); do_open(nameidata, file, open_flags) may_open(path, acc_mode, open_flag) /* new location of MAY_EXEC vs path_noexec() test */ inode_permission(inode, MAY_OPEN | acc_mode) security_inode_permission(inode, acc_mode) vfs_open(path, file) do_dentry_open(file, path->dentry->d_inode, open) security_file_open(f) open() /* old location of path_noexec() test */ Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Alexander Viro Cc: Aleksa Sarai Cc: Christian Brauner Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Tetsuo Handa Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org Signed-off-by: Linus Torvalds --- fs/exec.c | 12 ++++-------- fs/namei.c | 4 ++++ 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index cc7ec7b9000b..a57d9785832b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -147,10 +147,8 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) * and check again at the very end too. */ error = -EACCES; - if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode))) - goto exit; - - if (path_noexec(&file->f_path)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || + path_noexec(&file->f_path))) goto exit; fsnotify_open(file); @@ -919,10 +917,8 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) * and check again at the very end too. */ err = -EACCES; - if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode))) - goto exit; - - if (path_noexec(&file->f_path)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || + path_noexec(&file->f_path))) goto exit; err = deny_write_access(file); diff --git a/fs/namei.c b/fs/namei.c index 9cc3b90abdff..fb318e06ba3c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2863,6 +2863,10 @@ static int may_open(const struct path *path, int acc_mode, int flag) return -EACCES; flag &= ~O_TRUNC; break; + case S_IFREG: + if ((acc_mode & MAY_EXEC) && path_noexec(path)) + return -EACCES; + break; } error = inode_permission(inode, MAY_OPEN | acc_mode); From 0935288c6e008c0682ab6171fb5605dcc049e7bd Mon Sep 17 00:00:00 2001 From: Vijay Balakrishna Date: Tue, 11 Aug 2020 18:36:33 -0700 Subject: [PATCH 118/164] kdump: append kernel build-id string to VMCOREINFO Make kernel GNU build-id available in VMCOREINFO. Having build-id in VMCOREINFO facilitates presenting appropriate kernel namelist image with debug information file to kernel crash dump analysis tools. Currently VMCOREINFO lacks uniquely identifiable key for crash analysis automation. Regarding if this patch is necessary or matching of linux_banner and OSRELEASE in VMCOREINFO employed by crash(8) meets the need -- IMO, build-id approach more foolproof, in most instances it is a cryptographic hash generated using internal code/ELF bits unlike kernel version string upon which linux_banner is based that is external to the code. I feel each is intended for a different purpose. Also OSRELEASE is not suitable when two different kernel builds from same version with different features enabled. Currently for most linux (and non-linux) systems build-id can be extracted using standard methods for file types such as user mode crash dumps, shared libraries, loadable kernel modules etc., This is an exception for linux kernel dump. Having build-id in VMCOREINFO brings some uniformity for automation tools. Tyler said: : I think this is a nice improvement over today's linux_banner approach for : correlating vmlinux to a kernel dump. : : The elf notes parsing in this patch lines up with what is described in in : the "Notes (Nhdr)" section of the elf(5) man page. : : BUILD_ID_MAX is sufficient to hold a sha1 build-id, which is the default : build-id type today in GNU ld(2). It is also sufficient to hold the : "fast" build-id, which is the default build-id type today in LLVM lld(2). Signed-off-by: Vijay Balakrishna Signed-off-by: Andrew Morton Reviewed-by: Tyler Hicks Acked-by: Baoquan He Cc: Dave Young Cc: Vivek Goyal Link: http://lkml.kernel.org/r/1591849672-34104-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds --- include/linux/crash_core.h | 6 +++++ kernel/crash_core.c | 50 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 525510a9f965..6594dbc34a37 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -38,6 +38,8 @@ phys_addr_t paddr_vmcoreinfo_note(void); #define VMCOREINFO_OSRELEASE(value) \ vmcoreinfo_append_str("OSRELEASE=%s\n", value) +#define VMCOREINFO_BUILD_ID(value) \ + vmcoreinfo_append_str("BUILD-ID=%s\n", value) #define VMCOREINFO_PAGESIZE(value) \ vmcoreinfo_append_str("PAGESIZE=%ld\n", value) #define VMCOREINFO_SYMBOL(name) \ @@ -64,6 +66,10 @@ extern unsigned char *vmcoreinfo_data; extern size_t vmcoreinfo_size; extern u32 *vmcoreinfo_note; +/* raw contents of kernel .notes section */ +extern const void __start_notes __weak; +extern const void __stop_notes __weak; + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 18175687133a..106e4500fd53 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -11,6 +11,8 @@ #include #include +#include + /* vmcoreinfo stuff */ unsigned char *vmcoreinfo_data; size_t vmcoreinfo_size; @@ -376,6 +378,53 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) } EXPORT_SYMBOL(paddr_vmcoreinfo_note); +#define NOTES_SIZE (&__stop_notes - &__start_notes) +#define BUILD_ID_MAX SHA1_DIGEST_SIZE +#define NT_GNU_BUILD_ID 3 + +struct elf_note_section { + struct elf_note n_hdr; + u8 n_data[]; +}; + +/* + * Add build ID from .notes section as generated by the GNU ld(1) + * or LLVM lld(1) --build-id option. + */ +static void add_build_id_vmcoreinfo(void) +{ + char build_id[BUILD_ID_MAX * 2 + 1]; + int n_remain = NOTES_SIZE; + + while (n_remain >= sizeof(struct elf_note)) { + const struct elf_note_section *note_sec = + &__start_notes + NOTES_SIZE - n_remain; + const u32 n_namesz = note_sec->n_hdr.n_namesz; + + if (note_sec->n_hdr.n_type == NT_GNU_BUILD_ID && + n_namesz != 0 && + !strcmp((char *)¬e_sec->n_data[0], "GNU")) { + if (note_sec->n_hdr.n_descsz <= BUILD_ID_MAX) { + const u32 n_descsz = note_sec->n_hdr.n_descsz; + const u8 *s = ¬e_sec->n_data[n_namesz]; + + s = PTR_ALIGN(s, 4); + bin2hex(build_id, s, n_descsz); + build_id[2 * n_descsz] = '\0'; + VMCOREINFO_BUILD_ID(build_id); + return; + } + pr_warn("Build ID is too large to include in vmcoreinfo: %u > %u\n", + note_sec->n_hdr.n_descsz, + BUILD_ID_MAX); + return; + } + n_remain -= sizeof(struct elf_note) + + ALIGN(note_sec->n_hdr.n_namesz, 4) + + ALIGN(note_sec->n_hdr.n_descsz, 4); + } +} + static int __init crash_save_vmcoreinfo_init(void) { vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); @@ -394,6 +443,7 @@ static int __init crash_save_vmcoreinfo_init(void) } VMCOREINFO_OSRELEASE(init_uts_ns.name.release); + add_build_id_vmcoreinfo(); VMCOREINFO_PAGESIZE(PAGE_SIZE); VMCOREINFO_SYMBOL(init_uts_ns); From 216ec27f3c5215d357f351fa103817c8b62ea804 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 11 Aug 2020 18:36:37 -0700 Subject: [PATCH 119/164] drivers/rapidio/devices/rio_mport_cdev.c: use struct_size() helper Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This issue was found with the help of Coccinelle and, audited and fixed manually. Addresses KSPP ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Matt Porter Cc: Alexandre Bounine Link: http://lkml.kernel.org/r/20200619170843.GA24923@embeddedor Signed-off-by: Linus Torvalds --- drivers/rapidio/devices/rio_mport_cdev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/rapidio/devices/rio_mport_cdev.c b/drivers/rapidio/devices/rio_mport_cdev.c index 451608e960a1..3abbba8c2b5b 100644 --- a/drivers/rapidio/devices/rio_mport_cdev.c +++ b/drivers/rapidio/devices/rio_mport_cdev.c @@ -1710,8 +1710,7 @@ static int rio_mport_add_riodev(struct mport_cdev_priv *priv, if (rval & RIO_PEF_SWITCH) { rio_mport_read_config_32(mport, destid, hopcount, RIO_SWP_INFO_CAR, &swpinfo); - size += (RIO_GET_TOTAL_PORTS(swpinfo) * - sizeof(rswitch->nextdev[0])) + sizeof(*rswitch); + size += struct_size(rswitch, nextdev, RIO_GET_TOTAL_PORTS(swpinfo)); } rdev = kzalloc(size, GFP_KERNEL); From 330d5589604d7c1d08caa097a97690d2e8c5bb3f Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 11 Aug 2020 18:36:40 -0700 Subject: [PATCH 120/164] drivers/rapidio/rio-scan.c: use struct_size() helper Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. Also, while there, use the preferred form for passing a size of a struct. The alternative form where struct name is spelled out hurts readability and introduces an opportunity for a bug when the pointer variable type is changed but the corresponding sizeof that is passed as argument is not. This issue was found with the help of Coccinelle and, audited and fixed manually. Addresses KSPP ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Matt Porter Cc: Alexandre Bounine Link: http://lkml.kernel.org/r/20200619170445.GA22641@embeddedor Signed-off-by: Linus Torvalds --- drivers/rapidio/rio-scan.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/rapidio/rio-scan.c b/drivers/rapidio/rio-scan.c index eb8ed28533f8..19b0c33f4a62 100644 --- a/drivers/rapidio/rio-scan.c +++ b/drivers/rapidio/rio-scan.c @@ -330,7 +330,7 @@ static struct rio_dev *rio_setup_device(struct rio_net *net, size_t size; u32 swpinfo = 0; - size = sizeof(struct rio_dev); + size = sizeof(*rdev); if (rio_mport_read_config_32(port, destid, hopcount, RIO_PEF_CAR, &result)) return NULL; @@ -338,10 +338,8 @@ static struct rio_dev *rio_setup_device(struct rio_net *net, if (result & (RIO_PEF_SWITCH | RIO_PEF_MULTIPORT)) { rio_mport_read_config_32(port, destid, hopcount, RIO_SWP_INFO_CAR, &swpinfo); - if (result & RIO_PEF_SWITCH) { - size += (RIO_GET_TOTAL_PORTS(swpinfo) * - sizeof(rswitch->nextdev[0])) + sizeof(*rswitch); - } + if (result & RIO_PEF_SWITCH) + size += struct_size(rswitch, nextdev, RIO_GET_TOTAL_PORTS(swpinfo)); } rdev = kzalloc(size, GFP_KERNEL); From d71375499d7b0a954e3012d4e16ccad3f28e3d1c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 11 Aug 2020 18:36:43 -0700 Subject: [PATCH 121/164] rapidio/rio_mport_cdev: use array_size() helper in copy_{from,to}_user() Use array_size() helper instead of the open-coded version in copy_{from,to}_user(). These sorts of multiplication factors need to be wrapped in array_size(). This issue was found with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Cc: Matt Porter Cc: Alexandre Bounine Link: http://lkml.kernel.org/r/20200616183050.GA31840@embeddedor Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Linus Torvalds --- drivers/rapidio/devices/rio_mport_cdev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/rapidio/devices/rio_mport_cdev.c b/drivers/rapidio/devices/rio_mport_cdev.c index 3abbba8c2b5b..c07ceec3c6d4 100644 --- a/drivers/rapidio/devices/rio_mport_cdev.c +++ b/drivers/rapidio/devices/rio_mport_cdev.c @@ -981,7 +981,7 @@ static int rio_mport_transfer_ioctl(struct file *filp, void __user *arg) if (unlikely(copy_from_user(transfer, (void __user *)(uintptr_t)transaction.block, - transaction.count * sizeof(*transfer)))) { + array_size(sizeof(*transfer), transaction.count)))) { ret = -EFAULT; goto out_free; } @@ -994,7 +994,7 @@ static int rio_mport_transfer_ioctl(struct file *filp, void __user *arg) if (unlikely(copy_to_user((void __user *)(uintptr_t)transaction.block, transfer, - transaction.count * sizeof(*transfer)))) + array_size(sizeof(*transfer), transaction.count)))) ret = -EFAULT; out_free: From 79076e1241bb3bf02d0aac7d39120d8161fe07b1 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:36:46 -0700 Subject: [PATCH 122/164] kernel/panic.c: make oops_may_print() return bool The return value of oops_may_print() is true or false, so change its type to reflect that. Signed-off-by: Tiezhu Yang Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Cc: Xuefeng Li Link: http://lkml.kernel.org/r/1591103358-32087-1-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 2 +- kernel/panic.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 92517ae53e7c..211005163682 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -322,7 +322,7 @@ void nmi_panic(struct pt_regs *regs, const char *msg); extern void oops_enter(void); extern void oops_exit(void); void print_oops_end_marker(void); -extern int oops_may_print(void); +extern bool oops_may_print(void); void do_exit(long error_code) __noreturn; void complete_and_exit(struct completion *, long) __noreturn; diff --git a/kernel/panic.c b/kernel/panic.c index e2157ca387c8..93ba061d1c79 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -505,7 +505,7 @@ static void do_oops_enter_exit(void) * Return true if the calling CPU is allowed to print oops-related info. * This is a bit racy.. */ -int oops_may_print(void) +bool oops_may_print(void) { return pause_on_oops_flag == 0; } From 9d5b134f9f51dd5a67adf8fdc4f59af97e540ceb Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Tue, 11 Aug 2020 18:36:49 -0700 Subject: [PATCH 123/164] lib/Kconfig.debug: fix typo in the help text of CONFIG_PANIC_TIMEOUT There exists duplicated "the" in the help text of CONFIG_PANIC_TIMEOUT, Remove it. Signed-off-by: Tiezhu Yang Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Cc: Xuefeng Li Link: http://lkml.kernel.org/r/1591103358-32087-2-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9ca1b11af1f6..e068c3c7189a 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -917,7 +917,7 @@ config PANIC_TIMEOUT int "panic timeout" default 0 help - Set the timeout value (in seconds) until a reboot occurs when the + Set the timeout value (in seconds) until a reboot occurs when the kernel panics. If n = 0, then we wait forever. A timeout value n > 0 will wait n seconds before rebooting, while a timeout value n < 0 will reboot immediately. From 63037f74725ddd8a767ed2ad0369e60a3bf1f2ce Mon Sep 17 00:00:00 2001 From: Yue Hu Date: Tue, 11 Aug 2020 18:36:53 -0700 Subject: [PATCH 124/164] panic: make print_oops_end_marker() static Since print_oops_end_marker() is not used externally, also remove it in kernel.h at the same time. Signed-off-by: Yue Hu Signed-off-by: Andrew Morton Cc: Kees Cook Link: http://lkml.kernel.org/r/20200724011516.12756-1-zbestahu@gmail.com Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 1 - kernel/panic.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 211005163682..500def620d8f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -321,7 +321,6 @@ void panic(const char *fmt, ...) __noreturn __cold; void nmi_panic(struct pt_regs *regs, const char *msg); extern void oops_enter(void); extern void oops_exit(void); -void print_oops_end_marker(void); extern bool oops_may_print(void); void do_exit(long error_code) __noreturn; void complete_and_exit(struct completion *, long) __noreturn; diff --git a/kernel/panic.c b/kernel/panic.c index 93ba061d1c79..aef8872ba843 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -551,7 +551,7 @@ static int init_oops_id(void) } late_initcall(init_oops_id); -void print_oops_end_marker(void) +static void print_oops_end_marker(void) { init_oops_id(); pr_warn("---[ end trace %016llx ]---\n", (unsigned long long)oops_id); From 31a1b9878c0667945b078c92bce907552f57f2b6 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Tue, 11 Aug 2020 18:36:56 -0700 Subject: [PATCH 125/164] kcov: unconditionally add -fno-stack-protector to compiler options Unconditionally add -fno-stack-protector to KCOV's compiler options, as all supported compilers support the option. This saves a compiler invocation to determine if the option is supported. Because Clang does not support -fno-conserve-stack, and -fno-stack-protector was wrapped in the same cc-option, we were missing -fno-stack-protector with Clang. Unconditionally adding this option fixes this for Clang. Suggested-by: Nick Desaulniers Signed-off-by: Marco Elver Signed-off-by: Andrew Morton Reviewed-by: Nick Desaulniers Reviewed-by: Andrey Konovalov Cc: Dmitry Vyukov Cc: Alexander Potapenko Link: http://lkml.kernel.org/r/20200615184302.7591-1-elver@google.com Signed-off-by: Linus Torvalds --- kernel/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/Makefile b/kernel/Makefile index 5350fd292910..b3da548691c9 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -36,7 +36,7 @@ KCOV_INSTRUMENT_stacktrace.o := n KCOV_INSTRUMENT_kcov.o := n KASAN_SANITIZE_kcov.o := n KCSAN_SANITIZE_kcov.o := n -CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector # cond_syscall is currently not LTO compatible CFLAGS_sys_ni.o = $(DISABLE_LTO) From fed79d057d0842d2e381632b783af30745dd7938 Mon Sep 17 00:00:00 2001 From: Wei Yongjun Date: Tue, 11 Aug 2020 18:36:59 -0700 Subject: [PATCH 126/164] kcov: make some symbols static Fix sparse build warnings: kernel/kcov.c:99:1: warning: symbol '__pcpu_scope_kcov_percpu_data' was not declared. Should it be static? kernel/kcov.c:778:6: warning: symbol 'kcov_remote_softirq_start' was not declared. Should it be static? kernel/kcov.c:795:6: warning: symbol 'kcov_remote_softirq_stop' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Wei Yongjun Signed-off-by: Andrew Morton Reviewed-by: Andrey Konovalov Link: http://lkml.kernel.org/r/20200702115501.73077-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds --- kernel/kcov.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/kcov.c b/kernel/kcov.c index 6afae0bcbac4..6b8368be89c8 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -96,7 +96,7 @@ struct kcov_percpu_data { int saved_sequence; }; -DEFINE_PER_CPU(struct kcov_percpu_data, kcov_percpu_data); +static DEFINE_PER_CPU(struct kcov_percpu_data, kcov_percpu_data); /* Must be called with kcov_remote_lock locked. */ static struct kcov_remote *kcov_remote_find(u64 handle) @@ -775,7 +775,7 @@ static inline bool kcov_mode_enabled(unsigned int mode) return (mode & ~KCOV_IN_CTXSW) != KCOV_MODE_DISABLED; } -void kcov_remote_softirq_start(struct task_struct *t) +static void kcov_remote_softirq_start(struct task_struct *t) { struct kcov_percpu_data *data = this_cpu_ptr(&kcov_percpu_data); unsigned int mode; @@ -792,7 +792,7 @@ void kcov_remote_softirq_start(struct task_struct *t) } } -void kcov_remote_softirq_stop(struct task_struct *t) +static void kcov_remote_softirq_stop(struct task_struct *t) { struct kcov_percpu_data *data = this_cpu_ptr(&kcov_percpu_data); From a3ec9f38a97545b5923a90a1052cbdb3683b9631 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 11 Aug 2020 18:37:02 -0700 Subject: [PATCH 127/164] scripts/gdb: fix python 3.8 SyntaxWarning Fixes the observed warnings: scripts/gdb/linux/rbtree.py:20: SyntaxWarning: "is" with a literal. Did you mean "=="? if node is 0: scripts/gdb/linux/rbtree.py:36: SyntaxWarning: "is" with a literal. Did you mean "=="? if node is 0: It looks like this is a new warning added in Python 3.8. I've only seen this once after adding the add-auto-load-safe-path rule to my ~/.gdbinit for a new tree. Fixes: commit 449ca0c95ea2 ("scripts/gdb: add rb tree iterating utilities") Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Stephen Boyd Cc: Jan Kiszka Cc: Kieran Bingham Cc: Aymeric Agon-Rambosson Link: http://lkml.kernel.org/r/20200805225015.2847624-1-ndesaulniers@google.com Link: https://adamj.eu/tech/2020/01/21/why-does-python-3-8-syntaxwarning-for-is-literal/ Signed-off-by: Linus Torvalds --- scripts/gdb/linux/rbtree.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/gdb/linux/rbtree.py b/scripts/gdb/linux/rbtree.py index c4b991607917..fe462855eefd 100644 --- a/scripts/gdb/linux/rbtree.py +++ b/scripts/gdb/linux/rbtree.py @@ -17,7 +17,7 @@ def rb_first(root): raise gdb.GdbError("Must be struct rb_root not {}".format(root.type)) node = root['rb_node'] - if node is 0: + if node == 0: return None while node['rb_left']: @@ -33,7 +33,7 @@ def rb_last(root): raise gdb.GdbError("Must be struct rb_root not {}".format(root.type)) node = root['rb_node'] - if node is 0: + if node == 0: return None while node['rb_right']: From 00898e8599a184ae05e2c00903e85dc80d82a43d Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Tue, 11 Aug 2020 18:37:05 -0700 Subject: [PATCH 128/164] ipc: uninline functions Two functions are only called via function pointers, don't bother inlining them. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Cc: Manfred Spraul Cc: Davidlohr Bueso Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain Signed-off-by: Linus Torvalds --- ipc/sem.c | 3 +-- ipc/shm.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index 3687b71151b3..8c0244e0365e 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -585,8 +585,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) /* * Called with sem_ids.rwsem and ipcp locked. */ -static inline int sem_more_checks(struct kern_ipc_perm *ipcp, - struct ipc_params *params) +static int sem_more_checks(struct kern_ipc_perm *ipcp, struct ipc_params *params) { struct sem_array *sma; diff --git a/ipc/shm.c b/ipc/shm.c index bf38d7e2fbe9..fd296135a5e6 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -711,8 +711,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) /* * Called with shm_ids.rwsem and ipcp locked. */ -static inline int shm_more_checks(struct kern_ipc_perm *ipcp, - struct ipc_params *params) +static int shm_more_checks(struct kern_ipc_perm *ipcp, struct ipc_params *params) { struct shmid_kernel *shp; From ce14489c8e2d327b56ff3253a3642a6d79b9bca2 Mon Sep 17 00:00:00 2001 From: Liao Pingfang Date: Tue, 11 Aug 2020 18:37:08 -0700 Subject: [PATCH 129/164] ipc/shm.c: remove the superfluous break Remove the superfuous break, as there is a 'return' before it. Signed-off-by: Liao Pingfang Signed-off-by: Yi Wang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/1594724361-11525-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Linus Torvalds --- ipc/shm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/ipc/shm.c b/ipc/shm.c index fd296135a5e6..f1ed36e3ac9f 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1380,7 +1380,6 @@ static long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr, int versio case SHM_LOCK: case SHM_UNLOCK: return shmctl_do_lock(ns, shmid, cmd); - break; default: return -EINVAL; } From c7073bab5772e0e2ab2c766a05f68174974b1ab5 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:11 -0700 Subject: [PATCH 130/164] mm/page_isolation: prefer the node of the source page Patch series "clean-up the migration target allocation functions", v5. This patch (of 9): For locality, it's better to migrate the page to the same node rather than the node of the current caller's cpu. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Reviewed-by: Vlastimil Babka Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Link: http://lkml.kernel.org/r/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1594622517-20681-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/page_isolation.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index f6d07c5f0d34..aec26d972b9f 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -309,5 +309,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, struct page *alloc_migrate_target(struct page *page, unsigned long private) { - return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]); + int nid = page_to_nid(page); + + return new_page_nodemask(page, nid, &node_states[N_MEMORY]); } From b4b382238ed2f94f0d3860f9120b66404fa99463 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:14 -0700 Subject: [PATCH 131/164] mm/migrate: move migration helper from .h to .c It's not performance sensitive function. Move it to .c. This is a preparation step for future change. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Reviewed-by: Vlastimil Babka Acked-by: Mike Kravetz Acked-by: Michal Hocko Cc: Christoph Hellwig Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 33 +++++---------------------------- mm/migrate.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+), 28 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 540998d9810b..abeb4b15b297 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -31,34 +31,6 @@ enum migrate_reason { /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */ extern const char *migrate_reason_names[MR_TYPES]; -static inline struct page *new_page_nodemask(struct page *page, - int preferred_nid, nodemask_t *nodemask) -{ - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; - unsigned int order = 0; - struct page *new_page = NULL; - - if (PageHuge(page)) - return alloc_huge_page_nodemask(page_hstate(compound_head(page)), - preferred_nid, nodemask); - - if (PageTransHuge(page)) { - gfp_mask |= GFP_TRANSHUGE; - order = HPAGE_PMD_ORDER; - } - - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE)) - gfp_mask |= __GFP_HIGHMEM; - - new_page = __alloc_pages_nodemask(gfp_mask, order, - preferred_nid, nodemask); - - if (new_page && PageTransHuge(new_page)) - prep_transhuge_page(new_page); - - return new_page; -} - #ifdef CONFIG_MIGRATION extern void putback_movable_pages(struct list_head *l); @@ -67,6 +39,8 @@ extern int migrate_page(struct address_space *mapping, enum migrate_mode mode); extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason); +extern struct page *new_page_nodemask(struct page *page, + int preferred_nid, nodemask_t *nodemask); extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); @@ -85,6 +59,9 @@ static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason) { return -ENOSYS; } +static inline struct page *new_page_nodemask(struct page *page, + int preferred_nid, nodemask_t *nodemask) + { return NULL; } static inline int isolate_movable_page(struct page *page, isolate_mode_t mode) { return -EBUSY; } diff --git a/mm/migrate.c b/mm/migrate.c index 52896b4921a7..5269bc520aee 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1538,6 +1538,35 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, return rc; } +struct page *new_page_nodemask(struct page *page, + int preferred_nid, nodemask_t *nodemask) +{ + gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; + unsigned int order = 0; + struct page *new_page = NULL; + + if (PageHuge(page)) + return alloc_huge_page_nodemask( + page_hstate(compound_head(page)), + preferred_nid, nodemask); + + if (PageTransHuge(page)) { + gfp_mask |= GFP_TRANSHUGE; + order = HPAGE_PMD_ORDER; + } + + if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE)) + gfp_mask |= __GFP_HIGHMEM; + + new_page = __alloc_pages_nodemask(gfp_mask, order, + preferred_nid, nodemask); + + if (new_page && PageTransHuge(new_page)) + prep_transhuge_page(new_page); + + return new_page; +} + #ifdef CONFIG_NUMA static int store_status(int __user *status, int start, int value, int nr) From d92bbc2719bd2be237ee336113b63492a6baca3b Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:17 -0700 Subject: [PATCH 132/164] mm/hugetlb: unify migration callbacks There is no difference between two migration callback functions, alloc_huge_page_node() and alloc_huge_page_nodemask(), except __GFP_THISNODE handling. It's redundant to have two almost similar functions in order to handle this flag. So, this patch tries to remove one by introducing a new argument, gfp_mask, to alloc_huge_page_nodemask(). After introducing gfp_mask argument, it's caller's job to provide correct gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed to provide gfp_mask. Note that it's safe to remove a node id check in alloc_huge_page_node() since there is no caller passing NUMA_NO_NODE as a node id. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Reviewed-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Christoph Hellwig Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 26 ++++++++++++++++++-------- mm/hugetlb.c | 35 ++--------------------------------- mm/mempolicy.c | 10 ++++++---- mm/migrate.c | 11 +++++++---- 4 files changed, 33 insertions(+), 49 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a520bf26e5d8..3517edde681e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -10,6 +10,7 @@ #include #include #include +#include struct ctl_table; struct user_struct; @@ -506,9 +507,8 @@ struct huge_bootmem_page { struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); -struct page *alloc_huge_page_node(struct hstate *h, int nid); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, - nodemask_t *nmask); + nodemask_t *nmask, gfp_t gfp_mask); struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, @@ -694,6 +694,15 @@ static inline bool hugepage_movable_supported(struct hstate *h) return true; } +/* Movability of hugepages depends on migration support. */ +static inline gfp_t htlb_alloc_mask(struct hstate *h) +{ + if (hugepage_movable_supported(h)) + return GFP_HIGHUSER_MOVABLE; + else + return GFP_HIGHUSER; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { @@ -761,13 +770,9 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma, return NULL; } -static inline struct page *alloc_huge_page_node(struct hstate *h, int nid) -{ - return NULL; -} - static inline struct page * -alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask) +alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, + nodemask_t *nmask, gfp_t gfp_mask) { return NULL; } @@ -880,6 +885,11 @@ static inline bool hugepage_movable_supported(struct hstate *h) return false; } +static inline gfp_t htlb_alloc_mask(struct hstate *h) +{ + return 0; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b66bf74e999e..eaab9ef88e9d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1093,15 +1093,6 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, return NULL; } -/* Movability of hugepages depends on migration support. */ -static inline gfp_t htlb_alloc_mask(struct hstate *h) -{ - if (hugepage_movable_supported(h)) - return GFP_HIGHUSER_MOVABLE; - else - return GFP_HIGHUSER; -} - static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address, int avoid_reserve, @@ -1985,32 +1976,10 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, return page; } -/* page migration callback function */ -struct page *alloc_huge_page_node(struct hstate *h, int nid) -{ - gfp_t gfp_mask = htlb_alloc_mask(h); - struct page *page = NULL; - - if (nid != NUMA_NO_NODE) - gfp_mask |= __GFP_THISNODE; - - spin_lock(&hugetlb_lock); - if (h->free_huge_pages - h->resv_huge_pages > 0) - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL); - spin_unlock(&hugetlb_lock); - - if (!page) - page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL); - - return page; -} - /* page migration callback function */ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, - nodemask_t *nmask) + nodemask_t *nmask, gfp_t gfp_mask) { - gfp_t gfp_mask = htlb_alloc_mask(h); - spin_lock(&hugetlb_lock); if (h->free_huge_pages - h->resv_huge_pages > 0) { struct page *page; @@ -2038,7 +2007,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, gfp_mask = htlb_alloc_mask(h); node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); - page = alloc_huge_page_nodemask(h, node, nodemask); + page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask); mpol_cond_put(mpol); return page; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 25b7e412c20b..9ae2b704bdf6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1068,10 +1068,12 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist, /* page allocation callback for NUMA node migration */ struct page *alloc_new_node_page(struct page *page, unsigned long node) { - if (PageHuge(page)) - return alloc_huge_page_node(page_hstate(compound_head(page)), - node); - else if (PageTransHuge(page)) { + if (PageHuge(page)) { + struct hstate *h = page_hstate(compound_head(page)); + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + + return alloc_huge_page_nodemask(h, node, NULL, gfp_mask); + } else if (PageTransHuge(page)) { struct page *thp; thp = alloc_pages_node(node, diff --git a/mm/migrate.c b/mm/migrate.c index 5269bc520aee..8d084e9bc13b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1545,10 +1545,13 @@ struct page *new_page_nodemask(struct page *page, unsigned int order = 0; struct page *new_page = NULL; - if (PageHuge(page)) - return alloc_huge_page_nodemask( - page_hstate(compound_head(page)), - preferred_nid, nodemask); + if (PageHuge(page)) { + struct hstate *h = page_hstate(compound_head(page)); + + gfp_mask = htlb_alloc_mask(h); + return alloc_huge_page_nodemask(h, preferred_nid, + nodemask, gfp_mask); + } if (PageTransHuge(page)) { gfp_mask |= GFP_TRANSHUGE; From 9933a0c8a5396df6dbaef809a9ee4ad64ebc3abe Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:20 -0700 Subject: [PATCH 133/164] mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular THP allocations new_page_nodemask is a migration callback and it tries to use a common gfp flags for the target page allocation whether it is a base page or a THP. The later only adds GFP_TRANSHUGE to the given mask. This results in the allocation being slightly more aggressive than necessary because the resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP allocations usually exclude this flag to reduce over eager background reclaim during a high THP allocation load which has been seen during large mmaps initialization. There is no indication that this is a problem for migration as well but theoretically the same might happen when migrating large mappings to a different node. Make the migration callback consistent with regular THP allocations. [akpm@linux-foundation.org: fix comment typo, per Vlastimil] Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index 8d084e9bc13b..46cca5c2ebff 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1554,6 +1554,11 @@ struct page *new_page_nodemask(struct page *page, } if (PageTransHuge(page)) { + /* + * clear __GFP_RECLAIM to make the migration callback + * consistent with regular THP allocations. + */ + gfp_mask &= ~__GFP_RECLAIM; gfp_mask |= GFP_TRANSHUGE; order = HPAGE_PMD_ORDER; } From 19fc7bed252c16ace29491e4cfa2bafb264eb505 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:25 -0700 Subject: [PATCH 134/164] mm/migrate: introduce a standard migration target allocation function There are some similar functions for migration target allocation. Since there is no fundamental difference, it's better to keep just one rather than keeping all variants. This patch implements base migration target allocation function. In the following patches, variants will be converted to use this function. Changes should be mechanical, but, unfortunately, there are some differences. First, some callers' nodemask is assgined to NULL since NULL nodemask will be considered as all available nodes, that is, &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if user provided gfp_mask has it. This is because future caller of this function requires to set this node constaint. Lastly, if provided nodeid is NUMA_NO_NODE, nodeid is set up to the node where migration source lives. It helps to remove simple wrappers for setting up the nodeid. Note that PageHighmem() call in previous function is changed to open-code "is_highmem_idx()" since it provides more readability. [akpm@linux-foundation.org: tweak patch title, per Vlastimil] [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 15 +++++++++++++++ include/linux/migrate.h | 9 +++++---- mm/internal.h | 7 +++++++ mm/memory-failure.c | 7 +++++-- mm/memory_hotplug.c | 12 ++++++++---- mm/migrate.c | 26 ++++++++++++++++---------- mm/page_isolation.c | 7 +++++-- 7 files changed, 61 insertions(+), 22 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3517edde681e..30e1f14119c8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -703,6 +703,16 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h) return GFP_HIGHUSER; } +static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) +{ + gfp_t modified_mask = htlb_alloc_mask(h); + + /* Some callers might want to enforce node */ + modified_mask |= (gfp_mask & __GFP_THISNODE); + + return modified_mask; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { @@ -890,6 +900,11 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h) return 0; } +static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) +{ + return 0; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { diff --git a/include/linux/migrate.h b/include/linux/migrate.h index abeb4b15b297..0f8d1583fa8e 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -10,6 +10,8 @@ typedef struct page *new_page_t(struct page *page, unsigned long private); typedef void free_page_t(struct page *page, unsigned long private); +struct migration_target_control; + /* * Return values from addresss_space_operations.migratepage(): * - negative errno on page migration failure; @@ -39,8 +41,7 @@ extern int migrate_page(struct address_space *mapping, enum migrate_mode mode); extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason); -extern struct page *new_page_nodemask(struct page *page, - int preferred_nid, nodemask_t *nodemask); +extern struct page *alloc_migration_target(struct page *page, unsigned long private); extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); @@ -59,8 +60,8 @@ static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason) { return -ENOSYS; } -static inline struct page *new_page_nodemask(struct page *page, - int preferred_nid, nodemask_t *nodemask) +static inline struct page *alloc_migration_target(struct page *page, + unsigned long private) { return NULL; } static inline int isolate_movable_page(struct page *page, isolate_mode_t mode) { return -EBUSY; } diff --git a/mm/internal.h b/mm/internal.h index 42cf0b610847..f725aa8a9698 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -614,4 +614,11 @@ static inline bool is_migrate_highatomic_page(struct page *page) void setup_zone_pageset(struct zone *zone); extern struct page *alloc_new_node_page(struct page *page, unsigned long node); + +struct migration_target_control { + int nid; /* preferred node id */ + nodemask_t *nmask; + gfp_t gfp_mask; +}; + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 47b8ccb1fb9b..f1aa6433f404 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1648,9 +1648,12 @@ EXPORT_SYMBOL(unpoison_memory); static struct page *new_page(struct page *p, unsigned long private) { - int nid = page_to_nid(p); + struct migration_target_control mtc = { + .nid = page_to_nid(p), + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; - return new_page_nodemask(p, nid, &node_states[N_MEMORY]); + return alloc_migration_target(p, (unsigned long)&mtc); } /* diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0a9e1972fbe7..c32ead89c911 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1276,19 +1276,23 @@ static int scan_movable_pages(unsigned long start, unsigned long end, static struct page *new_node_page(struct page *page, unsigned long private) { - int nid = page_to_nid(page); nodemask_t nmask = node_states[N_MEMORY]; + struct migration_target_control mtc = { + .nid = page_to_nid(page), + .nmask = &nmask, + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; /* * try to allocate from a different node but reuse this node if there * are no other online nodes to be used (e.g. we are offlining a part * of the only existing node) */ - node_clear(nid, nmask); + node_clear(mtc.nid, nmask); if (nodes_empty(nmask)) - node_set(nid, nmask); + node_set(mtc.nid, nmask); - return new_page_nodemask(page, nid, &nmask); + return alloc_migration_target(page, (unsigned long)&mtc); } static int diff --git a/mm/migrate.c b/mm/migrate.c index 46cca5c2ebff..48b1f149494b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1538,19 +1538,26 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, return rc; } -struct page *new_page_nodemask(struct page *page, - int preferred_nid, nodemask_t *nodemask) +struct page *alloc_migration_target(struct page *page, unsigned long private) { - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; + struct migration_target_control *mtc; + gfp_t gfp_mask; unsigned int order = 0; struct page *new_page = NULL; + int nid; + int zidx; + + mtc = (struct migration_target_control *)private; + gfp_mask = mtc->gfp_mask; + nid = mtc->nid; + if (nid == NUMA_NO_NODE) + nid = page_to_nid(page); if (PageHuge(page)) { struct hstate *h = page_hstate(compound_head(page)); - gfp_mask = htlb_alloc_mask(h); - return alloc_huge_page_nodemask(h, preferred_nid, - nodemask, gfp_mask); + gfp_mask = htlb_modify_alloc_mask(h, gfp_mask); + return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask); } if (PageTransHuge(page)) { @@ -1562,12 +1569,11 @@ struct page *new_page_nodemask(struct page *page, gfp_mask |= GFP_TRANSHUGE; order = HPAGE_PMD_ORDER; } - - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE)) + zidx = zone_idx(page_zone(page)); + if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) gfp_mask |= __GFP_HIGHMEM; - new_page = __alloc_pages_nodemask(gfp_mask, order, - preferred_nid, nodemask); + new_page = __alloc_pages_nodemask(gfp_mask, order, nid, mtc->nmask); if (new_page && PageTransHuge(new_page)) prep_transhuge_page(new_page); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index aec26d972b9f..f25c66ea37ac 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -309,7 +309,10 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, struct page *alloc_migrate_target(struct page *page, unsigned long private) { - int nid = page_to_nid(page); + struct migration_target_control mtc = { + .nid = page_to_nid(page), + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; - return new_page_nodemask(page, nid, &node_states[N_MEMORY]); + return alloc_migration_target(page, (unsigned long)&mtc); } From a097631160c3b71a12db224ece7d72208c9b6846 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:28 -0700 Subject: [PATCH 135/164] mm/mempolicy: use a standard migration target allocation callback There is a well-defined migration target allocation callback. Use it. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/internal.h | 1 - mm/mempolicy.c | 31 ++++++------------------------- mm/migrate.c | 8 ++++++-- 3 files changed, 12 insertions(+), 28 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index f725aa8a9698..d11a9a8d2135 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -613,7 +613,6 @@ static inline bool is_migrate_highatomic_page(struct page *page) } void setup_zone_pageset(struct zone *zone); -extern struct page *alloc_new_node_page(struct page *page, unsigned long node); struct migration_target_control { int nid; /* preferred node id */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9ae2b704bdf6..afaa09ff9f6c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1065,29 +1065,6 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist, return 0; } -/* page allocation callback for NUMA node migration */ -struct page *alloc_new_node_page(struct page *page, unsigned long node) -{ - if (PageHuge(page)) { - struct hstate *h = page_hstate(compound_head(page)); - gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; - - return alloc_huge_page_nodemask(h, node, NULL, gfp_mask); - } else if (PageTransHuge(page)) { - struct page *thp; - - thp = alloc_pages_node(node, - (GFP_TRANSHUGE | __GFP_THISNODE), - HPAGE_PMD_ORDER); - if (!thp) - return NULL; - prep_transhuge_page(thp); - return thp; - } else - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE | - __GFP_THISNODE, 0); -} - /* * Migrate pages from one node to a target node. * Returns error or the number of pages not migrated. @@ -1098,6 +1075,10 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, nodemask_t nmask; LIST_HEAD(pagelist); int err = 0; + struct migration_target_control mtc = { + .nid = dest, + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + }; nodes_clear(nmask); node_set(source, nmask); @@ -1112,8 +1093,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, flags | MPOL_MF_DISCONTIG_OK, &pagelist); if (!list_empty(&pagelist)) { - err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest, - MIGRATE_SYNC, MR_SYSCALL); + err = migrate_pages(&pagelist, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_movable_pages(&pagelist); } diff --git a/mm/migrate.c b/mm/migrate.c index 48b1f149494b..5053439be6ab 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1598,9 +1598,13 @@ static int do_move_pages_to_node(struct mm_struct *mm, struct list_head *pagelist, int node) { int err; + struct migration_target_control mtc = { + .nid = node, + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + }; - err = migrate_pages(pagelist, alloc_new_node_page, NULL, node, - MIGRATE_SYNC, MR_SYSCALL); + err = migrate_pages(pagelist, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_movable_pages(pagelist); return err; From 8b94e0b8be360cf5460331e56a2678ba265f0694 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:31 -0700 Subject: [PATCH 136/164] mm/page_alloc: remove a wrapper for alloc_migration_target() There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 8 ++++++-- mm/page_isolation.c | 10 ---------- 2 files changed, 6 insertions(+), 12 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d80c9bac489c..8b7d0ecf30b1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8347,6 +8347,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, unsigned long pfn = start; unsigned int tries = 0; int ret = 0; + struct migration_target_control mtc = { + .nid = zone_to_nid(cc->zone), + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; migrate_prep(); @@ -8373,8 +8377,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, &cc->migratepages); cc->nr_migratepages -= nr_reclaimed; - ret = migrate_pages(&cc->migratepages, alloc_migrate_target, - NULL, 0, cc->mode, MR_CONTIG_RANGE); + ret = migrate_pages(&cc->migratepages, alloc_migration_target, + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE); } if (ret < 0) { putback_movable_pages(&cc->migratepages); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index f25c66ea37ac..242c03121d73 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -306,13 +306,3 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return pfn < end_pfn ? -EBUSY : 0; } - -struct page *alloc_migrate_target(struct page *page, unsigned long private) -{ - struct migration_target_control mtc = { - .nid = page_to_nid(page), - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, - }; - - return alloc_migration_target(page, (unsigned long)&mtc); -} From 41b4dc14ee807cb1bd15e67cad287534046f92dc Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:34 -0700 Subject: [PATCH 137/164] mm/gup: restrict CMA region by using allocation scope API We have well defined scope API to exclude CMA region. Use it rather than manipulating gfp_mask manually. With this change, we can now restore __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It would result in that the ZONE_MOVABLE is also searched by page allocator. For hugetlb, gfp_mask is redefined since it has a regular allocation mask filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask filter since a new user for gfp_mask filter, gup, want to be silent when allocation fails. Note that this can be considered as a fix for the commit 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region"). However, "Fixes" tag isn't added here since it is just suboptimal but it doesn't cause any problem. Suggested-by: Michal Hocko Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Christoph Hellwig Cc: Roman Gushchin Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: "Aneesh Kumar K . V" Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 2 ++ mm/gup.c | 17 ++++++++--------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 30e1f14119c8..d86c82749836 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -710,6 +710,8 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) /* Some callers might want to enforce node */ modified_mask |= (gfp_mask & __GFP_THISNODE); + modified_mask |= (gfp_mask & __GFP_NOWARN); + return modified_mask; } diff --git a/mm/gup.c b/mm/gup.c index d8a33dd1430d..6c20c6e37635 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1620,10 +1620,12 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private) * Trying to allocate a page for migration. Ignore allocation * failure warnings. We don't force __GFP_THISNODE here because * this node here is the node where we have CMA reservation and - * in some case these nodes will have really less non movable + * in some case these nodes will have really less non CMA * allocation memory. + * + * Note that CMA region is prohibited by allocation scope. */ - gfp_t gfp_mask = GFP_USER | __GFP_NOWARN; + gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN; if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; @@ -1631,6 +1633,8 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private) #ifdef CONFIG_HUGETLB_PAGE if (PageHuge(page)) { struct hstate *h = page_hstate(page); + + gfp_mask = htlb_modify_alloc_mask(h, gfp_mask); /* * We don't want to dequeue from the pool because pool pages will * mostly be from the CMA region. @@ -1645,11 +1649,6 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private) */ gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; - /* - * Remove the movable mask so that we don't allocate from - * CMA area again. - */ - thp_gfpmask &= ~__GFP_MOVABLE; thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); if (!thp) return NULL; @@ -1795,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk, vmas_tmp, NULL, gup_flags); if (gup_flags & FOLL_LONGTERM) { - memalloc_nocma_restore(flags); if (rc < 0) goto out; @@ -1808,9 +1806,10 @@ static long __gup_longterm_locked(struct task_struct *tsk, rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages, vmas_tmp, gup_flags); +out: + memalloc_nocma_restore(flags); } -out: if (vmas_tmp != vmas) kfree(vmas_tmp); return rc; From bbe88753bd42b1faf1458dde8f58ff1239990436 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:38 -0700 Subject: [PATCH 138/164] mm/hugetlb: make hugetlb migration callback CMA aware new_non_cma_page() in gup.c requires to allocate the new page that is not on the CMA area. new_non_cma_page() implements it by using allocation scope APIs. However, there is a work-around for hugetlb. Normal hugetlb page allocation API for migration is alloc_huge_page_nodemask(). It consists of two steps. First is dequeing from the pool. Second is, if there is no available page on the queue, allocating by using the page allocator. new_non_cma_page() can't use this API since first step (deque) isn't aware of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb internal function for the second step, alloc_migrate_huge_page(), to global scope and uses it directly. This is suboptimal since hugetlb pages on the queue cannot be utilized. This patch tries to fix this situation by making the deque function on hugetlb CMA aware. In the deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is found. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Mike Kravetz Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: "Aneesh Kumar K . V" Cc: Christoph Hellwig Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 2 -- mm/gup.c | 6 +----- mm/hugetlb.c | 11 +++++++++-- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d86c82749836..d5cc5f802dd4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -511,8 +511,6 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask, gfp_t gfp_mask); struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); -struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask); int huge_add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t idx); diff --git a/mm/gup.c b/mm/gup.c index 6c20c6e37635..c55427beeb26 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1635,11 +1635,7 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private) struct hstate *h = page_hstate(page); gfp_mask = htlb_modify_alloc_mask(h, gfp_mask); - /* - * We don't want to dequeue from the pool because pool pages will - * mostly be from the CMA region. - */ - return alloc_migrate_huge_page(h, gfp_mask, nid, NULL); + return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask); } #endif if (PageTransHuge(page)) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index eaab9ef88e9d..a301c2d672bf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -1040,10 +1041,16 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) { struct page *page; + bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA); + + list_for_each_entry(page, &h->hugepage_freelists[nid], lru) { + if (nocma && is_migrate_cma_page(page)) + continue; - list_for_each_entry(page, &h->hugepage_freelists[nid], lru) if (!PageHWPoison(page)) break; + } + /* * if 'non-isolated free hugepage' not found on the list, * the allocation fails. @@ -1935,7 +1942,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, return page; } -struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, +static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask) { struct page *page; From ed03d924587e7dfc54f1ace81e5f5a497a7d9666 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Tue, 11 Aug 2020 18:37:41 -0700 Subject: [PATCH 139/164] mm/gup: use a standard migration target allocation callback There is a well-defined migration target allocation callback. Use it. Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: "Aneesh Kumar K . V" Cc: Christoph Hellwig Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/gup.c | 54 ++++++------------------------------------------------ 1 file changed, 6 insertions(+), 48 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index c55427beeb26..e9d1d0cc18f0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1609,52 +1609,6 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) } #ifdef CONFIG_CMA -static struct page *new_non_cma_page(struct page *page, unsigned long private) -{ - /* - * We want to make sure we allocate the new page from the same node - * as the source page. - */ - int nid = page_to_nid(page); - /* - * Trying to allocate a page for migration. Ignore allocation - * failure warnings. We don't force __GFP_THISNODE here because - * this node here is the node where we have CMA reservation and - * in some case these nodes will have really less non CMA - * allocation memory. - * - * Note that CMA region is prohibited by allocation scope. - */ - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN; - - if (PageHighMem(page)) - gfp_mask |= __GFP_HIGHMEM; - -#ifdef CONFIG_HUGETLB_PAGE - if (PageHuge(page)) { - struct hstate *h = page_hstate(page); - - gfp_mask = htlb_modify_alloc_mask(h, gfp_mask); - return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask); - } -#endif - if (PageTransHuge(page)) { - struct page *thp; - /* - * ignore allocation failure warnings - */ - gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; - - thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); - if (!thp) - return NULL; - prep_transhuge_page(thp); - return thp; - } - - return __alloc_pages_node(nid, gfp_mask, 0); -} - static long check_and_migrate_cma_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, @@ -1669,6 +1623,10 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, bool migrate_allow = true; LIST_HEAD(cma_page_list); long ret = nr_pages; + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN, + }; check_again: for (i = 0; i < nr_pages;) { @@ -1714,8 +1672,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, for (i = 0; i < nr_pages; i++) put_page(pages[i]); - if (migrate_pages(&cma_page_list, new_non_cma_page, - NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) { + if (migrate_pages(&cma_page_list, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) { /* * some of the pages failed migration. Do get_user_pages * without migration. From bce617edecada007aee8610fbe2c14d10b8de2f6 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:37:44 -0700 Subject: [PATCH 140/164] mm: do page fault accounting in handle_mm_fault Patch series "mm: Page fault accounting cleanups", v5. This is v5 of the pf accounting cleanup series. It originates from Gerald Schaefer's report on an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/ What this series did: - Correct page fault accounting: we do accounting for a page fault (no matter whether it's from #PF handling, or gup, or anything else) only with the one that completed the fault. For example, page fault retries should not be counted in page fault counters. Same to the perf events. - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf event is used in an adhoc way across different archs. Case (1): for many archs it's done at the entry of a page fault handler, so that it will also cover e.g. errornous faults. Case (2): for some other archs, it is only accounted when the page fault is resolved successfully. Case (3): there're still quite some archs that have not enabled this perf event. Since this series will touch merely all the archs, we unify this perf event to always follow case (1), which is the one that makes most sense. And since we moved the accounting into handle_mm_fault, the other two MAJ/MIN perf events are well taken care of naturally. - Unify definition of "major faults": the definition of "major fault" is slightly changed when used in accounting (not VM_FAULT_MAJOR). More information in patch 1. - Always account the page fault onto the one that triggered the page fault. This does not matter much for #PF handlings, but mostly for gup. More information on this in patch 25. Patchset layout: Patch 1: Introduced the accounting in handle_mm_fault(), not enabled. Patch 2-23: Enable the new accounting for arch #PF handlers one by one. Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.) Patch 25: Cleanup GUP task_struct pointer since it's not needed any more This patch (of 25): This is a preparation patch to move page fault accountings into the general code in handle_mm_fault(). This includes both the per task flt_maj/flt_min counters, and the major/minor page fault perf events. To do this, the pt_regs pointer is passed into handle_mm_fault(). PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault handlers. So far, all the pt_regs pointer that passed into handle_mm_fault() is NULL, which means this patch should have no intented functional change. Suggested-by: Linus Torvalds Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Albert Ou Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Brian Cain Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Chris Zankel Cc: Dave Hansen Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Gerald Schaefer Cc: Greentime Hu Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James E.J. Bottomley Cc: John Hubbard Cc: Jonas Bonn Cc: Ley Foon Tan Cc: "Luck, Tony" Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Nick Hu Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Richard Henderson Cc: Rich Felker Cc: Russell King Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vincent Chen Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/alpha/mm/fault.c | 2 +- arch/arc/mm/fault.c | 2 +- arch/arm/mm/fault.c | 2 +- arch/arm64/mm/fault.c | 2 +- arch/csky/mm/fault.c | 3 +- arch/hexagon/mm/vm_fault.c | 2 +- arch/ia64/mm/fault.c | 2 +- arch/m68k/mm/fault.c | 2 +- arch/microblaze/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 +- arch/nds32/mm/fault.c | 2 +- arch/nios2/mm/fault.c | 2 +- arch/openrisc/mm/fault.c | 2 +- arch/parisc/mm/fault.c | 2 +- arch/powerpc/mm/copro_fault.c | 2 +- arch/powerpc/mm/fault.c | 2 +- arch/riscv/mm/fault.c | 2 +- arch/s390/mm/fault.c | 2 +- arch/sh/mm/fault.c | 2 +- arch/sparc/mm/fault_32.c | 4 +-- arch/sparc/mm/fault_64.c | 2 +- arch/um/kernel/trap.c | 2 +- arch/x86/mm/fault.c | 2 +- arch/xtensa/mm/fault.c | 2 +- drivers/iommu/amd/iommu_v2.c | 2 +- drivers/iommu/intel/svm.c | 3 +- include/linux/mm.h | 7 ++-- mm/gup.c | 4 +-- mm/hmm.c | 3 +- mm/ksm.c | 3 +- mm/memory.c | 64 ++++++++++++++++++++++++++++++++++- 31 files changed, 103 insertions(+), 34 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index c2303a8c2b9f..1983e43a5e2f 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -148,7 +148,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 7287c793d1c9..587dea524e6b 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -130,7 +130,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) goto bad_area; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index c6550eddfce1..01a8e0f8fef7 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -224,7 +224,7 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, goto out; } - return handle_mm_fault(vma, addr & PAGE_MASK, flags); + return handle_mm_fault(vma, addr & PAGE_MASK, flags, NULL); check_stack: /* Don't allow expansion below FIRST_USER_ADDRESS */ diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 8afb238ff335..be29f4076fe3 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -428,7 +428,7 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, */ if (!(vma->vm_flags & vm_flags)) return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags); + return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags, NULL); } static bool is_el0_instruction_abort(unsigned int esr) diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index b1dce9f2f04d..b252e6e4d32f 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -150,7 +150,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(vma, address, write ? FAULT_FLAG_WRITE : 0, + NULL); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index cd3808f96b93..f12f330e7946 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -88,7 +88,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) break; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 3a4dec334cc5..abf2808f9b4b 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -143,7 +143,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 508abb63da67..08b35a318ebe 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -134,7 +134,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); pr_debug("handle_mm_fault returns %x\n", fault); if (fault_signal_pending(fault, regs)) diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index a2bfe587b491..1a3d4c4ca28b 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -214,7 +214,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 01b168a90434..b1db39784db9 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -152,7 +152,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c index 8fb73f6401a0..d0ecc8fb5b23 100644 --- a/arch/nds32/mm/fault.c +++ b/arch/nds32/mm/fault.c @@ -206,7 +206,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, * the fault. */ - fault = handle_mm_fault(vma, addr, flags); + fault = handle_mm_fault(vma, addr, flags, NULL); /* * If we need to retry but a fatal signal is pending, handle the diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index 4112ef0e247e..86beb9a2698e 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -131,7 +131,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d2224ccca294..3daa491d1edb 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -159,7 +159,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 66ac0719bd49..e32d06928c24 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -302,7 +302,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, * fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index b83abbead4a2..2d0276abe0a6 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,7 +64,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, } ret = 0; - *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0); + *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0, NULL); if (unlikely(*flt & VM_FAULT_ERROR)) { if (*flt & VM_FAULT_OOM) { ret = -ENOMEM; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 925a7231abb3..c6a5225a3521 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -511,7 +511,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); major |= fault & VM_FAULT_MAJOR; diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 5873835a3e6b..30c1124d0fb6 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -109,7 +109,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, addr, flags); + fault = handle_mm_fault(vma, addr, flags, NULL); /* * If we need to retry but a fatal signal is pending, handle the diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index aebf9183bedd..ad783aaaf649 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -476,7 +476,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; if (flags & FAULT_FLAG_RETRY_NOWAIT) diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index fbe1f2fe9a8c..3c0a11827f7e 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -482,7 +482,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR))) if (mm_fault_error(regs, error_code, address, fault)) diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index cfef656eda0f..06af03db4417 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -234,7 +234,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; @@ -410,7 +410,7 @@ static void force_user_fault(unsigned long address, int write) if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } - switch (handle_mm_fault(vma, address, flags)) { + switch (handle_mm_fault(vma, address, flags, NULL)) { case VM_FAULT_SIGBUS: case VM_FAULT_OOM: goto do_sigbus; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index a3806614e4dc..9ebee14ee893 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -422,7 +422,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) goto bad_area; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) goto exit_exception; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 2b3afa354a90..8d9870d76da1 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -71,7 +71,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, do { vm_fault_t fault; - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) goto out_nosemaphore; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 0c7643d9f7cb..e1bf5555d80a 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1291,7 +1291,7 @@ void do_user_addr_fault(struct pt_regs *regs, * userland). The return to userland is identified whenever * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); major |= fault & VM_FAULT_MAJOR; /* Quick path to respond to signals */ diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index c128dcc7c85b..e72c8c1359a6 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -107,7 +107,7 @@ void do_page_fault(struct pt_regs *regs) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if (fault_signal_pending(fault, regs)) return; diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c index e4b025c5637c..c259108ab6dd 100644 --- a/drivers/iommu/amd/iommu_v2.c +++ b/drivers/iommu/amd/iommu_v2.c @@ -495,7 +495,7 @@ static void do_fault(struct work_struct *work) if (access_error(vma, fault)) goto out; - ret = handle_mm_fault(vma, address, flags); + ret = handle_mm_fault(vma, address, flags, NULL); out: mmap_read_unlock(mm); diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 6c87c807a0ab..5ae59a6ad681 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -872,7 +872,8 @@ static irqreturn_t prq_event_thread(int irq, void *d) goto invalid; ret = handle_mm_fault(vma, address, - req->wr_req ? FAULT_FLAG_WRITE : 0); + req->wr_req ? FAULT_FLAG_WRITE : 0, + NULL); if (ret & VM_FAULT_ERROR) goto invalid; diff --git a/include/linux/mm.h b/include/linux/mm.h index f97b10117d44..ec0ffb423769 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -38,6 +38,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct bdi_writeback; +struct pt_regs; void init_mm_internals(void); @@ -1658,7 +1659,8 @@ int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, - unsigned long address, unsigned int flags); + unsigned long address, unsigned int flags, + struct pt_regs *regs); extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); @@ -1668,7 +1670,8 @@ void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); #else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, - unsigned long address, unsigned int flags) + unsigned long address, unsigned int flags, + struct pt_regs *regs) { /* should never happen if there's no MMU */ BUG(); diff --git a/mm/gup.c b/mm/gup.c index e9d1d0cc18f0..ae7121d729fa 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -884,7 +884,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_TRIED; } - ret = handle_mm_fault(vma, address, fault_flags); + ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, *flags); @@ -1238,7 +1238,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, fatal_signal_pending(current)) return -EINTR; - ret = handle_mm_fault(vma, address, fault_flags); + ret = handle_mm_fault(vma, address, fault_flags, NULL); major |= ret & VM_FAULT_MAJOR; if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, 0); diff --git a/mm/hmm.c b/mm/hmm.c index bb279319bf40..943cb2ba4442 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -75,7 +75,8 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end, } for (; addr < end; addr += PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags) & VM_FAULT_ERROR) + if (handle_mm_fault(vma, addr, fault_flags, NULL) & + VM_FAULT_ERROR) return -EFAULT; return -EBUSY; } diff --git a/mm/ksm.c b/mm/ksm.c index 217842a66912..0aa2247bddd7 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -480,7 +480,8 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma, addr, - FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE); + FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, + NULL); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memory.c b/mm/memory.c index 325bb575e7ec..9b7d35734caa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -71,6 +71,8 @@ #include #include #include +#include +#include #include @@ -4356,6 +4358,64 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return handle_pte_fault(&vmf); } +/** + * mm_account_fault - Do page fault accountings + * + * @regs: the pt_regs struct pointer. When set to NULL, will skip accounting + * of perf event counters, but we'll still do the per-task accounting to + * the task who triggered this page fault. + * @address: the faulted address. + * @flags: the fault flags. + * @ret: the fault retcode. + * + * This will take care of most of the page fault accountings. Meanwhile, it + * will also include the PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] perf counter + * updates. However note that the handling of PERF_COUNT_SW_PAGE_FAULTS should + * still be in per-arch page fault handlers at the entry of page fault. + */ +static inline void mm_account_fault(struct pt_regs *regs, + unsigned long address, unsigned int flags, + vm_fault_t ret) +{ + bool major; + + /* + * We don't do accounting for some specific faults: + * + * - Unsuccessful faults (e.g. when the address wasn't valid). That + * includes arch_vma_access_permitted() failing before reaching here. + * So this is not a "this many hardware page faults" counter. We + * should use the hw profiling for that. + * + * - Incomplete faults (VM_FAULT_RETRY). They will only be counted + * once they're completed. + */ + if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) + return; + + /* + * We define the fault as a major fault when the final successful fault + * is VM_FAULT_MAJOR, or if it retried (which implies that we couldn't + * handle it immediately previously). + */ + major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED); + + /* + * If the fault is done for GUP, regs will be NULL, and we will skip + * the fault accounting. + */ + if (!regs) + return; + + if (major) { + current->maj_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + } else { + current->min_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); + } +} + /* * By the time we get here, we already hold the mm semaphore * @@ -4363,7 +4423,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, * return value. See filemap_fault() and __lock_page_or_retry(). */ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; @@ -4404,6 +4464,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } + mm_account_fault(regs, address, flags, ret); + return ret; } EXPORT_SYMBOL_GPL(handle_mm_fault); From c0f6eda41f97dbe4e4feda914c18d019b9b9fccd Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:37:49 -0700 Subject: [PATCH 141/164] mm/alpha: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Link: http://lkml.kernel.org/r/20200707225021.200906-3-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/alpha/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index 1983e43a5e2f..09172f017efc 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -25,6 +25,7 @@ #include #include #include +#include extern void die_if_kernel(char *,struct pt_regs *,long, unsigned long *); @@ -116,6 +117,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, #endif if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -148,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -164,10 +166,6 @@ do_page_fault(unsigned long address, unsigned long mmcsr, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 52e3f8d03052036ce97296915a3746421a1da1d0 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:37:52 -0700 Subject: [PATCH 142/164] mm/arc: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Fix PERF_COUNT_SW_PAGE_FAULTS perf event manually for page fault retries, by moving it before taking mmap_sem. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Vineet Gupta Link: http://lkml.kernel.org/r/20200707225021.200906-4-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/arc/mm/fault.c | 18 +++--------------- 1 file changed, 3 insertions(+), 15 deletions(-) diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 587dea524e6b..f5657cb68e4f 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -105,6 +105,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) if (write) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -130,7 +131,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) goto bad_area; } - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -155,22 +156,9 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) * Major/minor page fault accounting * (in case of retry we only land here once) */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - - if (likely(!(fault & VM_FAULT_ERROR))) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } - + if (likely(!(fault & VM_FAULT_ERROR))) /* Normal return path: fault Handled Gracefully */ return; - } if (!user_mode(regs)) goto no_context; From 79fea6c6548e28400d7870c61477d35aecb7baf8 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:37:54 -0700 Subject: [PATCH 143/164] mm/arm: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. To do this, we need to pass the pt_regs pointer into __do_page_fault(). Fix PERF_COUNT_SW_PAGE_FAULTS perf event manually for page fault retries, by moving it before taking mmap_sem. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Russell King Cc: Will Deacon Link: http://lkml.kernel.org/r/20200707225021.200906-5-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/arm/mm/fault.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 01a8e0f8fef7..efa402025031 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -202,7 +202,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static vm_fault_t __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - unsigned int flags, struct task_struct *tsk) + unsigned int flags, struct task_struct *tsk, + struct pt_regs *regs) { struct vm_area_struct *vma; vm_fault_t fault; @@ -224,7 +225,7 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, goto out; } - return handle_mm_fault(vma, addr & PAGE_MASK, flags, NULL); + return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); check_stack: /* Don't allow expansion below FIRST_USER_ADDRESS */ @@ -266,6 +267,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if ((fsr & FSR_WRITE) && !(fsr & FSR_CM)) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -290,7 +293,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, flags, tsk); + fault = __do_page_fault(mm, addr, fsr, flags, tsk, regs); /* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_lock because @@ -302,23 +305,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) return 0; } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (!(fault & VM_FAULT_ERROR) && flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; From 6a1bb025d28e1026fead73b7b33e2dfccba3f4d2 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:37:57 -0700 Subject: [PATCH 144/164] mm/arm64: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. To do this, we pass pt_regs pointer into __do_page_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Will Deacon Cc: Catalin Marinas Link: http://lkml.kernel.org/r/20200707225021.200906-6-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/arm64/mm/fault.c | 29 ++++++----------------------- 1 file changed, 6 insertions(+), 23 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index be29f4076fe3..f07333e86c2f 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -404,7 +404,8 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re #define VM_FAULT_BADACCESS 0x020000 static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, - unsigned int mm_flags, unsigned long vm_flags) + unsigned int mm_flags, unsigned long vm_flags, + struct pt_regs *regs) { struct vm_area_struct *vma = find_vma(mm, addr); @@ -428,7 +429,7 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, */ if (!(vma->vm_flags & vm_flags)) return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags, NULL); + return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags, regs); } static bool is_el0_instruction_abort(unsigned int esr) @@ -450,7 +451,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, { const struct fault_info *inf; struct mm_struct *mm = current->mm; - vm_fault_t fault, major = 0; + vm_fault_t fault; unsigned long vm_flags = VM_ACCESS_FLAGS; unsigned int mm_flags = FAULT_FLAG_DEFAULT; @@ -516,8 +517,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, #endif } - fault = __do_page_fault(mm, addr, mm_flags, vm_flags); - major |= fault & VM_FAULT_MAJOR; + fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -538,25 +538,8 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, * Handle the "normal" (no error) case first. */ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | - VM_FAULT_BADACCESS)))) { - /* - * Major/minor page fault accounting is only done - * once. If we go through a retry, it is extremely - * likely that the page will be found in page cache at - * that point. - */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - addr); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - addr); - } - + VM_FAULT_BADACCESS)))) return 0; - } /* * If we are in kernel mode at this point, we have no context to From a2a9e439baf8aca2af1e614fab7956c09091a6d1 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:00 -0700 Subject: [PATCH 145/164] mm/csky: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Guo Ren Link: http://lkml.kernel.org/r/20200707225021.200906-7-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/csky/mm/fault.c | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index b252e6e4d32f..081b178b41b1 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -151,7 +151,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, * the fault. */ fault = handle_mm_fault(vma, address, write ? FAULT_FLAG_WRITE : 0, - NULL); + regs); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -161,16 +161,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, goto bad_area; BUG(); } - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - address); - } - mmap_read_unlock(mm); return; From e08157c3c4239fac29879143019c96498ba9c2bc Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:03 -0700 Subject: [PATCH 146/164] mm/hexagon: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Brian Cain Link: http://lkml.kernel.org/r/20200707225021.200906-8-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/hexagon/mm/vm_fault.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index f12f330e7946..ef32c5a84ff3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -18,6 +18,7 @@ #include #include #include +#include /* * Decode of hardware exception sends us to one of several @@ -53,6 +54,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -88,7 +91,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) break; } - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -96,10 +99,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; From b444eed891cf8656af1f59d239f77c5338481774 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:06 -0700 Subject: [PATCH 147/164] mm/ia64: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: "Luck, Tony" Link: http://lkml.kernel.org/r/20200707225021.200906-9-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/ia64/mm/fault.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index abf2808f9b4b..cd9766d2b6e0 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -105,6 +106,8 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re flags |= FAULT_FLAG_USER; if (mask & VM_WRITE) flags |= FAULT_FLAG_WRITE; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -143,7 +146,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -166,10 +169,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From e1c17f627b4232e2c7e9edb7151fa60a746150ee Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:09 -0700 Subject: [PATCH 148/164] mm/m68k: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/20200707225021.200906-10-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/m68k/mm/fault.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 08b35a318ebe..795f483b1050 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -84,6 +85,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -134,7 +137,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); pr_debug("handle_mm_fault returns %x\n", fault); if (fault_signal_pending(fault, regs)) @@ -150,16 +153,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From aeb6aefc3129307858e4d5a4b73ae28f1871ae05 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:12 -0700 Subject: [PATCH 149/164] mm/microblaze: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Michal Simek Link: http://lkml.kernel.org/r/20200707225021.200906-11-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/microblaze/mm/fault.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index 1a3d4c4ca28b..b3fed2cecf84 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -121,6 +122,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -214,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -230,10 +233,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (unlikely(fault & VM_FAULT_MAJOR)) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 2558fd7f5c3eda31d4474c7cdc8dc4b3bb6526f5 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:15 -0700 Subject: [PATCH 150/164] mm/mips: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Fix PERF_COUNT_SW_PAGE_FAULTS perf event manually for page fault retries, by moving it before taking mmap_sem. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Thomas Bogendoerfer Link: http://lkml.kernel.org/r/20200707225021.200906-12-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/mips/mm/fault.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index b1db39784db9..7c871b14e74a 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -96,6 +96,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -152,12 +154,11 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -168,15 +169,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - tsk->maj_flt++; - } else { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - tsk->min_flt++; - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From daf7bf5d90397bc0f3d7b45030461f55eb9c74fa Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:19 -0700 Subject: [PATCH 151/164] mm/nds32: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Fix PERF_COUNT_SW_PAGE_FAULTS perf event manually for page fault retries, by moving it before taking mmap_sem. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Greentime Hu Cc: Nick Hu Cc: Vincent Chen Link: http://lkml.kernel.org/r/20200707225021.200906-13-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/nds32/mm/fault.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c index d0ecc8fb5b23..f02524eb6d56 100644 --- a/arch/nds32/mm/fault.c +++ b/arch/nds32/mm/fault.c @@ -121,6 +121,8 @@ void do_page_fault(unsigned long entry, unsigned long addr, if (unlikely(faulthandler_disabled() || !mm)) goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -206,7 +208,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, * the fault. */ - fault = handle_mm_fault(vma, addr, flags, NULL); + fault = handle_mm_fault(vma, addr, flags, regs); /* * If we need to retry but a fatal signal is pending, handle the @@ -228,22 +230,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, goto bad_area; } - /* - * Major/minor page fault accounting is only done on the initial - * attempt. If we go through a retry, it is extremely likely that the - * page will be found in page cache at that point. - */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 4487dcf9b75180eb5115c196ca9bd6ebefade5b3 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:22 -0700 Subject: [PATCH 152/164] mm/nios2: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Ley Foon Tan Link: http://lkml.kernel.org/r/20200707225021.200906-14-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/nios2/mm/fault.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index 86beb9a2698e..9476feecf512 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -83,6 +84,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + if (!mmap_read_trylock(mm)) { if (!user_mode(regs) && !search_exception_tables(regs->ea)) goto bad_area_nosemaphore; @@ -131,7 +134,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -146,16 +149,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 38caa902dccad61e02273c18a633fc5be91aeca5 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:25 -0700 Subject: [PATCH 153/164] mm/openrisc: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Stafford Horne Cc: Jonas Bonn Cc: Stefan Kristiansson Link: http://lkml.kernel.org/r/20200707225021.200906-15-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/openrisc/mm/fault.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index 3daa491d1edb..ca97d9baab51 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -103,6 +104,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (in_interrupt() || !mm) goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -159,7 +162,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -176,10 +179,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (flags & FAULT_FLAG_ALLOW_RETRY) { /*RGD modeled on Cris */ - if (fault & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From af8a7926273645dc81f9e7f5c3e18136abebf05b Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:28 -0700 Subject: [PATCH 154/164] mm/parisc: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Add the missing PERF_COUNT_SW_PAGE_FAULTS perf events too. Note, the other two perf events (PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]) were done in handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: James E.J. Bottomley Cc: Helge Deller Link: http://lkml.kernel.org/r/20200707225021.200906-16-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/parisc/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index e32d06928c24..4bfe2da9fbe3 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -18,6 +18,7 @@ #include #include #include +#include #include @@ -281,6 +282,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, acc_type = parisc_acctyp(code, regs->iir); if (acc_type & VM_WRITE) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma_prev(mm, address, &prev_vma); @@ -302,7 +304,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, * fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -323,10 +325,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { /* * No need to mmap_read_unlock(mm) as we would From 428fdc094492e720c342f1c934e7972cbc328d13 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:31 -0700 Subject: [PATCH 155/164] mm/powerpc: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Link: http://lkml.kernel.org/r/20200707225021.200906-17-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/powerpc/mm/fault.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index c6a5225a3521..0add963a849b 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -511,7 +511,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); major |= fault & VM_FAULT_MAJOR; @@ -537,14 +537,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, /* * Major/minor page fault accounting. */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + if (major) cmo_account_page_fault(); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } + return 0; } NOKPROBE_SYMBOL(__do_page_fault); From 5ac365a458902214adfbe3567c94e114cc91cde6 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:34 -0700 Subject: [PATCH 156/164] mm/riscv: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Pekka Enberg Acked-by: Palmer Dabbelt Cc: Paul Walmsley Cc: Albert Ou Link: http://lkml.kernel.org/r/20200707225021.200906-18-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/riscv/mm/fault.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 30c1124d0fb6..716d64e36f83 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -109,7 +109,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, addr, flags, NULL); + fault = handle_mm_fault(vma, addr, flags, regs); /* * If we need to retry but a fatal signal is pending, handle the @@ -127,21 +127,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 35e45f3e5a1fe652df2153f2b1c7dd234d448356 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:37 -0700 Subject: [PATCH 157/164] mm/s390: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Gerald Schaefer Acked-by: Gerald Schaefer Cc: Alexander Gordeev Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Link: http://lkml.kernel.org/r/20200707225021.200906-19-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/s390/mm/fault.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index ad783aaaf649..4c8c063bce5b 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -476,7 +476,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; if (flags & FAULT_FLAG_RETRY_NOWAIT) @@ -486,21 +486,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) if (unlikely(fault & VM_FAULT_ERROR)) goto out_up; - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { if (IS_ENABLED(CONFIG_PGSTE) && gmap && (flags & FAULT_FLAG_RETRY_NOWAIT)) { From 105f886220e944b6aa01accfad59af49341703c4 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:40 -0700 Subject: [PATCH 158/164] mm/sh: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Yoshinori Sato Cc: Rich Felker Link: http://lkml.kernel.org/r/20200707225021.200906-20-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/sh/mm/fault.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index 3c0a11827f7e..482668a2f6d3 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -482,22 +482,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR))) if (mm_fault_error(regs, error_code, address, fault)) return; if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 56e10e6ab11924b19b7c9583750fdd4e440e25c8 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:43 -0700 Subject: [PATCH 159/164] mm/sparc32: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: David S. Miller Link: http://lkml.kernel.org/r/20200707225021.200906-21-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/sparc/mm/fault_32.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 06af03db4417..8071bfd72349 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -234,7 +234,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -250,15 +250,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From f08147df4092dd5093b85dbccbd34a7843fb60bf Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:46 -0700 Subject: [PATCH 160/164] mm/sparc64: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: David S. Miller Link: http://lkml.kernel.org/r/20200707225021.200906-22-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/sparc/mm/fault_64.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 9ebee14ee893..0a6bcc85fba7 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -422,7 +422,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) goto bad_area; } - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) goto exit_exception; @@ -438,15 +438,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; From 968614fc7b8410e1ee99d0111015a1bf3e955f64 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:49 -0700 Subject: [PATCH 161/164] mm/x86: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e1bf5555d80a..35f1498e9832 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1139,7 +1139,7 @@ void do_user_addr_fault(struct pt_regs *regs, struct vm_area_struct *vma; struct task_struct *tsk; struct mm_struct *mm; - vm_fault_t fault, major = 0; + vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; @@ -1291,8 +1291,7 @@ void do_user_addr_fault(struct pt_regs *regs, * userland). The return to userland is identified whenever * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags. */ - fault = handle_mm_fault(vma, address, flags, NULL); - major |= fault & VM_FAULT_MAJOR; + fault = handle_mm_fault(vma, address, flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -1319,18 +1318,6 @@ void do_user_addr_fault(struct pt_regs *regs, return; } - /* - * Major/minor page fault accounting. If any of the events - * returned VM_FAULT_MAJOR, we account it as a major fault. - */ - if (major) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } - check_v8086_mode(regs, address, tsk); } NOKPROBE_SYMBOL(do_user_addr_fault); From 484e51e4af528a408b8e0f2db4865625169279cf Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:53 -0700 Subject: [PATCH 162/164] mm/xtensa: use general page fault accounting Use the general page fault accounting by passing regs into handle_mm_fault(). It naturally solve the issue of multiple page fault accounting when page fault retry happened. Remove the PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] perf events because it's now also done in handle_mm_fault(). Move the PERF_COUNT_SW_PAGE_FAULTS event higher before taking mmap_sem for the fault, then it'll match with the rest of the archs. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Acked-by: Max Filippov Cc: Chris Zankel Link: http://lkml.kernel.org/r/20200707225021.200906-24-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/xtensa/mm/fault.c | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e72c8c1359a6..7666408ce12a 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -72,6 +72,9 @@ void do_page_fault(struct pt_regs *regs) if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -107,7 +110,7 @@ void do_page_fault(struct pt_regs *regs) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, NULL); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -122,10 +125,6 @@ void do_page_fault(struct pt_regs *regs) BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -139,12 +138,6 @@ void do_page_fault(struct pt_regs *regs) } mmap_read_unlock(mm); - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (flags & VM_FAULT_MAJOR) - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - else - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - return; /* Something tried to access memory that isn't in our memory map.. From a2beb5f1efede6924a4258462a5660572e6ca864 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:38:57 -0700 Subject: [PATCH 163/164] mm: clean up the last pieces of page fault accountings Here're the last pieces of page fault accounting that were still done outside handle_mm_fault() where we still have regs==NULL when calling handle_mm_fault(): arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault arch/sparc/mm/fault_32.c: force_user_fault arch/um/kernel/trap.c: handle_page_fault mm/gup.c: faultin_page fixup_user_fault mm/hmm.c: hmm_vma_fault mm/ksm.c: break_ksm Some of them has the issue of duplicated accounting for page fault retries. Some of them didn't do the accounting at all. This patch cleans all these up by letting handle_mm_fault() to do per-task page fault accounting even if regs==NULL (though we'll still skip the perf event accountings). With that, we can safely remove all the outliers now. There's another functional change in that now we account the page faults to the caller of gup, rather than the task_struct that passed into the gup code. More information of this can be found at [1]. After this patch, below things should never be touched again outside handle_mm_fault(): - task_struct.[maj|min]_flt - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/ Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Albert Ou Cc: Alexander Gordeev Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Brian Cain Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Chris Zankel Cc: Dave Hansen Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Gerald Schaefer Cc: Greentime Hu Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James E.J. Bottomley Cc: John Hubbard Cc: Jonas Bonn Cc: Ley Foon Tan Cc: "Luck, Tony" Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Nick Hu Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Richard Henderson Cc: Rich Felker Cc: Russell King Cc: Stafford Horne Cc: Stefan Kristiansson Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vincent Chen Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/powerpc/mm/copro_fault.c | 5 ----- arch/um/kernel/trap.c | 4 ---- mm/gup.c | 13 ------------- mm/memory.c | 17 ++++++++++------- 4 files changed, 10 insertions(+), 29 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 2d0276abe0a6..8acd00178956 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -76,11 +76,6 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, BUG(); } - if (*flt & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; - out_unlock: mmap_read_unlock(mm); return ret; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 8d9870d76da1..ad12f78bda7e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -88,10 +88,6 @@ int handle_page_fault(unsigned long address, unsigned long ip, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/mm/gup.c b/mm/gup.c index ae7121d729fa..d5d44c68fa19 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -893,13 +893,6 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, BUG(); } - if (tsk) { - if (ret & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; - } - if (ret & VM_FAULT_RETRY) { if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT)) *locked = 0; @@ -1255,12 +1248,6 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, goto retry; } - if (tsk) { - if (major) - tsk->maj_flt++; - else - tsk->min_flt++; - } return 0; } EXPORT_SYMBOL_GPL(fixup_user_fault); diff --git a/mm/memory.c b/mm/memory.c index 9b7d35734caa..2b7f0e00f312 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4400,20 +4400,23 @@ static inline void mm_account_fault(struct pt_regs *regs, */ major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED); + if (major) + current->maj_flt++; + else + current->min_flt++; + /* - * If the fault is done for GUP, regs will be NULL, and we will skip - * the fault accounting. + * If the fault is done for GUP, regs will be NULL. We only do the + * accounting for the per thread fault counters who triggered the + * fault, and we skip the perf event updates. */ if (!regs) return; - if (major) { - current->maj_flt++; + if (major) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - } else { - current->min_flt++; + else perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } } /* From 64019a2e467a288a16b65ab55ddcbf58c1b00187 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 11 Aug 2020 18:39:01 -0700 Subject: [PATCH 164/164] mm/gup: remove task_struct pointer for all gup code After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: John Hubbard Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/arc/kernel/process.c | 2 +- arch/s390/kvm/interrupt.c | 2 +- arch/s390/kvm/kvm-s390.c | 2 +- arch/s390/kvm/priv.c | 8 +- arch/s390/mm/gmap.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 2 +- drivers/infiniband/core/umem_odp.c | 2 +- drivers/vfio/vfio_iommu_type1.c | 4 +- fs/exec.c | 2 +- include/linux/mm.h | 9 +- kernel/events/uprobes.c | 6 +- kernel/futex.c | 2 +- mm/gup.c | 101 ++++++++------------ mm/memory.c | 2 +- mm/process_vm_access.c | 2 +- security/tomoyo/domain.c | 2 +- virt/kvm/async_pf.c | 2 +- virt/kvm/kvm_main.c | 2 +- 18 files changed, 69 insertions(+), 87 deletions(-) diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index e12c80d71b78..efeba1fe7252 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -91,7 +91,7 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new) goto fail; mmap_read_lock(current->mm); - ret = fixup_user_fault(current, current->mm, (unsigned long) uaddr, + ret = fixup_user_fault(current->mm, (unsigned long) uaddr, FAULT_FLAG_WRITE, NULL); mmap_read_unlock(current->mm); diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index 1608fd99bbee..2f177298c663 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -2768,7 +2768,7 @@ static struct page *get_map_page(struct kvm *kvm, u64 uaddr) struct page *page = NULL; mmap_read_lock(kvm->mm); - get_user_pages_remote(NULL, kvm->mm, uaddr, 1, FOLL_WRITE, + get_user_pages_remote(kvm->mm, uaddr, 1, FOLL_WRITE, &page, NULL, NULL); mmap_read_unlock(kvm->mm); return page; diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 66da278a67fb..6b74b92c1a58 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -1892,7 +1892,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) r = set_guest_storage_key(current->mm, hva, keys[i], 0); if (r) { - r = fixup_user_fault(current, current->mm, hva, + r = fixup_user_fault(current->mm, hva, FAULT_FLAG_WRITE, &unlocked); if (r) break; diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c index 2f721a923b54..cd74989ce0b0 100644 --- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -273,7 +273,7 @@ static int handle_iske(struct kvm_vcpu *vcpu) rc = get_guest_storage_key(current->mm, vmaddr, &key); if (rc) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); if (!rc) { mmap_read_unlock(current->mm); @@ -319,7 +319,7 @@ static int handle_rrbe(struct kvm_vcpu *vcpu) mmap_read_lock(current->mm); rc = reset_guest_reference_bit(current->mm, vmaddr); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); if (!rc) { mmap_read_unlock(current->mm); @@ -390,7 +390,7 @@ static int handle_sske(struct kvm_vcpu *vcpu) m3 & SSKE_MC); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); rc = !rc ? -EAGAIN : rc; } @@ -1094,7 +1094,7 @@ static int handle_pfmf(struct kvm_vcpu *vcpu) rc = cond_set_guest_storage_key(current->mm, vmaddr, key, NULL, nq, mr, mc); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); rc = !rc ? -EAGAIN : rc; } diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 190357ff86b3..8747487c50a8 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -649,7 +649,7 @@ int gmap_fault(struct gmap *gmap, unsigned long gaddr, rc = vmaddr; goto out_up; } - if (fixup_user_fault(current, gmap->mm, vmaddr, fault_flags, + if (fixup_user_fault(gmap->mm, vmaddr, fault_flags, &unlocked)) { rc = -EFAULT; goto out_up; @@ -879,7 +879,7 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr, BUG_ON(gmap_is_shadow(gmap)); fault_flags = (prot == PROT_WRITE) ? FAULT_FLAG_WRITE : 0; - if (fixup_user_fault(current, mm, vmaddr, fault_flags, &unlocked)) + if (fixup_user_fault(mm, vmaddr, fault_flags, &unlocked)) return -EFAULT; if (unlocked) /* lost mmap_lock, caller has to retry __gmap_translate */ diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index e946032b13e4..2c2bf24140c9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -469,7 +469,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work) locked = 1; } ret = pin_user_pages_remote - (work->task, mm, + (mm, obj->userptr.ptr + pinned * PAGE_SIZE, npages - pinned, flags, diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 5e32f61a2fe4..cc6b4befde7c 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -439,7 +439,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, * complex (and doesn't gain us much performance in most use * cases). */ - npages = get_user_pages_remote(owning_process, owning_mm, + npages = get_user_pages_remote(owning_mm, user_virt, gup_num_pages, flags, local_page_list, NULL, NULL); mmap_read_unlock(owning_mm); diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5e556ac9102a..9d41105bfd01 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -425,7 +425,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) { bool unlocked = false; - ret = fixup_user_fault(NULL, mm, vaddr, + ret = fixup_user_fault(mm, vaddr, FAULT_FLAG_REMOTE | (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); @@ -453,7 +453,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, flags |= FOLL_WRITE; mmap_read_lock(mm); - ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM, + ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, page, NULL, NULL); if (ret == 1) { *pfn = page_to_pfn(page[0]); diff --git a/fs/exec.c b/fs/exec.c index a57d9785832b..a91003e28eaa 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -217,7 +217,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, * We are doing an exec(). 'current' is the process * doing the exec and bprm->mm is the new process's mm. */ - ret = get_user_pages_remote(current, bprm->mm, pos, 1, gup_flags, + ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, &page, NULL, NULL); if (ret <= 0) return NULL; diff --git a/include/linux/mm.h b/include/linux/mm.h index ec0ffb423769..e7602a3bcef1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1661,7 +1661,7 @@ int invalidate_inode_page(struct page *page); extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs); -extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +extern int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); void unmap_mapping_pages(struct address_space *mapping, @@ -1677,8 +1677,7 @@ static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, BUG(); return VM_FAULT_SIGBUS; } -static inline int fixup_user_fault(struct task_struct *tsk, - struct mm_struct *mm, unsigned long address, +static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked) { /* should never happen if there's no MMU */ @@ -1704,11 +1703,11 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); -long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long pin_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 49047d479c57..649fd53dc9ad 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -376,7 +376,7 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d) if (!vaddr || !d) return -EINVAL; - ret = get_user_pages_remote(NULL, mm, vaddr, 1, + ret = get_user_pages_remote(mm, vaddr, 1, FOLL_WRITE, &page, &vma, NULL); if (unlikely(ret <= 0)) { /* @@ -477,7 +477,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, if (is_register) gup_flags |= FOLL_SPLIT_PMD; /* Read the page with vaddr into memory */ - ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags, + ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, &vma, NULL); if (ret <= 0) return ret; @@ -2029,7 +2029,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) * but we treat this as a 'remote' access since it is * essentially a kernel access to the memory. */ - result = get_user_pages_remote(NULL, mm, vaddr, 1, FOLL_FORCE, &page, + result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL, NULL); if (result < 0) return result; diff --git a/kernel/futex.c b/kernel/futex.c index 83404124b77b..61e8153e6c76 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -678,7 +678,7 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; mmap_read_lock(mm); - ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + ret = fixup_user_fault(mm, (unsigned long)uaddr, FAULT_FLAG_WRITE, NULL); mmap_read_unlock(mm); diff --git a/mm/gup.c b/mm/gup.c index d5d44c68fa19..39e58df6925d 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -859,7 +859,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, * does not include FOLL_NOWAIT, the mmap_lock may be released. If it * is, *@locked will be set to 0 and -EBUSY returned. */ -static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, +static int faultin_page(struct vm_area_struct *vma, unsigned long address, unsigned int *flags, int *locked) { unsigned int fault_flags = 0; @@ -962,7 +962,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) /** * __get_user_pages() - pin user pages in memory - * @tsk: task_struct of target task * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -1021,7 +1020,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) * instead of __get_user_pages. __get_user_pages should be used only if * you need some special @gup_flags. */ -static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, +static long __get_user_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1103,8 +1102,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page) { - ret = faultin_page(tsk, vma, start, &foll_flags, - locked); + ret = faultin_page(vma, start, &foll_flags, locked); switch (ret) { case 0: goto retry; @@ -1178,8 +1176,6 @@ static bool vma_permits_fault(struct vm_area_struct *vma, /** * fixup_user_fault() - manually resolve a user page fault - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @address: user address * @fault_flags:flags to pass down to handle_mm_fault() @@ -1207,7 +1203,7 @@ static bool vma_permits_fault(struct vm_area_struct *vma, * This function will not return with an unlocked mmap_lock. So it has not the * same semantics wrt the @mm->mmap_lock as does filemap_fault(). */ -int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked) { @@ -1256,8 +1252,7 @@ EXPORT_SYMBOL_GPL(fixup_user_fault); * Please note that this function, unlike __get_user_pages will not * return 0 for nr_pages > 0 without FOLL_NOWAIT */ -static __always_inline long __get_user_pages_locked(struct task_struct *tsk, - struct mm_struct *mm, +static __always_inline long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1290,7 +1285,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done = 0; lock_dropped = false; for (;;) { - ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages, + ret = __get_user_pages(mm, start, nr_pages, flags, pages, vmas, locked); if (!locked) /* VM_FAULT_RETRY couldn't trigger, bypass */ @@ -1350,7 +1345,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, } *locked = 1; - ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED, + ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED, pages, NULL, locked); if (!*locked) { /* Continue to retry until we succeeded */ @@ -1437,7 +1432,7 @@ long populate_vma_page_range(struct vm_area_struct *vma, * We made sure addr is within a VMA, so the following will * not result in a stack expansion that recurses back here. */ - return __get_user_pages(current, mm, start, nr_pages, gup_flags, + return __get_user_pages(mm, start, nr_pages, gup_flags, NULL, NULL, locked); } @@ -1521,7 +1516,7 @@ struct page *get_dump_page(unsigned long addr) struct vm_area_struct *vma; struct page *page; - if (__get_user_pages(current, current->mm, addr, 1, + if (__get_user_pages(current->mm, addr, 1, FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma, NULL) < 1) return NULL; @@ -1530,8 +1525,7 @@ struct page *get_dump_page(unsigned long addr) } #endif /* CONFIG_ELF_CORE */ #else /* CONFIG_MMU */ -static long __get_user_pages_locked(struct task_struct *tsk, - struct mm_struct *mm, unsigned long start, +static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, struct vm_area_struct **vmas, int *locked, unsigned int foll_flags) @@ -1596,8 +1590,7 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) } #ifdef CONFIG_CMA -static long check_and_migrate_cma_pages(struct task_struct *tsk, - struct mm_struct *mm, +static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1675,7 +1668,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, * again migrating any new CMA pages which we failed to isolate * earlier. */ - ret = __get_user_pages_locked(tsk, mm, start, nr_pages, + ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, gup_flags); @@ -1689,8 +1682,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, return ret; } #else -static long check_and_migrate_cma_pages(struct task_struct *tsk, - struct mm_struct *mm, +static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1705,8 +1697,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which * allows us to process the FOLL_LONGTERM flag. */ -static long __gup_longterm_locked(struct task_struct *tsk, - struct mm_struct *mm, +static long __gup_longterm_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1731,7 +1722,7 @@ static long __gup_longterm_locked(struct task_struct *tsk, flags = memalloc_nocma_save(); } - rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas_tmp, NULL, gup_flags); if (gup_flags & FOLL_LONGTERM) { @@ -1745,7 +1736,7 @@ static long __gup_longterm_locked(struct task_struct *tsk, goto out; } - rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages, + rc = check_and_migrate_cma_pages(mm, start, rc, pages, vmas_tmp, gup_flags); out: memalloc_nocma_restore(flags); @@ -1756,22 +1747,20 @@ static long __gup_longterm_locked(struct task_struct *tsk, return rc; } #else /* !CONFIG_FS_DAX && !CONFIG_CMA */ -static __always_inline long __gup_longterm_locked(struct task_struct *tsk, - struct mm_struct *mm, +static __always_inline long __gup_longterm_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, struct vm_area_struct **vmas, unsigned int flags) { - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, flags); } #endif /* CONFIG_FS_DAX || CONFIG_CMA */ #ifdef CONFIG_MMU -static long __get_user_pages_remote(struct task_struct *tsk, - struct mm_struct *mm, +static long __get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1790,20 +1779,18 @@ static long __get_user_pages_remote(struct task_struct *tsk, * This will check the vmas (even if our vmas arg is NULL) * and return -ENOTSUPP if DAX isn't allowed in this case: */ - return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, + return __gup_longterm_locked(mm, start, nr_pages, pages, vmas, gup_flags | FOLL_TOUCH | FOLL_REMOTE); } - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, locked, gup_flags | FOLL_TOUCH | FOLL_REMOTE); } /** * get_user_pages_remote() - pin user pages in memory - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -1862,7 +1849,7 @@ static long __get_user_pages_remote(struct task_struct *tsk, * should use get_user_pages_remote because it cannot pass * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. */ -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1874,13 +1861,13 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + return __get_user_pages_remote(mm, start, nr_pages, gup_flags, pages, vmas, locked); } EXPORT_SYMBOL(get_user_pages_remote); #else /* CONFIG_MMU */ -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1888,8 +1875,7 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, return 0; } -static long __get_user_pages_remote(struct task_struct *tsk, - struct mm_struct *mm, +static long __get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1909,11 +1895,10 @@ static long __get_user_pages_remote(struct task_struct *tsk, * @vmas: array of pointers to vmas corresponding to each page. * Or NULL if the caller does not require them. * - * This is the same as get_user_pages_remote(), just with a - * less-flexible calling convention where we assume that the task - * and mm being operated on are the current task's and don't allow - * passing of a locked parameter. We also obviously don't pass - * FOLL_REMOTE in here. + * This is the same as get_user_pages_remote(), just with a less-flexible + * calling convention where we assume that the mm being operated on belongs to + * the current task, and doesn't allow passing of a locked parameter. We also + * obviously don't pass FOLL_REMOTE in here. */ long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, @@ -1926,7 +1911,7 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __gup_longterm_locked(current, current->mm, start, nr_pages, + return __gup_longterm_locked(current->mm, start, nr_pages, pages, vmas, gup_flags | FOLL_TOUCH); } EXPORT_SYMBOL(get_user_pages); @@ -1936,7 +1921,7 @@ EXPORT_SYMBOL(get_user_pages); * * mmap_read_lock(mm); * do_something() - * get_user_pages(tsk, mm, ..., pages, NULL); + * get_user_pages(mm, ..., pages, NULL); * mmap_read_unlock(mm); * * to: @@ -1944,7 +1929,7 @@ EXPORT_SYMBOL(get_user_pages); * int locked = 1; * mmap_read_lock(mm); * do_something() - * get_user_pages_locked(tsk, mm, ..., pages, &locked); + * get_user_pages_locked(mm, ..., pages, &locked); * if (locked) * mmap_read_unlock(mm); * @@ -1982,7 +1967,7 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __get_user_pages_locked(current, current->mm, start, nr_pages, + return __get_user_pages_locked(current->mm, start, nr_pages, pages, NULL, locked, gup_flags | FOLL_TOUCH); } @@ -1992,12 +1977,12 @@ EXPORT_SYMBOL(get_user_pages_locked); * get_user_pages_unlocked() is suitable to replace the form: * * mmap_read_lock(mm); - * get_user_pages(tsk, mm, ..., pages, NULL); + * get_user_pages(mm, ..., pages, NULL); * mmap_read_unlock(mm); * * with: * - * get_user_pages_unlocked(tsk, mm, ..., pages); + * get_user_pages_unlocked(mm, ..., pages); * * It is functionally equivalent to get_user_pages_fast so * get_user_pages_fast should be used instead if specific gup_flags @@ -2020,7 +2005,7 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, return -EINVAL; mmap_read_lock(mm); - ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL, + ret = __get_user_pages_locked(mm, start, nr_pages, pages, NULL, &locked, gup_flags | FOLL_TOUCH); if (locked) mmap_read_unlock(mm); @@ -2665,7 +2650,7 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, */ if (gup_flags & FOLL_LONGTERM) { mmap_read_lock(current->mm); - ret = __gup_longterm_locked(current, current->mm, + ret = __gup_longterm_locked(current->mm, start, nr_pages, pages, NULL, gup_flags); mmap_read_unlock(current->mm); @@ -2908,10 +2893,8 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages, EXPORT_SYMBOL_GPL(pin_user_pages_fast_only); /** - * pin_user_pages_remote() - pin pages of a remote process (task != current) + * pin_user_pages_remote() - pin pages of a remote process * - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -2932,7 +2915,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only); * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for details. */ -long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long pin_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -2942,7 +2925,7 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, return -EINVAL; gup_flags |= FOLL_PIN; - return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + return __get_user_pages_remote(mm, start, nr_pages, gup_flags, pages, vmas, locked); } EXPORT_SYMBOL(pin_user_pages_remote); @@ -2974,7 +2957,7 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages, return -EINVAL; gup_flags |= FOLL_PIN; - return __gup_longterm_locked(current, current->mm, start, nr_pages, + return __gup_longterm_locked(current->mm, start, nr_pages, pages, vmas, gup_flags); } EXPORT_SYMBOL(pin_user_pages); @@ -3019,7 +3002,7 @@ long pin_user_pages_locked(unsigned long start, unsigned long nr_pages, return -EINVAL; gup_flags |= FOLL_PIN; - return __get_user_pages_locked(current, current->mm, start, nr_pages, + return __get_user_pages_locked(current->mm, start, nr_pages, pages, NULL, locked, gup_flags | FOLL_TOUCH); } diff --git a/mm/memory.c b/mm/memory.c index 2b7f0e00f312..228efaca75d3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4742,7 +4742,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, void *maddr; struct page *page = NULL; - ret = get_user_pages_remote(tsk, mm, addr, 1, + ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page, &vma, NULL); if (ret <= 0) { #ifndef CONFIG_HAVE_IOREMAP_PROT diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index cc85ce81914a..29c052099aff 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -105,7 +105,7 @@ static int process_vm_rw_single_vec(unsigned long addr, * current/current->mm */ mmap_read_lock(mm); - pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages, + pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages, flags, process_pages, NULL, &locked); if (locked) diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c index 53b3e1f5f227..dc4ecc0b2038 100644 --- a/security/tomoyo/domain.c +++ b/security/tomoyo/domain.c @@ -914,7 +914,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos, * (represented by bprm). 'current' is the process doing * the execve(). */ - if (get_user_pages_remote(current, bprm->mm, pos, 1, + if (get_user_pages_remote(bprm->mm, pos, 1, FOLL_FORCE, &page, NULL, NULL) <= 0) return false; #else diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 390f758d5a27..dd777688d14a 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -61,7 +61,7 @@ static void async_pf_execute(struct work_struct *work) * access remotely. */ mmap_read_lock(mm); - get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL, + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, NULL, &locked); if (locked) mmap_read_unlock(mm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2c2c0254c2d8..737666db02de 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1893,7 +1893,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * not call the fault handler, so do it here. */ bool unlocked = false; - r = fixup_user_fault(current, current->mm, addr, + r = fixup_user_fault(current->mm, addr, (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); if (unlocked)