From f48cfddc6729ef133933062320039808bafa6f45 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 8 Nov 2013 16:31:29 -0800 Subject: [PATCH 01/55] vfs: In d_path don't call d_dname on a mount point Aditya Kali (adityakali@google.com) wrote: > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > "proc: Fix the namespace inode permission checks." converted > the namespace files into symlinks. The same commit changed > the way namespace bind mounts appear in /proc/mounts: > $ mount --bind /proc/self/ns/ipc /mnt/ipc > Originally: > $ cat /proc/mounts | grep ipc > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0 > > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > $ cat /proc/mounts | grep ipc > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > This breaks userspace which expects the 2nd field in > /proc/mounts to be a valid path. The symlink /proc//ns/{ipc,mnt,net,pid,user,uts} point to dentries allocated with d_alloc_pseudo that we can mount, and that have interesting names printed out with d_dname. When these files are bind mounted /proc/mounts is not currently displaying the mount point correctly because d_dname is called instead of just displaying the path where the file is mounted. Solve this by adding an explicit check to distinguish mounted pseudo inodes and unmounted pseudo inodes. Unmounted pseudo inodes always use mount of their filesstem as the mnt_root in their path making these two cases easy to distinguish. CC: stable@vger.kernel.org Acked-by: Serge Hallyn Reported-by: Aditya Kali Signed-off-by: "Eric W. Biederman" --- fs/dcache.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index 4bdb300b16e2..f7282ebf1a37 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen) * thus don't need to be hashed. They also don't need a name until a * user wants to identify the object in /proc/pid/fd/. The little hack * below allows us to generate a name for these objects on demand: + * + * Some pseudo inodes are mountable. When they are mounted + * path->dentry == path->mnt->mnt_root. In that case don't call d_dname + * and instead have d_path return the mounted path. */ - if (path->dentry->d_op && path->dentry->d_op->d_dname) + if (path->dentry->d_op && path->dentry->d_op->d_dname && + (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root)) return path->dentry->d_op->d_dname(path->dentry, buf, buflen); rcu_read_lock(); From 1f7f4dde5c945f41a7abc2285be43d918029ecc5 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 14 Nov 2013 21:10:16 -0800 Subject: [PATCH 02/55] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) Serge Hallyn writes: > Hi Oleg, > > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e : > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks" > breaks lxc-attach in 3.12. That code forks a child which does > setns() and then does a clone(CLONE_PARENT). That way the > grandchild can be in the right namespaces (which the child was > not) and be a child of the original task, which is the monitor. > > lxc-attach in 3.11 was working fine with no side effects that I > could see. Is there a real danger in allowing CLONE_PARENT > when current->nsproxy->pidns_for_children is not our pidns, > or was this done out of an "over-abundance of caution"? Can we > safely revert that new extra check? The two fundamental things I know we can not allow are: - A shared signal queue aka CLONE_THREAD. Because we compute the pid and uid of the signal when we place it in the queue. - Changing the pid and by extention pid_namespace of an existing process. From a parents perspective there is nothing special about the pid namespace, to deny CLONE_PARENT, because the parent simply won't know or care. From the childs perspective all that is special really are shared signal queues. User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks in different pid namespaces is almost certainly going to break because it is complicated. But shared signal handlers can look at per thread information to know which pid namespace a process is in, so I don't know of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads at the kernel level. It would be absolutely stupid to implement but that is a different thing. So hmm. Because it can do no harm, and because it is a regression let's remove the CLONE_PARENT check and send it stable. Cc: stable@vger.kernel.org Acked-by: Oleg Nesterov Acked-by: Andy Lutomirski Acked-by: Serge E. Hallyn Signed-off-by: "Eric W. Biederman" --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 728d5be9548c..f82fa2ee7581 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, * do not allow it to share a thread group or signal handlers or * parent with the forking task. */ - if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) { + if (clone_flags & CLONE_SIGHAND) { if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) || (task_active_pid_ns(current) != current->nsproxy->pid_ns_for_children)) From 41301ae78a99ead04ea42672a1ab72c6f44cc81d Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 14 Nov 2013 21:22:25 -0800 Subject: [PATCH 03/55] vfs: Fix a regression in mounting proc Gao feng reported that commit e51db73532955dc5eaba4235e62b74b460709d5b userns: Better restrictions on when proc and sysfs can be mounted caused a regression on mounting a new instance of proc in a mount namespace created with user namespace privileges, when binfmt_misc is mounted on /proc/sys/fs/binfmt_misc. This is an unintended regression caused by the absolutely bogus empty directory check in fs_fully_visible. The check fs_fully_visible replaced didn't even bother to attempt to verify proc was fully visible and hiding proc files with any kind of mount is rare. So for now fix the userspace regression by allowing directory with nlink == 1 as /proc/sys/fs/binfmt_misc has. I will have a better patch but it is not stable material, or last minute kernel material. So it will have to wait. Cc: stable@vger.kernel.org Acked-by: Serge Hallyn Acked-by: Gao feng Tested-by: Gao feng Signed-off-by: "Eric W. Biederman" --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index ac2ce8a766e1..be32ebccdeb1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) struct inode *inode = child->mnt_mountpoint->d_inode; if (!S_ISDIR(inode->i_mode)) goto next; - if (inode->i_nlink != 2) + if (inode->i_nlink > 2) goto next; } visible = true; From f9b0e058cbd04ada76b13afffa7e1df830543c24 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Sat, 14 Dec 2013 04:21:26 +0800 Subject: [PATCH 04/55] writeback: Fix data corruption on NFS Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added a condition to skip clean inode. However this is wrong in WB_SYNC_ALL mode because there we also want to wait for outstanding writeback on possibly clean inode. This was causing occasional data corruption issues on NFS because it uses sync_inode() to make sure all outstanding writes are flushed to the server before truncating the inode and with sync_inode() returning prematurely file was sometimes extended back by an outstanding write after it was truncated. So modify the test to also check for pages under writeback in WB_SYNC_ALL mode. CC: stable@vger.kernel.org # >= 3.5 Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93 Reported-and-tested-by: Dan Duval Signed-off-by: Jan Kara --- fs/fs-writeback.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1f4a10ece2f1..e0259a163f98 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -516,13 +516,16 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, } WARN_ON(inode->i_state & I_SYNC); /* - * Skip inode if it is clean. We don't want to mess with writeback - * lists in this function since flusher thread may be doing for example - * sync in parallel and if we move the inode, it could get skipped. So - * here we make sure inode is on some writeback list and leave it there - * unless we have completely cleaned the inode. + * Skip inode if it is clean and we have no outstanding writeback in + * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this + * function since flusher thread may be doing for example sync in + * parallel and if we move the inode, it could get skipped. So here we + * make sure inode is on some writeback list and leave it there unless + * we have completely cleaned the inode. */ - if (!(inode->i_state & I_DIRTY)) + if (!(inode->i_state & I_DIRTY) && + (wbc->sync_mode != WB_SYNC_ALL || + !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))) goto out; inode->i_state |= I_SYNC; spin_unlock(&inode->i_lock); From c1dcc927dae01dfd4904ee82ce2c00b50eab6dc3 Mon Sep 17 00:00:00 2001 From: Soren Brinkmann Date: Tue, 26 Nov 2013 17:04:50 -0800 Subject: [PATCH 05/55] clocksource: cadence_ttc: Fix mutex taken inside interrupt context When the kernel is compiled with: CONFIG_HIGH_RES_TIMERS=no CONFIG_HZ_PERIODIC=yes CONFIG_DEBUG_ATOMIC_SLEEP=yes The following WARN appears: WARNING: CPU: 1 PID: 0 at linux/kernel/mutex.c:856 mutex_trylock+0x70/0x1fc() DEBUG_LOCKS_WARN_ON(in_interrupt()) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-xilinx-dirty #93 [] (unwind_backtrace+0x0/0x11c) from [] (show_stack+0x10/0x14) [] (show_stack+0x10/0x14) from [] (dump_stack+0x7c/0xc0) [] (dump_stack+0x7c/0xc0) from [] (warn_slowpath_common+0x60/0x84) [] (warn_slowpath_common+0x60/0x84) from [] (warn_slowpath_fmt+0x2c/0x3c) [] (warn_slowpath_fmt+0x2c/0x3c) from [] (mutex_trylock+0x70/0x1fc) [] (mutex_trylock+0x70/0x1fc) from [] (clk_prepare_lock+0xc/0xe4) [] (clk_prepare_lock+0xc/0xe4) from [] (clk_get_rate+0xc/0x44) [] (clk_get_rate+0xc/0x44) from [] (ttc_set_mode+0x34/0x78) [] (ttc_set_mode+0x34/0x78) from [] (clockevents_set_mode+0x28/0x5c) [] (clockevents_set_mode+0x28/0x5c) from [] (tick_broadcast_on_off+0x190/0x1c0) [] (tick_broadcast_on_off+0x190/0x1c0) from [] (clockevents_notify+0x58/0x1ac) [] (clockevents_notify+0x58/0x1ac) from [] (cpuidle_setup_broadcast_timer+0x20/0x24) [] (cpuidle_setup_broadcast_timer+0x20/0x24) from [] (generic_smp_call_function_single_interrupt+0) [] (generic_smp_call_function_single_interrupt+0xe0/0x130) from [] (handle_IPI+0x88/0x118) [] (handle_IPI+0x88/0x118) from [] (gic_handle_irq+0x58/0x60) [] (gic_handle_irq+0x58/0x60) from [] (__irq_svc+0x44/0x78) Exception stack(0xef099fa0 to 0xef099fe8) 9fa0: 00000001 ef092100 00000000 ef092100 ef098000 00000015 c0399f2c c0579d74 9fc0: 0000406a 413fc090 00000000 00000000 00000000 ef099fe8 c00666ec c000f46c 9fe0: 20000113 ffffffff [] (__irq_svc+0x44/0x78) from [] (arch_cpu_idle+0x34/0x3c) [] (arch_cpu_idle+0x34/0x3c) from [] (cpu_startup_entry+0xa8/0x10c) [] (cpu_startup_entry+0xa8/0x10c) from [<000085a4>] (0x85a4) We are in an interrupt context (IPI) and we are calling clk_get_rate in the set_mode function which in turn ends up by getting a mutex... Even if that does not hang, it is a potential kernel deadlock. It is not allowed to call clk_get_rate() from interrupt context. To avoid such calls the timer input frequency is stored in the driver's data struct which makes it accessible to the driver in any context. [dlezcano] completed the changelog with the WARN trace and added a more detailed description. Tested on zync zc702. Acked-by: Daniel Lezcano Tested-by: Daniel Lezcano Signed-off-by: Soren Brinkmann Signed-off-by: Daniel Lezcano --- drivers/clocksource/cadence_ttc_timer.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/clocksource/cadence_ttc_timer.c b/drivers/clocksource/cadence_ttc_timer.c index b2bb3a4bc205..a92350b55d32 100644 --- a/drivers/clocksource/cadence_ttc_timer.c +++ b/drivers/clocksource/cadence_ttc_timer.c @@ -67,11 +67,13 @@ * struct ttc_timer - This definition defines local timer structure * * @base_addr: Base address of timer + * @freq: Timer input clock frequency * @clk: Associated clock source * @clk_rate_change_nb Notifier block for clock rate changes */ struct ttc_timer { void __iomem *base_addr; + unsigned long freq; struct clk *clk; struct notifier_block clk_rate_change_nb; }; @@ -196,9 +198,8 @@ static void ttc_set_mode(enum clock_event_mode mode, switch (mode) { case CLOCK_EVT_MODE_PERIODIC: - ttc_set_interval(timer, - DIV_ROUND_CLOSEST(clk_get_rate(ttce->ttc.clk), - PRESCALE * HZ)); + ttc_set_interval(timer, DIV_ROUND_CLOSEST(ttce->ttc.freq, + PRESCALE * HZ)); break; case CLOCK_EVT_MODE_ONESHOT: case CLOCK_EVT_MODE_UNUSED: @@ -273,6 +274,8 @@ static void __init ttc_setup_clocksource(struct clk *clk, void __iomem *base) return; } + ttccs->ttc.freq = clk_get_rate(ttccs->ttc.clk); + ttccs->ttc.clk_rate_change_nb.notifier_call = ttc_rate_change_clocksource_cb; ttccs->ttc.clk_rate_change_nb.next = NULL; @@ -298,16 +301,14 @@ static void __init ttc_setup_clocksource(struct clk *clk, void __iomem *base) __raw_writel(CNT_CNTRL_RESET, ttccs->ttc.base_addr + TTC_CNT_CNTRL_OFFSET); - err = clocksource_register_hz(&ttccs->cs, - clk_get_rate(ttccs->ttc.clk) / PRESCALE); + err = clocksource_register_hz(&ttccs->cs, ttccs->ttc.freq / PRESCALE); if (WARN_ON(err)) { kfree(ttccs); return; } ttc_sched_clock_val_reg = base + TTC_COUNT_VAL_OFFSET; - setup_sched_clock(ttc_sched_clock_read, 16, - clk_get_rate(ttccs->ttc.clk) / PRESCALE); + setup_sched_clock(ttc_sched_clock_read, 16, ttccs->ttc.freq / PRESCALE); } static int ttc_rate_change_clockevent_cb(struct notifier_block *nb, @@ -334,6 +335,9 @@ static int ttc_rate_change_clockevent_cb(struct notifier_block *nb, ndata->new_rate / PRESCALE); local_irq_restore(flags); + /* update cached frequency */ + ttc->freq = ndata->new_rate; + /* fall through */ } case PRE_RATE_CHANGE: @@ -367,6 +371,7 @@ static void __init ttc_setup_clockevent(struct clk *clk, if (clk_notifier_register(ttcce->ttc.clk, &ttcce->ttc.clk_rate_change_nb)) pr_warn("Unable to register clock notifier.\n"); + ttcce->ttc.freq = clk_get_rate(ttcce->ttc.clk); ttcce->ttc.base_addr = base; ttcce->ce.name = "ttc_clockevent"; @@ -396,7 +401,7 @@ static void __init ttc_setup_clockevent(struct clk *clk, } clockevents_config_and_register(&ttcce->ce, - clk_get_rate(ttcce->ttc.clk) / PRESCALE, 1, 0xfffe); + ttcce->ttc.freq / PRESCALE, 1, 0xfffe); } /** From 6bcac805bacb3b637911244d95368512ae389abc Mon Sep 17 00:00:00 2001 From: Russell King Date: Tue, 7 Jan 2014 17:53:54 +0000 Subject: [PATCH 06/55] Revert "ARM: 7908/1: mm: Fix the arm_dma_limit calculation" This reverts commit 787b0d5c1ca7ff24feb6f92e4c7f4410ee7d81a8 since it is no longer required after 7909/1 was applied, and it causes build regressions when ARM_PATCH_PHYS_VIRT is disabled and DMA_ZONE is enabled. Signed-off-by: Russell King --- arch/arm/mm/init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 1f7b19a47060..3e8f106ee5fe 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -229,7 +229,7 @@ void __init setup_dma_zone(const struct machine_desc *mdesc) #ifdef CONFIG_ZONE_DMA if (mdesc->dma_zone_size) { arm_dma_zone_size = mdesc->dma_zone_size; - arm_dma_limit = __pv_phys_offset + arm_dma_zone_size - 1; + arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1; } else arm_dma_limit = 0xffffffff; arm_dma_pfn_limit = arm_dma_limit >> PAGE_SHIFT; From e44ef891e9e68b6ce7d3fd3bac73b7d5433050ae Mon Sep 17 00:00:00 2001 From: Sudeep Holla Date: Wed, 8 Jan 2014 18:24:21 +0100 Subject: [PATCH 07/55] ARM: 7934/1: DT/kernel: fix arch_match_cpu_phys_id to avoid erroneous match The MPIDR contains specific bitfields(MPIDR.Aff{2..0}) which uniquely identify a CPU, in addition to some non-identifying information and reserved bits. The ARM cpu binding defines the 'reg' property to only contain the affinity bits, and any cpu nodes with other bits set in their 'reg' entry are skipped. As such it is not necessary to mask the phys_id with MPIDR_HWID_BITMASK, and doing so could lead to matching erroneous CPU nodes in the device tree. This patch removes the masking of the physical identifier. Acked-by: Mark Rutland Signed-off-by: Sudeep Holla Signed-off-by: Russell King --- arch/arm/kernel/devtree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/kernel/devtree.c b/arch/arm/kernel/devtree.c index 739c3dfc1da2..34d5fd585bbb 100644 --- a/arch/arm/kernel/devtree.c +++ b/arch/arm/kernel/devtree.c @@ -171,7 +171,7 @@ void __init arm_dt_init_cpu_maps(void) bool arch_match_cpu_phys_id(int cpu, u64 phys_id) { - return (phys_id & MPIDR_HWID_BITMASK) == cpu_logical_map(cpu); + return phys_id == cpu_logical_map(cpu); } static const void * __init arch_get_next_mach(const char *const **match) From 261521f142fb426b58ed460d2476049dc8fa4fb9 Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Fri, 10 Jan 2014 00:57:06 +0100 Subject: [PATCH 08/55] ARM: 7937/1: perf_event: Silence sparse warning arch/arm/kernel/perf_event_cpu.c:274:25: warning: incorrect type in assignment (different modifiers) arch/arm/kernel/perf_event_cpu.c:274:25: expected int ( *init_fn )( ... ) arch/arm/kernel/perf_event_cpu.c:274:25: got void const *const data Acked-by: Will Deacon Signed-off-by: Stephen Boyd Signed-off-by: Russell King --- arch/arm/kernel/perf_event_cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c index d85055cd24ba..20d553c9f5e2 100644 --- a/arch/arm/kernel/perf_event_cpu.c +++ b/arch/arm/kernel/perf_event_cpu.c @@ -254,7 +254,7 @@ static int probe_current_pmu(struct arm_pmu *pmu) static int cpu_pmu_device_probe(struct platform_device *pdev) { const struct of_device_id *of_id; - int (*init_fn)(struct arm_pmu *); + const int (*init_fn)(struct arm_pmu *); struct device_node *node = pdev->dev.of_node; struct arm_pmu *pmu; int ret = -ENODEV; From d6cd989477e2fee29ccda257614ef7b2621d0601 Mon Sep 17 00:00:00 2001 From: Taras Kondratiuk Date: Fri, 10 Jan 2014 01:36:00 +0100 Subject: [PATCH 09/55] ARM: 7939/1: traps: fix opcode endianness when read from user memory Currently code has an inverted logic: opcode from user memory is swapped to a proper endianness only in case of read error. While normally opcode should be swapped only if it was read correctly from user memory. Reviewed-by: Victor Kamensky Signed-off-by: Ben Dooks Signed-off-by: Taras Kondratiuk Signed-off-by: Russell King --- arch/arm/kernel/traps.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index 6eda3bf85c52..4636d56af2db 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -431,9 +431,10 @@ asmlinkage void __exception do_undefinstr(struct pt_regs *regs) instr2 = __mem_to_opcode_thumb16(instr2); instr = __opcode_thumb32_compose(instr, instr2); } - } else if (get_user(instr, (u32 __user *)pc)) { + } else { + if (get_user(instr, (u32 __user *)pc)) + goto die_sig; instr = __mem_to_opcode_arm(instr); - goto die_sig; } if (call_undef_hook(regs, instr) == 0) From 9722c2dac708e9468cc0dc30218ef76946ffbc9d Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Mon, 6 Jan 2014 11:39:12 +0000 Subject: [PATCH 10/55] sched: Calculate effective load even if local weight is 0 Thomas Hellstrom bisected a regression where erratic 3D performance is experienced on virtual machines as measured by glxgears. It identified commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") as the problem which had modified the behaviour of effective_load. Effective load calculates the difference to the system-wide load if a scheduling entity was moved to another CPU. The task group is not heavier as a result of the move but overall system load can increase/decrease as a result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") changed effective_load to make it suitable for calculating if a particular NUMA node was compute overloaded. To reduce the cost of the function, it assumed that a current sched entity weight of 0 was uninteresting but that is not the case. wake_affine() uses a weight of 0 for sync wakeups on the grounds that it is assuming the waking task will sleep and not contribute to load in the near future. In this case, we still want to calculate the effective load of the sched entity hierarchy. As effective_load is no longer used by task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide search to find swap/migration candidates), this patch simply restores the historical behaviour. Reported-and-tested-by: Thomas Hellstrom Signed-off-by: Rik van Riel [ Wrote changelog] Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.de Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c7395d97e4cb..e64b0794060e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3923,7 +3923,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) { struct sched_entity *se = tg->se[cpu]; - if (!tg->parent || !wl) /* the trivial, non-cgroup case */ + if (!tg->parent) /* the trivial, non-cgroup case */ return wl; for_each_sched_entity(se) { From 0c3351d451ae2fa438d5d1ed719fc43354fbffbb Mon Sep 17 00:00:00 2001 From: John Stultz Date: Thu, 2 Jan 2014 15:11:13 -0800 Subject: [PATCH 11/55] seqlock: Use raw_ prefix instead of _no_lockdep MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Linus disliked the _no_lockdep() naming, so instead use the more-consistent raw_* prefix to the non-lockdep enabled seqcount methods. This also adds raw_ methods for the write operations as well, which will be utilized in a following patch. Acked-by: Linus Torvalds Reviewed-by: Stephen Boyd Signed-off-by: John Stultz Signed-off-by: Peter Zijlstra Cc: Krzysztof Hałasa Cc: Uwe Kleine-König Cc: Willy Tarreau Link: http://lkml.kernel.org/r/1388704274-5278-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/vdso/vclock_gettime.c | 8 ++++---- include/linux/seqlock.h | 27 +++++++++++++++++++-------- 2 files changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c index 2ada505067cc..eb5d7a56f8d4 100644 --- a/arch/x86/vdso/vclock_gettime.c +++ b/arch/x86/vdso/vclock_gettime.c @@ -178,7 +178,7 @@ notrace static int __always_inline do_realtime(struct timespec *ts) ts->tv_nsec = 0; do { - seq = read_seqcount_begin_no_lockdep(>od->seq); + seq = raw_read_seqcount_begin(>od->seq); mode = gtod->clock.vclock_mode; ts->tv_sec = gtod->wall_time_sec; ns = gtod->wall_time_snsec; @@ -198,7 +198,7 @@ notrace static int do_monotonic(struct timespec *ts) ts->tv_nsec = 0; do { - seq = read_seqcount_begin_no_lockdep(>od->seq); + seq = raw_read_seqcount_begin(>od->seq); mode = gtod->clock.vclock_mode; ts->tv_sec = gtod->monotonic_time_sec; ns = gtod->monotonic_time_snsec; @@ -214,7 +214,7 @@ notrace static int do_realtime_coarse(struct timespec *ts) { unsigned long seq; do { - seq = read_seqcount_begin_no_lockdep(>od->seq); + seq = raw_read_seqcount_begin(>od->seq); ts->tv_sec = gtod->wall_time_coarse.tv_sec; ts->tv_nsec = gtod->wall_time_coarse.tv_nsec; } while (unlikely(read_seqcount_retry(>od->seq, seq))); @@ -225,7 +225,7 @@ notrace static int do_monotonic_coarse(struct timespec *ts) { unsigned long seq; do { - seq = read_seqcount_begin_no_lockdep(>od->seq); + seq = raw_read_seqcount_begin(>od->seq); ts->tv_sec = gtod->monotonic_time_coarse.tv_sec; ts->tv_nsec = gtod->monotonic_time_coarse.tv_nsec; } while (unlikely(read_seqcount_retry(>od->seq, seq))); diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index cf87a24c0f92..535f158977b9 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -117,15 +117,15 @@ static inline unsigned __read_seqcount_begin(const seqcount_t *s) } /** - * read_seqcount_begin_no_lockdep - start seq-read critical section w/o lockdep + * raw_read_seqcount_begin - start seq-read critical section w/o lockdep * @s: pointer to seqcount_t * Returns: count to be passed to read_seqcount_retry * - * read_seqcount_begin_no_lockdep opens a read critical section of the given + * raw_read_seqcount_begin opens a read critical section of the given * seqcount, but without any lockdep checking. Validity of the critical * section is tested by checking read_seqcount_retry function. */ -static inline unsigned read_seqcount_begin_no_lockdep(const seqcount_t *s) +static inline unsigned raw_read_seqcount_begin(const seqcount_t *s) { unsigned ret = __read_seqcount_begin(s); smp_rmb(); @@ -144,7 +144,7 @@ static inline unsigned read_seqcount_begin_no_lockdep(const seqcount_t *s) static inline unsigned read_seqcount_begin(const seqcount_t *s) { seqcount_lockdep_reader_access(s); - return read_seqcount_begin_no_lockdep(s); + return raw_read_seqcount_begin(s); } /** @@ -206,14 +206,26 @@ static inline int read_seqcount_retry(const seqcount_t *s, unsigned start) } + +static inline void raw_write_seqcount_begin(seqcount_t *s) +{ + s->sequence++; + smp_wmb(); +} + +static inline void raw_write_seqcount_end(seqcount_t *s) +{ + smp_wmb(); + s->sequence++; +} + /* * Sequence counter only version assumes that callers are using their * own mutexing. */ static inline void write_seqcount_begin_nested(seqcount_t *s, int subclass) { - s->sequence++; - smp_wmb(); + raw_write_seqcount_begin(s); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); } @@ -225,8 +237,7 @@ static inline void write_seqcount_begin(seqcount_t *s) static inline void write_seqcount_end(seqcount_t *s) { seqcount_release(&s->dep_map, 1, _RET_IP_); - smp_wmb(); - s->sequence++; + raw_write_seqcount_end(s); } /** From 7a06c41cbec33c6dbe7eec575c61986122617408 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Thu, 2 Jan 2014 15:11:14 -0800 Subject: [PATCH 12/55] sched_clock: Disable seqlock lockdep usage in sched_clock() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Unfortunately the seqlock lockdep enablement can't be used in sched_clock(), since the lockdep infrastructure eventually calls into sched_clock(), which causes a deadlock. Thus, this patch changes all generic sched_clock() usage to use the raw_* methods. Acked-by: Linus Torvalds Reviewed-by: Stephen Boyd Reported-by: Krzysztof Hałasa Signed-off-by: John Stultz Cc: Uwe Kleine-König Cc: Willy Tarreau Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1388704274-5278-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar --- kernel/time/sched_clock.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 68b799375981..0abb36464281 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -74,7 +74,7 @@ unsigned long long notrace sched_clock(void) return cd.epoch_ns; do { - seq = read_seqcount_begin(&cd.seq); + seq = raw_read_seqcount_begin(&cd.seq); epoch_cyc = cd.epoch_cyc; epoch_ns = cd.epoch_ns; } while (read_seqcount_retry(&cd.seq, seq)); @@ -99,10 +99,10 @@ static void notrace update_sched_clock(void) cd.mult, cd.shift); raw_local_irq_save(flags); - write_seqcount_begin(&cd.seq); + raw_write_seqcount_begin(&cd.seq); cd.epoch_ns = ns; cd.epoch_cyc = cyc; - write_seqcount_end(&cd.seq); + raw_write_seqcount_end(&cd.seq); raw_local_irq_restore(flags); } From b25f3e1c358434bf850220e04f28eebfc45eb634 Mon Sep 17 00:00:00 2001 From: Taras Kondratiuk Date: Fri, 10 Jan 2014 01:27:08 +0100 Subject: [PATCH 13/55] ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling Kexec disables outer cache before jumping to reboot code, but it doesn't flush it explicitly. Flush is done implicitly inside of l2x0_disable(). But some SoC's override default .disable handler and don't flush cache. This may lead to a corrupted memory during Kexec reboot on these platforms. This patch adds cache flush inside of OMAP4 and Highbank outer_cache.disable() handlers to make it consistent with default l2x0_disable(). Acked-by: Rob Herring Acked-by: Santosh Shilimkar Acked-by: Tony Lindgren Signed-off-by: Taras Kondratiuk Signed-off-by: Russell King --- arch/arm/mach-highbank/highbank.c | 1 + arch/arm/mach-omap2/omap4-common.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/arm/mach-highbank/highbank.c b/arch/arm/mach-highbank/highbank.c index b3d7e5634b83..ae171506cb06 100644 --- a/arch/arm/mach-highbank/highbank.c +++ b/arch/arm/mach-highbank/highbank.c @@ -50,6 +50,7 @@ static void __init highbank_scu_map_io(void) static void highbank_l2x0_disable(void) { + outer_flush_all(); /* Disable PL310 L2 Cache controller */ highbank_smc1(0x102, 0x0); } diff --git a/arch/arm/mach-omap2/omap4-common.c b/arch/arm/mach-omap2/omap4-common.c index 57911430324e..3f44b162fcab 100644 --- a/arch/arm/mach-omap2/omap4-common.c +++ b/arch/arm/mach-omap2/omap4-common.c @@ -163,6 +163,7 @@ void __iomem *omap4_get_l2cache_base(void) static void omap4_l2x0_disable(void) { + outer_flush_all(); /* Disable PL310 L2 Cache controller */ omap_smc1(0x102, 0x0); } From 9ef9730ba84d4021a27d3a1679fd50f9bac0e0e7 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 9 Jan 2014 08:34:00 +0300 Subject: [PATCH 14/55] cxgb4: silence shift wrapping static checker warning I don't know how large "tp->vlan_shift" is but static checkers worry about shift wrapping bugs here. Signed-off-by: Dan Carpenter Acked-by: Dimitris Michailidis Signed-off-by: David S. Miller --- drivers/net/ethernet/chelsio/cxgb4/l2t.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c index cb05be905def..81e8402a74b4 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c +++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c @@ -423,7 +423,7 @@ u64 cxgb4_select_ntuple(struct net_device *dev, * in the Compressed Filter Tuple. */ if (tp->vlan_shift >= 0 && l2t->vlan != VLAN_NONE) - ntuple |= (F_FT_VLAN_VLD | l2t->vlan) << tp->vlan_shift; + ntuple |= (u64)(F_FT_VLAN_VLD | l2t->vlan) << tp->vlan_shift; if (tp->port_shift >= 0) ntuple |= (u64)l2t->lport << tp->port_shift; From 1cc03eb93245e63b0b7a7832165efdc52e25b4e6 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 6 Jan 2014 13:19:42 +1100 Subject: [PATCH 15/55] md/raid5: Fix possible confusion when multiple write errors occur. commit 5d8c71f9e5fbdd95650be00294d238e27a363b5c md: raid5 crash during degradation Fixed a crash in an overly simplistic way which could leave R5_WriteError or R5_MadeGood set in the stripe cache for devices for which it is no longer relevant. When those devices are removed and spares added the flags are still set and can cause incorrect behaviour. commit 14a75d3e07c784c004b4b44b34af996b8e4ac453 md/raid5: preferentially read from replacement device if possible. Fixed the same bug if a more effective way, so we can now revert the original commit. Reported-and-tested-by: Alexander Lyakas Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though) Fixes: 5d8c71f9e5fbdd95650be00294d238e27a363b5c Signed-off-by: NeilBrown --- drivers/md/raid5.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index cc055da02e2a..9168173deaf3 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3608,7 +3608,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) */ set_bit(R5_Insync, &dev->flags); - if (rdev && test_bit(R5_WriteError, &dev->flags)) { + if (test_bit(R5_WriteError, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ struct md_rdev *rdev2 = rcu_dereference( @@ -3621,7 +3621,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) } else clear_bit(R5_WriteError, &dev->flags); } - if (rdev && test_bit(R5_MadeGood, &dev->flags)) { + if (test_bit(R5_MadeGood, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ struct md_rdev *rdev2 = rcu_dereference( From b50c259e25d9260b9108dc0c2964c26e5ecbe1c1 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 14 Jan 2014 10:38:09 +1100 Subject: [PATCH 16/55] md/raid10: fix two bugs in handling of known-bad-blocks. If we discover a bad block when reading we split the request and potentially read some of it from a different device. The code path of this has two bugs in RAID10. 1/ we get a spin_lock with _irq, but unlock without _irq!! 2/ The calculation of 'sectors_handled' is wrong, as can be clearly seen by comparison with raid1.c This leads to at least 2 warnings and a probable crash is a RAID10 ever had known bad blocks. Cc: stable@vger.kernel.org (v3.1+) Fixes: 856e08e23762dfb92ffc68fd0a8d228f9e152160 Reported-by: Damian Nowak URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown --- drivers/md/raid10.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index c504e8389e69..65285211568f 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1319,7 +1319,7 @@ static void make_request(struct mddev *mddev, struct bio * bio) /* Could not read all from this device, so we will * need another r10_bio. */ - sectors_handled = (r10_bio->sectors + max_sectors + sectors_handled = (r10_bio->sector + max_sectors - bio->bi_sector); r10_bio->sectors = max_sectors; spin_lock_irq(&conf->device_lock); @@ -1327,7 +1327,7 @@ static void make_request(struct mddev *mddev, struct bio * bio) bio->bi_phys_segments = 2; else bio->bi_phys_segments++; - spin_unlock(&conf->device_lock); + spin_unlock_irq(&conf->device_lock); /* Cannot call generic_make_request directly * as that will be queued in __generic_make_request * and subsequent mempool_alloc might block From 41a336e011887f73e7c879b60e1e3544045435cb Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 14 Jan 2014 11:56:14 +1100 Subject: [PATCH 17/55] md/raid1: fix request counting bug in new 'barrier' code. The new iobarrier implementation in raid1 (which keeps normal writes and resync activity separate) counts every request what is not before the current resync point in either next_window_requests or current_window_requests. It flags that the request is counted by setting ->start_next_window. allow_barrier follows this model exactly and decrements one of the *_window_requests if and only if ->start_next_window is set. However wait_barrier(), which increments *_window_requests uses a slightly different test for setting -.start_next_window (which is set from the return value of this function). So there is a possibility of the counts getting out of sync, and this leads to the resync hanging. So change wait_barrier() to return a non-zero value in exactly the same cases that it increments *_window_requests. But was introduced in 3.13-rc1. Reported-by: Bruno Wolff III URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061 Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Cc: majianpeng Signed-off-by: NeilBrown --- drivers/md/raid1.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 1e5a540995e9..a49cfcc7a343 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -924,9 +924,8 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio) conf->next_window_requests++; else conf->current_window_requests++; - } - if (bio->bi_sector >= conf->start_next_window) sector = conf->start_next_window; + } } conf->nr_pending++; From 5af9bef72c074dbe946da8b74eabd79cd5a9ff19 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Tue, 14 Jan 2014 15:16:10 +1100 Subject: [PATCH 18/55] md/raid5: fix a recently broken BUG_ON(). commit 6d183de4077191d1201283a9035ce57a9b05254d md/raid5: fix newly-broken locking in get_active_stripe. simplified a BUG_ON, but removed too much so now it sometimes fires when it shouldn't. When the STRIPE_EXPANDING flag is set, the stripe_head might be on a special list while multiple stripe_heads are collected, or it might not be on any list, even a 'free' list when the refcount is zero. As long as STRIPE_EXPANDING is set, it will be found and added back to a list eventually. So both of the BUG_ONs which test for the ->lru being empty or not need to avoid the case where STRIPE_EXPANDING is set. The patch which broke this was marked for -stable, so this patch needs to be applied to any branch that received 6d183de4 Fixes: 6d183de4077191d1201283a9035ce57a9b05254d Cc: stable@vger.kernel.org (any release to which above was applied) Signed-off-by: NeilBrown --- drivers/md/raid5.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 9168173deaf3..cbb15716a5db 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -687,7 +687,8 @@ get_active_stripe(struct r5conf *conf, sector_t sector, } else { if (!test_bit(STRIPE_HANDLE, &sh->state)) atomic_inc(&conf->active_stripes); - BUG_ON(list_empty(&sh->lru)); + BUG_ON(list_empty(&sh->lru) && + !test_bit(STRIPE_EXPANDING, &sh->state)); list_del_init(&sh->lru); if (sh->group) { sh->group->stripes_cnt--; From e8b849158508565e0cd6bc80061124afc5879160 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 6 Jan 2014 10:35:34 +1100 Subject: [PATCH 19/55] md/raid10: fix bug when raid10 recovery fails to recover a block. commit e875ecea266a543e643b19e44cf472f1412708f9 md/raid10 record bad blocks as needed during recovery. added code to the "cannot recover this block" path to record a bad block rather than fail the whole recovery. Unfortunately this new case was placed *after* r10bio was freed rather than *before*, yet it still uses r10bio. This is will crash with a null dereference. So move the freeing of r10bio down where it is safe. Cc: stable@vger.kernel.org (v3.1+) Fixes: e875ecea266a543e643b19e44cf472f1412708f9 Reported-by: Damian Nowak URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown --- drivers/md/raid10.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 65285211568f..06eeb99ea6fc 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3218,10 +3218,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, if (j == conf->copies) { /* Cannot recover, so abort the recovery or * record a bad block */ - put_buf(r10_bio); - if (rb2) - atomic_dec(&rb2->remaining); - r10_bio = rb2; if (any_working) { /* problem is that there are bad blocks * on other device(s) @@ -3253,6 +3249,10 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, mirror->recovery_disabled = mddev->recovery_disabled; } + put_buf(r10_bio); + if (rb2) + atomic_dec(&rb2->remaining); + r10_bio = rb2; break; } } From 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 12 Dec 2013 10:13:33 +1100 Subject: [PATCH 20/55] md: fix problem when adding device to read-only array with bitmap. If an array is started degraded, and then the missing device is found it can be re-added and a minimal bitmap-based recovery will bring it fully up-to-date. If the array is read-only a recovery would not be allowed. But also if the array is read-only and the missing device was present very recently, then there could be no need for any recovery at all, so we simply include the device in the read-only array without any recovery. However... if the missing device was removed a little longer ago it could be missing some updates, but if a bitmap is present it will be conditionally accepted pending a bitmap-based update. We don't currently detect this case properly and will include that old device into the read-only array with no recovery even though it really needs a recovery. This patch keeps track of whether a bitmap-based-recovery is really needed or not in the new Bitmap_sync rdev flag. If that is set, then the device will not be added to a read-only array. Cc: Andrei Warkentin Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b Cc: stable@vger.kernel.org (3.2+) Signed-off-by: NeilBrown --- drivers/md/md.c | 18 +++++++++++++++--- drivers/md/md.h | 3 +++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index e60cebf3f519..2a456a5d59a8 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1087,6 +1087,7 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = -1; clear_bit(Faulty, &rdev->flags); clear_bit(In_sync, &rdev->flags); + clear_bit(Bitmap_sync, &rdev->flags); clear_bit(WriteMostly, &rdev->flags); if (mddev->raid_disks == 0) { @@ -1165,6 +1166,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) */ if (ev1 < mddev->bitmap->events_cleared) return 0; + if (ev1 < mddev->events) + set_bit(Bitmap_sync, &rdev->flags); } else { if (ev1 < mddev->events) /* just a hot-add of a new device, leave raid_disk at -1 */ @@ -1573,6 +1576,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = -1; clear_bit(Faulty, &rdev->flags); clear_bit(In_sync, &rdev->flags); + clear_bit(Bitmap_sync, &rdev->flags); clear_bit(WriteMostly, &rdev->flags); if (mddev->raid_disks == 0) { @@ -1655,6 +1659,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) */ if (ev1 < mddev->bitmap->events_cleared) return 0; + if (ev1 < mddev->events) + set_bit(Bitmap_sync, &rdev->flags); } else { if (ev1 < mddev->events) /* just a hot-add of a new device, leave raid_disk at -1 */ @@ -2798,6 +2804,7 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len) else rdev->saved_raid_disk = -1; clear_bit(In_sync, &rdev->flags); + clear_bit(Bitmap_sync, &rdev->flags); err = rdev->mddev->pers-> hot_add_disk(rdev->mddev, rdev); if (err) { @@ -5770,6 +5777,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) info->raid_disk < mddev->raid_disks) { rdev->raid_disk = info->raid_disk; set_bit(In_sync, &rdev->flags); + clear_bit(Bitmap_sync, &rdev->flags); } else rdev->raid_disk = -1; } else @@ -7716,7 +7724,8 @@ static int remove_and_add_spares(struct mddev *mddev, if (test_bit(Faulty, &rdev->flags)) continue; if (mddev->ro && - rdev->saved_raid_disk < 0) + ! (rdev->saved_raid_disk >= 0 && + !test_bit(Bitmap_sync, &rdev->flags))) continue; rdev->recovery_offset = 0; @@ -7797,9 +7806,12 @@ void md_check_recovery(struct mddev *mddev) * As we only add devices that are already in-sync, * we can activate the spares immediately. */ - clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); remove_and_add_spares(mddev, NULL); - mddev->pers->spare_active(mddev); + /* There is no thread, but we need to call + * ->spare_active and clear saved_raid_disk + */ + md_reap_sync_thread(mddev); + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); goto unlock; } diff --git a/drivers/md/md.h b/drivers/md/md.h index 2f5cc8a7ef3e..0095ec84ffc7 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -129,6 +129,9 @@ struct md_rdev { enum flag_bits { Faulty, /* device is known to have a fault */ In_sync, /* device is in_sync with rest of array */ + Bitmap_sync, /* ..actually, not quite In_sync. Need a + * bitmap-based recovery to get fully in sync + */ Unmerged, /* device is being added to array and should * be considerred for bvec_merge_fn but not * yet for actual IO From 70315d22d3c7383f9a508d0aab21e2eb35b2303a Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Fri, 10 Jan 2014 15:34:45 -0500 Subject: [PATCH 21/55] inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock (not just TIME_WAIT), and for such sockets the tw_substate field holds the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2. This brings the inet_diag state-matching code in line with the field it uses to populate idiag_state. This is also analogous to the info exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state and get_timewait4_sock() exports tw->tw_substate. Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi state time-wait" would also return sockets in state TCP_FIN_WAIT2. This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain"). Signed-off-by: Neal Cardwell Cc: Eric Dumazet Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv4/inet_diag.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index a0f52dac8940..e34dccbc4d70 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -930,12 +930,15 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb, spin_lock_bh(lock); sk_nulls_for_each(sk, node, &head->chain) { int res; + int state; if (!net_eq(sock_net(sk), net)) continue; if (num < s_num) goto next_normal; - if (!(r->idiag_states & (1 << sk->sk_state))) + state = (sk->sk_state == TCP_TIME_WAIT) ? + inet_twsk(sk)->tw_substate : sk->sk_state; + if (!(r->idiag_states & (1 << state))) goto next_normal; if (r->sdiag_family != AF_UNSPEC && sk->sk_family != r->sdiag_family) From fdc3452cd2c7b2bfe0f378f92123f4f9a98fa2bd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bj=C3=B8rn=20Mork?= Date: Fri, 10 Jan 2014 23:10:17 +0100 Subject: [PATCH 22/55] net: usbnet: fix SG initialisation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 60e453a940ac ("USBNET: fix handling padding packet") added an extra SG entry in case padding is necessary, but failed to update the initialisation of the list. This can cause list traversal to fall off the end of the list, resulting in an oops. Fixes: 60e453a940ac ("USBNET: fix handling padding packet") Reported-by: Thomas Kear Cc: Ming Lei Signed-off-by: Bjørn Mork Tested-by: Ming Lei Signed-off-by: David S. Miller --- drivers/net/usb/usbnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 8494bb53ebdc..aba04f561760 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1245,7 +1245,7 @@ static int build_dma_sg(const struct sk_buff *skb, struct urb *urb) return -ENOMEM; urb->num_sgs = num_sgs; - sg_init_table(urb->sg, urb->num_sgs); + sg_init_table(urb->sg, urb->num_sgs + 1); sg_set_buf(&urb->sg[s++], skb->data, skb_headlen(skb)); total_len += skb_headlen(skb); From 2fac2b891f287691c27ee8d2eeecf39571b27fea Mon Sep 17 00:00:00 2001 From: Stephen Warren Date: Mon, 13 Jan 2014 14:29:04 -0700 Subject: [PATCH 23/55] i2c: Re-instate body of i2c_parent_is_i2c_adapter() The body of i2c_parent_is_i2c_adapter() is currently guarded by I2C_MUX. It should be CONFIG_I2C_MUX instead. Among potentially other problems, this resulted in i2c_lock_adapter() only locking I2C mux child adapters, and not the parent adapter. In turn, this could allow inter-mingling of mux child selection and I2C transactions, which could result in I2C transactions being directed to the wrong I2C bus, and possibly even switching between busses in the middle of a transaction. One concrete issue caused by this bug was corrupted HDMI EDID reads during boot on the NVIDIA Tegra Seaboard system, although this only became apparent in recent linux-next, when the boot timing was changed just enough to trigger the race condition. Fixes: 3923172b3d70 ("i2c: reduce parent checking to a NOOP in non-I2C_MUX case") Cc: Phil Carmody Cc: Signed-off-by: Stephen Warren Signed-off-by: Wolfram Sang --- include/linux/i2c.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/i2c.h b/include/linux/i2c.h index eff50e062be8..d9c8dbd3373f 100644 --- a/include/linux/i2c.h +++ b/include/linux/i2c.h @@ -445,7 +445,7 @@ static inline void i2c_set_adapdata(struct i2c_adapter *dev, void *data) static inline struct i2c_adapter * i2c_parent_is_i2c_adapter(const struct i2c_adapter *adapter) { -#if IS_ENABLED(I2C_MUX) +#if IS_ENABLED(CONFIG_I2C_MUX) struct device *parent = adapter->dev.parent; if (parent != NULL && parent->type == &i2c_adapter_type) From 3f9aec7610b39521c7c69d754de7265f6994c194 Mon Sep 17 00:00:00 2001 From: Jean Delvare Date: Tue, 14 Jan 2014 15:59:55 +0100 Subject: [PATCH 24/55] hwmon: (coretemp) Fix truncated name of alarm attributes When the core number exceeds 9, the size of the buffer storing the alarm attribute name is insufficient and the attribute name is truncated. This causes libsensors to skip these attributes as the truncated name is not recognized. Reported-by: Andreas Hollmann Signed-off-by: Jean Delvare Cc: stable@vger.kernel.org Signed-off-by: Guenter Roeck --- drivers/hwmon/coretemp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index 78be66176840..942509892895 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -52,7 +52,7 @@ MODULE_PARM_DESC(tjmax, "TjMax value in degrees Celsius"); #define BASE_SYSFS_ATTR_NO 2 /* Sysfs Base attr no for coretemp */ #define NUM_REAL_CORES 32 /* Number of Real cores per cpu */ -#define CORETEMP_NAME_LENGTH 17 /* String Length of attrs */ +#define CORETEMP_NAME_LENGTH 19 /* String Length of attrs */ #define MAX_CORE_ATTRS 4 /* Maximum no of basic attrs */ #define TOTAL_ATTRS (MAX_CORE_ATTRS + 1) #define MAX_CORE_DATA (NUM_REAL_CORES + BASE_SYSFS_ATTR_NO) From 267d29a69c6af39445f36102a832b25ed483f299 Mon Sep 17 00:00:00 2001 From: Christian Engelmayer Date: Sat, 11 Jan 2014 22:19:30 +0100 Subject: [PATCH 25/55] ieee802154: Fix memory leak in ieee802154_add_iface() Fix a memory leak in the ieee802154_add_iface() error handling path. Detected by Coverity: CID 710490. Signed-off-by: Christian Engelmayer Signed-off-by: David S. Miller --- net/ieee802154/nl-phy.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ieee802154/nl-phy.c b/net/ieee802154/nl-phy.c index d08c7a43dcd1..89b265aea151 100644 --- a/net/ieee802154/nl-phy.c +++ b/net/ieee802154/nl-phy.c @@ -221,8 +221,10 @@ int ieee802154_add_iface(struct sk_buff *skb, struct genl_info *info) if (info->attrs[IEEE802154_ATTR_DEV_TYPE]) { type = nla_get_u8(info->attrs[IEEE802154_ATTR_DEV_TYPE]); - if (type >= __IEEE802154_DEV_MAX) - return -EINVAL; + if (type >= __IEEE802154_DEV_MAX) { + rc = -EINVAL; + goto nla_put_failure; + } } dev = phy->add_iface(phy, devname, type); From a02bbb1ccfe81d3e859ab2ddde6a2eff9c8b39fd Mon Sep 17 00:00:00 2001 From: "Michael S. Tsirkin" Date: Sun, 12 Jan 2014 13:37:56 +0200 Subject: [PATCH 26/55] MAINTAINERS: add virtio-dev ML for virtio Since virtio is an OASIS standard draft now, virtio implementation discussions are taking place on the virtio-dev OASIS mailing list. Update MAINTAINERS. Signed-off-by: Michael S. Tsirkin Signed-off-by: David S. Miller --- MAINTAINERS | 3 +++ 1 file changed, 3 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 31a046213274..6a6e4ac72287 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9231,6 +9231,7 @@ F: include/media/videobuf2-* VIRTIO CONSOLE DRIVER M: Amit Shah +L: virtio-dev@lists.oasis-open.org L: virtualization@lists.linux-foundation.org S: Maintained F: drivers/char/virtio_console.c @@ -9240,6 +9241,7 @@ F: include/uapi/linux/virtio_console.h VIRTIO CORE, NET AND BLOCK DRIVERS M: Rusty Russell M: "Michael S. Tsirkin" +L: virtio-dev@lists.oasis-open.org L: virtualization@lists.linux-foundation.org S: Maintained F: drivers/virtio/ @@ -9252,6 +9254,7 @@ F: include/uapi/linux/virtio_*.h VIRTIO HOST (VHOST) M: "Michael S. Tsirkin" L: kvm@vger.kernel.org +L: virtio-dev@lists.oasis-open.org L: virtualization@lists.linux-foundation.org L: netdev@vger.kernel.org S: Maintained From 7c4b5175f65f31e0cd9867a6ddc3171007dfc110 Mon Sep 17 00:00:00 2001 From: Peter Korsgaard Date: Sun, 12 Jan 2014 23:15:51 +0100 Subject: [PATCH 27/55] dm9601: add USB IDs for new dm96xx variants A number of new dm96xx variants now exist. Reported-by: Joseph Chang Signed-off-by: Peter Korsgaard Signed-off-by: David S. Miller --- drivers/net/usb/dm9601.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c index 14aa48fa8d7e..e80219877730 100644 --- a/drivers/net/usb/dm9601.c +++ b/drivers/net/usb/dm9601.c @@ -614,6 +614,18 @@ static const struct usb_device_id products[] = { USB_DEVICE(0x0a46, 0x9621), /* DM9621A USB to Fast Ethernet Adapter */ .driver_info = (unsigned long)&dm9601_info, }, + { + USB_DEVICE(0x0a46, 0x9622), /* DM9622 USB to Fast Ethernet Adapter */ + .driver_info = (unsigned long)&dm9601_info, + }, + { + USB_DEVICE(0x0a46, 0x0269), /* DM962OA USB to Fast Ethernet Adapter */ + .driver_info = (unsigned long)&dm9601_info, + }, + { + USB_DEVICE(0x0a46, 0x1269), /* DM9621A USB to Fast Ethernet Adapter */ + .driver_info = (unsigned long)&dm9601_info, + }, {}, // END }; From 95f4a45de1a0f172b35451fc52283290adb21f6e Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Mon, 13 Jan 2014 02:45:22 +0100 Subject: [PATCH 28/55] net: avoid reference counter overflows on fib_rules in multicast forwarding Bob Falken reported that after 4G packets, multicast forwarding stopped working. This was because of a rule reference counter overflow which freed the rule as soon as the overflow happend. This patch solves this by adding the FIB_LOOKUP_NOREF flag to fib_rules_lookup calls. This is safe even from non-rcu locked sections as in this case the flag only implies not taking a reference to the rule, which we don't need at all. Rules only hold references to the namespace, which are guaranteed to be available during the call of the non-rcu protected function reg_vif_xmit because of the interface reference which itself holds a reference to the net namespace. Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables") Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables") Reported-by: Bob Falken Cc: Patrick McHardy Cc: Thomas Graf Cc: Julian Anastasov Cc: Eric Dumazet Signed-off-by: Hannes Frederic Sowa Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv4/ipmr.c | 7 +++++-- net/ipv6/ip6mr.c | 7 +++++-- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 62212c772a4b..1672409f5ba5 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -157,9 +157,12 @@ static struct mr_table *ipmr_get_table(struct net *net, u32 id) static int ipmr_fib_lookup(struct net *net, struct flowi4 *flp4, struct mr_table **mrt) { - struct ipmr_result res; - struct fib_lookup_arg arg = { .result = &res, }; int err; + struct ipmr_result res; + struct fib_lookup_arg arg = { + .result = &res, + .flags = FIB_LOOKUP_NOREF, + }; err = fib_rules_lookup(net->ipv4.mr_rules_ops, flowi4_to_flowi(flp4), 0, &arg); diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index f365310bfcca..0eb4038a4d63 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -141,9 +141,12 @@ static struct mr6_table *ip6mr_get_table(struct net *net, u32 id) static int ip6mr_fib_lookup(struct net *net, struct flowi6 *flp6, struct mr6_table **mrt) { - struct ip6mr_result res; - struct fib_lookup_arg arg = { .result = &res, }; int err; + struct ip6mr_result res; + struct fib_lookup_arg arg = { + .result = &res, + .flags = FIB_LOOKUP_NOREF, + }; err = fib_rules_lookup(net->ipv6.mr6_rules_ops, flowi6_to_flowi(flp6), 0, &arg); From 51bb352f15595f2dee42b599680809de3d08999d Mon Sep 17 00:00:00 2001 From: Jitendra Kalsaria Date: Tue, 14 Jan 2014 13:57:25 -0500 Subject: [PATCH 29/55] qlge: Fix vlan netdev features. vlan gets the same netdev features except vlan filter. Signed-off-by: Jitendra Kalsaria Signed-off-by: David S. Miller --- drivers/net/ethernet/qlogic/qlge/qlge_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c index 449f506d2e8f..f705aeeba767 100644 --- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c +++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c @@ -4765,6 +4765,8 @@ static int qlge_probe(struct pci_dev *pdev, NETIF_F_RXCSUM; ndev->features = ndev->hw_features; ndev->vlan_features = ndev->hw_features; + /* vlan gets same features (except vlan filter) */ + ndev->vlan_features &= ~NETIF_F_HW_VLAN_CTAG_FILTER; if (test_bit(QL_DMA64, &qdev->flags)) ndev->features |= NETIF_F_HIGHDMA; From 70f2fe3a26248724d8a5019681a869abdaf3e89a Mon Sep 17 00:00:00 2001 From: Andreas Rohner Date: Tue, 14 Jan 2014 17:56:36 -0800 Subject: [PATCH 30/55] nilfs2: fix segctor bug that causes file system corruption There is a bug in the function nilfs_segctor_collect, which results in active data being written to a segment, that is marked as clean. It is possible, that this segment is selected for a later segment construction, whereby the old data is overwritten. The problem shows itself with the following kernel log message: nilfs_sufile_do_cancel_free: segment 6533 must be clean Usually a few hours later the file system gets corrupted: NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0 NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660) The issue can be reproduced with a file system that is nearly full and with the cleaner running, while some IO intensive task is running. Although it is quite hard to reproduce. This is what happens: 1. The cleaner starts the segment construction 2. nilfs_segctor_collect is called 3. sc_stage is on NILFS_ST_SUFILE and segments are freed 4. sc_stage is on NILFS_ST_DAT current segment is full 5. nilfs_segctor_extend_segments is called, which allocates a new segment 6. The new segment is one of the segments freed in step 3 7. nilfs_sufile_cancel_freev is called and produces an error message 8. Loop around and the collection starts again 9. sc_stage is on NILFS_ST_SUFILE and segments are freed including the newly allocated segment, which will contain active data and can be allocated at a later time 10. A few hours later another segment construction allocates the segment and causes file system corruption This can be prevented by simply reordering the statements. If nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments the freed segments are marked as dirty and cannot be allocated any more. Signed-off-by: Andreas Rohner Reviewed-by: Ryusuke Konishi Tested-by: Andreas Rohner Signed-off-by: Ryusuke Konishi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/nilfs2/segment.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 9f6b486b6c01..a1a191634abc 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1440,17 +1440,19 @@ static int nilfs_segctor_collect(struct nilfs_sc_info *sci, nilfs_clear_logs(&sci->sc_segbufs); - err = nilfs_segctor_extend_segments(sci, nilfs, nadd); - if (unlikely(err)) - return err; - if (sci->sc_stage.flags & NILFS_CF_SUFREED) { err = nilfs_sufile_cancel_freev(nilfs->ns_sufile, sci->sc_freesegs, sci->sc_nfreesegs, NULL); WARN_ON(err); /* do not happen */ + sci->sc_stage.flags &= ~NILFS_CF_SUFREED; } + + err = nilfs_segctor_extend_segments(sci, nilfs, nadd); + if (unlikely(err)) + return err; + nadd = min_t(int, nadd << 1, SC_MAX_SEGDELTA); sci->sc_stage = prev_stage; } From bad009fe354a00e6b2bf87328995ec76e59ab970 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Tue, 14 Jan 2014 17:56:37 -0800 Subject: [PATCH 31/55] MIPS: fix case mismatch in local_r4k_flush_icache_range() Currently, Loongson-2 call protected_blast_icache_range() and others call protected_loongson23_blast_icache_range(), but I think the correct behavior should be the opposite. BTW, Loongson-3's cache-ops is compatible with MIPS64, but not compatible with Loongson-2. So, rename xxx_loongson23_yyy things to xxx_loongson2_yyy. The patch fixes early boot hang with 3.13-rc1, introduced in commit 14bd8c082016 ("MIPS: Loongson: Get rid of Loongson 2 #ifdefery all over arch/mips"). Signed-off-by: Huacai Chen Signed-off-by: Aaro Koskinen Reviewed-by: Aurelien Jarno Acked-by: John Crispin Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mips/include/asm/cacheops.h | 2 +- arch/mips/include/asm/r4kcache.h | 8 ++++---- arch/mips/mm/c-r4k.c | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/mips/include/asm/cacheops.h b/arch/mips/include/asm/cacheops.h index c75025f27c20..06b9bc7ea14b 100644 --- a/arch/mips/include/asm/cacheops.h +++ b/arch/mips/include/asm/cacheops.h @@ -83,6 +83,6 @@ /* * Loongson2-specific cacheops */ -#define Hit_Invalidate_I_Loongson23 0x00 +#define Hit_Invalidate_I_Loongson2 0x00 #endif /* __ASM_CACHEOPS_H */ diff --git a/arch/mips/include/asm/r4kcache.h b/arch/mips/include/asm/r4kcache.h index 34d1a1917125..91d20b08246f 100644 --- a/arch/mips/include/asm/r4kcache.h +++ b/arch/mips/include/asm/r4kcache.h @@ -165,7 +165,7 @@ static inline void flush_icache_line(unsigned long addr) __iflush_prologue switch (boot_cpu_type()) { case CPU_LOONGSON2: - cache_op(Hit_Invalidate_I_Loongson23, addr); + cache_op(Hit_Invalidate_I_Loongson2, addr); break; default: @@ -219,7 +219,7 @@ static inline void protected_flush_icache_line(unsigned long addr) { switch (boot_cpu_type()) { case CPU_LOONGSON2: - protected_cache_op(Hit_Invalidate_I_Loongson23, addr); + protected_cache_op(Hit_Invalidate_I_Loongson2, addr); break; default: @@ -452,8 +452,8 @@ static inline void prot##extra##blast_##pfx##cache##_range(unsigned long start, __BUILD_BLAST_CACHE_RANGE(d, dcache, Hit_Writeback_Inv_D, protected_, ) __BUILD_BLAST_CACHE_RANGE(s, scache, Hit_Writeback_Inv_SD, protected_, ) __BUILD_BLAST_CACHE_RANGE(i, icache, Hit_Invalidate_I, protected_, ) -__BUILD_BLAST_CACHE_RANGE(i, icache, Hit_Invalidate_I_Loongson23, \ - protected_, loongson23_) +__BUILD_BLAST_CACHE_RANGE(i, icache, Hit_Invalidate_I_Loongson2, \ + protected_, loongson2_) __BUILD_BLAST_CACHE_RANGE(d, dcache, Hit_Writeback_Inv_D, , ) __BUILD_BLAST_CACHE_RANGE(s, scache, Hit_Writeback_Inv_SD, , ) /* blast_inv_dcache_range */ diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c index 62ffd20ea869..73f02da61baf 100644 --- a/arch/mips/mm/c-r4k.c +++ b/arch/mips/mm/c-r4k.c @@ -580,11 +580,11 @@ static inline void local_r4k_flush_icache_range(unsigned long start, unsigned lo else { switch (boot_cpu_type()) { case CPU_LOONGSON2: - protected_blast_icache_range(start, end); + protected_loongson2_blast_icache_range(start, end); break; default: - protected_loongson23_blast_icache_range(start, end); + protected_blast_icache_range(start, end); break; } } From 43a06847b9d277e9f2c3bf8052b44b74e17526c7 Mon Sep 17 00:00:00 2001 From: Aaro Koskinen Date: Tue, 14 Jan 2014 17:56:38 -0800 Subject: [PATCH 32/55] MIPS: fix blast_icache32 on loongson2 Commit 14bd8c082016 ("MIPS: Loongson: Get rid of Loongson 2 #ifdefery all over arch/mips") failed to add Loongson2 specific blast_icache32 functions. Fix that. The patch fixes the following crash seen with 3.13-rc1: Reserved instruction in kernel code[#1]: [...] Call Trace: blast_icache32_page+0x8/0xb0 r4k_flush_cache_page+0x19c/0x200 do_wp_page.isra.97+0x47c/0xe08 handle_mm_fault+0x938/0x1118 __do_page_fault+0x140/0x540 resume_userspace_check+0x0/0x10 Code: 00200825 64834000 00200825 bc900020 bc900040 bc900060 bc900080 bc9000a0 Signed-off-by: Aaro Koskinen Reviewed-by: Aurelien Jarno Acked-by: John Crispin Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mips/include/asm/r4kcache.h | 41 ++++++++++++++++---------------- arch/mips/mm/c-r4k.c | 7 ++++++ 2 files changed, 28 insertions(+), 20 deletions(-) diff --git a/arch/mips/include/asm/r4kcache.h b/arch/mips/include/asm/r4kcache.h index 91d20b08246f..c84caddb8bde 100644 --- a/arch/mips/include/asm/r4kcache.h +++ b/arch/mips/include/asm/r4kcache.h @@ -357,8 +357,8 @@ static inline void invalidate_tcache_page(unsigned long addr) "i" (op)); /* build blast_xxx, blast_xxx_page, blast_xxx_page_indexed */ -#define __BUILD_BLAST_CACHE(pfx, desc, indexop, hitop, lsize) \ -static inline void blast_##pfx##cache##lsize(void) \ +#define __BUILD_BLAST_CACHE(pfx, desc, indexop, hitop, lsize, extra) \ +static inline void extra##blast_##pfx##cache##lsize(void) \ { \ unsigned long start = INDEX_BASE; \ unsigned long end = start + current_cpu_data.desc.waysize; \ @@ -376,7 +376,7 @@ static inline void blast_##pfx##cache##lsize(void) \ __##pfx##flush_epilogue \ } \ \ -static inline void blast_##pfx##cache##lsize##_page(unsigned long page) \ +static inline void extra##blast_##pfx##cache##lsize##_page(unsigned long page) \ { \ unsigned long start = page; \ unsigned long end = page + PAGE_SIZE; \ @@ -391,7 +391,7 @@ static inline void blast_##pfx##cache##lsize##_page(unsigned long page) \ __##pfx##flush_epilogue \ } \ \ -static inline void blast_##pfx##cache##lsize##_page_indexed(unsigned long page) \ +static inline void extra##blast_##pfx##cache##lsize##_page_indexed(unsigned long page) \ { \ unsigned long indexmask = current_cpu_data.desc.waysize - 1; \ unsigned long start = INDEX_BASE + (page & indexmask); \ @@ -410,23 +410,24 @@ static inline void blast_##pfx##cache##lsize##_page_indexed(unsigned long page) __##pfx##flush_epilogue \ } -__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 16) -__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 16) -__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 16) -__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 32) -__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 32) -__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 32) -__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 64) -__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 64) -__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 64) -__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 128) +__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 16, ) +__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 16, ) +__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 16, ) +__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 32, ) +__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 32, ) +__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I_Loongson2, 32, loongson2_) +__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 32, ) +__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 64, ) +__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 64, ) +__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 64, ) +__BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 128, ) -__BUILD_BLAST_CACHE(inv_d, dcache, Index_Writeback_Inv_D, Hit_Invalidate_D, 16) -__BUILD_BLAST_CACHE(inv_d, dcache, Index_Writeback_Inv_D, Hit_Invalidate_D, 32) -__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 16) -__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 32) -__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 64) -__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 128) +__BUILD_BLAST_CACHE(inv_d, dcache, Index_Writeback_Inv_D, Hit_Invalidate_D, 16, ) +__BUILD_BLAST_CACHE(inv_d, dcache, Index_Writeback_Inv_D, Hit_Invalidate_D, 32, ) +__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 16, ) +__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 32, ) +__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 64, ) +__BUILD_BLAST_CACHE(inv_s, scache, Index_Writeback_Inv_SD, Hit_Invalidate_SD, 128, ) /* build blast_xxx_range, protected_blast_xxx_range */ #define __BUILD_BLAST_CACHE_RANGE(pfx, desc, hitop, prot, extra) \ diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c index 73f02da61baf..49e572d879e1 100644 --- a/arch/mips/mm/c-r4k.c +++ b/arch/mips/mm/c-r4k.c @@ -237,6 +237,8 @@ static void r4k_blast_icache_page_setup(void) r4k_blast_icache_page = (void *)cache_noop; else if (ic_lsize == 16) r4k_blast_icache_page = blast_icache16_page; + else if (ic_lsize == 32 && current_cpu_type() == CPU_LOONGSON2) + r4k_blast_icache_page = loongson2_blast_icache32_page; else if (ic_lsize == 32) r4k_blast_icache_page = blast_icache32_page; else if (ic_lsize == 64) @@ -261,6 +263,9 @@ static void r4k_blast_icache_page_indexed_setup(void) else if (TX49XX_ICACHE_INDEX_INV_WAR) r4k_blast_icache_page_indexed = tx49_blast_icache32_page_indexed; + else if (current_cpu_type() == CPU_LOONGSON2) + r4k_blast_icache_page_indexed = + loongson2_blast_icache32_page_indexed; else r4k_blast_icache_page_indexed = blast_icache32_page_indexed; @@ -284,6 +289,8 @@ static void r4k_blast_icache_setup(void) r4k_blast_icache = blast_r4600_v1_icache32; else if (TX49XX_ICACHE_INDEX_INV_WAR) r4k_blast_icache = tx49_blast_icache32; + else if (current_cpu_type() == CPU_LOONGSON2) + r4k_blast_icache = loongson2_blast_icache32; else r4k_blast_icache = blast_icache32; } else if (ic_lsize == 64) From 03e5ac2fc3bf6f4140db0371e8bb4243b24e3e02 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Tue, 14 Jan 2014 17:56:40 -0800 Subject: [PATCH 33/55] mm: fix crash when using XFS on loopback Commit 8456a648cf44 ("slab: use struct page for slab management") causes a crash in the LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and later. Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d) CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5 task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001101111100100001110 Not tainted r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0 r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40 r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000 r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000 r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001 r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080 r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0 r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001 sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430 IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000 ORIG_R28: 00000000405a95e0 IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48 IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48 RP(r2): flush_dcache_page+0x128/0x388 Backtrace: flush_dcache_page+0x128/0x388 lo_splice_actor+0x90/0x148 [loop] splice_from_pipe_feed+0xc0/0x1d0 __splice_from_pipe+0xac/0xc0 lo_direct_splice_actor+0x1c/0x70 [loop] splice_direct_to_actor+0xec/0x228 lo_receive+0xe4/0x298 [loop] loop_thread+0x478/0x640 [loop] kthread+0x134/0x168 end_fault_vector+0x20/0x28 xfs_setsize_buftarg+0x0/0x90 [xfs] Kernel panic - not syncing: Bad Address (null pointer deref?) Commit 8456a648cf44 changes the page structure so that the slab subsystem reuses the page->mapping field. The crash happens in the following way: * XFS allocates some memory from slab and issues a bio to read data into it. * the bio is sent to the loopback device. * lo_receive creates an actor and calls splice_direct_to_actor. * lo_splice_actor copies data to the target page. * lo_splice_actor calls flush_dcache_page because the page may be mapped by userspace. In that case we need to flush the kernel cache. * flush_dcache_page asks for the list of userspace mappings, however that page->mapping field is reused by the slab subsystem for a different purpose. This causes the crash. Note that other architectures without coherent caches (sparc, arm, mips) also call page_mapping from flush_dcache_page, so they may crash in the same way. This patch fixes this bug by testing if the page is a slab page in page_mapping and returning NULL if it is. The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in earlier kernels in the same scenario on architectures without cache coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to stable kernels. In the old kernels, the function page_mapping is placed in include/linux/mm.h, so you should modify the patch accordingly when backporting it. Signed-off-by: Mikulas Patocka Cc: John David Anglin ] Cc: Andi Kleen Cc: Christoph Lameter Acked-by: Pekka Enberg Reviewed-by: Joonsoo Kim Cc: Helge Deller Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/util.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/util.c b/mm/util.c index f7bc2096071c..808f375648e7 100644 --- a/mm/util.c +++ b/mm/util.c @@ -390,7 +390,10 @@ struct address_space *page_mapping(struct page *page) { struct address_space *mapping = page->mapping; - VM_BUG_ON(PageSlab(page)); + /* This happens if someone calls flush_dcache_page on slab page */ + if (unlikely(PageSlab(page))) + return NULL; + if (unlikely(PageSwapCache(page))) { swp_entry_t entry; From 5a610fcc7390ee60308deaf09426ada87a1eeec2 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Tue, 14 Jan 2014 17:56:41 -0800 Subject: [PATCH 34/55] crash_dump: fix compilation error (on MIPS at least) In file included from kernel/crash_dump.c:2:0: include/linux/crash_dump.h:22:27: error: unknown type name `pgprot_t' when CONFIG_CRASH_DUMP=y The error was traced back to commit 9cb218131de1 ("vmcore: introduce remap_oldmem_pfn_range()") include to get the missing definition Signed-off-by: Qais Yousef Reviewed-by: James Hogan Cc: Michael Holzheu Acked-by: Vivek Goyal Cc: [3.12+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/crash_dump.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h index fe68a5a98583..7032518f8542 100644 --- a/include/linux/crash_dump.h +++ b/include/linux/crash_dump.h @@ -6,6 +6,8 @@ #include #include +#include /* for pgprot_t */ + #define ELFCORE_ADDR_MAX (-1ULL) #define ELFCORE_ADDR_ERR (-2ULL) From 74e72f894d56eb9d2e1218530c658e7d297e002b Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 14 Jan 2014 17:56:42 -0800 Subject: [PATCH 35/55] lib/percpu_counter.c: fix __percpu_counter_add() __percpu_counter_add() may be called in softirq/hardirq handler (such as, blk_mq_queue_exit() is typically called in hardirq/softirq handler), so we need to call this_cpu_add()(irq safe helper) to update percpu counter, otherwise counts may be lost. This fixes the problem that 'rmmod null_blk' hangs in blk_cleanup_queue() because of miscounting of request_queue->mq_usage_counter. This patch is the v1 of previous one of "lib/percpu_counter.c: disable local irq when updating percpu couter", and takes Andrew's approach which may be more efficient for ARCHs(x86, s390) that have optimized this_cpu_add(). Signed-off-by: Ming Lei Cc: Paul Gortmaker Cc: Shaohua Li Cc: Jens Axboe Cc: Fan Du Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/percpu_counter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 7473ee3b4ee7..1da85bb1bc07 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -82,10 +82,10 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) unsigned long flags; raw_spin_lock_irqsave(&fbc->lock, flags); fbc->count += count; + __this_cpu_sub(*fbc->counters, count); raw_spin_unlock_irqrestore(&fbc->lock, flags); - __this_cpu_write(*fbc->counters, 0); } else { - __this_cpu_write(*fbc->counters, count); + this_cpu_add(*fbc->counters, amount); } preempt_enable(); } From 0dce7cd67fd9055c4a2ff278f8af1431e646d346 Mon Sep 17 00:00:00 2001 From: Andrew Jones Date: Wed, 15 Jan 2014 13:39:59 +0100 Subject: [PATCH 36/55] kvm: x86: fix apic_base enable check Commit e66d2ae7c67bd moved the assignment vcpu->arch.apic_base = value above a condition with (vcpu->arch.apic_base ^ value), causing that check to always fail. Use old_value, vcpu->arch.apic_base's old value, in the condition instead. Signed-off-by: Andrew Jones Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 1673940cf9c3..775702f649ca 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1355,7 +1355,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) vcpu->arch.apic_base = value; /* update jump label if enable bit changes */ - if ((vcpu->arch.apic_base ^ value) & MSR_IA32_APICBASE_ENABLE) { + if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) { if (value & MSR_IA32_APICBASE_ENABLE) static_key_slow_dec_deferred(&apic_hw_disabled); else From 1df0cbd509bc21b0c331358c1f9d9a6fc94bada8 Mon Sep 17 00:00:00 2001 From: Marek Lindner Date: Wed, 15 Jan 2014 20:31:18 +0800 Subject: [PATCH 37/55] batman-adv: fix batman-adv header overhead calculation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Batman-adv prepends a full ethernet header in addition to its own header. This has to be reflected in the MTU calculation, especially since the value is used to set dev->hard_header_len. Introduced by 411d6ed93a5d0601980d3e5ce75de07c98e3a7de ("batman-adv: consider network coding overhead when calculating required mtu") Reported-by: cmsv Reported-by: Martin Hundebøll Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c index 1511f64a6cea..faba0f61ad53 100644 --- a/net/batman-adv/main.c +++ b/net/batman-adv/main.c @@ -277,7 +277,7 @@ int batadv_max_header_len(void) sizeof(struct batadv_coded_packet)); #endif - return header_len; + return header_len + ETH_HLEN; } /** From a926592f5e4e900f3fa903298c4619a131e60963 Mon Sep 17 00:00:00 2001 From: Richard Weinberger Date: Tue, 14 Jan 2014 22:46:36 +0100 Subject: [PATCH 38/55] net,via-rhine: Fix tx_timeout handling rhine_reset_task() misses to disable the tx scheduler upon reset, this can lead to a crash if work is still scheduled while we're resetting the tx queue. Fixes: [ 93.591707] BUG: unable to handle kernel NULL pointer dereference at 0000004c [ 93.595514] IP: [] rhine_napipoll+0x491/0x6 Signed-off-by: Richard Weinberger Signed-off-by: David S. Miller --- drivers/net/ethernet/via/via-rhine.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c index cce6c4bc556a..ef312bc6b865 100644 --- a/drivers/net/ethernet/via/via-rhine.c +++ b/drivers/net/ethernet/via/via-rhine.c @@ -1618,6 +1618,7 @@ static void rhine_reset_task(struct work_struct *work) goto out_unlock; napi_disable(&rp->napi); + netif_tx_disable(dev); spin_lock_bh(&rp->lock); /* clear all descriptors */ From d9aee591b0f06bd44cd577b757d3f267bc35fe4d Mon Sep 17 00:00:00 2001 From: Yuval Mintz Date: Wed, 15 Jan 2014 12:05:30 +0200 Subject: [PATCH 39/55] bnx2x: Don't release PCI bars on shutdown The bnx2x driver in its pci shutdown() callback releases its pci bars (in the same manner it does during its pci remove() callback). During a system reboot while VFs are enabled, its possible for the VF's remove to be called (as a result of pci_disable_sriov()) after its shutdown callback has already finished running; This will cause a paging request fault as the VF tries to access the pci bar which it has previously released, crashing the system. This patch further differentiates the shutdown and remove callbacks, preventing the pci release procedures from being called during shutdown. Signed-off-by: Yuval Mintz Signed-off-by: Ariel Elior Signed-off-by: David S. Miller --- .../net/ethernet/broadcom/bnx2x/bnx2x_main.c | 29 ++++++++++--------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 8b3107b2fcc1..0067b975873f 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12942,25 +12942,26 @@ static void __bnx2x_remove(struct pci_dev *pdev, pci_set_power_state(pdev, PCI_D3hot); } - if (bp->regview) - iounmap(bp->regview); + if (remove_netdev) { + if (bp->regview) + iounmap(bp->regview); - /* for vf doorbells are part of the regview and were unmapped along with - * it. FW is only loaded by PF. - */ - if (IS_PF(bp)) { - if (bp->doorbells) - iounmap(bp->doorbells); + /* For vfs, doorbells are part of the regview and were unmapped + * along with it. FW is only loaded by PF. + */ + if (IS_PF(bp)) { + if (bp->doorbells) + iounmap(bp->doorbells); - bnx2x_release_firmware(bp); - } - bnx2x_free_mem_bp(bp); + bnx2x_release_firmware(bp); + } + bnx2x_free_mem_bp(bp); - if (remove_netdev) free_netdev(dev); - if (atomic_read(&pdev->enable_cnt) == 1) - pci_release_regions(pdev); + if (atomic_read(&pdev->enable_cnt) == 1) + pci_release_regions(pdev); + } pci_disable_device(pdev); } From ba42fad0964a41f0830e80c1b6be49c1e6bfcc01 Mon Sep 17 00:00:00 2001 From: Ivan Vecera Date: Wed, 15 Jan 2014 11:11:34 +0100 Subject: [PATCH 40/55] be2net: add dma_mapping_error() check for dma_map_page() The driver does not check value returned by dma_map_page. The patch fixes this. v2: Removed the bugfix for non-bug ;-) (thanks Sathya) Signed-off-by: Ivan Vecera Acked-by: Sathya Perla Signed-off-by: David S. Miller --- drivers/net/ethernet/emulex/benet/be_main.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index bf40fdaecfa3..a37039d353c5 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -1776,6 +1776,7 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp) struct be_rx_page_info *page_info = NULL, *prev_page_info = NULL; struct be_queue_info *rxq = &rxo->q; struct page *pagep = NULL; + struct device *dev = &adapter->pdev->dev; struct be_eth_rx_d *rxd; u64 page_dmaaddr = 0, frag_dmaaddr; u32 posted, page_offset = 0; @@ -1788,9 +1789,15 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp) rx_stats(rxo)->rx_post_fail++; break; } - page_dmaaddr = dma_map_page(&adapter->pdev->dev, pagep, - 0, adapter->big_page_size, + page_dmaaddr = dma_map_page(dev, pagep, 0, + adapter->big_page_size, DMA_FROM_DEVICE); + if (dma_mapping_error(dev, page_dmaaddr)) { + put_page(pagep); + pagep = NULL; + rx_stats(rxo)->rx_post_fail++; + break; + } page_info->page_offset = 0; } else { get_page(pagep); From aee636c4809fa54848ff07a899b326eb1f9987a2 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 15 Jan 2014 06:50:07 -0800 Subject: [PATCH 41/55] bpf: do not use reciprocal divide At first Jakub Zawadzki noticed that some divisions by reciprocal_divide were not correct. (off by one in some cases) http://www.wireshark.org/~darkjames/reciprocal-buggy.c He could also show this with BPF: http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c The reciprocal divide in linux kernel is not generic enough, lets remove its use in BPF, as it is not worth the pain with current cpus. Signed-off-by: Eric Dumazet Reported-by: Jakub Zawadzki Cc: Mircea Gherzan Cc: Daniel Borkmann Cc: Hannes Frederic Sowa Cc: Matt Evans Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: David S. Miller Signed-off-by: David S. Miller --- arch/arm/net/bpf_jit_32.c | 6 +++--- arch/powerpc/net/bpf_jit_comp.c | 7 ++++--- arch/s390/net/bpf_jit_comp.c | 17 ++++++++++++----- arch/sparc/net/bpf_jit_comp.c | 17 ++++++++++++++--- arch/x86/net/bpf_jit_comp.c | 14 ++++++++++---- net/core/filter.c | 30 ++---------------------------- 6 files changed, 45 insertions(+), 46 deletions(-) diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c index 9ed155ad0f97..271b5e971568 100644 --- a/arch/arm/net/bpf_jit_32.c +++ b/arch/arm/net/bpf_jit_32.c @@ -641,10 +641,10 @@ static int build_body(struct jit_ctx *ctx) emit(ARM_MUL(r_A, r_A, r_X), ctx); break; case BPF_S_ALU_DIV_K: - /* current k == reciprocal_value(userspace k) */ + if (k == 1) + break; emit_mov_i(r_scratch, k, ctx); - /* A = top 32 bits of the product */ - emit(ARM_UMULL(r_scratch, r_A, r_A, r_scratch), ctx); + emit_udiv(r_A, r_A, r_scratch, ctx); break; case BPF_S_ALU_DIV_X: update_on_xread(ctx); diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index ac3c2a10dafd..555034f8505e 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -223,10 +223,11 @@ static int bpf_jit_build_body(struct sk_filter *fp, u32 *image, } PPC_DIVWU(r_A, r_A, r_X); break; - case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */ + case BPF_S_ALU_DIV_K: /* A /= K */ + if (K == 1) + break; PPC_LI32(r_scratch1, K); - /* Top 32 bits of 64bit result -> A */ - PPC_MULHWU(r_A, r_A, r_scratch1); + PPC_DIVWU(r_A, r_A, r_scratch1); break; case BPF_S_ALU_AND_X: ctx->seen |= SEEN_XREG; diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 16871da37371..fc0fa77728e1 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -371,11 +371,13 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter, /* dr %r4,%r12 */ EMIT2(0x1d4c); break; - case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */ - /* m %r4,(%r13) */ - EMIT4_DISP(0x5c40d000, EMIT_CONST(K)); - /* lr %r5,%r4 */ - EMIT2(0x1854); + case BPF_S_ALU_DIV_K: /* A /= K */ + if (K == 1) + break; + /* lhi %r4,0 */ + EMIT4(0xa7480000); + /* d %r4,(%r13) */ + EMIT4_DISP(0x5d40d000, EMIT_CONST(K)); break; case BPF_S_ALU_MOD_X: /* A %= X */ jit->seen |= SEEN_XREG | SEEN_RET0; @@ -391,6 +393,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter, EMIT2(0x1854); break; case BPF_S_ALU_MOD_K: /* A %= K */ + if (K == 1) { + /* lhi %r5,0 */ + EMIT4(0xa7580000); + break; + } /* lhi %r4,0 */ EMIT4(0xa7480000); /* d %r4,(%r13) */ diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c index 218b6b23c378..01fe9946d388 100644 --- a/arch/sparc/net/bpf_jit_comp.c +++ b/arch/sparc/net/bpf_jit_comp.c @@ -497,9 +497,20 @@ void bpf_jit_compile(struct sk_filter *fp) case BPF_S_ALU_MUL_K: /* A *= K */ emit_alu_K(MUL, K); break; - case BPF_S_ALU_DIV_K: /* A /= K */ - emit_alu_K(MUL, K); - emit_read_y(r_A); + case BPF_S_ALU_DIV_K: /* A /= K with K != 0*/ + if (K == 1) + break; + emit_write_y(G0); +#ifdef CONFIG_SPARC32 + /* The Sparc v8 architecture requires + * three instructions between a %y + * register write and the first use. + */ + emit_nop(); + emit_nop(); + emit_nop(); +#endif + emit_alu_K(DIV, K); break; case BPF_S_ALU_DIV_X: /* A /= X; */ emit_cmpi(r_X, 0); diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 26328e800869..4ed75dd81d05 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -359,15 +359,21 @@ void bpf_jit_compile(struct sk_filter *fp) EMIT2(0x89, 0xd0); /* mov %edx,%eax */ break; case BPF_S_ALU_MOD_K: /* A %= K; */ + if (K == 1) { + CLEAR_A(); + break; + } EMIT2(0x31, 0xd2); /* xor %edx,%edx */ EMIT1(0xb9);EMIT(K, 4); /* mov imm32,%ecx */ EMIT2(0xf7, 0xf1); /* div %ecx */ EMIT2(0x89, 0xd0); /* mov %edx,%eax */ break; - case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */ - EMIT3(0x48, 0x69, 0xc0); /* imul imm32,%rax,%rax */ - EMIT(K, 4); - EMIT4(0x48, 0xc1, 0xe8, 0x20); /* shr $0x20,%rax */ + case BPF_S_ALU_DIV_K: /* A /= K */ + if (K == 1) + break; + EMIT2(0x31, 0xd2); /* xor %edx,%edx */ + EMIT1(0xb9);EMIT(K, 4); /* mov imm32,%ecx */ + EMIT2(0xf7, 0xf1); /* div %ecx */ break; case BPF_S_ALU_AND_X: seen |= SEEN_XREG; diff --git a/net/core/filter.c b/net/core/filter.c index 01b780856db2..ad30d626a5bd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -36,7 +36,6 @@ #include #include #include -#include #include #include #include @@ -166,7 +165,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb, A /= X; continue; case BPF_S_ALU_DIV_K: - A = reciprocal_divide(A, K); + A /= K; continue; case BPF_S_ALU_MOD_X: if (X == 0) @@ -553,11 +552,6 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen) /* Some instructions need special checks */ switch (code) { case BPF_S_ALU_DIV_K: - /* check for division by zero */ - if (ftest->k == 0) - return -EINVAL; - ftest->k = reciprocal_value(ftest->k); - break; case BPF_S_ALU_MOD_K: /* check for division by zero */ if (ftest->k == 0) @@ -853,27 +847,7 @@ void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to) to->code = decodes[code]; to->jt = filt->jt; to->jf = filt->jf; - - if (code == BPF_S_ALU_DIV_K) { - /* - * When loaded this rule user gave us X, which was - * translated into R = r(X). Now we calculate the - * RR = r(R) and report it back. If next time this - * value is loaded and RRR = r(RR) is calculated - * then the R == RRR will be true. - * - * One exception. X == 1 translates into R == 0 and - * we can't calculate RR out of it with r(). - */ - - if (filt->k == 0) - to->k = 1; - else - to->k = reciprocal_value(filt->k); - - BUG_ON(reciprocal_value(to->k) != filt->k); - } else - to->k = filt->k; + to->k = filt->k; } int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, unsigned int len) From c026b3591e4f2a4993df773183704bb31634e0bd Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 10 Jan 2014 21:06:03 +0100 Subject: [PATCH 42/55] x86, mm, perf: Allow recursive faults from interrupts Waiman managed to trigger a PMI while in a emulate_vsyscall() fault, the PMI in turn managed to trigger a fault while obtaining a stack trace. This triggered the sig_on_uaccess_error recursive fault logic and killed the process dead. Fix this by explicitly excluding interrupts from the recursive fault logic. Reported-and-Tested-by: Waiman Long Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()") Cc: Aswin Chandramouleeswaran Cc: Scott J Norton Cc: Linus Torvalds Cc: Andy Lutomirski Cc: Arnaldo Carvalho de Melo Cc: Andrew Morton Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20140110200603.GJ7572@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- arch/x86/mm/fault.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9ff85bb8dd69..9d591c895803 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code, /* Are we prepared to handle this kernel fault? */ if (fixup_exception(regs)) { + /* + * Any interrupt that takes a fault gets the fixup. This makes + * the below recursive fault logic only apply to a faults from + * task context. + */ + if (in_interrupt()) + return; + + /* + * Per the above we're !in_interrupt(), aka. task context. + * + * In this case we need to make sure we're not recursively + * faulting through the emulate_vsyscall() logic. + */ if (current_thread_info()->sig_on_uaccess_error && signal) { tsk->thread.trap_nr = X86_TRAP_PF; tsk->thread.error_code = error_code | PF_USER; @@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code, /* XXX: hwpoison faults will set the wrong code. */ force_sig_info_fault(signal, si_code, address, tsk, 0); } + + /* + * Barring that, we can do the fixup and be happy. + */ return; } From bee09ed91cacdbffdbcd3b05de8409c77ec9fcd6 Mon Sep 17 00:00:00 2001 From: Robert Richter Date: Wed, 15 Jan 2014 15:57:29 +0100 Subject: [PATCH 43/55] perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h On AMD family 10h we see following error messages while waking up from S3 for all non-boot CPUs leading to a failed IBS initialization: Enabling non-boot CPUs ... smpboot: Booting Node 0 Processor 1 APIC 0x1 [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu perf: IBS APIC setup failed on cpu #1 process: Switch to broadcast mode on CPU1 CPU1 is up ... ACPI: Waking up from system sleep state S3 Reason for this is that during suspend the LVT offset for the IBS vector gets lost and needs to be reinialized while resuming. The offset is read from the IBSCTL msr. On family 10h the offset needs to be 1 as offset 0 is used for the MCE threshold interrupt, but firmware assings it for IBS to 0 too. The kernel needs to reprogram the vector. The msr is a readonly node msr, but a new value can be written via pci config space access. The reinitialization is implemented for family 10h in setup_ibs_ctl() which is forced during IBS setup. This patch fixes IBS setup after waking up from S3 by adding resume/supend hooks for the boot cpu which does the offset reinitialization. Marking it as stable to let distros pick up this fix. Signed-off-by: Robert Richter Signed-off-by: Peter Zijlstra Cc: v3.2.. Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 53 ++++++++++++++++++++---- 1 file changed, 45 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index e09f0bfb7b8f..4b8e4d3cd6ea 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -10,6 +10,7 @@ #include #include #include +#include #include @@ -816,6 +817,18 @@ static int force_ibs_eilvt_setup(void) return ret; } +static void ibs_eilvt_setup(void) +{ + /* + * Force LVT offset assignment for family 10h: The offsets are + * not assigned by the BIOS for this family, so the OS is + * responsible for doing it. If the OS assignment fails, fall + * back to BIOS settings and try to setup this. + */ + if (boot_cpu_data.x86 == 0x10) + force_ibs_eilvt_setup(); +} + static inline int get_ibs_lvt_offset(void) { u64 val; @@ -851,6 +864,36 @@ static void clear_APIC_ibs(void *dummy) setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1); } +#ifdef CONFIG_PM + +static int perf_ibs_suspend(void) +{ + clear_APIC_ibs(NULL); + return 0; +} + +static void perf_ibs_resume(void) +{ + ibs_eilvt_setup(); + setup_APIC_ibs(NULL); +} + +static struct syscore_ops perf_ibs_syscore_ops = { + .resume = perf_ibs_resume, + .suspend = perf_ibs_suspend, +}; + +static void perf_ibs_pm_init(void) +{ + register_syscore_ops(&perf_ibs_syscore_ops); +} + +#else + +static inline void perf_ibs_pm_init(void) { } + +#endif + static int perf_ibs_cpu_notifier(struct notifier_block *self, unsigned long action, void *hcpu) { @@ -877,18 +920,12 @@ static __init int amd_ibs_init(void) if (!caps) return -ENODEV; /* ibs not supported by the cpu */ - /* - * Force LVT offset assignment for family 10h: The offsets are - * not assigned by the BIOS for this family, so the OS is - * responsible for doing it. If the OS assignment fails, fall - * back to BIOS settings and try to setup this. - */ - if (boot_cpu_data.x86 == 0x10) - force_ibs_eilvt_setup(); + ibs_eilvt_setup(); if (!ibs_eilvt_valid()) goto out; + perf_ibs_pm_init(); get_online_cpus(); ibs_caps = caps; /* make ibs_caps visible to other cpus: */ From 4ce00dfcf19c473f3dbf23d5b1372639f0c334f6 Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Thu, 16 Jan 2014 18:32:25 +0000 Subject: [PATCH 44/55] Revert "arm64: Fix memory shareability attribute for ioremap_wc/cache" This reverts commit 2f7dc6027522499582a520807cb9ffda589de47e. The above commit breaks the mapping type for Device memory because pgprot_default already contains a Normal memory type. pgprot_default is also not initialised early enough for earlyprintk resulting in an inconsistent memory mapping with 64K PAGE_SIZE configuration. Signed-off-by: Catalin Marinas Reported-by: Will Deacon Acked-by: Will Deacon --- arch/arm64/include/asm/io.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 572769727227..4cc813eddacb 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -229,7 +229,7 @@ extern void __iomem *__ioremap(phys_addr_t phys_addr, size_t size, pgprot_t prot extern void __iounmap(volatile void __iomem *addr); extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size); -#define PROT_DEFAULT (pgprot_default | PTE_DIRTY) +#define PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_DIRTY) #define PROT_DEVICE_nGnRE (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_ATTRINDX(MT_DEVICE_nGnRE)) #define PROT_NORMAL_NC (PROT_DEFAULT | PTE_ATTRINDX(MT_NORMAL_NC)) #define PROT_NORMAL (PROT_DEFAULT | PTE_ATTRINDX(MT_NORMAL)) From 38a529b5d42e4cfc5ac94844e61335a00eb2d320 Mon Sep 17 00:00:00 2001 From: Mika Westerberg Date: Thu, 16 Jan 2014 14:39:39 +0200 Subject: [PATCH 45/55] e1000e: Fix compilation warning when !CONFIG_PM_SLEEP MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 7509963c703b (e1000e: Fix a compile flag mis-match for suspend/resume) moved suspend and resume hooks to be available when CONFIG_PM is set. However, it can be set even if CONFIG_PM_SLEEP is not set causing following warnings to be emitted: drivers/net/ethernet/intel/e1000e/netdev.c:6178:12: warning: ‘e1000_suspend’ defined but not used [-Wunused-function] drivers/net/ethernet/intel/e1000e/netdev.c:6185:12: warning: ‘e1000_resume’ defined but not used [-Wunused-function] To fix this make the hooks to be available only when CONFIG_PM_SLEEP is set and remove CONFIG_PM wrapping from driver ops because this is already handled by SET_SYSTEM_SLEEP_PM_OPS() and SET_RUNTIME_PM_OPS(). Signed-off-by: Mika Westerberg Cc: Dave Ertman Cc: Aaron Brown Cc: Jeff Kirsher Signed-off-by: David S. Miller --- drivers/net/ethernet/intel/e1000e/netdev.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index c30d41d6e426..6d14eea17918 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -6174,7 +6174,7 @@ static int __e1000_resume(struct pci_dev *pdev) return 0; } -#ifdef CONFIG_PM +#ifdef CONFIG_PM_SLEEP static int e1000_suspend(struct device *dev) { struct pci_dev *pdev = to_pci_dev(dev); @@ -6193,7 +6193,7 @@ static int e1000_resume(struct device *dev) return __e1000_resume(pdev); } -#endif /* CONFIG_PM */ +#endif /* CONFIG_PM_SLEEP */ #ifdef CONFIG_PM_RUNTIME static int e1000_runtime_suspend(struct device *dev) @@ -7015,13 +7015,11 @@ static DEFINE_PCI_DEVICE_TABLE(e1000_pci_tbl) = { }; MODULE_DEVICE_TABLE(pci, e1000_pci_tbl); -#ifdef CONFIG_PM static const struct dev_pm_ops e1000_pm_ops = { SET_SYSTEM_SLEEP_PM_OPS(e1000_suspend, e1000_resume) SET_RUNTIME_PM_OPS(e1000_runtime_suspend, e1000_runtime_resume, e1000_idle) }; -#endif /* PCI Device API Driver */ static struct pci_driver e1000_driver = { @@ -7029,11 +7027,9 @@ static struct pci_driver e1000_driver = { .id_table = e1000_pci_tbl, .probe = e1000_probe, .remove = e1000_remove, -#ifdef CONFIG_PM .driver = { .pm = &e1000_pm_ops, }, -#endif .shutdown = e1000_shutdown, .err_handler = &e1000_err_handler }; From d1969a84dd6a44d375aa82bba7d6c38713a429c3 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 16 Jan 2014 15:26:48 -0800 Subject: [PATCH 46/55] percpu_counter: unbreak __percpu_counter_add() Commit 74e72f894d56 ("lib/percpu_counter.c: fix __percpu_counter_add()") looked very plausible, but its arithmetic was badly wrong: obvious once you see the fix, but maddening to get there from the weird tmpfs ENOSPCs Signed-off-by: Hugh Dickins Cc: Ming Lei Cc: Paul Gortmaker Cc: Shaohua Li Cc: Jens Axboe Cc: Fan Du Cc: Andrew Morton Signed-off-by: Linus Torvalds --- lib/percpu_counter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 1da85bb1bc07..8280a5dd1727 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -82,7 +82,7 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) unsigned long flags; raw_spin_lock_irqsave(&fbc->lock, flags); fbc->count += count; - __this_cpu_sub(*fbc->counters, count); + __this_cpu_sub(*fbc->counters, count - amount); raw_spin_unlock_irqrestore(&fbc->lock, flags); } else { this_cpu_add(*fbc->counters, amount); From c196403b79aa241c3fefb3ee5bb328aa7c5cc860 Mon Sep 17 00:00:00 2001 From: Gerald Schaefer Date: Thu, 16 Jan 2014 16:54:48 +0100 Subject: [PATCH 47/55] net: rds: fix per-cpu helper usage commit ae4b46e9d "net: rds: use this_cpu_* per-cpu helper" broke per-cpu handling for rds. chpfirst is the result of __this_cpu_read(), so it is an absolute pointer and not __percpu. Therefore, __this_cpu_write() should not operate on chpfirst, but rather on cache->percpu->first, just like __this_cpu_read() did before. Cc: # 3.8+ Signed-off-byd Gerald Schaefer Signed-off-by: David S. Miller --- net/rds/ib_recv.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 8eb9501e3d60..b7ebe23cdedf 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -421,8 +421,7 @@ static void rds_ib_recv_cache_put(struct list_head *new_item, struct rds_ib_refill_cache *cache) { unsigned long flags; - struct list_head *old; - struct list_head __percpu *chpfirst; + struct list_head *old, *chpfirst; local_irq_save(flags); @@ -432,7 +431,7 @@ static void rds_ib_recv_cache_put(struct list_head *new_item, else /* put on front */ list_add_tail(new_item, chpfirst); - __this_cpu_write(chpfirst, new_item); + __this_cpu_write(cache->percpu->first, new_item); __this_cpu_inc(cache->percpu->count); if (__this_cpu_read(cache->percpu->count) < RDS_IB_RECYCLE_BATCH_COUNT) @@ -452,7 +451,7 @@ static void rds_ib_recv_cache_put(struct list_head *new_item, } while (old); - __this_cpu_write(chpfirst, NULL); + __this_cpu_write(cache->percpu->first, NULL); __this_cpu_write(cache->percpu->count, 0); end: local_irq_restore(flags); From 77f99ad16a07aa062c2d30fae57b1fee456f6ef6 Mon Sep 17 00:00:00 2001 From: Christoph Paasch Date: Thu, 16 Jan 2014 20:01:21 +0100 Subject: [PATCH 48/55] tcp: metrics: Avoid duplicate entries with the same destination-IP Because the tcp-metrics is an RCU-list, it may be that two soft-interrupts are inside __tcp_get_metrics() for the same destination-IP at the same time. If this destination-IP is not yet part of the tcp-metrics, both soft-interrupts will end up in tcpm_new and create a new entry for this IP. So, we will have two tcp-metrics with the same destination-IP in the list. This patch checks twice __tcp_get_metrics(). First without holding the lock, then while holding the lock. The second one is there to confirm that the entry has not been added by another soft-irq while waiting for the spin-lock. Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.) Signed-off-by: Christoph Paasch Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv4/tcp_metrics.c | 51 ++++++++++++++++++++++++++---------------- 1 file changed, 32 insertions(+), 19 deletions(-) diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 06493736fbc8..098b3a29f6f3 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -22,6 +22,9 @@ int sysctl_tcp_nometrics_save __read_mostly; +static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *addr, + struct net *net, unsigned int hash); + struct tcp_fastopen_metrics { u16 mss; u16 syn_loss:10; /* Recurring Fast Open SYN losses */ @@ -130,16 +133,41 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm, struct dst_entry *dst, } } +#define TCP_METRICS_TIMEOUT (60 * 60 * HZ) + +static void tcpm_check_stamp(struct tcp_metrics_block *tm, struct dst_entry *dst) +{ + if (tm && unlikely(time_after(jiffies, tm->tcpm_stamp + TCP_METRICS_TIMEOUT))) + tcpm_suck_dst(tm, dst, false); +} + +#define TCP_METRICS_RECLAIM_DEPTH 5 +#define TCP_METRICS_RECLAIM_PTR (struct tcp_metrics_block *) 0x1UL + static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst, struct inetpeer_addr *addr, - unsigned int hash, - bool reclaim) + unsigned int hash) { struct tcp_metrics_block *tm; struct net *net; + bool reclaim = false; spin_lock_bh(&tcp_metrics_lock); net = dev_net(dst->dev); + + /* While waiting for the spin-lock the cache might have been populated + * with this entry and so we have to check again. + */ + tm = __tcp_get_metrics(addr, net, hash); + if (tm == TCP_METRICS_RECLAIM_PTR) { + reclaim = true; + tm = NULL; + } + if (tm) { + tcpm_check_stamp(tm, dst); + goto out_unlock; + } + if (unlikely(reclaim)) { struct tcp_metrics_block *oldest; @@ -169,17 +197,6 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst, return tm; } -#define TCP_METRICS_TIMEOUT (60 * 60 * HZ) - -static void tcpm_check_stamp(struct tcp_metrics_block *tm, struct dst_entry *dst) -{ - if (tm && unlikely(time_after(jiffies, tm->tcpm_stamp + TCP_METRICS_TIMEOUT))) - tcpm_suck_dst(tm, dst, false); -} - -#define TCP_METRICS_RECLAIM_DEPTH 5 -#define TCP_METRICS_RECLAIM_PTR (struct tcp_metrics_block *) 0x1UL - static struct tcp_metrics_block *tcp_get_encode(struct tcp_metrics_block *tm, int depth) { if (tm) @@ -282,7 +299,6 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk, struct inetpeer_addr addr; unsigned int hash; struct net *net; - bool reclaim; addr.family = sk->sk_family; switch (addr.family) { @@ -304,13 +320,10 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk, hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log); tm = __tcp_get_metrics(&addr, net, hash); - reclaim = false; - if (tm == TCP_METRICS_RECLAIM_PTR) { - reclaim = true; + if (tm == TCP_METRICS_RECLAIM_PTR) tm = NULL; - } if (!tm && create) - tm = tcpm_new(dst, &addr, hash, reclaim); + tm = tcpm_new(dst, &addr, hash); else tcpm_check_stamp(tm, dst); From 11ffff752c6a5adc86f7dd397b2f75af8f917c51 Mon Sep 17 00:00:00 2001 From: Hannes Frederic Sowa Date: Thu, 16 Jan 2014 20:13:04 +0100 Subject: [PATCH 49/55] ipv6: simplify detection of first operational link-local address on interface In commit 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses") I build the detection of the first operational link-local address much to complex. Additionally this code now has a race condition. Replace it with a much simpler variant, which just scans the address list when duplicate address detection completes, to check if this is the first valid link local address and send RS and MLD reports then. Fixes: 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses") Reported-by: Jiri Pirko Cc: Flavio Leitner Signed-off-by: Hannes Frederic Sowa Acked-by: Flavio Leitner Acked-by: Jiri Pirko Signed-off-by: David S. Miller --- include/net/if_inet6.h | 1 - net/ipv6/addrconf.c | 38 +++++++++++++++++--------------------- 2 files changed, 17 insertions(+), 22 deletions(-) diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 76d54270f2e2..65bb13035598 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -165,7 +165,6 @@ struct inet6_dev { struct net_device *dev; struct list_head addr_list; - int valid_ll_addr_cnt; struct ifmcaddr6 *mc_list; struct ifmcaddr6 *mc_tomb; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index abe46a4228ce..4b6b720971b9 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3189,6 +3189,22 @@ static void addrconf_dad_timer(unsigned long data) in6_ifa_put(ifp); } +/* ifp->idev must be at least read locked */ +static bool ipv6_lonely_lladdr(struct inet6_ifaddr *ifp) +{ + struct inet6_ifaddr *ifpiter; + struct inet6_dev *idev = ifp->idev; + + list_for_each_entry(ifpiter, &idev->addr_list, if_list) { + if (ifp != ifpiter && ifpiter->scope == IFA_LINK && + (ifpiter->flags & (IFA_F_PERMANENT|IFA_F_TENTATIVE| + IFA_F_OPTIMISTIC|IFA_F_DADFAILED)) == + IFA_F_PERMANENT) + return false; + } + return true; +} + static void addrconf_dad_completed(struct inet6_ifaddr *ifp) { struct net_device *dev = ifp->idev->dev; @@ -3208,14 +3224,11 @@ static void addrconf_dad_completed(struct inet6_ifaddr *ifp) */ read_lock_bh(&ifp->idev->lock); - spin_lock(&ifp->lock); - send_mld = ipv6_addr_type(&ifp->addr) & IPV6_ADDR_LINKLOCAL && - ifp->idev->valid_ll_addr_cnt == 1; + send_mld = ifp->scope == IFA_LINK && ipv6_lonely_lladdr(ifp); send_rs = send_mld && ipv6_accept_ra(ifp->idev) && ifp->idev->cnf.rtr_solicits > 0 && (dev->flags&IFF_LOOPBACK) == 0; - spin_unlock(&ifp->lock); read_unlock_bh(&ifp->idev->lock); /* While dad is in progress mld report's source address is in6_addrany. @@ -4512,19 +4525,6 @@ static void inet6_prefix_notify(int event, struct inet6_dev *idev, rtnl_set_sk_err(net, RTNLGRP_IPV6_PREFIX, err); } -static void update_valid_ll_addr_cnt(struct inet6_ifaddr *ifp, int count) -{ - write_lock_bh(&ifp->idev->lock); - spin_lock(&ifp->lock); - if (((ifp->flags & (IFA_F_PERMANENT|IFA_F_TENTATIVE|IFA_F_OPTIMISTIC| - IFA_F_DADFAILED)) == IFA_F_PERMANENT) && - (ipv6_addr_type(&ifp->addr) & IPV6_ADDR_LINKLOCAL)) - ifp->idev->valid_ll_addr_cnt += count; - WARN_ON(ifp->idev->valid_ll_addr_cnt < 0); - spin_unlock(&ifp->lock); - write_unlock_bh(&ifp->idev->lock); -} - static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) { struct net *net = dev_net(ifp->idev->dev); @@ -4533,8 +4533,6 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) switch (event) { case RTM_NEWADDR: - update_valid_ll_addr_cnt(ifp, 1); - /* * If the address was optimistic * we inserted the route at the start of @@ -4550,8 +4548,6 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) ifp->idev->dev, 0, 0); break; case RTM_DELADDR: - update_valid_ll_addr_cnt(ifp, -1); - if (ifp->idev->cnf.forwarding) addrconf_leave_anycast(ifp); addrconf_leave_solict(ifp->idev, &ifp->addr); From 75b99dbd634f9caf7b4e48ec5c2dcec8c1837372 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Jan 2014 11:15:12 -0800 Subject: [PATCH 50/55] parisc: fix SO_MAX_PACING_RATE typo SO_MAX_PACING_RATE definition on parisc got a typo. Its not too late to fix it, before 3.13 is official. Fixes: 62748f32d501 ("net: introduce SO_MAX_PACING_RATE") Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- arch/parisc/include/uapi/asm/socket.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index f33113a6141e..70b3674dac4e 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -75,6 +75,6 @@ #define SO_BUSY_POLL 0x4027 -#define SO_MAX_PACING_RATE 0x4048 +#define SO_MAX_PACING_RATE 0x4028 #endif /* _UAPI_ASM_SOCKET_H */ From 3af57f78c38131b7a66e2b01e06fdacae01992a3 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Fri, 17 Jan 2014 09:37:15 +0100 Subject: [PATCH 51/55] s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions The s390 bpf jit compiler emits the signed divide instructions "dr" and "d" for unsigned divisions. This can cause problems: the dividend will be zero extended to a 64 bit value and the divisor is the 32 bit signed value as specified A or X accumulator, even though A and X are supposed to be treated as unsigned values. The divide instrunctions will generate an exception if the result cannot be expressed with a 32 bit signed value. This is the case if e.g. the dividend is 0xffffffff and the divisor either 1 or also 0xffffffff (signed: -1). To avoid all these issues simply use unsigned divide instructions. Signed-off-by: Heiko Carstens Signed-off-by: David S. Miller --- arch/s390/net/bpf_jit_comp.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index fc0fa77728e1..708d60e40066 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -368,16 +368,16 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter, EMIT4_PCREL(0xa7840000, (jit->ret0_ip - jit->prg)); /* lhi %r4,0 */ EMIT4(0xa7480000); - /* dr %r4,%r12 */ - EMIT2(0x1d4c); + /* dlr %r4,%r12 */ + EMIT4(0xb997004c); break; case BPF_S_ALU_DIV_K: /* A /= K */ if (K == 1) break; /* lhi %r4,0 */ EMIT4(0xa7480000); - /* d %r4,(%r13) */ - EMIT4_DISP(0x5d40d000, EMIT_CONST(K)); + /* dl %r4,(%r13) */ + EMIT6_DISP(0xe340d000, 0x0097, EMIT_CONST(K)); break; case BPF_S_ALU_MOD_X: /* A %= X */ jit->seen |= SEEN_XREG | SEEN_RET0; @@ -387,8 +387,8 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter, EMIT4_PCREL(0xa7840000, (jit->ret0_ip - jit->prg)); /* lhi %r4,0 */ EMIT4(0xa7480000); - /* dr %r4,%r12 */ - EMIT2(0x1d4c); + /* dlr %r4,%r12 */ + EMIT4(0xb997004c); /* lr %r5,%r4 */ EMIT2(0x1854); break; @@ -400,8 +400,8 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter, } /* lhi %r4,0 */ EMIT4(0xa7480000); - /* d %r4,(%r13) */ - EMIT4_DISP(0x5d40d000, EMIT_CONST(K)); + /* dl %r4,(%r13) */ + EMIT6_DISP(0xe340d000, 0x0097, EMIT_CONST(K)); /* lr %r5,%r4 */ EMIT2(0x1854); break; From 2b844ba79f4a114bd228ad6fee040ffd99a0963d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 17 Jan 2014 14:23:29 +0100 Subject: [PATCH 52/55] Revert "ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs" This reverts commit f6308b36c411 (ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs), because it causes the Alan Cox' ASUS T100TA to "crash and burn" during boot if the Baytrail pinctrl driver is compiled in. Fixes: f6308b36c411 (ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs) Reported-by: One Thousand Gnomes Requested-by: Linus Walleij Signed-off-by: Rafael J. Wysocki --- drivers/acpi/acpi_lpss.c | 1 - drivers/pinctrl/pinctrl-baytrail.c | 1 - 2 files changed, 2 deletions(-) diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index e60390597372..6745fe137b9e 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -162,7 +162,6 @@ static const struct acpi_device_id acpi_lpss_device_ids[] = { { "80860F14", (unsigned long)&byt_sdio_dev_desc }, { "80860F41", (unsigned long)&byt_i2c_dev_desc }, { "INT33B2", }, - { "INT33FC", }, { "INT3430", (unsigned long)&lpt_dev_desc }, { "INT3431", (unsigned long)&lpt_dev_desc }, diff --git a/drivers/pinctrl/pinctrl-baytrail.c b/drivers/pinctrl/pinctrl-baytrail.c index 114f5ef4b73a..2832576d8b12 100644 --- a/drivers/pinctrl/pinctrl-baytrail.c +++ b/drivers/pinctrl/pinctrl-baytrail.c @@ -512,7 +512,6 @@ static const struct dev_pm_ops byt_gpio_pm_ops = { static const struct acpi_device_id byt_gpio_acpi_match[] = { { "INT33B2", 0 }, - { "INT33FC", 0 }, { } }; MODULE_DEVICE_TABLE(acpi, byt_gpio_acpi_match); From 72de182362e013b2c2cc92092d97fff58e429d5d Mon Sep 17 00:00:00 2001 From: Ilia Mirkin Date: Sun, 19 Jan 2014 10:30:32 -0500 Subject: [PATCH 53/55] drm/nouveau/mxm: fix null deref on load Since commit 61b365a505d6 ("drm/nouveau: populate master subdev pointer only when fully constructed"), the nouveau_mxm(bios) call will return NULL, since it's still being called from the constructor. Instead, pass the mxm pointer via the unused data field. See https://bugs.freedesktop.org/show_bug.cgi?id=73791 Reported-by: Andreas Reis Tested-by: Andreas Reis Signed-off-by: Ilia Mirkin Cc: Ben Skeggs Cc: Dave Airlie Signed-off-by: Linus Torvalds --- drivers/gpu/drm/nouveau/core/subdev/mxm/nv50.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/nouveau/core/subdev/mxm/nv50.c b/drivers/gpu/drm/nouveau/core/subdev/mxm/nv50.c index af129c2e8113..64f8b4702bf7 100644 --- a/drivers/gpu/drm/nouveau/core/subdev/mxm/nv50.c +++ b/drivers/gpu/drm/nouveau/core/subdev/mxm/nv50.c @@ -100,7 +100,7 @@ mxm_match_dcb(struct nouveau_mxm *mxm, u8 *data, void *info) static int mxm_dcb_sanitise_entry(struct nouveau_bios *bios, void *data, int idx, u16 pdcb) { - struct nouveau_mxm *mxm = nouveau_mxm(bios); + struct nouveau_mxm *mxm = data; struct context ctx = { .outp = (u32 *)(bios->data + pdcb) }; u8 type, i2cidx, link, ver, len; u8 *conn; @@ -199,7 +199,7 @@ mxm_dcb_sanitise(struct nouveau_mxm *mxm) return; } - dcb_outp_foreach(bios, NULL, mxm_dcb_sanitise_entry); + dcb_outp_foreach(bios, mxm, mxm_dcb_sanitise_entry); mxms_foreach(mxm, 0x01, mxm_show_unmatched, NULL); } From d8ec26d7f8287f5788a494f56e8814210f0e64be Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 19 Jan 2014 18:40:07 -0800 Subject: [PATCH 54/55] Linux 3.13 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index eeec740776f3..b8b7f74696b4 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ VERSION = 3 PATCHLEVEL = 13 SUBLEVEL = 0 -EXTRAVERSION = -rc8 +EXTRAVERSION = NAME = One Giant Leap for Frogkind # *DOCUMENTATION* From bed86f15bdc23436fb30d09e2faa3dfb7d3834e1 Mon Sep 17 00:00:00 2001 From: Russell King Date: Mon, 27 Jan 2014 23:33:11 +0000 Subject: [PATCH 55/55] DRM: armada: fix missing DRM_KMS_FB_HELPER select Commit 92b6f89f6b8f (drm: Add separate Kconfig option for fbdev helpers) happened in parallel with the inclusion of Armada DRM into mainline, and so missed this update. Add the necessary select statement to avoid build errors. Signed-off-by: Russell King --- drivers/gpu/drm/armada/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/armada/Kconfig b/drivers/gpu/drm/armada/Kconfig index 40d371521fe1..50ae88ad4d76 100644 --- a/drivers/gpu/drm/armada/Kconfig +++ b/drivers/gpu/drm/armada/Kconfig @@ -5,6 +5,7 @@ config DRM_ARMADA select FB_CFB_COPYAREA select FB_CFB_IMAGEBLIT select DRM_KMS_HELPER + select DRM_KMS_FB_HELPER help Support the "LCD" controllers found on the Marvell Armada 510 devices. There are two controllers on the device, each controller