Commit Graph

24756 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
4dce17b26b Merge commit '0f17976568b3f72e676450af0c0db6f8752253d6' into trace/ftrace/core
Need to get the changes from 0f17976568 ("ftrace: Fix regression with
module command in stack_trace_filter") as it is required to fix some other
changes with stack_trace_filter and the new development code.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-05 09:51:24 -04:00
Steven Rostedt (VMware)
0f17976568 ftrace: Fix regression with module command in stack_trace_filter
When doing the following command:

 # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter

it triggered a crash.

This happened with the clean up of probes. It required all callers to the
regex function (doing ftrace filtering) to have ops->private be a pointer to
a trace_array. But for the stack tracer, that is not the case.

Allow for the ops->private to be NULL, and change the function command
callbacks to handle the trace_array pointer being NULL as well.

Fixes: d2afd57a4b ("tracing/ftrace: Allow instances to have their own function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-29 10:05:45 -04:00
Steven Rostedt (VMware)
4ec7846785 ftrace: Decrement count for dyn_ftrace_total_info for init functions
Init boot up functions may be traced, but they are also freed when the
kernel finishes booting. These are removed from the ftrace tables, and the
debug variable for dyn_ftrace_total_info needs to reflect that as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 11:57:03 -04:00
Steven Rostedt (VMware)
3b58a3c72f ftrace: Unlock hash mutex on failed allocation in process_mod_list()
If the new_hash fails to allocate, then unlock the hash mutex on error.

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 09:09:38 -04:00
Joel Fernandes
441dae8f2f tracing: Add support for display of tgid in trace output
Earlier patches introduced ability to record the tgid using the 'record-tgid'
option. Here we read the tgid and output it if the option is enabled.

Link: http://lkml.kernel.org/r/20170626053844.5746-3-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
Joel Fernandes
d914ba37d7 tracing: Add support for recording tgid of tasks
Inorder to support recording of tgid, the following changes are made:

* Introduce a new API (tracing_record_taskinfo) to additionally record the tgid
  along with the task's comm at the same time. This has has the benefit of not
  setting trace_cmdline_save before all the information for a task is saved.
* Add a new API tracing_record_taskinfo_sched_switch to record task information
  for 2 tasks at a time (previous and next) and use it from sched_switch probe.
* Preserve the old API (tracing_record_cmdline) and create it as a wrapper
  around the new one so that existing callers aren't affected.
* Reuse the existing sched_switch and sched_wakeup probes to record tgid
  information and add a new option 'record-tgid' to enable recording of tgid

When record-tgid option isn't enabled to being with, we take care to make sure
that there's isn't memory or runtime overhead.

Link: http://lkml.kernel.org/r/20170627020155.5139-1-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
Steven Rostedt (VMware)
83dd14933e ftrace: Decrement count for dyn_ftrace_total_info file
The dyn_ftrace_total_info file is used to show how many functions have been
converted into nops and can be used by ftrace. The problem is that it does
not get decremented when functions are removed (init boot code being freed,
and modules being freed). That means the number is very inaccurate everytime
functions are removed from the ftrace tables. Decrement it when functions
are removed.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:27 -04:00
Steven Rostedt (VMware)
6a9c981b1e ftrace: Remove unused function ftrace_arch_read_dyn_info()
ftrace_arch_read_dyn_info() was used so that archs could add its own debug
information into the dyn_ftrace_total_info in the tracefs file system. That
file is for debugging usage of dynamic ftrace. No arch uses that function
anymore, so just get rid of it.

This also allows for tracing_read_dyn_info() to be cleaned up a bit.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:22 -04:00
Steven Rostedt (VMware)
8c08f0d5c6 ftrace: Have cached module filters be an active filter
When a module filter is added to set_ftrace_filter, if the module is not
loaded, it is cached. This should be considered an active filter, and
function tracing should be filtered by this. That is, if a cached module
filter is the only filter set, then no function tracing should be happening,
as all the functions available will be filtered out.

This makes sense, as the reason to add a cached module filter, is to trace
the module when you load it. There shouldn't be any other tracing happening
until then.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:04 -04:00
Steven Rostedt (VMware)
d7fbf8df7c ftrace: Implement cached modules tracing on module load
If a module is cached in the set_ftrace_filter, and that module is loaded,
then enable tracing on that module as if the cached module text was written
into set_ftrace_filter just as the module is loaded.

  # echo ":mod:kvm_intel" >
  # cat /sys/kernel/tracing/set_ftrace_filter
 #### all functions enabled ####
 :mod:kvm_intel
  # modprobe kvm_intel
  # cat /sys/kernel/tracing/set_ftrace_filter
 vmx_get_rflags [kvm_intel]
 vmx_get_pkru [kvm_intel]
 vmx_get_interrupt_shadow [kvm_intel]
 vmx_rdtscp_supported [kvm_intel]
 vmx_invpcid_supported [kvm_intel]
 [..]

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:03 -04:00
Steven Rostedt (VMware)
5985ea8bd5 ftrace: Have the cached module list show in set_ftrace_filter
When writing in a module filter into set_ftrace_filter for a module that is
not yet loaded, it it cached, and will be executed when the module is loaded
(although that is not implemented yet at this commit). Display the list of
cached modules to be traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
Steven Rostedt (VMware)
673feb9d76 ftrace: Add :mod: caching infrastructure to trace_array
This is the start of the infrastructure work to allow for tracing module
functions before it is loaded.

Currently the following command:

  # echo :mod:some-mod > set_ftrace_filter

will enable tracing of all functions within the module "some-mod" if it is
loaded. What we want, is if the module is not loaded, that line will be
saved. When the module is loaded, then the "some-mod" will have that line
executed on it, so that the functions within it starts being traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
Steven Rostedt (VMware)
feaf1283d1 tracing: Show address when function names are not found
Currently, when a function is not found in kallsyms, instead of simply
showing the function address, it shows nothing at all:

 # echo ':mod:kvm_intel' > /sys/kernel/tracing/set_ftrace_filter
 # echo function > /sys/kernel/tracing/set_ftrace_filter
 # qemu -enable-kvm /home/my-qemu-image
   <Ctrl-C>
 # rmmod kvm_intel
 # cat /sys/kernel/tracing/trace
 qemu-system-x86-2408  [001] d..2   135.013238:  <-kvm_arch_hardware_enable
 qemu-system-x86-2408  [001] ....   135.014574:  <-kvm_arch_vm_ioctl
 qemu-system-x86-2408  [001] ....   135.015420:  <-kvm_vm_ioctl_check_extension
 qemu-system-x86-2408  [001] ....   135.045411:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ...1   135.045413:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045413:  <-__do_cpuid_ent

When it should show:

 qemu-system-x86-2408  [001] d..2   135.013238: 0xffffffffa02a39f0 <-kvm_arch_hardware_enable
 qemu-system-x86-2408  [001] ....   135.014574: 0xffffffffa02a2ba0 <-kvm_arch_vm_ioctl
 qemu-system-x86-2408  [001] ....   135.015420: 0xffffffffa029e4e0 <-kvm_vm_ioctl_check_extension
 qemu-system-x86-2408  [001] ....   135.045411: 0xffffffffa02a1380 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e160 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e180 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e520 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ...1   135.045413: 0xffffffffa02a13b0 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045413: 0xffffffffa02a1380 <-__do_cpuid_ent

instead.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-22 17:10:06 -04:00
Jeremy Linton
681bec0367 tracing: Rename update the enum_map file
The enum_map file is used to display a list of symbol
to name conversions. As its now used to resolve sizeof
lets update the name and description.

Link: http://lkml.kernel.org/r/20170531215653.3240-13-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:13:06 -04:00
Jeremy Linton
67ec0d8595 tracing: Rename enum_replace to eval_replace
The enum_replace stanza works as is for sizeof()
calls as well as enums. Rename it as well.

Link: http://lkml.kernel.org/r/20170531215653.3240-9-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:34 -04:00
Jeremy Linton
f57a41434f trace: rename enum_map functions
Rename the core trace enum routines to use eval, to
reflect their use by more than just enum to value mapping.

Link: http://lkml.kernel.org/r/20170531215653.3240-8-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:23 -04:00
Jeremy Linton
5f60b351a7 trace: rename trace.c enum functions
Rename the init and trace_enum_jmp_to_tail() routines
to reflect their use by more than enumerated types.

Link: http://lkml.kernel.org/r/20170531215653.3240-7-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:12 -04:00
Jeremy Linton
1793ed939b trace: rename trace_enum_mutex to trace_eval_mutex
There is a lock protecting the trace_enum_map, rename
it to reflect the use by more than enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-6-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:01 -04:00
Jeremy Linton
23bf8cb8dc trace: rename trace enum data structures in trace.c
The enum map entries can be exported to userspace
via a sys enum_map file. Rename those functions
and structures to reflect the fact that we are using
them for more than enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-5-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:09:50 -04:00
Jeremy Linton
99be647c58 trace: rename struct module entry for trace enums
Each module has a list of enum's its contributing to the
enum map, rename that entry to reflect its use by more than
enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-4-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:09:31 -04:00
Jeremy Linton
00f4b652b6 trace: rename trace_enum_map to trace_eval_map
Each enum is loaded into the trace_enum_map, as we
are now using this for more than enums rename it.

Link: http://lkml.kernel.org/r/20170531215653.3240-3-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:08:57 -04:00
Jeremy Linton
02fd7f68f5 trace: rename kernel enum section to eval
The kernel and its modules have sections containing the enum
string to value conversions. Rename this section because we
intend to store more than enums in it.

Link: http://lkml.kernel.org/r/20170531215653.3240-2-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:08:46 -04:00
Joel Fernandes
6a5ae63a0c tracing: Remove unused declaration of trace_stop_cmdline_recording
trace_stop_cmdline_recording declaration isn't in use, remove it.

Link: http://lkml.kernel.org/r/20170609025327.9508-2-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 09:40:52 -04:00
Linus Torvalds
45b44f0f28 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
 "An error handling corner case fix"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Drop the device lock on error
2017-06-10 10:49:42 -07:00
Linus Torvalds
6b7ed4588c Merge branch 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
 "Fix an SRCU bug affecting KVM IRQ injection"

* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  srcu: Allow use of Classic SRCU from both process and interrupt context
  srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
2017-06-10 10:22:35 -07:00
Linus Torvalds
f701d860af Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This is mostly tooling fixes, plus an instruction pointer filtering
  fix.

  It's more fixes than usual - Arnaldo got back from a longer vacation
  and there was a backlog"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  perf symbols: Kill dso__build_id_is_kmod()
  perf symbols: Keep DSO->symtab_type after decompress
  perf tests: Decompress kernel module before objdump
  perf tools: Consolidate error path in __open_dso()
  perf tools: Decompress kernel module when reading DSO data
  perf annotate: Use dso__decompress_kmodule_path()
  perf tools: Introduce dso__decompress_kmodule_{fd,path}
  perf tools: Fix a memory leak in __open_dso()
  perf annotate: Fix symbolic link of build-id cache
  perf/core: Drop kernel samples even though :u is specified
  perf script python: Remove dups in documentation examples
  perf script python: Updated trace_unhandled() signature
  perf script python: Fix wrong code snippets in documentation
  perf script: Fix documentation errors
  perf script: Fix outdated comment for perf-trace-python
  perf probe: Fix examples section of documentation
  perf report: Ensure the perf DSO mapping matches what libdw sees
  perf report: Include partial stacks unwound with libdw
  perf annotate: Add missing powerpc triplet
  perf test: Disable breakpoint signal tests for powerpc
  ...
2017-06-10 10:15:47 -07:00
Ingo Molnar
8affb06737 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent
Pull RCU fix from Paul E. McKenney:

" This series enables srcu_read_lock() and srcu_read_unlock() to be used from
  interrupt handlers, which fixes a bug in KVM's use of SRCU in delivery
  of interrupts to guest OSes. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-09 08:17:10 +02:00
Linus Torvalds
0d22df90c7 Power management fixes for v4.12-rc5
- Revert a recent commit that attempted to avoid spurious wakeups
    from suspend-to-idle via ACPI SCI, but introduced regressions on
    some systems (Rafael Wysocki).
 
    We will get back to the problem it tried to address in the next
    cycle.
 
  - Fix a possible division by 0 during intel_pstate initialization
    due to a missing check (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZOeX+AAoJEILEb/54YlRxYksQAIczTx5toDqcmMbyXUbS/KBp
 /kQK5iwGo9WV5V4ZsWugmwrTY9R4xwBdeBow/nacO5NGVtjYix95wTlGlw84RLIi
 2QO4SzN0ZKqh8DUxvjro/exIe+YPqX4Bodvo9CX4hr60xjutcZRVz5Dq6ZS80KPw
 o0/fWUlQzT1BM9IorfJ0YT2XEdMpdVbPWrwTpm8l2G42vWXSQwsd6fFnOLpTrd46
 580JAH5Dux6YNMsSejTazQon/3P0sChYxbJkpm6nvv819EMbFDy8p+ebIgceBlos
 l6Zlckd7ETwDwL3G3OGi5/Zcpb/YMg5Slm3+IGM/J5ccVIfdG8gjqTJklrxH7I4S
 /+0MzwlXUbRYEvurB6nsP3kIofrZN+t+c609ewmIFLy2QIDJF9BiVhKnRrjNfsuU
 KrY0zFATtxGy/0CfkZCmSWZBid06tAQ0ZZ1dYkO/1Qf5dn1ge+Yr8tcc0WKqJqbq
 NR+BfsCVa84b6s+uBsMdKR6kAg0tz7uG5njXlH7bupH3ZCtttbbWp0znmd/ow/jU
 usJyEHStvzhjC4T9s0tRzEi96B2/MlsGYmL+qq9GBScdhYKc6K/4xzdUkp+yGQwD
 311sheKvQZ06kwj0v+hK1aBOH2y3pjcBMvKjCdr/IiMtX8/kD0tsKx1+0QESR91D
 H80L6EjjEAUm1KEsQcpy
 =oWdX
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These revert one problematic commit related to system sleep and fix
  one recent intel_pstate regression.

  Specifics:

   - Revert a recent commit that attempted to avoid spurious wakeups
     from suspend-to-idle via ACPI SCI, but introduced regressions on
     some systems (Rafael Wysocki).

     We will get back to the problem it tried to address in the next
     cycle.

   - Fix a possible division by 0 during intel_pstate initialization
     due to a missing check (Rafael Wysocki)"

* tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
  cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
2017-06-08 17:40:32 -07:00
Rafael J. Wysocki
fbd78afe34 Merge branches 'intel_pstate' and 'pm-sleep'
* intel_pstate:
  cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()

* pm-sleep:
  Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
2017-06-09 01:25:16 +02:00
Linus Torvalds
dc0cf5a77d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk fix from Petr Mladek:
 "This reverts a fix added into 4.12-rc1. It caused the kernel log to be
  printed on another console when two consoles of the same type were
  defined, e.g. console=ttyS0 console=ttyS1.

  This configuration was never supported by kernel itself, but it
  started to make sense with systemd. In other words, the commit broke
  userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  Revert "printk: fix double printing with earlycon"
2017-06-08 10:50:04 -07:00
Paolo Bonzini
1123a60416 srcu: Allow use of Classic SRCU from both process and interrupt context
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Cc: stable@vger.kernel.org
Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:19 -07:00
Paolo Bonzini
cdf7abc461 srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler.  Therefore, the only change required for Tiny SRCU
is to its comments.

Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 08:24:26 -07:00
Petr Mladek
dac8bbbae1 Revert "printk: fix double printing with earlycon"
This reverts commit cf39bf58af.

The commit regression to users that define both console=ttyS1
and console=ttyS0 on the command line, see
https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain

The kernel log messages always appeared only on one serial port. It is
even documented in Documentation/admin-guide/serial-console.rst:

"Note that you can only define one console per device type (serial,
video)."

The above mentioned commit changed the order in which the command line
parameters are searched. As a result, the kernel log messages go to
the last mentioned ttyS* instead of the first one.

We long thought that using two console=ttyS* on the command line
did not make sense. But then we realized that console= parameters
were handled also by systemd, see
http://0pointer.de/blog/projects/serial-console.html

"By default systemd will instantiate one serial-getty@.service on
the main kernel console, if it is not a virtual terminal."

where

"[4] If multiple kernel consoles are used simultaneously, the main
console is the one listed first in /sys/class/tty/console/active,
which is the last one listed on the kernel command line."

This puts the original report into another light. The system is running
in qemu. The first serial port is used to store the messages into a file.
The second one is used to login to the system via a socket. It depends
on systemd and the historic kernel behavior.

By other words, systemd causes that it makes sense to define both
console=ttyS1 console=ttyS0 on the command line. The kernel fix
caused regression related to userspace (systemd) and need to be
reverted.

In addition, it went out that the fix helped only partially.
The messages still were duplicated when the boot console was
removed early by late_initcall(printk_late_init). Then the entire
log was replayed when the same console was registered as a normal one.

Link: 20170606160339.GC7604@pathway.suse.cz
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
Cc: linux-serial@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-06-08 12:06:00 +02:00
Jin Yao
cc1582c231 perf/core: Drop kernel samples even though :u is specified
When doing sampling, for example:

  perf record -e cycles:u ...

On workloads that do a lot of kernel entry/exits we see kernel
samples, even though :u is specified. This is due to skid existing.

This might be a security issue because it can leak kernel addresses even
though kernel sampling support is disabled.

The patch drops the kernel samples if exclude_kernel is specified.

For example, test on Haswell desktop:

  perf record -e cycles:u <mgen>
  perf report --stdio

Before patch applied:

    99.77%  mgen     mgen              [.] buf_read
     0.20%  mgen     mgen              [.] rand_buf_init
     0.01%  mgen     [kernel.vmlinux]  [k] apic_timer_interrupt
     0.00%  mgen     mgen              [.] last_free_elem
     0.00%  mgen     libc-2.23.so      [.] __random_r
     0.00%  mgen     libc-2.23.so      [.] _int_malloc
     0.00%  mgen     mgen              [.] rand_array_init
     0.00%  mgen     [kernel.vmlinux]  [k] page_fault
     0.00%  mgen     libc-2.23.so      [.] __random
     0.00%  mgen     libc-2.23.so      [.] __strcasestr
     0.00%  mgen     ld-2.23.so        [.] strcmp
     0.00%  mgen     ld-2.23.so        [.] _dl_start
     0.00%  mgen     libc-2.23.so      [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so        [.] _start

We can see kernel symbols apic_timer_interrupt and page_fault.

After patch applied:

    99.79%  mgen     mgen           [.] buf_read
     0.19%  mgen     mgen           [.] rand_buf_init
     0.00%  mgen     libc-2.23.so   [.] __random_r
     0.00%  mgen     mgen           [.] rand_array_init
     0.00%  mgen     mgen           [.] last_free_elem
     0.00%  mgen     libc-2.23.so   [.] vfprintf
     0.00%  mgen     libc-2.23.so   [.] rand
     0.00%  mgen     libc-2.23.so   [.] __random
     0.00%  mgen     libc-2.23.so   [.] _int_malloc
     0.00%  mgen     libc-2.23.so   [.] _IO_doallocbuf
     0.00%  mgen     ld-2.23.so     [.] do_lookup_x
     0.00%  mgen     ld-2.23.so     [.] open_verify.constprop.7
     0.00%  mgen     ld-2.23.so     [.] _dl_important_hwcaps
     0.00%  mgen     libc-2.23.so   [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so     [.] _start

There are only userspace symbols.

Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Cc: kan.liang@intel.com
Cc: mark.rutland@arm.com
Cc: will.deacon@arm.com
Cc: yao.jin@intel.com
Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:11:50 +02:00
Rafael J. Wysocki
f3b7eaae1b Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
Revert commit eed4d47efe (ACPI / sleep: Ignore spurious SCI wakeups
from suspend-to-idle) as it turned out to be premature and triggered
a number of different issues on various systems.

That includes, but is not limited to, premature suspend-to-RAM aborts
on Dell XPS 13 (9343) reported by Dominik.

The issue the commit in question attempted to address is real and
will need to be taken care of going forward, but evidently more work
is needed for this purpose.

Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-07 00:57:37 +02:00
Linus Torvalds
ba7b2387ad Merge branch 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Two cgroup fixes. One to address RCU delay of cpuset removal affecting
  userland visible behaviors. The other fixes a race condition between
  controller disable and cgroup removal"

* 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: consider dying css as offline
  cgroup: Prevent kill_css() from being called more than once
2017-06-05 15:37:03 -07:00
Sebastian Andrzej Siewior
40da1b11f0 cpu/hotplug: Drop the device lock on error
If a custom CPU target is specified and that one is not available _or_
can't be interrupted then the code returns to userland without dropping a
lock as notices by lockdep:

|echo 133 > /sys/devices/system/cpu/cpu7/hotplug/target
| ================================================
| [ BUG: lock held when returning to user space! ]
| ------------------------------------------------
| bash/503 is leaving the kernel with locks still held!
| 1 lock held by bash/503:
|  #0:  (device_hotplug_lock){+.+...}, at: [<ffffffff815b5650>] lock_device_hotplug_sysfs+0x10/0x40

So release the lock then.

Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170602142714.3ogo25f2wbq6fjpj@linutronix.de
2017-06-03 09:35:04 +02:00
Linus Torvalds
035f1456f9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "Kconfig dependency fix for livepatching infrastructure from Miroslav
  Benes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMS
2017-06-02 08:59:17 -07:00
Linus Torvalds
39b8ab31bc Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixlet from Thomas Gleixner:
 "Silence dmesg spam by making the posix cpu timer printks depend on
  print_fatal_signals"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-timers: Make signal printks conditional
2017-05-27 09:14:24 -07:00
Linus Torvalds
805f286907 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
 "A fix for a state leak which was introduced in the recent rework of
  futex/rtmutex interaction"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
2017-05-27 08:59:37 -07:00
Linus Torvalds
d024baa58a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kthread fix from Thomas Gleixner:
 "A single fix which prevents a use after free when kthread fork fails"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Fix use-after-free if kthread fork fails
2017-05-27 08:52:27 -07:00
Linus Torvalds
77d6465695 There's been a few memory issues found with ftrace.
One was simply a memory leak where not all was being freed that should
 have been in releasing a file pointer on set_graph_function.
 
 Then Thomas found that the ftrace trampolines were marked for read/write
 as well as execute. To shrink the possible attack surface, he added
 calls to set them to ro. Which also uncovered some other issues with
 freeing module allocated memory that had its permissions changed.
 
 Kprobes had a similar issue which is fixed and a selftest was added
 to trigger that issue again.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZKOiVFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 vBoH/jxVozuAEVCv+Nbj6fhRxe4emjo0lZZb32EbEaSV/nUQGqHIZFdDQtbt+ld+
 sn06/BSMBI+L4BqLj1BCAW0e/zIn/4birIg53SX5jQwc3AlhUG7HS2d+RJZZCrp9
 Zofq9L6xZ4Hl2XjkPXqwEgtrwxQtkIPLlJqeYDJ6BVrlPfOPEwB7bfR7B684wiYT
 6h2Qo7f/ZQzgJ1sK8N2IjHEnAgE08KCYcj4IB4WHJk6SqQz3bv1Y00WBg2UQihVT
 TPPSVhYLnrSw53fxyALqZbHo2DvnQf1TnNadWxvSIpbvgm/T5GG60FDtvHgNfbwz
 yKuKAog+P9xBLkoAcfvODLY9O5s=
 =75TZ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "There's been a few memory issues found with ftrace.

  One was simply a memory leak where not all was being freed that should
  have been in releasing a file pointer on set_graph_function.

  Then Thomas found that the ftrace trampolines were marked for
  read/write as well as execute. To shrink the possible attack surface,
  he added calls to set them to ro. Which also uncovered some other
  issues with freeing module allocated memory that had its permissions
  changed.

  Kprobes had a similar issue which is fixed and a selftest was added to
  trigger that issue again"

* tag 'trace-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  x86/ftrace: Make sure that ftrace trampolines are not RWX
  x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range()
  selftests/ftrace: Add a testcase for many kprobe events
  kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
  ftrace: Fix memory leak in ftrace_graph_release()
2017-05-27 08:30:30 -07:00
Masami Hiramatsu
c93f5cf571 kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
Fix kprobes to set(recover) RWX bits correctly on trampoline
buffer before releasing it. Releasing readonly page to
module_memfree() crash the kernel.

Without this fix, if kprobes user register a bunch of kprobes
in function body (since kprobes on function entry usually
use ftrace) and unregister it, kernel hits a BUG and crash.

Link: http://lkml.kernel.org/r/149570868652.3518.14120169373590420503.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: d0381c81c2 ("kprobes/x86: Set kprobes pages read-only")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-26 22:37:00 -04:00
Luis Henriques
f9797c2f20 ftrace: Fix memory leak in ftrace_graph_release()
ftrace_hash is being kfree'ed in ftrace_graph_release(), however the
->buckets field is not.  This results in a memory leak that is easily
captured by kmemleak:

unreferenced object 0xffff880038afe000 (size 8192):
  comm "trace-cmd", pid 238, jiffies 4294916898 (age 9.736s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff815f561e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8113964d>] __kmalloc+0x12d/0x1a0
    [<ffffffff810bf6d1>] alloc_ftrace_hash+0x51/0x80
    [<ffffffff810c0523>] __ftrace_graph_open.isra.39.constprop.46+0xa3/0x100
    [<ffffffff810c05e8>] ftrace_graph_open+0x68/0xa0
    [<ffffffff8114003d>] do_dentry_open.isra.1+0x1bd/0x2d0
    [<ffffffff81140df7>] vfs_open+0x47/0x60
    [<ffffffff81150f95>] path_openat+0x2a5/0x1020
    [<ffffffff81152d6a>] do_filp_open+0x8a/0xf0
    [<ffffffff811411df>] do_sys_open+0x12f/0x200
    [<ffffffff811412ce>] SyS_open+0x1e/0x20
    [<ffffffff815fa6e0>] entry_SYSCALL_64_fastpath+0x13/0x94
    [<ffffffffffffffff>] 0xffffffffffffffff

Link: http://lkml.kernel.org/r/20170525152038.7661-1-lhenriques@suse.com

Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-26 22:35:48 -04:00
Miroslav Benes
5720acf4bf livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMS
If TRIM_UNUSED_KSYMS is enabled, all unneeded exported symbols are made
unexported. Two-pass build of the kernel is done to find out which
symbols are needed based on a configuration. This effectively
complicates things for out-of-tree modules.

Livepatch exports functions to (un)register and enable/disable a live
patch. The only in-tree module which uses these functions is a sample in
samples/livepatch/. If the sample is disabled, the functions are
trimmed and out-of-tree live patches cannot be built.

Note that live patches are intended to be built out-of-tree.

Suggested-by: Michal Marek <mmarek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-05-27 00:27:37 +02:00
Linus Torvalds
6741d51699 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix state pruning in bpf verifier wrt. alignment, from Daniel
    Borkmann.

 2) Handle non-linear SKBs properly in SCTP ICMP parsing, from Davide
    Caratti.

 3) Fix bit field definitions for rss_hash_type of descriptors in mlx5
    driver, from Jesper Brouer.

 4) Defer slave->link updates until bonding is ready to do a full commit
    to the new settings, from Nithin Sujir.

 5) Properly reference count ipv4 FIB metrics to avoid use after free
    situations, from Eric Dumazet and several others including Cong Wang
    and Julian Anastasov.

 6) Fix races in llc_ui_bind(), from Lin Zhang.

 7) Fix regression of ESP UDP encapsulation for TCP packets, from
    Steffen Klassert.

 8) Fix mdio-octeon driver Kconfig deps, from Randy Dunlap.

 9) Fix regression in setting DSCP on ipv6/GRE encapsulation, from Peter
    Dawson.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  ipv4: add reference counting to metrics
  net: ethernet: ax88796: don't call free_irq without request_irq first
  ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets
  sctp: fix ICMP processing if skb is non-linear
  net: llc: add lock_sock in llc_ui_bind to avoid a race condition
  bonding: Don't update slave->link until ready to commit
  test_bpf: Add a couple of tests for BPF_JSGE.
  bpf: add various verifier test cases
  bpf: fix wrong exposure of map_flags into fdinfo for lpm
  bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data
  bpf: properly reset caller saved regs after helper call and ld_abs/ind
  bpf: fix incorrect pruning decision when alignment must be tracked
  arp: fixed -Wuninitialized compiler warning
  tcp: avoid fastopen API to be used on AF_UNSPEC
  net: move somaxconn init from sysctl code
  net: fix potential null pointer dereference
  geneve: fix fill_info when using collect_metadata
  virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
  be2net: Fix offload features for Q-in-Q packets
  vlan: Fix tcp checksum offloads in Q-in-Q vlans
  ...
2017-05-26 13:51:01 -07:00
Daniel Borkmann
a316338cb7 bpf: fix wrong exposure of map_flags into fdinfo for lpm
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:28 -04:00
Daniel Borkmann
a9789ef9af bpf: properly reset caller saved regs after helper call and ld_abs/ind
Currently, after performing helper calls, we clear all caller saved
registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
specification. The way we reset these regs can affect pruning decisions
in later paths, since we only reset register's imm to 0 and type to
NOT_INIT. However, we leave out clearing of other variables such as id,
min_value, max_value, etc, which can later on lead to pruning mismatches
due to stale data.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:27 -04:00
Daniel Borkmann
1ad2f5838d bpf: fix incorrect pruning decision when alignment must be tracked
Currently, when we enforce alignment tracking on direct packet access,
the verifier lets the following program pass despite doing a packet
write with unaligned access:

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (61) r7 = *(u32 *)(r1 +8)
  3: (bf) r0 = r2
  4: (07) r0 += 14
  5: (25) if r7 > 0x1 goto pc+4
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  6: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  7: (63) *(u32 *)(r0 -4) = r0
  8: (b7) r0 = 0
  9: (95) exit

  from 6 to 8:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  8: (b7) r0 = 0
  9: (95) exit

  from 5 to 10:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=2 R10=fp
  10: (07) r0 += 1
  11: (05) goto pc-6
  6: safe                           <----- here, wrongly found safe
  processed 15 insns

However, if we enforce a pruning mismatch by adding state into r8
which is then being mismatched in states_equal(), we find that for
the otherwise same program, the verifier detects a misaligned packet
access when actually walking that path:

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (61) r7 = *(u32 *)(r1 +8)
  3: (b7) r8 = 1
  4: (bf) r0 = r2
  5: (07) r0 += 14
  6: (25) if r7 > 0x1 goto pc+4
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  7: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  8: (63) *(u32 *)(r0 -4) = r0
  9: (b7) r0 = 0
  10: (95) exit

  from 7 to 9:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  9: (b7) r0 = 0
  10: (95) exit

  from 6 to 11:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=2
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  11: (07) r0 += 1
  12: (b7) r8 = 0
  13: (05) goto pc-7                <----- mismatch due to r8
  7: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
   R3=pkt_end R7=inv,min_value=2
   R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
  8: (63) *(u32 *)(r0 -4) = r0
  misaligned packet access off 2+15+-4 size 4

The reason why we fail to see it in states_equal() is that the
third test in compare_ptrs_to_packet() ...

  if (old->off <= cur->off &&
      old->off >= old->range && cur->off >= cur->range)
          return true;

... will let the above pass. The situation we run into is that
old->off <= cur->off (14 <= 15), meaning that prior walked paths
went with smaller offset, which was later used in the packet
access after successful packet range check and found to be safe
already.

For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
as in above program to it, results in R0=pkt(id=0,off=14,r=0)
before the packet range test. Now, testing this against R3=pkt_end
with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
for the case when we're within bounds. A write into the packet
at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
aligned (2 is for NET_IP_ALIGN). After processing this with
all fall-through paths, we later on check paths from branches.
When the above skb->mark test is true, then we jump near the
end of the program, perform r0 += 1, and jump back to the
'if r0 > r3 goto out' test we've visited earlier already. This
time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
that part because this time we'll have a larger safe packet
range, and we already found that with off=14 all further insn
were already safe, so it's safe as well with a larger off.
However, the problem is that the subsequent write into the packet
with 2 + 15 -4 is then unaligned, and not caught by the alignment
tracking. Note that min_align, aux_off, and aux_off_align were
all 0 in this example.

Since we cannot tell at this time what kind of packet access was
performed in the prior walk and what minimal requirements it has
(we might do so in the future, but that requires more complexity),
fix it to disable this pruning case for strict alignment for now,
and let the verifier do check such paths instead. With that applied,
the test cases pass and reject the program due to misalignment.

Fixes: d117441674 ("bpf: Track alignment of register values in the verifier.")
Reference: http://patchwork.ozlabs.org/patch/761909/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:27 -04:00
Tejun Heo
41c25707d2 cpuset: consider dying css as offline
In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-24 12:43:30 -04:00