kernel_optimize_test/kernel/trace
Yonghong Song 1bc7896e9e bpf: Fix deadlock with rq_lock in bpf_send_signal()
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153f2b ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
     --- a/tools/trace.py
     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
2020-03-05 14:02:22 -08:00
..
blktrace.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
bpf_trace.c bpf: Fix deadlock with rq_lock in bpf_send_signal() 2020-03-05 14:02:22 -08:00
fgraph.c tracing: Define MCOUNT_INSN_SIZE when not defined without direct calls 2020-01-02 21:56:44 -05:00
ftrace_internal.h treewide: Rename rcu_dereference_raw_notrace() to _check() 2019-08-01 14:16:21 -07:00
ftrace.c ftrace: Protect ftrace_graph_hash with ftrace_sync 2020-02-05 17:16:42 -05:00
Kconfig bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue 2020-02-25 19:07:58 -05:00
kprobe_event_gen_test.c tracing: Add kprobe event command generation test module 2020-01-30 09:46:28 -05:00
Makefile tracing: Add kprobe event command generation test module 2020-01-30 09:46:28 -05:00
power-traces.c
preemptirq_delay_test.c preemptirq_delay_test: Add the burst feature and a sysfs trigger 2019-11-13 09:37:28 -05:00
ring_buffer_benchmark.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
ring_buffer.c ring-buffer: Remove abandoned macro RB_MISSED_FLAGS 2020-01-21 18:38:02 -05:00
rpm-traces.c
synth_event_gen_test.c tracing: Have synthetic event test use raw_smp_processor_id() 2020-02-20 17:43:41 -05:00
trace_benchmark.c trace: Use pr_warn instead of pr_warning 2019-10-18 15:01:57 +02:00
trace_benchmark.h
trace_boot.c tracing: Change trace_boot to use synth_event interface 2020-01-31 12:59:26 -05:00
trace_branch.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
trace_clock.c
trace_dynevent.c tracing: Use seq_buf for building dynevent_cmd string 2020-02-01 13:10:15 -05:00
trace_dynevent.h tracing: Remove check_arg() callbacks from dynevent args 2020-02-01 13:09:23 -05:00
trace_entries.h Tracing updates: 2020-02-06 07:12:11 +00:00
trace_event_perf.c Merge branch 'perf/urgent' into perf/core, to pick up fixes 2019-10-28 12:38:26 +01:00
trace_events_filter_test.h
trace_events_filter.c tracing: Avoid memory leak in process_system_preds() 2019-12-19 18:24:17 -05:00
trace_events_hist.c Tracing updates: 2020-02-26 10:34:42 -08:00
trace_events_inject.c tracing: Initialize val to zero in parse_entry of inject code 2020-01-02 19:04:57 -05:00
trace_events_trigger.c trigger_next should increase position index 2020-01-30 09:46:27 -05:00
trace_events.c Tracing updates: 2020-02-06 07:12:11 +00:00
trace_export.c Linux 5.5-rc1 2019-12-10 10:11:00 +01:00
trace_functions_graph.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
trace_functions.c tracing: Rename trace_buffer to array_buffer 2020-01-13 13:19:38 -05:00
trace_hwlat.c Tracing updates: 2020-02-06 07:12:11 +00:00
trace_irqsoff.c tracing: Rename trace_buffer to array_buffer 2020-01-13 13:19:38 -05:00
trace_kdb.c tracing: Rename trace_buffer to array_buffer 2020-01-13 13:19:38 -05:00
trace_kprobe_selftest.c
trace_kprobe_selftest.h
trace_kprobe.c Various fixes: 2020-02-11 16:39:18 -08:00
trace_mmiotrace.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
trace_nop.c
trace_output.c tracing: Rename trace_buffer to array_buffer 2020-01-13 13:19:38 -05:00
trace_output.h
trace_preemptirq.c
trace_printk.c tracing: Add locked_down checks to the open calls of files created for tracefs 2019-10-12 20:48:06 -04:00
trace_probe_tmpl.h tracing/probe: Support user-space dereference 2019-05-25 23:04:42 -04:00
trace_probe.c tracing/kprobes: Have uname use __get_str() in print_fmt 2020-01-27 10:56:02 -05:00
trace_probe.h tracing/uprobe: Fix to make trace_uprobe_filter alignment safe 2020-01-22 07:09:20 -05:00
trace_sched_switch.c tracing: Fix sched switch start/stop refcount racy updates 2020-01-30 09:46:10 -05:00
trace_sched_wakeup.c tracing: Make struct ring_buffer less ambiguous 2020-01-13 13:19:38 -05:00
trace_selftest_dynamic.c
trace_selftest.c tracing: Rename trace_buffer to array_buffer 2020-01-13 13:19:38 -05:00
trace_seq.c tracing: Remove unused TRACE_SEQ_BUF_USED 2020-01-21 18:39:54 -05:00
trace_stack.c tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined 2020-01-02 22:04:07 -05:00
trace_stat.c tracing: Fix tracing_stat return values in error handling paths 2020-01-24 18:06:48 -05:00
trace_stat.h tracing: Use generic type for comparator function 2019-11-14 13:15:11 -05:00
trace_syscalls.c Tracing updates: 2020-02-06 07:12:11 +00:00
trace_uprobe.c Tracing updates: 2020-02-06 07:12:11 +00:00
trace.c Tracing updates: 2020-02-26 10:34:42 -08:00
trace.h Tracing updates: 2020-02-06 07:12:11 +00:00
tracing_map.c tracing: Have the histogram compare functions convert to u64 first 2019-12-19 18:26:00 -05:00
tracing_map.h