kernel_optimize_test

Author	SHA1	Message	Date
Wei Wang	6d37fa49da	l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a UDP socket. User could call sendmsg() on both this tunnel and the UDP socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call __sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there could be a race and cause the dst cache to be freed multiple times. So we fix l2tp side code to always call sk_dst_check() to garantee xchg() is called when refreshing sk->sk_dst_cache to avoid race conditions. Syzkaller reported stack trace: BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline] BUG: KASAN: use-after-free in atomic_fetch_add_unless include/linux/atomic.h:575 [inline] BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 [inline] BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline] BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029 Read of size 4 at addr ffff8801aea9a880 by task syz-executor129/4829 CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline] atomic_fetch_add_unless include/linux/atomic.h:575 [inline] atomic_add_unless include/linux/atomic.h:597 [inline] dst_hold_safe include/net/dst.h:308 [inline] ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029 rt6_get_pcpu_route net/ipv6/route.c:1249 [inline] ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098 fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126 ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079 ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117 udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x51d/0x930 net/socket.c:2115 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210 __do_sys_sendmmsg net/socket.c:2239 [inline] __se_sys_sendmmsg net/socket.c:2236 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x446a29 Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4de5532db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000006dcc38 RCX: 0000000000446a29 RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003 RBP: 00000000006dcc30 R08: 00007f4de5533700 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dcc3c R13: 00007ffe2b830fdf R14: 00007f4de55339c0 R15: 0000000000000001 Fixes: `71b1391a41` ("l2tp: ensure sk->dst is still valid") Reported-by: syzbot+05f840f3b04f211bad55@syzkaller.appspotmail.com Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Guillaume Nault <g.nault@alphalink.fr> Cc: David Ahern <dsahern@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 08:43:36 -07:00
Virgile Jarry	e6f86b0f7a	ipv6: Add icmp_echo_ignore_all support for ICMPv6 Preventing the kernel from responding to ICMP Echo Requests messages can be useful in several ways. The sysctl parameter 'icmp_echo_ignore_all' can be used to prevent the kernel from responding to IPv4 ICMP echo requests. For IPv6 pings, such a sysctl kernel parameter did not exist. Add the ability to prevent the kernel from responding to IPv6 ICMP echo requests through the use of the following sysctl parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all. Update the documentation to reflect this change. Signed-off-by: Virgile Jarry <virgile@acceis.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 08:42:25 -07:00
David S. Miller	8f78004442	Merge branch 'net-tls-Combined-memory-allocation-for-decryption-request' Vakul Garg says: ==================== net/tls: Combined memory allocation for decryption request This patch does a combined memory allocation from heap for scatterlists, aead_request, aad and iv for the tls record decryption path. In present code, aead_request is allocated from heap, scatterlists on a conditional basis are allocated on heap or on stack. This is inefficient as it may requires multiple kmalloc/kfree. The initialization vector passed in cryption request is allocated on stack. This is a problem since the stack memory is not dma-able from crypto accelerators. Doing one combined memory allocation for each decryption request fixes both the above issues. It also paves a way to be able to submit multiple async decryption requests while the previous one is pending i.e. being processed or queued. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 08:41:09 -07:00
Vakul Garg	0b243d004e	net/tls: Combined memory allocation for decryption request For preparing decryption request, several memory chunks are required (aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to an accelerator, it is required that the buffers which are read by the accelerator must be dma-able and not come from stack. The buffers for aad and iv can be separately kmalloced each, but it is inefficient. This patch does a combined allocation for preparing decryption request and then segments into aead_req \|\| sgin \|\| sgout \|\| iv \|\| aad. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 08:41:09 -07:00
David S. Miller	78cbac647e	Merge branch 'ip-faster-in-order-IP-fragments' Peter Oskolkov says: ==================== ip: faster in-order IP fragments Added "Signed-off-by" in v2. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 17:54:18 -07:00
Peter Oskolkov	a4fd284a1f	ip: process in-order fragments efficiently This patch changes the runtime behavior of IP defrag queue: incoming in-order fragments are added to the end of the current list/"run" of in-order fragments at the tail. On some workloads, UDP stream performance is substantially improved: RX: ./udp_stream -F 10 -T 2 -l 60 TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60 with this patchset applied on a 10Gbps receiver: throughput=9524.18 throughput_units=Mbit/s upstream (net-next): throughput=4608.93 throughput_units=Mbit/s Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 17:54:18 -07:00
Peter Oskolkov	353c9cb360	ip: add helpers to process in-order fragments faster. This patch introduces several helper functions/macros that will be used in the follow-up patch. No runtime changes yet. The new logic (fully implemented in the second patch) is as follows: * Nodes in the rb-tree will now contain not single fragments, but lists of consecutive fragments ("runs"). * At each point in time, the current "active" run at the tail is maintained/tracked. Fragments that arrive in-order, adjacent to the previous tail fragment, are added to this tail run without triggering the re-balancing of the rb-tree. * If a fragment arrives out of order with the offset _before_ the tail run, it is inserted into the rb-tree as a single fragment. * If a fragment arrives after the current tail fragment (with a gap), it starts a new "tail" run, as is inserted into the rb-tree at the end as the head of the new run. skb->cb is used to store additional information needed here (suggested by Eric Dumazet). Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 17:54:18 -07:00
David S. Miller	6a92ef08a1	Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net	2018-08-11 17:52:00 -07:00
David S. Miller	9a95d9c642	Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations' Vlad Buslov says: ==================== Remove rtnl lock dependency from all action implementations Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a second step to remove rtnl lock dependency from TC rules update path. Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added. Handlers registered with this flag are called without RTNL taken. End goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER, etc.) to be registered with UNLOCKED flag to allow parallel execution. However, there is no intention to completely remove or split rtnl lock itself. This patch set addresses specific problems in implementation of tc actions that prevent their control path from being executed concurrently. Additional changes are required to refactor classifiers API and individual classifiers for parallel execution. This patch set lays groundwork to eventually register rule update handlers as rtnl-unlocked. Action API is already prepared for parallel execution with previous patch set, which means that action ops that use action API for their implementation do not require additional modifications. (delete, search, etc.) Action API implements concurrency-safe reference counting and guarantees that cleanup/delete is called only once, after last reference to action is released. The goal of this change is to update specific actions APIs that access action private state directly, in order to be independent from external locking. General approach is to re-use existing tcf_lock spinlock (used by some action implementation to synchronize control path with data path) to protect action private state from concurrent modification. If action has rcu-protected pointer, tcf spinlock is used to protect its update code, instead of relying on rtnl lock. Some actions need to determine rtnl mutex status in order to release it. For example, ife action can load additional kernel modules(meta ops) and must make sure that no locks are held during module load. In such cases 'rtnl_held' argument is used to conditionally release rtnl mutex. Changes from V1 to V2: - Patch 12: - new patch - Patch 14: - refactor gen_new_estimator() to reuse stats_lock when re-assigning rate estimator statistics pointer - Remove mirred and tunnel_key helper function changes. (to be submitted and standalone patch) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	e329bc4273	net: sched: act_police: remove dependency on rtnl lock Use tcf spinlock to protect police action private data from concurrent modification during dump. (init already uses tcf spinlock when changing police action state) Pass tcf spinlock as estimator lock argument to gen_replace_estimator() during action init. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	51a9f5ae65	net: core: protect rate estimator statistics pointer with lock Extend gen_new_estimator() to also take stats_lock when re-assigning rate estimator statistics pointer. (to be used by unlocked actions) Rename 'stats_lock' to 'lock' and change argument description to explain that it is now also used for control path. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	4e232818bd	net: sched: act_mirred: remove dependency on rtnl lock Re-introduce mirred list spinlock, that was removed some time ago, in order to protect it from concurrent modifications, instead of relying on rtnl lock. Use tcf spinlock to protect mirred action private data from concurrent modification in init and dump. Rearrange access to mirred data in order to be performed only while holding the lock. Rearrange net dev access to always hold reference while working with it, instead of relying on rntl lock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	84a75b329b	net: sched: extend action ops with put_dev callback As a preparation for removing dependency on rtnl lock from rules update path, all users of shared objects must take reference while working with them. Extend action ops with put_dev() API to be used on net device returned by get_dev(). Modify mirred action (only action that implements get_dev callback): - Take reference to net device in get_dev. - Implement put_dev API that releases reference to net device. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	764e9a2448	net: sched: act_vlan: remove dependency on rtnl lock Use tcf spinlock to protect vlan action private data from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	729e012609	net: sched: act_tunnel_key: remove dependency on rtnl lock Use tcf lock to protect tunnel key action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl lock assertion that is no longer required. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	c8814552fe	net: sched: act_skbmod: remove dependency on rtnl lock Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf spinlock to protect private skbmod data from concurrent modification during dump. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	5e48180ed8	net: sched: act_simple: remove dependency on rtnl lock Use tcf spinlock to protect private simple action data from concurrent modification during dump. (simple init already uses tcf spinlock when changing action state) Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	d772849566	net: sched: act_sample: remove dependency on rtnl lock Use tcf spinlock to protect private sample action data from concurrent modification during dump and init. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	67b0c1a3c9	net: sched: act_pedit: remove dependency on rtnl lock Rearrange pedit init code to only access pedit action data while holding tcf spinlock. Change keys allocation type to atomic to allow it to execute while holding tcf spinlock. Take tcf spinlock in dump function when accessing pedit action data. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	ff25276de9	net: sched: act_ipt: remove dependency on rtnl lock Use tcf spinlock to protect ipt action private data from concurrent modification during dump. Ipt init already takes tcf spinlock when modifying ipt state. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	54d0d423a4	net: sched: act_ife: remove dependency on rtnl lock Use tcf spinlock and rcu to protect params pointer from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Ife action has meta-actions that are compiled as standalone modules. Rtnl mutex must be released while loading a kernel module. In order to support execution without rtnl mutex, propagate 'rtnl_held' argument to meta action loading functions. When requesting meta action module, conditionally release rtnl lock depending on 'rtnl_held' argument. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	e8917f4370	net: sched: act_gact: remove dependency on rtnl lock Use tcf spinlock to protect gact action private state from concurrent modification during dump and init. Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	b6a2b971c0	net: sched: act_csum: remove dependency on rtnl lock Use tcf lock to protect csum action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	2142236b45	net: sched: act_bpf: remove dependency on rtnl lock Use tcf spinlock to protect bpf action private data from concurrent modification during dump and init. Remove rtnl lock assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
David S. Miller	2b14e1ea21	Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc' Konstantin Khorenko says: ==================== net/sctp: Avoid allocating high order memory with kmalloc() Each SCTP association can have up to 65535 input and output streams. For each stream type an array of sctp_stream_in or sctp_stream_out structures is allocated using kmalloc_array() function. This function allocates physically contiguous memory regions, so this can lead to allocation of memory regions of very high order, i.e.: sizeof(struct sctp_stream_out) == 24, ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page), which means 9th memory order. This can lead to a memory allocation failures on the systems under a memory stress. We actually do not need these arrays of memory to be physically contiguous. Possible simple solution would be to use kvmalloc() instread of kmalloc() as kvmalloc() can allocate physically scattered pages if contiguous pages are not available. But the problem is that the allocation can happed in a softirq context with GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario. So the other possible solution is to use flexible arrays instead of contiguios arrays of memory so that the memory would be allocated on a per-page basis. This patchset replaces kvmalloc() with flex_array usage. It consists of two parts: * First patch is preparatory - it mechanically wraps all direct access to assoc->stream.out[] and assoc->stream.in[] arrays with SCTP_SO() and SCTP_SI() wrappers so that later a direct array access could be easily changed to an access to a flex_array (or any other possible alternative). * Second patch replaces kmalloc_array() with flex_array usage. v2 changes: sctp_stream_in() users are updated to provide stream as an argument, sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}(). v3 changes: Move type chages struct sctp_stream_out -> flex_array to next patch. Make sctp_stream_{in,out}() static incline and move them to a header. Performance results (single stream): ==================================== * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread) * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz RAM: 32 Gb * netperf: taken from https://github.com/HewlettPackard/netperf.git, compiled from sources with sctp support * netperf server and client are run on the same node * ip link set lo mtu 1500 The script used to run tests: # cat run_tests.sh #!/bin/bash for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do echo "TEST: $test"; for i in `seq 1 3`; do echo "Iteration: $i"; set -x netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \ -l 60 -- -m 1452; set +x done done ================================================ Results (a bit reformatted to be more readable): Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec v4.18-rc7 v4.18-rc7 + fixes TEST: SCTP_STREAM 212992 212992 1452 60.21 1125.52 1247.04 212992 212992 1452 60.20 1376.38 1149.95 212992 212992 1452 60.20 1131.40 1163.85 TEST: SCTP_STREAM_MANY 212992 212992 1452 60.00 1111.00 1310.05 212992 212992 1452 60.00 1188.55 1130.50 212992 212992 1452 60.00 1108.06 1162.50 =========== Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec v4.18-rc7 v4.18-rc7 + fixes TEST: SCTP_RR 212992 212992 1 1 60.00 45486.98 46089.43 212992 212992 1 1 60.00 45584.18 45994.21 212992 212992 1 1 60.00 45703.86 45720.84 TEST: SCTP_RR_MANY 212992 212992 1 1 60.00 40.75 40.77 212992 212992 1 1 60.00 40.58 40.08 212992 212992 1 1 60.00 39.98 39.97 Performance results for many streams: ===================================== * Kernel: v4.18-rc8 - stock and with 2 patches v3 * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz RAM: 32 Gb * sctp_test: https://github.com/sctp/lksctp-tools * both server and client are run on the same node * ip link set lo mtu 1500 * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented) The script used to run tests: ============================= # cat run_sctp_test.sh #!/bin/bash set -x uname -r ip link set lo mtu 1500 swapoff -a free cat /proc/buddyinfo ./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 & sleep 3 time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \ -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null killall -9 lt-sctp_test =============================== Results (a bit reformatted to be more readable): 1) ms stock kernel v4.18-rc8, no memory fragmentation test 1 test 2 test 3 real 0m14.715s 0m14.593s 0m15.954s user 0m0.954s 0m0.955s 0m0.854s sys 0m13.388s 0m12.537s 0m13.749s 2) kernel with fixes, no memory fragmentation test 1 test 2 test 3 real 0m14.959s 0m14.693s 0m14.762s user 0m0.948s 0m0.921s 0m0.929s sys 0m13.538s 0m13.225s 0m13.217s 3) kernel with fixes, memory fragmented 'free': total used free shared buff/cache available Mem: 32906008 30555200 302740 764 2048068 266452 Mem: 32906008 30379948 541436 764 1984624 442376 Mem: 32906008 30717312 262380 764 1926316 109908 /proc/buddyinfo: Node 0, zone Normal 40773 37 34 29 0 0 0 0 0 0 0 Node 0, zone Normal 100332 68 8 4 2 1 1 0 0 0 0 Node 0, zone Normal 31113 7 2 1 0 0 0 0 0 0 0 test 1 test 2 test 3 real 0m14.159s 0m15.252s 0m15.826s user 0m0.839s 0m1.004s 0m1.048s sys 0m11.827s 0m14.240s 0m14.778s ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:26:48 -07:00
Konstantin Khorenko	0d493b4d0b	net/sctp: Replace in/out stream arrays with flex_array This path replaces physically contiguous memory arrays allocated using kmalloc_array() with flexible arrays. This enables to avoid memory allocation failures on the systems under a memory stress. Signed-off-by: Oleg Babin <obabin@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:25:15 -07:00
Konstantin Khorenko	05364ca03c	net/sctp: Make wrappers for accessing in/out streams This patch introduces wrappers for accessing in/out streams indirectly. This will enable to replace physically contiguous memory arrays of streams with flexible arrays (or maybe any other appropriate mechanism) which do memory allocation on a per-page basis. Signed-off-by: Oleg Babin <obabin@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:25:15 -07:00
Keara Leibovitz	b70f1f3af4	tc: Update README and add config Updated README. Added config file that contains the minimum required features enabled to run the tests currently present in the kernel. This must be updated when new unittests are created and require their own modules. Signed-off-by: Keara Leibovitz <kleib@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:20:59 -07:00
David S. Miller	3305f9a905	Merge branch 'l2tp-rework-pppol2tp-ioctl-handling' Guillaume Nault says: ==================== l2tp: rework pppol2tp ioctl handling The current ioctl() handling code can be simplified. It tests for non-relevant conditions and uselessly holds sockets. Once useless code is removed, it becomes even simpler to let pppol2tp_ioctl() handle commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl() or pppol2tp_session_ioctl(). That is the approach taken by this series. Patch #1 and #2 define helper functions aimed at simplifying the rest of the patch set. Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a refcount on the socket. Patches #4, #5 and #6 are the core of the series. They let pppol2tp_ioctl() handle all ioctls and drop the tunnel and session specific functions. Then patch #6 brings a little bit of consolidation. Finally, patch #7 takes advantage of the simplified code to make pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer feature, but it is trivial and it is always nice to see l2tp getting better integration with the rest of the stack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	4f5f85e9a7	l2tp: let pppol2tp_ioctl() fallback to dev_ioctl() Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl() handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX. PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring this mechanism. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	7390ed8a40	l2tp: zero out stats in pppol2tp_copy_stats() Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it manually every time. While there, constify 'stats'. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	b0e29063dc	l2tp: remove pppol2tp_session_ioctl() pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS on session sockets. We just need to copy the stats and set ->session_id. As a side effect of sharing session and tunnel code, ->using_ipsec is properly set even when the request was made using a session socket. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	528534f0de	l2tp: remove pppol2tp_tunnel_ioctl() Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a tunnel. This one is a bit special because the caller may use the tunnel socket to retrieve statistics of one of its sessions. If the session_id is set, the corresponding session's statistics are returned, instead of those of the tunnel. This is handled by the new pppol2tp_tunnel_copy_stats() helper function. Set ->tunnel_id and ->using_ipsec out of the conditional, so that it can be used by the 'else' branch in the following patch. We cannot do that for ->session_id, because tunnel sockets have to report the value that was originally passed in 'stats.session_id', while session sockets have to report their own session_id. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	79e6760e64	l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl() Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	bdd0292f96	l2tp: simplify pppol2tp_ioctl() * Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could not have called us. * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket would be in the process of being released and no ioctl could be running anymore. * Drop test on 'PPPOX_' state: we depend on ->sk_user_data to get the session structure. If it is non-NULL, then the socket is connected. Testing for PPPOX_ is redundant. * Retrieve session using ->sk_user_data directly, instead of going through pppol2tp_sock_to_session(). This avoids grabbing a useless reference on the socket. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	01e28b921b	l2tp: split l2tp_session_get() l2tp_session_get() is used for two different purposes. If 'tunnel' is NULL, the session is searched globally in the supplied network namespace. Otherwise it is searched exclusively in the tunnel context. Callers always know the context in which they need to search the session. But some of them do provide both a namespace and a tunnel, making the semantic of the call unclear. This patch defines l2tp_tunnel_get_session() for lookups done in a tunnel and restricts l2tp_session_get() to namespace searches. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
Guillaume Nault	d6a61ec936	l2tp: define l2tp_tunnel_uses_xfrm() Use helper function to figure out if a tunnel is using ipsec. Also, avoid accessing ->sk_policy directly since it's RCU protected. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:13:49 -07:00
David S. Miller	8a8982d1e2	Merge branch 'netsec-driver-improvements' Ilias Apalodimas says: ==================== netsec driver improvements This patchset introduces some improvements on socionext netsec driver. - patch 1/2, avoids unneeded MMIO reads on the Rx path - patch 2/2, is adjusting the numbers of descriptors used Changes since v1: - Move dma_rmb() to protect descriptor accesses until the device has updated the NETSEC_RX_PKT_OWN_FIELD bit ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:11:36 -07:00
Ilias Apalodimas	b6311b7bea	net: socionext: Increase descriptors to 256 Increasing descriptors to 256 from 128 and adjusting the NAPI weight to 64 increases performace on Rx by ~20% on 64byte packets Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:11:36 -07:00
Ilias Apalodimas	63ae7949e9	net: socionext: Use descriptor info instead of MMIO reads on Rx MMIO reads for remaining packets in queue occur (at least)twice per invocation of netsec_process_rx(). We can use the packet descriptor to identify if it's owned by the hardware and break out, avoiding the more expensive MMIO read operations. This has a ~2% increase on the pps of the Rx path when tested with 64byte packets Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:11:36 -07:00
YueHaibing	78aca3bbee	vxge: remove set but not used variable 'req_out', 'status' and 'ret' Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/neterion/vxge/vxge-config.c:1097:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2263:6: warning: variable 'req_out' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2262:22: warning: variable 'status' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2360:22: warning: variable 'status' set but not used [-Wunused-but-set-variable] enum vxge_hw_status status = VXGE_HW_OK; Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:05:19 -07:00
David S. Miller	29afde5051	Merge branch 'virtio_net-Expand-affinity-to-arbitrary-numbers-of-cpu-and-vq' Caleb Raitto says: ==================== virtio_net: Expand affinity to arbitrary numbers of cpu and vq Virtio-net tries to pin each virtual queue rx and tx interrupt to a cpu if there are as many queues as cpus. Expand this heuristic to configure a reasonable affinity setting also when the number of cpus != the number of virtual queues. Patch 1 allows vqs to take an affinity mask with more than 1 cpu. Patch 2 generalizes the algorithm in virtnet_set_affinity beyond the case where #cpus == #vqs. v2 changes: Renamed "virtio_net: Make vp_set_vq_affinity() take a mask." to "virtio: Make vp_set_vq_affinity() take a mask." Tested: [InstanceSetup] set_multiqueue = false $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep "." $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep "." tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 8000 $ sudo ethtool -L eth0 combined 15 $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep "." $i/smp_affinity_list; done 0-15 0-1 0-1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 14` ; do sudo grep "." tx-$i/xps_cpus; done 0003 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 8000 $ sudo ethtool -L eth0 combined 8 $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep "." $i/smp_affinity_list; done 0-15 0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7 8-9 8-9 10-11 10-11 12-13 12-13 14-15 14-15 9 9 10 10 11 11 12 12 13 13 14 14 15 15 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 7` ; do sudo grep "." tx-$i/xps_cpus; done 0003 000c 0030 00c0 0300 0c00 3000 c000 $ sudo ethtool -L eth0 combined 16 $ sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu15/online" $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep "." $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 0 0 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep "." tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 0001 $ for i in `seq 8 15`; \ do sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu$i/online"; done $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep "." $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep "." tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0001 0002 0004 0008 0010 0020 0040 0080 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:02:18 -07:00
Caleb Raitto	2ca653d607	virtio_net: Stripe queue affinities across cores. Always set the affinity hint, even if #cpu != #vq. Handle the case where #cpu > #vq (including when #cpu % #vq != 0) and when #vq > #cpu (including when #vq % #cpu != 0). Signed-off-by: Caleb Raitto <caraitto@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jon Olson <jonolson@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:02:18 -07:00
Caleb Raitto	19e226e8cc	virtio: Make vp_set_vq_affinity() take a mask. Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU. If there are fewer queues than cores, queue affinity should be able to map to multiple cores. Link: https://patchwork.ozlabs.org/patch/948149/ Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Caleb Raitto <caraitto@google.com> Acked-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:02:18 -07:00
Bryan Whitehead	07624df1c9	lan743x: lan743x: Add PTP support PTP support includes: Ingress, and egress timestamping. One step timestamping available. PTP clock support. Periodic output support. Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:42:18 -07:00
David S. Miller	217e502b89	Merge branch 'tcp-new-mechanism-to-ACK-immediately' Yuchung Cheng says: ==================== tcp: new mechanism to ACK immediately This patch is a follow-up feature improvement to the recent fixes on the performance issues in ECN (delayed) ACKs. Many of the fixes use tcp_enter_quickack_mode routine to force immediate ACKs. However the routine also reset tracking interactive session. This is not ideal because these immediate ACKs are required by protocol specifics unrelated to the interactiveness nature of the application. This patch set introduces a new flag to send a one-time immediate ACK without changing the status of interactive session tracking. With this patch set the immediate ACKs are generated upon these protocol states: 1) When a hole is repaired 2) When CE status changes between subsequent data packets received 3) When a data packet carries CWR flag ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:31:36 -07:00
Yuchung Cheng	fd2123a3d7	tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag Previously commit `9aee400061` ("tcp: ack immediately when a cwr packet arrives") calls tcp_enter_quickack_mode to force sending two immediate ACKs upon receiving a packet w/ CWR flag. The side effect is it'll also reset the delayed ACK timer and interactive session tracking. This patch removes that side effect by using the new ACK_NOW flag to force an immmediate ACK. Packetdrill to demonstrate: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.1 < [ect0] . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 < [ect0] . 1:1001(1000) ack 1 win 257 +0 > [ect01] . 1:1(0) ack 1001 +0 write(4, ..., 1) = 1 +0 > [ect01] P. 1:2(1) ack 1001 +0 < [ect0] . 1001:2001(1000) ack 2 win 257 +0 write(4, ..., 1) = 1 +0 > [ect01] P. 2:3(1) ack 2001 +0 < [ect0] . 2001:3001(1000) ack 3 win 257 +0 < [ect0] . 3001:4001(1000) ack 3 win 257 // Ack delayed ... +.01 < [ce] P. 4001:4501(500) ack 3 win 257 +0 > [ect01] . 3:3(0) ack 4001 +0 > [ect01] E. 3:3(0) ack 4501 +.001 read(4, ..., 4500) = 4500 +0 write(4, ..., 1) = 1 +0 > [ect01] PE. 3:4(1) ack 4501 win 100 +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257 // No delayed ACK on CWR flag +0 > [ect01] . 4:4(0) ack 5501 +.31 < [ect0] . 5501:6501(1000) ack 4 win 257 +0 > [ect01] . 4:4(0) ack 6501 Fixes: `9aee400061` ("tcp: ack immediately when a cwr packet arrives") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:31:35 -07:00
Yuchung Cheng	15bdd5686c	tcp: always ACK immediately on hole repairs RFC 5681 sec 4.2: To provide feedback to senders recovering from losses, the receiver SHOULD send an immediate ACK when it receives a data segment that fills in all or part of a gap in the sequence space. When a gap is partially filled, __tcp_ack_snd_check already checks the out-of-order queue and correctly send an immediate ACK. However when a gap is fully filled, the previous implementation only resets pingpong mode which does not guarantee an immediate ACK because the quick ACK counter may be zero. This patch addresses this issue by marking the one-time immediate ACK flag instead. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:31:35 -07:00
Yuchung Cheng	d2ccd7bc8a	tcp: avoid resetting ACK timer in DCTCP The recent fix of acking immediately in DCTCP on CE status change has an undesirable side-effect: it also resets TCP ack timer and disables pingpong mode (interactive session). But the CE status change has nothing to do with them. This patch addresses that by using the new one-time immediate ACK flag instead of calling tcp_enter_quickack_mode(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:31:35 -07:00
Yuchung Cheng	466466dc6c	tcp: mandate a one-time immediate ACK Add a new flag to indicate a one-time immediate ACK. This flag is occasionaly set under specific TCP protocol states in addition to the more common quickack mechanism for interactive application. In several cases in the TCP code we want to force an immediate ACK but do not want to call tcp_enter_quickack_mode() because we do not want to forget the icsk_ack.pingpong or icsk_ack.ato state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 11:31:35 -07:00

1 2 3 4 5 ...

770549 Commits