kernel_optimize_test

Author	SHA1	Message	Date
Eric Dumazet	bd9dfc54e3	tcp: fix hang in tcp_sendpage_locked() syszkaller got a hang in tcp stack, related to a bug in tcp_sendpage_locked() root@syzkaller:~# cat /proc/3059/stack [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0 [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110 [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50 [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0 [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110 [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60 [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280 [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160 [<ffffffff84089203>] tcp_sendpage+0x43/0x60 [<ffffffff841641da>] inet_sendpage+0x1aa/0x660 [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0 [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0 [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0 [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750 [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330 [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50 [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610 [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: `306b13eb3c` ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:22:01 -07:00
WANG Cong	3cd904ecbb	net_sched: kill u32_node pointer in Qdisc It is ugly to hide a u32-filter-specific pointer inside Qdisc, this breaks the TC layers: 1. Qdisc is a generic representation, should not have any specific data of any type 2. Qdisc layer is above filter layer, should only save filters in the list of struct tcf_proto. This pointer is used as the head of the chain of u32 hash tables, that is struct tc_u_hnode, because u32 filter is very special, it allows to create multiple hash tables within one qdisc and across multiple u32 filters. Instead of using this ugly pointer, we can just save it in a global hash table key'ed by (dev ifindex, qdisc handle), therefore we can still treat it as a per qdisc basis data structure conceptually. Of course, because of network namespaces, this key is not unique at all, but it is fine as we already have a pointer to Qdisc in struct tc_u_common, we can just compare the pointers when collision. And this only affects slow paths, has no impact to fast path, thanks to the pointer ->tp_c. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:19:10 -07:00
WANG Cong	143976ce99	net_sched: remove tc class reference counting For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:19:10 -07:00
WANG Cong	14546ba1e5	net_sched: introduce tclass_del_notify() Like for TC actions, ->delete() is a special case, we have to prepare and fill the notification before delete otherwise would get use-after-free after we remove the reference count. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:19:10 -07:00
WANG Cong	27d7f07c49	net_sched: get rid of more forward declarations This is not needed if we move them up properly. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:19:10 -07:00
David Lebrun	891ef8dd2a	ipv6: sr: implement additional seg6local actions This patch implements the following seg6local actions. - SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the next-hop looked up in the specified routing table. - SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to the specified network interface. - SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it, possibly to the specified next-hop. - SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it to the next-hop looked up in the specified routing table. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:10:24 -07:00
David Lebrun	d7a669dd2f	ipv6: sr: add helper functions for seg6local This patch adds three helper functions to be used with the seg6local packet processing actions. The decap_and_validate() function will be used by the End.D* actions, that decapsulate an SR-enabled packet. The advance_nextseg() function applies the fundamental operations to update an SRH for the next segment. The lookup_nexthop() function helps select the next-hop for the processed SR packets. It supports an optional next-hop address to route the packet specifically through it, and an optional routing table to use. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:10:24 -07:00
David Lebrun	6285217f0c	ipv6: sr: enforce IPv6 packets for seg6local lwt This patch ensures that the seg6local lightweight tunnel is used solely with IPv6 routes and processes only IPv6 packets. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:10:23 -07:00
David Lebrun	38ee7f2d47	ipv6: sr: add support for encapsulation of L2 frames This patch implements the L2 frame encapsulation mechanism, referred to as T.Encaps.L2 in the SRv6 specifications [1]. A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only accepts packets with an existing MAC header (i.e., it will not work for locally generated packets). The resulting packet looks like IPv6 -> SRH -> Ethernet -> original L3 payload. The next header field of the SRH is set to NEXTHDR_NONE. [1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:10:23 -07:00
David Lebrun	32d99d0b67	ipv6: sr: add support for ip4ip6 encapsulation This patch enables the SRv6 encapsulation mode to carry an IPv4 payload. All the infrastructure was already present, I just had to add a parameter to seg6_do_srh_encap() to specify the inner packet protocol, and perform some additional checks. Usage example: ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-25 17:10:23 -07:00
Eric Biggers	3fd8712707	strparser: initialize all callbacks commit `bbb03029a8` ("strparser: Generalize strparser") added more function pointers to 'struct strp_callbacks'; however, kcm_attach() was not updated to initialize them. This could cause the ->lock() and/or ->unlock() function pointers to be set to garbage values, causing a crash in strp_work(). Fix the bug by moving the callback structs into static memory, so unspecified members are zeroed. Also constify them while we're at it. This bug was found by syzkaller, which encountered the following splat: IP: 0x55 PGD 3b1ca067 P4D 3b1ca067 PUD 3b12f067 PMD 0 Oops: 0010 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: kstrp strp_work task: ffff88006bb0e480 task.stack: ffff88006bb10000 RIP: 0010:0x55 RSP: 0018:ffff88006bb17540 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000 RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48 RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000 R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48 R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000 FS: 0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098 worker_thread+0x223/0x1860 kernel/workqueue.c:2233 kthread+0x35e/0x430 kernel/kthread.c:231 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Code: Bad RIP value. RIP: 0x55 RSP: ffff88006bb17540 CR2: 0000000000000055 ---[ end trace f0e4920047069cee ]--- Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and CONFIG_AF_KCM=y): #include <linux/bpf.h> #include <linux/kcm.h> #include <linux/types.h> #include <stdint.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <sys/syscall.h> #include <unistd.h> static const struct bpf_insn bpf_insns[3] = { { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) / { .code = 0x95 }, / BPF_EXIT_INSN() */ }; static const union bpf_attr bpf_attr = { .prog_type = 1, .insn_cnt = 2, .insns = (uintptr_t)&bpf_insns, .license = (uintptr_t)"", }; int main(void) { int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &bpf_attr, sizeof(bpf_attr)); int inet_fd = socket(AF_INET, SOCK_STREAM, 0); int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0); ioctl(kcm_fd, SIOCKCMATTACH, &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd }); } Fixes: `bbb03029a8` ("strparser: Generalize strparser") Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 21:57:50 -07:00
Jakub Sitnicki	b673d6ccea	ipv6: Use multipath hash from flow info if available Allow our callers to influence the choice of ECMP link by honoring the hash passed together with the flow info. This allows for special treatment of ICMP errors which we would like to route over the same path as the IPv6 datagram that triggered the error. Also go through rt6_multipath_hash(), in the usual case when we aren't dealing with an ICMP error, so that there is one central place where multipath hash is computed. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 18:21:17 -07:00
Jakub Sitnicki	956b45318a	ipv6: Fold rt6_info_hash_nhsfn() into its only caller Commit `644d0e6569` ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes sense to keep it around. Also remove the accompanying comment that has become outdated. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 18:21:17 -07:00
Jakub Sitnicki	23aebdacb0	ipv6: Compute multipath hash for ICMP errors from offending packet When forwarding or sending out an ICMPv6 error, look at the embedded packet that triggered the error and compute a flow hash over its headers. This let's us route the ICMP error together with the flow it belongs to when multipath (ECMP) routing is in use, which in turn makes Path MTU Discovery work in ECMP load-balanced or anycast setups (RFC 7690). Granted, end-hosts behind the ECMP router (aka servers) need to reflect the IPv6 Flow Label for PMTUD to work. The code is organized to be in parallel with ipv4 stack: ip_multipath_l3_keys -> ip6_multipath_l3_keys fib_multipath_hash -> rt6_multipath_hash Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 18:21:17 -07:00
Jakub Sitnicki	22b6722bfa	ipv6: Add sysctl for per namespace flow label reflection Reflecting IPv6 Flow Label at server nodes is useful in environments that employ multipath routing to load balance the requests. As "IPv6 Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error messages generated in response to a downstream packets from the server can be routed by a load balancer back to the original server without looking at transport headers, if the server applies the flow label reflection. This enables the Path MTU Discovery past the ECMP router in load-balance or anycast environments where each server node is reachable by only one path. Introduce a sysctl to enable flow label reflection per net namespace for all newly created sockets. Same could be earlier achieved only per socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR socket option. [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 18:05:43 -07:00
Jesper Dangaard Brouer	a873585587	xdp: remove net_device names from xdp_redirect tracepoint There is too much overhead in the current trace_xdp_redirect tracepoint as it does strcpy and strlen on the net_device names. Besides, exposing the ifindex/index is actually the information that is needed in the tracepoint to diagnose issues. When a lookup fails (either ifindex or devmap index) then there is a need for saying which to_index that have issues. V2: Adjust args to be aligned with trace_xdp_exception. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 11:59:37 -07:00
Jesper Dangaard Brouer	2facaad600	xdp: make generic xdp redirect use tracepoint trace_xdp_redirect If the xdp_do_generic_redirect() call fails, it trigger the trace_xdp_exception tracepoint. It seems better to use the same tracepoint trace_xdp_redirect, as the native xdp_do_redirect{,_map} does. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 11:59:36 -07:00
Jesper Dangaard Brouer	d08adb82fd	xdp: remove bpf_warn_invalid_xdp_redirect Given there is a tracepoint that can track the error code of xdp_do_redirect calls, the WARN_ONCE in bpf_warn_invalid_xdp_redirect doesn't seem relevant any longer. Simply remove the function. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 11:59:36 -07:00
Arkadi Sharshevsky	3580732448	devlink: Move dpipe entry clear function into devlink The entry clear routine can be shared between the drivers, thus it is moved inside devlink. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	ffd3cdccf2	devlink: Add support for dynamic table size Up until now the dpipe table's size was static and known at registration time. The host table does not have constant size and it is resized in dynamic manner. In order to support this behavior the size is changed to be obtained dynamically via an op. This patch also adjust the current dpipe table for the new API. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	3fb886ecea	devlink: Add IPv4 header for dpipe This will be used by the IPv4 host table which will be introduced in the following patches. This header is global and can be reused by many drivers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	1177009131	devlink: Add Ethernet header for dpipe This will be used by the IPv4 host table which will be introduced in the following patches. This header is global and can be reused by many drivers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:15 -07:00
Colin Ian King	60890e0460	gre: remove duplicated assignment of iph iph is being assigned the same value twice; remove the redundant first assignment. (Thanks to Nikolay Aleksandrov for pointing out that the first asssignment should be removed and not the second) Fixes warning: net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 22:36:03 -07:00
Arvind Yadav	042a90106b	net: tipc: constify genl_ops genl_ops are not supposed to change at runtime. All functions working with genl_ops provided by <net/genetlink.h> work with const genl_ops. So mark the non-const structs as const. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 22:31:38 -07:00
Paolo Abeni	257a73031d	net/sock: allow the user to set negative peek offset This is necessary to allow the user to disable peeking with offset once it's enabled. Unix sockets already allow the above, with this patch we permit it for udp[6] sockets, too. Fixes: `627d2d6b55` ("udp: enable MSG_PEEK at non-zero offset") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 22:18:44 -07:00
Xin Long	5f9ae3d9e7	ipv4: do metrics match when looking up and deleting a route Now when ipv4 route inserts a fib_info, it memcmp fib_metrics. It means ipv4 route identifies one route also with metrics. But when removing a route, it tries to find the route without caring about the metrics. It will cause that the route with right metrics can't be removed. Thomas noticed this issue when doing the testing: 1. add: # ip route append 192.168.7.0/24 dev v window 1000 # ip route append 192.168.7.0/24 dev v window 1001 # ip route append 192.168.7.0/24 dev v window 1002 # ip route append 192.168.7.0/24 dev v window 1003 2. delete: # ip route delete 192.168.7.0/24 dev v window 1002 3. show: 192.168.7.0/24 proto boot scope link window 1001 192.168.7.0/24 proto boot scope link window 1002 192.168.7.0/24 proto boot scope link window 1003 The one with window 1002 wasn't deleted but the first one was. This patch is to do metrics match when looking up and deleting one route. Reported-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 20:37:10 -07:00
Mike Maloney	98aaa913b4	tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the timestamp corresponding to the highest sequence number data returned. Previously the skb->tstamp is overwritten when a TCP packet is placed in the out of order queue. While the packet is in the ooo queue, save the timestamp in the TCB_SKB_CB. This space is shared with the gso_* options which are only used on the tx path, and a previously unused 4 byte hole. When skbs are coalesced either in the sk_receive_queue or the out_of_order_queue always choose the timestamp of the appended skb to maintain the invariant of returning the timestamp of the last byte in the recvmsg buffer. Signed-off-by: Mike Maloney <maloney@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 20:30:47 -07:00
William Tu	e3d0328c76	gre: fix goto statement typo Fix typo: pnet_tap_faied. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 21:28:37 -07:00
Daniel Borkmann	e4a8e817d3	bpf: misc xdp redirect cleanups Few cleanups including: bpf_redirect_map() is really XDP only due to the return code. Move it to a more appropriate location where we do the XDP redirect handling and change it's name into bpf_xdp_redirect_map() to make it consistent to the bpf_xdp_redirect() helper. xdp_do_redirect_map() helper can be static since only used out of filter.c file. Drop the goto in xdp_do_generic_redirect() and only return errors directly. In xdp_do_flush_map() only clear ri->map_to_flush which is the arg we're using in that function, ri->map is cleared earlier along with ri->ifindex. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 21:26:29 -07:00
Eric Dumazet	9695fe6f21	net: sched: use kvmalloc() for class hash tables High order GFP_KERNEL allocations can stress the host badly. Use modern kvmalloc_array()/kvfree() instead of custom allocations. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 14:37:24 -07:00
William Tu	84e54fe0a5	gre: introduce native tunnel support for ERSPAN The patch adds ERSPAN type II tunnel support. The implementation is based on the draft at [1]. One of the purposes is for Linux box to be able to receive ERSPAN monitoring traffic sent from the Cisco switch, by creating a ERSPAN tunnel device. In addition, the patch also adds ERSPAN TX, so Linux virtual switch can redirect monitored traffic to the ERSPAN tunnel device. The traffic will be encapsulated into ERSPAN and sent out. The implementation reuses tunnel key as ERSPAN session ID, and field 'erspan' as ERSPAN Index fields: ./ip link add dev ers11 type erspan seq key 100 erspan 123 \ local 172.16.1.200 remote 172.16.1.100 To use the above device as ERSPAN receiver, configure Nexus 5000 switch as below: monitor session 100 type erspan-source erspan-id 123 vrf default destination ip 172.16.1.200 source interface Ethernet1/11 both source interface Ethernet1/12 both no shut monitor erspan origin ip-address 172.16.1.100 global [1] https://tools.ietf.org/html/draft-foschiano-erspan-01 [2] iproute2 patch: http://marc.info/?l=linux-netdev&m=150306086924951&w=2 [3] test script: http://marc.info/?l=linux-netdev&m=150231021807304&w=2 Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Meenakshi Vohra <mvohra@vmware.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 14:29:30 -07:00
Willem de Bruijn	ab2fb7e324	udp: remove unreachable ufo branches Remove two references to ufo in the udp send path that are no longer reachable now that ufo has been removed. Commit `85f1bd9a7b` ("udp: consistently apply ufo or fragmentation") is a fix to ufo. It is safe to revert what remains of it. Also, no skb can enter ip_append_page with skb_is_gso true now that skb_shinfo(skb)->gso_type is no longer set in ip_append_page/_data. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 14:27:18 -07:00
Tonghao Zhang	1119936927	tcp: Remove the unused parameter for tcp_try_fastopen. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 14:16:12 -07:00
Tonghao Zhang	49c71586a6	tcp: Get a proper dst before checking it. tcp_peer_is_proven needs a proper route to make the determination, but dst always is NULL. This bug may be there at the beginning of git tree. This does not look serious enough to deserve backports to stable versions. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 14:16:12 -07:00
David S. Miller	e2a7c34fb2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-21 17:06:42 -07:00
Jon Paul Maloy	40501f90ed	tipc: don't reset stale broadcast send link When the broadcast send link after 100 attempts has failed to transfer a packet to all peers, we consider it stale, and reset it. Thereafter it needs to re-synchronize with the peers, something currently done by just resetting and re-establishing all links to all peers. This has turned out to be overkill, with potentially unwanted consequences for the remaining cluster. A closer analysis reveals that this can be done much simpler. When this kind of failure happens, for reasons that may lie outside the TIPC protocol, it is typically only one peer which is failing to receive and acknowledge packets. It is hence sufficient to identify and reset the links only to that peer to resolve the situation, without having to reset the broadcast link at all. This solution entails a much lower risk of negative consequences for the own node as well as for the overall cluster. We implement this change in this commit. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 13:37:45 -07:00
David Lamparter	e65a4955b0	net: check type when freeing metadata dst Commit `3fcece12bc` ("net: store port/representator id in metadata_dst") added a new type field to metadata_dst, but metadata_dst_free() wasn't updated to check it before freeing the METADATA_IP_TUNNEL specific dst cache entry. This is not currently causing problems since it's far enough back in the struct to be zeroed for the only other type currently in existance (METADATA_HW_PORT_MUX), but nevertheless it's not correct. Fixes: `3fcece12bc` ("net: store port/representator id in metadata_dst") Signed-off-by: David Lamparter <equinox@diac24.net> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> Cc: Simon Horman <horms@verge.net.au> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 10:57:38 -07:00
David Ahern	4832c30d54	net: ipv6: put host and anycast routes on device with address One nagging difference between ipv4 and ipv6 is host routes for ipv6 addresses are installed using the loopback device or VRF / L3 Master device. e.g., 2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium Using the loopback device is convenient -- necessary for local tx, but has some nasty side effects, most notably setting the 'lo' device down causes all host routes for all local IPv6 address to be removed from the FIB and completely breaks IPv6 networking across all interfaces. This patch puts FIB entries for IPv6 routes against the device. This simplifies the routes in the FIB, for example by making dst->dev and rt6i_idev->dev the same (a future patch can look at removing the device reference taken for rt6i_idev for FIB entries). When copies are made on FIB lookups, the cloned route has dst->dev set to loopback (or the L3 master device). This is needed for the local Tx of packets to local addresses. With fib entries allocated against the real network device, the addrconf code that reinserts host routes on admin up of 'lo' is no longer needed. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 10:40:17 -07:00
Florian Westphal	89e49506bc	dsa: remove unused net_device arg from handlers compile tested only, but saw no warnings/errors with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 10:39:11 -07:00
David S. Miller	a43dce9358	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-08-21 1) Support RX checksum with IPsec crypto offload for esp4/esp6. From Ilan Tayari. 2) Fixup IPv6 checksums when doing IPsec crypto offload. From Yossi Kuperman. 3) Auto load the xfrom offload modules if a user installs a SA that requests IPsec offload. From Ilan Tayari. 4) Clear RX offload informations in xfrm_input to not confuse the TX path with stale offload informations. From Ilan Tayari. 5) Allow IPsec GSO for local sockets if the crypto operation will be offloaded. 6) Support setting of an output mark to the xfrm_state. This mark can be used to to do the tunnel route lookup. From Lorenzo Colitti. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 09:29:47 -07:00
Wei Wang	348a400272	ipv6: repair fib6 tree in failure case In fib6_add(), it is possible that fib6_add_1() picks an intermediate node and sets the node's fn->leaf to NULL in order to add this new route. However, if fib6_add_rt2node() fails to add the new route for some reason, fn->leaf will be left as NULL and could potentially cause crash when fn->leaf is accessed in fib6_locate(). This patch makes sure fib6_repair_tree() is called to properly repair fn->leaf in the above failure case. Here is the syzkaller reported general protection fault in fib6_locate: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801d7d64100 ti: ffff8801d01a0000 task.ti: ffff8801d01a0000 RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233 RSP: 0018:ffff8801d01a36a8 EFLAGS: 00010202 RAX: 0000000000000020 RBX: ffff8801bc790e00 RCX: ffffc90002983000 RDX: 0000000000001219 RSI: ffff8801d01a37a0 RDI: 0000000000000100 RBP: ffff8801d01a36f0 R08: 00000000000000ff R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001 R13: dffffc0000000000 R14: ffff8801d01a37a0 R15: 0000000000000000 FS: 00007f6afd68c700(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004c6340 CR3: 00000000ba41f000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0 ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0 ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988 Call Trace: [<ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109 [<ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075 [<ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450 [<ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281 [<ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456 [<ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline] [<ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232 [<ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778 [<ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline] [<ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619 [<ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834 [<ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478 [<ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491 [<ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538 [<ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline] [<ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577 [<ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17 Note: there is no "Fixes" tag as this seems to be a bug introduced very early. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-20 20:06:56 -07:00
stephen hemminger	6648c65e7e	net: style cleanups Make code closer to current style. Mostly whitespace changes. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:47 -07:00
stephen hemminger	667e427bc3	net: mark receive queue attributes ro_after_init Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	2b9c758129	net: make queue attributes ro_after_init The XPS queue attributes can be ro_after_init. Also use __ATTR_RX macros to simplify initialization. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	170c658afc	net: make BQL sysfs attributes ro_after_init Also fix macro to not have ; at end. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	718ad681ef	net: drop unused attribute argument from sysfs queue funcs The show and store functions don't need/use the attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:31 -07:00
stephen hemminger	ec6cc5993c	net: make net sysfs attributes ro_after_init The attributes of net devices are immutable. Ideally, attribute groups would contain const attributes but there are too many places that do modifications of list during startup (in other code) to allow that. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:35 -07:00
stephen hemminger	737aec57c6	net: constify net_ns_type_operations This can be const. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:27 -07:00
stephen hemminger	e6d473e635	net: make net_class ro_after_init The net_class in sysfs is only modified on init. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:21 -07:00
stephen hemminger	b793dc5c6e	net: constify netdev_class_file These functions are wrapper arount class_create_file which can take a const attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:12 -07:00

1 2 3 4 5 ...

47581 Commits