Commit Graph

665602 Commits

Author SHA1 Message Date
Wei Yongjun
adeb45cbb5 fib_rules: fix error return code
Fix to return error code -EINVAL from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 622ec2c9d5 ("net: core: add UID to flows, rules, and routes")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:35:57 -04:00
Mike Manning
99f906e9ad bridge: add per-port broadcast flood flag
Support for l2 multicast flood control was added in commit b6cb5ac833
("net: bridge: add per-port multicast flood flag"). It allows broadcast
as it was introduced specifically for unknown multicast flood control.
But as broadcast is a special case of multicast, this may also need to
be disabled. For this purpose, introduce a flag to disable the flooding
of received l2 broadcasts. This approach is backwards compatible and
provides flexibility in filtering for the desired packet types.

Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:34:29 -04:00
Gao Feng
06b4fc520d net: fib: Decrease one unnecessary rt cache flush in fib_disable_ip
The func fib_flush already flushes the rt cache if necessary, so it
is not necessary to invoke rt_cache_flush again in fib_disable_ip.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:33:30 -04:00
Guillaume Nault
1514dc857f l2tp: remove useless device duplication test in l2tp_eth_create()
There's no need to verify that cfg->ifname is unique at this point.
register_netdev() will return -EEXIST if asked to create a device with
a name that's alrealy in use.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:32:13 -04:00
Zhang Shengju
0575c86b5d net: remove unnecessary carrier status check
Since netif_carrier_on() will do nothing if device's carrier is already
on, so it's unnecessary to do carrier status check.

It's the same for netif_carrier_off().

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:31:05 -04:00
Zhang Shengju
8ecbc40ada net: update comment for netif_dormant() function
This patch updates the comment for netif_dormant() function to reflect
the intended usage.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:23:17 -04:00
David Ahern
3993f2cb98 samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel
Add option to xdp1 and xdp_tx_iptunnel to insert xdp program in
SKB_MODE:
 - update set_link_xdp_fd to take a flags argument that is added to the
   RTM_SETLINK message

 - Add -S option to xdp1 and xdp_tx_iptunnel user code. When passed in
   XDP_FLAGS_SKB_MODE is set in the flags arg passed to set_link_xdp_fd

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 12:49:26 -04:00
Herbert Xu
6d684e5469 rhashtable: Cap total number of entries to 2^31
When max_size is not set or if it set to a sufficiently large
value, the nelems counter can overflow.  This would cause havoc
with the automatic shrinking as it would then attempt to fit a
huge number of entries into a tiny hash table.

This patch fixes this by adding max_elems to struct rhashtable
to cap the number of elements.  This is set to 2^31 as nelems is
not a precise count.  This is sufficiently smaller than UINT_MAX
that it should be safe.

When max_size is set max_elems will be lowered to at most twice
max_size as is the status quo.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 11:48:24 -04:00
Eric Dumazet
4b726e81da tcp: tcp_rack_reo_timeout() must update tp->tcp_mstamp
I wrongly assumed tp->tcp_mstamp was up to date at the time
tcp_rack_reo_timeout() was called.

It is not true, since we only update tcp->tcp_mstamp when receiving
a packet (as initially done in commit 69e996c58a ("tcp: add
tp->tcp_mstamp field")

tcp_rack_reo_timeout() being called by a timer and not an incoming
packet, we need to refresh tp->tcp_mstamp

Fixes: 7c1c730859 ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 11:46:15 -04:00
David S. Miller
b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Linus Torvalds
f83246089c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "I didn't want the release to go out without the statx system call
  properly hooked up"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: Update syscall tables.
  sparc64: Fill in rest of HAVE_REGS_AND_STACK_ACCESS_API
2017-04-26 15:10:45 -07:00
David Howells
1e2f82d1e9 statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATH
With the new statx() syscall, the following both allow the attributes of
the file attached to a file descriptor to be retrieved:

	statx(dfd, NULL, 0, ...);

and:

	statx(dfd, "", AT_EMPTY_PATH, ...);

Change the code to reject the first option, though this means copying
the path and engaging pathwalk for the fstat() equivalent.  dfd can be a
non-directory provided path is "".

[ The timing of this isn't wonderful, but applying this now before we
  have statx() in any released kernel, before anybody starts using the
  NULL special case.    - Linus ]

Fixes: a528d35e8b ("statx: Add a system call to make enhanced file info available")
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Sandeen <sandeen@sandeen.net>
cc: fstests@vger.kernel.org
cc: linux-api@vger.kernel.org
cc: linux-man@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26 15:05:47 -07:00
Linus Torvalds
fc08b197bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) MLX5 bug fixes from Saeed Mahameed et al:
     - released wrong resources when firmware timeout happens
     - fix wrong check for encapsulation size limits
     - UAR memory leak
     - ETHTOOL_GRXCLSRLALL failed to fill in info->data

 2) Don't cache l3mdev on mis-matches local route, causes net devices to
    leak refs. From Robert Shearman.

 3) Handle fragmented SKBs properly in macsec driver, the problem is
    that we were mis-sizing the sgvec table. From Jason A. Donenfeld.

 4) We cannot have checksum offload enabled for inner UDP tunneled
    packet during IPSEC, from Ansis Atteka.

 5) Fix double SKB free in ravb driver, from Dan Carpenter.

 6) Fix CPU port handling in b53 DSA driver, from Florian Dainelli.

 7) Don't use on-stack buffers for usb_control_msg() in CAN usb driver,
    from Maksim Salau.

 8) Fix device leak in macvlan driver, from Herbert Xu. We have to purge
    the broadcast queue properly on port destroy.

 9) Fix tx ring entry limit on EF10 devices in sfc driver. From Bert
    Kenward.

10) Fix memory leaks in team driver, from Pan Bian.

11) Don't setup ipv6_stub before it can be actually used, from Paolo
    Abeni.

12) Fix tipc socket flow control accounting, from Parthasarathy
    Bhuvaragan.

13) Fix crash on module unload in hso driver, from Andreas Kemnade.

14) Fix purging of bridge multicast entries, the problem is that if we
    don't defer it to ndo_uninit it's possible for new entries to get
    added after we purge. Fix from Xin Long.

15) Don't return garbage for PACKET_HDRLEN getsockopt, from Alexander
    Potapenko.

16) Fix autoneg stall properly in PHY layer, and revert micrel driver
    change that was papering over it. From Alexander Kochetkov.

17) Don't dereference an ipv4 route as an ipv6 one in the ip6_tunnnel
    code, from Cong Wang.

18) Clear out the congestion control private of the TCP socket in all of
    the right places, from Wei Wang.

19) rawv6_ioctl measures SKB length incorrectly, fix from Jamie
    Bainbridge.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  ipv6: check raw payload size correctly in ioctl
  tcp: memset ca_priv data to 0 properly
  ipv6: check skb->protocol before lookup for nexthop
  net: core: Prevent from dereferencing null pointer when releasing SKB
  macsec: dynamically allocate space for sglist
  Revert "phy: micrel: Disable auto negotiation on startup"
  net: phy: fix auto-negotiation stall due to unavailable interrupt
  net/packet: check length in getsockopt() called with PACKET_HDRLEN
  net: ipv6: regenerate host route if moved to gc list
  bridge: move bridge multicast cleanup to ndo_uninit
  ipv6: fix source routing
  qed: Fix error in the dcbx app meta data initialization.
  netvsc: fix calculation of available send sections
  net: hso: fix module unloading
  tipc: fix socket flow control accounting error at tipc_recv_stream
  tipc: fix socket flow control accounting error at tipc_send_stream
  ipv6: move stub initialization after ipv6 setup completion
  team: fix memory leaks
  sfc: tx ring can only have 2048 entries for all EF10 NICs
  macvlan: Fix device ref leak when purging bc_queue
  ...
2017-04-26 13:42:32 -07:00
Jamie Bainbridge
105f5528b9 ipv6: check raw payload size correctly in ioctl
In situations where an skb is paged, the transport header pointer and
tail pointer can be the same because the skb contents are in frags.

This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a
length of 0 when the length to receive is actually greater than zero.

skb->len is already correctly set in ip6_input_finish() with
pskb_pull(), so use skb->len as it always returns the correct result
for both linear and paged data.

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:59:35 -04:00
Wei Wang
c120144407 tcp: memset ca_priv data to 0 properly
Always zero out ca_priv data in tcp_assign_congestion_control() so that
ca_priv data is cleared out during socket creation.
Also always zero out ca_priv data in tcp_reinit_congestion_control() so
that when cc algorithm is changed, ca_priv data is cleared out as well.
We should still zero out ca_priv data even in TCP_CLOSE state because
user could call connect() on AF_UNSPEC to disconnect the socket and
leave it in TCP_CLOSE state and later call setsockopt() to switch cc
algorithm on this socket.

Fixes: 2b0a8c9ee ("tcp: add CDG congestion control")
Reported-by: Andrey Konovalov  <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:58:32 -04:00
WANG Cong
199ab00f3c ipv6: check skb->protocol before lookup for nexthop
Andrey reported a out-of-bound access in ip6_tnl_xmit(), this
is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4
neigh key as an IPv6 address:

        neigh = dst_neigh_lookup(skb_dst(skb),
                                 &ipv6_hdr(skb)->daddr);
        if (!neigh)
                goto tx_err_link_failure;

        addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE
        addr_type = ipv6_addr_type(addr6);

        if (addr_type == IPV6_ADDR_ANY)
                addr6 = &ipv6_hdr(skb)->daddr;

        memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));

Also the network header of the skb at this point should be still IPv4
for 4in6 tunnels, we shold not just use it as IPv6 header.

This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it
is, we are safe to do the nexthop lookup using skb_dst() and
ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which
dest address we can pick here, we have to rely on callers to fill it
from tunnel config, so just fall to ip6_route_output() to make the
decision.

Fixes: ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:51:26 -04:00
Willem de Bruijn
78a57b482a virtio-net: on tx, only call napi_disable if tx napi is on
As of tx napi, device down (`ip link set dev $dev down`) hangs unless
tx napi is enabled. Else napi_enable is not called, so napi_disable
will spin on test_and_set_bit NAPI_STATE_SCHED.

Only call napi_disable if tx napi is enabled.

Fixes: 5a719c2552ca ("virtio-net: transmit napi")
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:50:06 -04:00
David S. Miller
1f7fe5d492 Merge branch 'ibmvnic-Move-sub-crq-init-out-of-interrupt-context'
Nathan Fontenot says:

====================
ibmvnic: Move sub crq init out of interrupt context

The sub crqs are currently intialized in interrupt context when
handling a crq response fromn the vios server. There is no reason
they must be initialized there.

Moving the initialization of the sub crqs to the ibmvnic_init routine
allows us to do the initialization outside of interrupt context and
make all of the allocations with GFP_KERNEL instead of GFP_ATOMIC.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:49:16 -04:00
Nathan Fontenot
1bb3c739ad ibmvnic: Move initialization of sub crqs to ibmvnic_init
The sub crq structures are initialized in interrupt context while
handling the response to crqs when negotiating capabilities for
the driver. The sub crqs do not need to be initialized at this point
and can be moved to being done from ibmvnic_init. Moving the init
of the sub crqs to ibmvnic_init also allows use to allocate the
memory with GFP_KERNEL instead of GFP_ATOMIC.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:49:15 -04:00
Nathan Fontenot
d346b9bc4f ibmvnic: Split initialization of scrqs to its own routine
Split the sending of capability request crqs and the initialization
of sub crqs into their own routines. This is a first step to moving
the allocation of sub-crqs out of interrupt context.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:49:15 -04:00
Myungho Jung
9899886d5e net: core: Prevent from dereferencing null pointer when releasing SKB
Added NULL check to make __dev_kfree_skb_irq consistent with kfree
family of functions.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289

Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:47:14 -04:00
Florian Fainelli
4c5e7a2c05 dt-bindings: mdio: Clarify binding document
The described GPIO reset property is applicable to *all* child PHYs. If
we have one reset line per PHY present on the MDIO bus, these
automatically become properties of the child PHY nodes.

Finally, indicate how the RESET pulse width must be defined, which is
the maximum value of all individual PHYs RESET pulse widths determined
by reading their datasheets.

Fixes: 69226896ad ("mdio_bus: Issue GPIO RESET to PHYs.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:45:34 -04:00
David S. Miller
4629498336 Merge branch 'tcp-do-not-use-tcp_time_stamp-for-rcv-autotuning'
Eric Dumazet says:

====================
tcp: do not use tcp_time_stamp for rcv autotuning

Some devices or linux distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With BBR (or other CC allowing to increase BDP), we are willing to
increase tcp_rmem[2], but this receive autotuning defect is a blocker
for hosts dealing with gazillions of TCP flows in the data centers,
since many of them have inflated RCVBUF. Risk of OOM is too high.

Note that TSO autodefer, tcp cubic, and TCP TS options (RFC 7323)
also suffer from our dependency to jiffies (via tcp_time_stamp).

We have ongoing efforts to improve all that in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:39 -04:00
Eric Dumazet
645f4c6f2e tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps
Some devices or distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)

Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.

Tested:
 HZ=100 kernel
 ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &

 Peer without patch :
 lpaa24:~# ss -tmi dst lpaa23
 ...
 skmem:(r0,rb8388608,...)
 rcv_rtt:10 rcv_space:3210000 minrtt:0.017

 Peer with the patch :
 lpaa23:~# ss -tmi dst lpaa24
 ...
 skmem:(r0,rb428800,...)
 rcv_rtt:0.069 rcv_space:30000 minrtt:0.017

We can see saner RCVBUF, and more precise rcv_rtt information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:39 -04:00
Eric Dumazet
a6db50b81e tcp: remove ack_time from struct tcp_sacktag_state
It is no longer needed, everything uses tp->tcp_mstamp instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet
7e0ca8a4c1 tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
Following patch will remove ack_time from struct tcp_sacktag_state

Same info is now found in tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet
d2329f102d tcp: do not pass timestamp to tcp_rack_advance()
No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet
88d5c65098 tcp: do not pass timestamp to tcp_rate_gen()
No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet
1317a9d69f tcp: do not pass timestamp to tcp_fastretrans_alert()
Not used anymore now tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet
efab8f8582 tcp: do not pass timestamp to tcp_rack_identify_loss()
Not used anymore now tp->tcp_mstamp holds the information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet
128eda86be tcp: do not pass timestamp to tcp_rack_mark_lost()
This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet
7c1c730859 tcp: do not pass timestamp to tcp_rack_detect_loss()
We can use tp->tcp_mstamp as it contains a recent timestamp.

This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet
69e996c58a tcp: add tp->tcp_mstamp field
We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.

tp->tcp_mstamp is guaranteed to be updated once per incoming packet.

We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:36 -04:00
Jason A. Donenfeld
5294b83086 macsec: dynamically allocate space for sglist
We call skb_cow_data, which is good anyway to ensure we can actually
modify the skb as such (another error from prior). Now that we have the
number of fragments required, we can safely allocate exactly that amount
of memory.

Fixes: c09440f7dc ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:41:53 -04:00
Florian Westphal
038a3e858d rhashtable: remove insecure_max_entries param
no users in the tree, insecure_max_entries is always set to
ht->p.max_size * 2 in rhtashtable_init().

Replace only spot that uses it with a ht->p.max_size check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:39:24 -04:00
David S. Miller
b43bd72835 Revert "phy: micrel: Disable auto negotiation on startup"
This reverts commit 99f81afc13.

It was papering over the real problem, which is fixed by commit
f555f34fdc ("net: phy: fix auto-negotiation stall due to unavailable
interrupt")

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:34:07 -04:00
Alexander Kochetkov
f555f34fdc net: phy: fix auto-negotiation stall due to unavailable interrupt
The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet
cable was plugged before the Ethernet interface was brought up.

The patch trigger PHY state machine to update link state if PHY was requested to
do auto-negotiation and auto-negotiation complete flag already set.

During power-up cycle the PHY do auto-negotiation, generate interrupt and set
auto-negotiation complete flag. Interrupt is handled by PHY state machine but
doesn't update link state because PHY is in PHY_READY state. After some time
MAC bring up, start and request PHY to do auto-negotiation. If there are no new
settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation.
PHY continue to stay in auto-negotiation complete state and doesn't fire
interrupt. At the same time PHY state machine expect that PHY started
auto-negotiation and is waiting for interrupt from PHY and it won't get it.

Fixes: 321beec504 ("net: phy: Use interrupts when available in NOLINK state")
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Cc: stable <stable@vger.kernel.org> # v4.9+
Tested-by: Roger Quadros <rogerq@ti.com>
Tested-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:32:00 -04:00
Linus Torvalds
ea3a8596a0 sound fixes for 4.11
Since we got a bonus week, let me try to screw a few pending fixes.
 
 A slightly large fix is the locking fix in ASoC STI driver, but it's
 pretty board-specific, and the risk is fairly low.
 
 All the rest are small / trivial fixes, mostly marked as stable, for
 ALSA sequencer core, ASoC topology, ASoC Intel bytcr and Firewire
 drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJCBAABCAAsFiEECxfAB4MH3rD5mfB6bDGAVD0pKaQFAlkAczcOHHRpd2FpQHN1
 c2UuZGUACgkQbDGAVD0pKaQP+Q//Z2yK7q80sbEpX6CVtTpX86Y4qNCPdO5v4p5Y
 25w/kR1P1lpGUTl2FYv+CRB6hjjaNc5egTJqPGk2zjKT1ZSOR+Y1KE+woaTJ9iFF
 fATKzQPaLFy+fB5w2egqB4WZnk0jx285QkDlcLPNIOrnaQwSvoc/Hlhxf+HmClhq
 U2PN9ihdd9sX1XeyhmueJx98hm4BQ+MhVX4V1W86qF+qOOXmZ70pYQS/ertSpCwj
 yf8na1wrWB0DChwdT44ZdHDhcnvPMqPI+Wg5R/kpJenPSdPjLht9hVkGz3Faj1TJ
 30aEHzEiJClnbrJJ5vIF+5e2IXbMeNs/otdal3QmDG/u6bKUbeqNRVI0aNvqRL+r
 RpwR01rDrxTuq/aL3Dn0vYoIwK49DvLLq74vovN1lySEvA6Mb0h7kH8PK6R8z9fC
 yxHHDbnujo7eqLuQiwOmPDUjlRmulRn5XdnxdQtvcik4PWWfBxDO8xQNS9baVwFZ
 lRwAsyuv6CAlLMGh5UbYQNnMj69KNd7HHmif4xtay2py64+NS92Q6Aba3ji50ww6
 x3PEaxPta581d/b5czaJWZw1xqH+19k0U0v9ssla0OyOwHs7UFdm4NTY7hv/J6VD
 MfG6XRfNEqUH/+OOoAGqYh2YywkDSj7KoyWgwX1E8RXMNnWuBQ8+I+LJH9fCtS2F
 0Y5xVFs=
 =TQDC
 -----END PGP SIGNATURE-----

Merge tag 'sound-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Since we got a bonus week, let me try to screw a few pending fixes.

  A slightly large fix is the locking fix in ASoC STI driver, but it's
  pretty board-specific, and the risk is fairly low.

  All the rest are small / trivial fixes, mostly marked as stable, for
  ALSA sequencer core, ASoC topology, ASoC Intel bytcr and Firewire
  drivers"

* tag 'sound-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: intel: Fix PM and non-atomic crash in bytcr drivers
  ALSA: firewire-lib: fix inappropriate assignment between signed/unsigned type
  ALSA: seq: Don't break snd_use_lock_sync() loop by timeout
  ASoC: topology: Fix to store enum text values
  ASoC: STI: Fix null ptr deference in IRQ handler
  ALSA: oxfw: fix regression to handle Stanton SCS.1m/1d
2017-04-26 09:30:33 -07:00
Linus Torvalds
ea839b4174 Last minute fixes for ARC
- Build error in Mellanox nps platform
 
  - addressing lack of saving FPU regs in releavnt configs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIbBAABAgAGBQJY/7Y9AAoJEGnX8d3iisJerOIP+OIw4iInJEycs2A1lQxMTUAa
 DFny0uv1EuWD5cC7g1O3m/SGIeN14zt6WBo40VHgxiOw/VFDsk4fze/QyfA+FFj0
 Qr7c5UDQKsWCNUNwkAfUYqp5/HWM9cO+WlUr3JbKL5sThh0SOBhb09T+Z9BEh+Ns
 jMu4VKa1Jh8RuXPd0Wb86CEhkRalTySTbY3og6g4F+HYRPvvqHvigdJ3yPI/mh0E
 ToPbZcqCNqgpJ0vyHak+QXb1gSbMjlevmjmokJR9O49Ypb8jEvQjwkZE0I81N793
 ayEftdCL9zVQ3WCcGVrhviXOMuxvDcimNUrAQiDJBqBSGEjq9AVaejEi1DIOyT3X
 InN1SmZPKno64OuZSFVDF2IuwTAumNXjp886of7LNy76lnznov2VvTXiiJSG/2ZE
 QEyhIq++DWFns+1IerJfU7QjzHEmx3StibBgPYur6K1dN49iHsFBTgRGgKzEhd0L
 iUOQQajg0RoYBhrcS/HQXBQgJhu0Fpkx6csDTtmW66hgw95I6ZoPJshk8SvouRTg
 Lif/U/qOd86Bghwj0V5pw/504LQ9i8xfSZzidM+St8XvhEa/DwuTggVuoIzOB8AH
 6qOExsN75jwExjAR71JsI33jNNTdTQ7HaHaqB1DQJ+2y12QmttNXaPy/iLYqQNsN
 iLeZ6MKdCrXKr73t2Js=
 =wA0V
 -----END PGP SIGNATURE-----

Merge tag 'arc-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC fix from Vineet Gupta:
 "Last minute fixes for ARC:

   - build error in Mellanox nps platform

   - addressing lack of saving FPU regs in releavnt configs"

* tag 'arc-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARCv2: entry: save Accumulator register pair (r58:59) if present
  ARC: [plat-eznps] Fix build error
2017-04-25 14:07:24 -07:00
Eric Dumazet
7acedaf5c4 net: move xdp_prog field in RX cache lines
(struct net_device, xdp_prog) field should be moved in RX cache lines,
reducing latencies when a single packet is received on idle host,
since netif_elide_gro() needs it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 16:25:36 -04:00
Alexander Potapenko
fd2c83b357 net/packet: check length in getsockopt() called with PACKET_HDRLEN
In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:05:52 -04:00
David Ahern
8048ced9be net: ipv6: regenerate host route if moved to gc list
Taking down the loopback device wreaks havoc on IPv6 routing. By
extension, taking down a VRF device wreaks havoc on its table.

Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
FIB code while running syzkaller fuzzer. The root cause is a dead dst
that is on the garbage list gets reinserted into the IPv6 FIB. While on
the gc (or perhaps when it gets added to the gc list) the dst->next is
set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
out-of-bounds access.

Andrey's reproducer was the key to getting to the bottom of this.

With IPv6, host routes for an address have the dst->dev set to the
loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
a walk of the fib evicting routes with the 'lo' device which means all
host routes are removed. That process moves the dst which is attached to
an inet6_ifaddr to the gc list and marks it as dead.

The recent change to keep global IPv6 addresses added a new function,
fixup_permanent_addr, that is called on admin up. That function restarts
dad for an inet6_ifaddr and when it completes the host route attached
to it is inserted into the fib. Since the route was marked dead and
moved to the gc list, re-inserting the route causes the reported
out-of-bounds accesses. If the device with the address is taken down
or the address is removed, the WARN_ON in fib6_del is triggered.

All of those faults are fixed by regenerating the host route if the
existing one has been moved to the gc list, something that can be
determined by checking if the rt6i_ref counter is 0.

Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:04:44 -04:00
Xin Long
b1b9d36602 bridge: move bridge multicast cleanup to ndo_uninit
During removing a bridge device, if the bridge is still up, a new mdb entry
still can be added in br_multicast_add_group() after all mdb entries are
removed in br_multicast_dev_del(). Like the path:

  mld_ifc_timer_expire ->
    mld_sendpack -> ...
      br_multicast_rcv ->
        br_multicast_add_group

The new mp's timer will be set up. If the timer expires after the bridge
is freed, it may cause use-after-free panic in br_multicast_group_expired.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
Call Trace:
 <IRQ>
 [<ffffffff81094536>] call_timer_fn+0x36/0x110
 [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
 [<ffffffff81096967>] run_timer_softirq+0x237/0x340
 [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
 [<ffffffff8169889c>] call_softirq+0x1c/0x30
 [<ffffffff8102c275>] do_softirq+0x65/0xa0
 [<ffffffff8108e055>] irq_exit+0x115/0x120
 [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
 [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80

Nikolay also found it would cause a memory leak - the mdb hash is
reallocated and not freed due to the mdb rehash.

unreferenced object 0xffff8800540ba800 (size 2048):
  backtrace:
    [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
    [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
    [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
    [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
    [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
    [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
    [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
    [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
    [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
    [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
    [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
    [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
    [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
    [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
    [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
    [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]

This could happen when ip link remove a bridge or destroy a netns with a
bridge device inside.

With Nikolay's suggestion, this patch is to clean up bridge multicast in
ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
that netif_running check in br_multicast_add_group can avoid this issue.

v1->v2:
  - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
    of calling dev_close in br_dev_delete.

(NOTE: Depends upon b6fe0440c6 ("bridge: implement missing ndo_uninit()"))

Fixes: e10177abf8 ("bridge: multicast: fix handling of temp and perm entries")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:02:39 -04:00
Sabrina Dubroca
ec9c4215fe ipv6: fix source routing
Commit a149e7c7ce ("ipv6: sr: add support for SRH injection through
setsockopt") introduced handling of IPV6_SRCRT_TYPE_4, but at the same
time restricted it to only IPV6_SRCRT_TYPE_0 and
IPV6_SRCRT_TYPE_4. Previously, ipv6_push_exthdr() and fl6_update_dst()
would also handle other values (ie STRICT and TYPE_2).

Restore previous source routing behavior, by handling IPV6_SRCRT_STRICT
and IPV6_SRCRT_TYPE_2 the same way as IPV6_SRCRT_TYPE_0 in
ipv6_push_exthdr() and fl6_update_dst().

Fixes: a149e7c7ce ("ipv6: sr: add support for SRH injection through setsockopt")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 13:59:24 -04:00
Wei Yongjun
5e1fc7c5ba drivers: net: xgene-v2: Fix error return code in xge_mdio_config()
Fix to return error code -ENODEV from the no PHY found error
handling case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 13:48:06 -04:00
David S. Miller
b5cdae3291 net: Generic XDP
This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.

Tested-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 13:33:49 -04:00
Wei Yongjun
2f7878c06e qed: fix invalid use of sizeof in qed_alloc_qm_data()
sizeof() when applied to a pointer typed expression gives the
size of the pointer, not that of the pointed data.

Fixes: b5a9ee7cf3 ("qed: Revise QM configuration")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:59:36 -04:00
sudarsana.kalluru@cavium.com
c8fcd133ea qed: Fix error in the dcbx app meta data initialization.
DCBX app_data array is initialized with the incorrect values for
personality field. This would  prevent offloaded protocols from
honoring the PFC.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:58:36 -04:00
Teng Qin
8fe4592438 bpf: map_get_next_key to return first key on NULL
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:57:45 -04:00
stephen hemminger
fdfb70d275 netvsc: fix calculation of available send sections
My change (introduced in 4.11) to use find_first_clear_bit
incorrectly assumed that the size argument was words, not bits.
The effect was only a small limited number of the available send
sections were being actually used. This can cause performance loss
with some workloads.

Since map_words is now used only during initialization, it can
be on stack instead of in per-device data.

Fixes: b58a185801 ("netvsc: simplify get next send section")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:56:59 -04:00