Commit Graph

23410 Commits

Author SHA1 Message Date
David S. Miller
8e77327783 inet: Add inetpeer tree roots to the FIB tables.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:09:16 -07:00
David S. Miller
b48c80ece9 inet: Add family scope inetpeer flushes.
This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.

Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.

Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:09:10 -07:00
David S. Miller
46517008e1 ipv4: Kill ip_rt_frag_needed().
There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove.  If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:08:59 -07:00
David S. Miller
97bab73f98 inet: Hide route peer accesses behind helpers.
We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:08:47 -07:00
David S. Miller
c0efc887dc inet: Pass inetpeer root into inet_getpeer*() interfaces.
Otherwise we reference potentially non-existing members when
ipv6 is disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:12:36 -07:00
Eric Dumazet
8b51b064a6 af_unix: remove unix_iter_state
As pointed out by Michael Tokarev , struct unix_iter_state is no longer
needed.

Suggested-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:06:21 -07:00
David S. Miller
2b823f7258 ipv6: Do not mark ipv6_inetpeer_ops as __net_initdata.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:00:16 -07:00
David S. Miller
56a6b248eb inet: Consolidate inetpeer_invalidate_tree() interfaces.
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:32:41 -07:00
David S. Miller
c3426b4719 inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c
Instead of net/ipv4/inetpeer.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:27:05 -07:00
David S. Miller
2397849baa [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary.
Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.

In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 14:56:12 -07:00
David S. Miller
4670fd819e tcp: Get rid of inetpeer special cases.
The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.

First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.

Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.

Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 01:25:47 -07:00
David S. Miller
fbfe95a42e inet: Create and use rt{,6}_get_peer_create().
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 23:24:18 -07:00
Eric Dumazet
7123aaa3a1 af_unix: speedup /proc/net/unix
/proc/net/unix has quadratic behavior, and can hold unix_table_lock for
a while if high number of unix sockets are alive. (90 ms for 200k
sockets...)

We already have a hash table, so its quite easy to use it.

Problem is unbound sockets are still hashed in a single hash slot
(unix_socket_table[UNIX_HASH_TABLE])

This patch also spreads unbound sockets to 256 hash slots, to speedup
both /proc/net/unix and unix_diag.

Time to read /proc/net/unix with 200k unix sockets :
(time dd if=/proc/net/unix of=/dev/null bs=4k)

before : 520 secs
after : 2 secs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Gao feng
54db0cc2ba inetpeer: add parameter net for inet_getpeer_v4,v6
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Gao feng
c8a627ed06 inetpeer: add namespace support for inetpeer
now inetpeer doesn't support namespace,the information will
be leaking across namespace.

this patch move the global vars v4_peers and v6_peers to
netns_ipv4 and netns_ipv6 as a field peers.

add struct pernet_operations inetpeer_ops to initial pernet
inetpeer data.

and change family_to_base and inet_getpeer to support namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Yuval Mintz
80f12eccce Added kernel support in EEE Ethtool commands
This patch extends the kernel's ethtool interface by adding support
for 2 new EEE commands - get_eee and set_eee.

Thanks goes to Giuseppe Cavallaro for his original patch adding this support.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-07 13:18:54 -07:00
Ben Hutchings
94b6042cfe net: Update kernel-doc for __alloc_skb()
__alloc_skb() now extends tailroom to allow the use of padding added
by the heap allocator.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-07 13:18:54 -07:00
David S. Miller
c1864cfb80 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-06-06 15:06:41 -07:00
David S. Miller
da2e852612 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
John Linville says:
====================
Amitkumar Karwar gives us a cfg80211 fix that changes some state
tracking in order to avoid a WARNING.

Arik Nemtsov provide a mac80211 fix for an RCU-related race.

Avinash Patil shares a pair of mwifiex fixes, one which invalidates
some stale configuration data before a channel change and another to
restrict hidden SSID support to zero-length SSIDs only.

Chun-Yeow Yeoh brings a mac80211 fix for a mesh problem triggered
when combining multiple mesh networks into one.

Felix Fietkau provides a mac80211 lockdep fix.

Joe Perches fixes a couple of thinkos related to bitwise operations.

Johannes Berg comes through with a flurry of fixes.  The iwlwifi ones
address a problem Linus recently reported, and some of the fallout
discovered while fixing it.  The mac80211 fix properly cleans-up
remain-on-channel work on an interface that is stopped.  The others
are clean-ups for regressions caused by stricter checking of possible
virtual interfaces supported by wireless drivers.

Meenakshi Venkataraman provides a mac80211 fix for an off-by-one error.

Seth Forshee provides a fix to make the wireless adapters used in
some Mac boxes work after being in S3 power saving state.

Stanislaw Gruszka offers a copule of fixes, a fix for a mac80211
scanning regression and an rt2x00 fix to avoid some lockdep spew.

Last but not least, Vinicius Costa Gomes provides a bluetooth fix
for a typo that "was preventing important features of Bluetooth
from working".
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-06 13:30:02 -07:00
David S. Miller
9b97b84eb5 Merge branch 'master' of git://gitorious.org/linux-can/linux-can-next 2012-06-06 11:13:26 -07:00
John W. Linville
4e924fec59 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2012-06-06 14:02:56 -04:00
John W. Linville
2d4524ac18 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth 2012-06-06 13:57:05 -04:00
Eric Dumazet
55432d2b54 inetpeer: fix a race in inetpeer_gc_worker()
commit 5faa5df1fa (inetpeer: Invalidate the inetpeer tree along with
the routing cache) added a race :

Before freeing an inetpeer, we must respect a RCU grace period, and make
sure no user will attempt to increase refcnt.

inetpeer_invalidate_tree() waits for a RCU grace period before inserting
inetpeer tree into gc_list and waking the worker. At that time, no
concurrent lookup can find a inetpeer in this tree.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-06 10:45:15 -07:00
Johannes Berg
463454b5db cfg80211: fix interface combinations check
If a given interface combination doesn't contain
a required interface type then we missed checking
that and erroneously allowed it even though iface
type wasn't there at all. Add a check that makes
sure that all interface types are accounted for.

Cc: stable@kernel.org
Reported-by: Mohammed Shafi Shajakhan <mohammed@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-05 14:27:30 -04:00
Vinicius Costa Gomes
ddcd0f4147 Bluetooth: Fix checking the wrong flag when accepting a socket
Most probably a typo, the check should have been for BT_SK_DEFER_SETUP
instead of BT_DEFER_SETUP (which right now only represents a socket
option).

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@openbossa.org>
Acked-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2012-06-05 06:26:26 +03:00
Arik Nemtsov
794454ce72 mac80211: fix non RCU-safe sta_list manipulation
sta_info_cleanup locks the sta_list using rcu_read_lock however
the delete operation isn't rcu safe. A race between sta_info_cleanup
timer being called and a STA being removed can occur which leads
to a panic while traversing sta_list. Fix this by switching to the
RCU-safe versions.

Cc: stable@vger.kernel.org
Reported-by: Eyal Shapira <eyal@wizery.com>
Signed-off-by: Arik Nemtsov <arik@wizery.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:26:54 -04:00
Joe Perches
5204267d2f mac80211: Fix likely misuse of | for &
Using | with a constant is always true.
Likely this should have be &.

cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:25:56 -04:00
Felix Fietkau
d8c7aae64c mac80211: add missing rcu_read_lock/unlock in agg-rx session timer
Fixes a lockdep warning:

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/mac80211/agg-rx.c:148 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
1 lock held by arecord/11226:
 #0:  (&tid_agg_rx->session_timer){+.-...}, at: [<ffffffff81066bb0>] call_timer_fn+0x0/0x360

stack backtrace:
Pid: 11226, comm: arecord Not tainted 3.1.0-kml #16
Call Trace:
 <IRQ>  [<ffffffff81093454>] lockdep_rcu_dereference+0xa4/0xc0
 [<ffffffffa02778c9>] sta_rx_agg_session_timer_expired+0xc9/0x110 [mac80211]
 [<ffffffffa0277800>] ? ieee80211_process_addba_resp+0x220/0x220 [mac80211]
 [<ffffffff81066c3a>] call_timer_fn+0x8a/0x360
 [<ffffffff81066bb0>] ? init_timer_deferrable_key+0x30/0x30
 [<ffffffff81477bb0>] ? _raw_spin_unlock_irq+0x30/0x70
 [<ffffffff81067049>] run_timer_softirq+0x139/0x310
 [<ffffffff81091d5e>] ? put_lock_stats.isra.25+0xe/0x40
 [<ffffffff810922ac>] ? lock_release_holdtime.part.26+0xdc/0x160
 [<ffffffffa0277800>] ? ieee80211_process_addba_resp+0x220/0x220 [mac80211]
 [<ffffffff8105cb78>] __do_softirq+0xc8/0x3c0
 [<ffffffff8108f088>] ? tick_dev_program_event+0x48/0x110
 [<ffffffff8108f16f>] ? tick_program_event+0x1f/0x30
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8147a43c>] call_softirq+0x1c/0x30
 [<ffffffff81004c55>] do_softirq+0xa5/0xe0
 [<ffffffff8105d1ee>] irq_exit+0xae/0xe0
 [<ffffffff8147ac6b>] smp_apic_timer_interrupt+0x6b/0x98
 [<ffffffff81479ab3>] apic_timer_interrupt+0x73/0x80
 <EOI>  [<ffffffff8146aac6>] ? free_debug_processing+0x1a1/0x1d5
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8146ab2b>] __slab_free+0x31/0x2ca
 [<ffffffff81477c3a>] ? _raw_spin_unlock_irqrestore+0x4a/0x90
 [<ffffffff81253b8f>] ? __debug_check_no_obj_freed+0x15f/0x210
 [<ffffffff81097054>] ? lock_release_nested+0x84/0xc0
 [<ffffffff8113ec55>] ? kmem_cache_free+0x105/0x250
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8113ed8f>] kmem_cache_free+0x23f/0x250
 [<ffffffff81153b15>] putname+0x35/0x50
 [<ffffffff81146d8d>] do_sys_open+0x16d/0x1d0
 [<ffffffff81146e10>] sys_open+0x20/0x30
 [<ffffffff81478f42>] system_call_fastpath+0x16/0x1b

Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:25:41 -04:00
Johannes Berg
71ecfa1893 mac80211: clean up remain-on-channel on interface stop
When any interface goes down, it could be the one that we
were doing a remain-on-channel with. We therefore need to
cancel the remain-on-channel and flush the related work
structs so they don't run after the interface has been
removed or even destroyed.

It's also possible in this case that an off-channel SKB
was never transmitted, so free it if this is the case.
Note that this can also happen if the driver finishes
the off-channel period without ever starting it.

Cc: stable@kernel.org
Reported-by: Nirav Shah <nirav.j2.shah@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:25:26 -04:00
Meenakshi Venkataraman
bd34ab62a3 mac80211: fix error in station state transitions during reconfig
As part of hardware reconfig mac80211 tries
to restore the station state to its values
before the hardware reconfig, but it only
goes to the last-state - 1. Fix this
off-by-one error.

Cc: stable@kernel.org [3.4]
Signed-off-by: Meenakshi Venkataraman <meenakshi.venkataraman@intel.com>
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:24:56 -04:00
Chun-Yeow Yeoh
b8bacc187a mac80211: Fix Unreachable Mesh Station Problem when joining to another MBSS
Mesh station that joins an MBSS is reachable using mesh portal with 6
address frame by mesh stations from another MBSS if these two different
MBSSes are bridged. However, if the mesh station later moves into the
same MBSS of those mesh stations, it is unreachable by mesh stations
in the MBSS due to the mpp_paths table is not deleted. A quick fix
is to perform mesh_path_lookup, if it is available for the target
destination, mpp_path_lookup is not performed. When the mesh station
moves back to its original MBSS, the mesh_paths will be deleted once
expired. So, it will be reachable using mpp_path_lookup again.

Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:23:23 -04:00
Amitkumar Karwar
28f333666e cfg80211: use sme_state in ibss start/join path
CFG80211_DEV_WARN_ON() at "net/wireless/ibss.c line 63"
is unnecessarily triggered even after successful connection,
when cfg80211_ibss_joined() is called by driver inside
.join_ibss handler.

This patch fixes the problem by changing 'sme_state' in ibss path
and having WARN_ON() check for 'sme_state' similar to infra
association.

Signed-off-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Bing Zhao <bzhao@marvell.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:21:28 -04:00
Stanislaw Gruszka
925e64c3c5 mac80211: run scan after finish connection monitoring
commit 133d40f9a2
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Wed Mar 28 16:01:19 2012 +0200

    mac80211: do not scan and monitor connection in parallel

add bug, which make possible to start a scan and never finish it, so
make every new scanning request finish with -EBUSY error. This can
happen on code paths where we finish connection monitoring and clear
IEEE80211_STA_*_POLL flags, but do not check if scan was deferred.
This patch fixes those code paths.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-06-04 15:21:18 -04:00
Joe Perches
f07d90107c net/9p: Add __force to cast of __user pointer
A recent commit that removed unnecessary casts of pointers
to the same type uncovered a missing __force cast.

Add it.

Reported by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 13:51:17 -04:00
Joe Perches
e3192690a3 net: Remove casts to same type
Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

	int y;
	int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts.  I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

-	(T *)p
+	p

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:45:11 -04:00
Eric Dumazet
bec4596b4e drop_monitor: dont sleep in atomic context
drop_monitor calls several sleeping functions while in atomic context.

 BUG: sleeping function called from invalid context at mm/slub.c:943
 in_atomic(): 1, irqs_disabled(): 0, pid: 2103, name: kworker/0:2
 Pid: 2103, comm: kworker/0:2 Not tainted 3.5.0-rc1+ #55
 Call Trace:
  [<ffffffff810697ca>] __might_sleep+0xca/0xf0
  [<ffffffff811345a3>] kmem_cache_alloc_node+0x1b3/0x1c0
  [<ffffffff8105578c>] ? queue_delayed_work_on+0x11c/0x130
  [<ffffffff815343fb>] __alloc_skb+0x4b/0x230
  [<ffffffffa00b0360>] ? reset_per_cpu_data+0x160/0x160 [drop_monitor]
  [<ffffffffa00b022f>] reset_per_cpu_data+0x2f/0x160 [drop_monitor]
  [<ffffffffa00b03ab>] send_dm_alert+0x4b/0xb0 [drop_monitor]
  [<ffffffff810568e0>] process_one_work+0x130/0x4c0
  [<ffffffff81058249>] worker_thread+0x159/0x360
  [<ffffffff810580f0>] ? manage_workers.isra.27+0x240/0x240
  [<ffffffff8105d403>] kthread+0x93/0xa0
  [<ffffffff816be6d4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8105d370>] ? kthread_freezable_should_stop+0x80/0x80
  [<ffffffff816be6d0>] ? gs_change+0xb/0xb

Rework the logic to call the sleeping functions in right context.

Use standard timer/workqueue api to let system chose any cpu to perform
the allocation and netlink send.

Also avoid a loop if reset_per_cpu_data() cannot allocate memory :
use mod_timer() to wait 1/10 second before next try.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:42:01 -04:00
Eric Dumazet
d594e987c6 sock_diag: add SK_MEMINFO_BACKLOG
Adding socket backlog len in INET_DIAG_SKMEMINFO is really useful to
diagnose various TCP problems.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:27:40 -04:00
Eric Dumazet
5d0ba55b64 net: use consume_skb() in place of kfree_skb()
Remove some dropwatch/drop_monitor false positives.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:27:40 -04:00
Eric Dumazet
4aea39c11c tcp: tcp_make_synack() consumes dst parameter
tcp_make_synack() clones the dst, and callers release it.

We can avoid two atomic operations per SYNACK if tcp_make_synack()
consumes dst instead of cloning it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:27:39 -04:00
Eric Dumazet
90ba9b1986 tcp: tcp_make_synack() can use alloc_skb()
There is no value using sock_wmalloc() in tcp_make_synack().

A listener socket only sends SYNACK packets, they are not queued in a
socket queue, only in Qdisc and device layers, so the number of in
flight packets is limited in these layers. We used sock_wmalloc() with
the %force parameter set to 1 to ignore socket limits anyway.

This patch removes two atomic operations per SYNACK packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04 11:27:39 -04:00
Linus Torvalds
4fc3acf291 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:

 1) Make syn floods consume significantly less resources by

    a) Not pre-COW'ing routing metrics for SYN/ACKs
    b) Mirroring the device queue mapping of the SYN for the SYN/ACK
       reply.

    Both from Eric Dumazet.

 2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.

 3) Validate the length requested when building a paged SKB for a
    socket, so we don't overrun the page vector accidently.  From Jason
    Wang.

 4) When netlabel is disabled, we abort all IP option processing when we
    see a CIPSO option.  This isn't the right thing to do, we should
    simply skip over it and continue processing the remaining options
    (if any).  Fix from Paul Moore.

 5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
    Apfelbaum.

 6) 8139cp enables the receiver before the ring address is properly
    programmed, which potentially lets the device crap over random
    memory.  Fix from Jason Wang.

 7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
    address reference in jumbo RX frame processing from Bruce Allan and
    Sebastian Andrzej Siewior, respectively.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  fec_mpc52xx: fix timestamp filtering
  mcs7830: Implement link state detection
  e1000e: fix Rapid Start Technology support for i217
  e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
  r8169: call netif_napi_del at errpaths and at driver unload
  tcp: reflect SYN queue_mapping into SYNACK packets
  tcp: do not create inetpeer on SYNACK message
  8139cp/8139too: terminate the eeprom access with the right opmode
  8139cp: set ring address before enabling receiver
  cipso: handle CIPSO options correctly when NetLabel is disabled
  net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
  bql: Avoid possible inconsistent calculation.
  bql: Avoid unneeded limit decrement.
  bql: Fix POSDIFF() to integer overflow aware.
  net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
  net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
  net/mlx4_core: Fixes for VF / Guest startup flow
  net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
  net/mlx4_core: Fix number of EQs used in ICM initialisation
  net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
2012-06-02 16:22:51 -07:00
Linus Torvalds
f309532bf3 tty: Revert the tty locking series, it needs more work
This reverts the tty layer change to use per-tty locking, because it's
not correct yet, and fixing it will require some more deep surgery.

The main revert is d29f3ef39b ("tty_lock: Localise the lock"), but
there are several smaller commits that built upon it, they also get
reverted here. The list of reverted commits is:

  fde86d3108 - tty: add lockdep annotations
  8f6576ad47 - tty: fix ldisc lock inversion trace
  d3ca8b64b9 - pty: Fix lock inversion
  b1d679afd7 - tty: drop the pty lock during hangup
  abcefe5fc3 - tty/amiserial: Add missing argument for tty_unlock()
  fd11b42e35 - cris: fix missing tty arg in wait_event_interruptible_tty call
  d29f3ef39b - tty_lock: Localise the lock

The revert had a trivial conflict in the 68360serial.c staging driver
that got removed in the meantime.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-02 15:21:43 -07:00
Eric Dumazet
fff3269907 tcp: reflect SYN queue_mapping into SYNACK packets
While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.

Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.

Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.

After patch, not a single SYNACK is dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01 14:22:11 -04:00
Eric Dumazet
7433819a1e tcp: do not create inetpeer on SYNACK message
Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
larger and larger, using lots of memory and cpu time.

tcp_v4_send_synack()
->inet_csk_route_req()
 ->ip_route_output_flow()
  ->rt_set_nexthop()
   ->rt_init_metrics()
    ->inet_getpeer( create = true)

This is a side effect of commit a4daad6b09 (net: Pre-COW metrics for
TCP) added in 2.6.39

Possible solution :

Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS

Before patch :

# grep peer /proc/slabinfo
inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0

Samples: 41K of event 'cycles', Event count (approx.): 30716565122
+  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
+   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
+   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
+   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
+   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
+   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
+   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
+   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
+   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
+   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
+   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
+   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
+   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
+   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
+   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
+   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
+   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01 14:22:11 -04:00
Linus Torvalds
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Linus Torvalds
419f431949 Merge branch 'for-3.5' of git://linux-nfs.org/~bfields/linux
Pull the rest of the nfsd commits from Bruce Fields:
 "... and then I cherry-picked the remainder of the patches from the
  head of my previous branch"

This is the rest of the original nfsd branch, rebased without the
delegation stuff that I thought really needed to be redone.

I don't like rebasing things like this in general, but in this situation
this was the lesser of two evils.

* 'for-3.5' of git://linux-nfs.org/~bfields/linux: (50 commits)
  nfsd4: fix, consolidate client_has_state
  nfsd4: don't remove rebooted client record until confirmation
  nfsd4: remove some dprintk's and a comment
  nfsd4: return "real" sequence id in confirmed case
  nfsd4: fix exchange_id to return confirm flag
  nfsd4: clarify that renewing expired client is a bug
  nfsd4: simpler ordering of setclientid_confirm checks
  nfsd4: setclientid: remove pointless assignment
  nfsd4: fix error return in non-matching-creds case
  nfsd4: fix setclientid_confirm same_cred check
  nfsd4: merge 3 setclientid cases to 2
  nfsd4: pull out common code from setclientid cases
  nfsd4: merge last two setclientid cases
  nfsd4: setclientid/confirm comment cleanup
  nfsd4: setclientid remove unnecessary terms from a logical expression
  nfsd4: move rq_flavor into svc_cred
  nfsd4: stricter cred comparison for setclientid/exchange_id
  nfsd4: move principal name into svc_cred
  nfsd4: allow removing clients not holding state
  nfsd4: rearrange exchange_id logic to simplify
  ...
2012-06-01 08:32:58 -07:00
Al Viro
d58367515f sch_atm.c: get rid of poinless extern
sockfd_lookup() is declared in linux/net.h, which is pulled by
linux/skbuff.h (and needed for a lot of other stuff in sch_atm.c
anyway).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:18 -04:00
Linus Torvalds
a00b6151a2 Merge branch 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd update from Bruce Fields.

* 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linux: (23 commits)
  nfsd: trivial: use SEEK_SET instead of 0 in vfs_llseek
  SUNRPC: split upcall function to extract reusable parts
  nfsd: allocate id-to-name and name-to-id caches in per-net operations.
  nfsd: make name-to-id cache allocated per network namespace context
  nfsd: make id-to-name cache allocated per network namespace context
  nfsd: pass network context to idmap init/exit functions
  nfsd: allocate export and expkey caches in per-net operations.
  nfsd: make expkey cache allocated per network namespace context
  nfsd: make export cache allocated per network namespace context
  nfsd: pass pointer to export cache down to stack wherever possible.
  nfsd: pass network context to export caches init/shutdown routines
  Lockd: pass network namespace to creation and destruction routines
  NFSd: remove hard-coded dereferences to name-to-id and id-to-name caches
  nfsd: pass pointer to expkey cache down to stack wherever possible.
  nfsd: use hash table from cache detail in nfsd export seq ops
  nfsd: pass svc_export_cache pointer as private data to "exports" seq file ops
  nfsd: use exp_put() for svc_export_cache put
  nfsd: use cache detail pointer from svc_export structure on cache put
  nfsd: add link to owner cache detail to svc_export structure
  nfsd: use passed cache_detail pointer expkey_parse()
  ...
2012-05-31 18:18:11 -07:00
J. Bruce Fields
d5497fc693 nfsd4: move rq_flavor into svc_cred
Move the rq_flavor into struct svc_cred, and use it in setclientid and
exchange_id comparisons as well.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-05-31 20:29:58 -04:00
J. Bruce Fields
03a4e1f6dd nfsd4: move principal name into svc_cred
Instead of keeping the principal name associated with a request in a
structure that's private to auth_gss and using an accessor function,
move it to svc_cred.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-05-31 20:29:55 -04:00