Otherwise we beat heavily on the global tcp_memory atomics
when all of the sockets in the system are slowly sending
perioding packet clumps.
Noticed and suggested by Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
I.e. remove the net != &init_net checks from the places, that now can
handle other-than-init net namespace.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
... up to rtentry_to_fib_config
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the netlink socket to be per namespace. That allows
to have each namespace its own socket for routing queries.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The preparatory work has been done. All we need is to substitute
fib_table_hash with net->ipv4.fib_table_hash. Netns context is
available when required.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The final trick for rules: place fib4_rules_ops into struct net and
modify initialization path for this.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second part (for the CONFIG_IP_FIB_TRIE case) of the patch
#4, where we have created proc files in namespaces.
Now we can dump correct info in them.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second part (for the CONFIG_IP_FIB_HASH case) of the patch
#4, where we have created proc files in namespaces.
Now we can dump correct info in them.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
nl_info is used to track the end-user destination of routing change
notification. This is a natural object to hold a namespace on. Place
it there and utilize the context in the appropriate places.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch extends the inet_addr_type and inet_dev_addr_type with the
network namespace pointer. That allows to access the different tables
relatively to the network namespace.
The modification of the signature function is reported in all the
callers of the inet_addr_type using the pointer to the well known
init_net.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends the fib_get_table and the fib_new_table functions
with the network namespace pointer. That will allow to access the
table relatively from the network namespace.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the direct pointers to local and main tables with
calls to fib_get_table() with appropriate argument.
This doesn't introduce additional dereferences, but makes the access to fib
tables uniform in any (CONFIG_IP_MULTIPLE_TABLES) case.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the fib to be initialized as a subsystem for the
network namespaces. The code does not handle several namespaces yet,
so in case of a creation of a network namespace, the
creation/initialization will not occur.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds error paths into both versions of fib4_rules_init
(with/without CONFIG_IP_MULTIPLE_TABLES) and returns error code to the
caller.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds netns parameter to fib_proc_init/exit and replaces __init
specifier with __net_init. After this, we will not yet have these proc
files show info from the specific namespace - this will be done when
these tables become namespaced.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_rules_ops contains operations and the list of configured rules. ops will
become per/namespace soon, so we need them to be known in the default_pref
callback.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch extends the different fib rules API in order to pass the
network namespace pointer. That will allow to access the different
tables from a namespace relative object. As usual, the pointer to the
init_net variable is passed as parameter so we don't break the
network.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes the most simple cases for netfilter.
The first part is tne queue modules for ipv4 and ipv6,
on which the net/ipv4/ and net/ipv6/ paths are reused
from the appropriate ipv4 and ipv6 code.
The conntrack module is also patched, but this hunk is
very small and simple.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The feature of ipvs ctls is that the net/ipv4/vs path
is common for core ipvs ctls and for two schedulers,
so I make it exported and re-use it in modules.
Two other .c files required linux/sysctl.h to make the
extern declaration of this path compile well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
- The DNAT (Destination NAT) is not implemented in IPV4.
- This patch remove the code which checks these flags
in net/ipv4/arp.c and net/ipv4/route.c.
The RTCF_NAT and RTCF_NAT should stay in the header (linux/in_route.h)
because they are used in DECnet.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that ip_build_xmit is no longer used in here and
ip_append_data is used.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous NETNS patches broke CONFIG_SYSCTL=n case
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CHECK net/ipv4/icmp.c
net/ipv4/icmp.c:249:13: warning: context imbalance in 'icmp_xmit_unlock' -
unexpected unlock
net/ipv4/icmp.c:376:13: warning: context imbalance in 'icmp_reply' - different
lock contexts for basic block
net/ipv4/icmp.c:430:6: warning: context imbalance in 'icmp_send' - different
lock contexts for basic block
Solution is to declare both icmp_xmit_lock() and icmp_xmit_unlock() as inline
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Cleanups (all functions are prefixed by sock_prot_inuse)
sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1)
sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1)
sock_prot_inuse() -> sock_prot_inuse_get()
New functions :
sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use.
2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto",
since nobody wants to read the inuse value.
This saves 1372 bytes on i386/SMP and some cpu cycles.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of segments which are purely for control without any
data (SYN/ACK/FIN/RST), many fields are set to common values
in multiple places.
i386 results:
$ gcc --version
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)
$ codiff tcp_output.o.old tcp_output.o.new
net/ipv4/tcp_output.c:
tcp_xmit_probe_skb | -48
tcp_send_ack | -56
tcp_retransmit_skb | -79
tcp_connect | -43
tcp_send_active_reset | -35
tcp_make_synack | -42
tcp_send_fin | -48
7 functions changed, 351 bytes removed
net/ipv4/tcp_output.c:
tcp_init_nondata_skb | +90
1 function changed, 90 bytes added
tcp_output.o.mid:
8 functions changed, 90 bytes added, 351 bytes removed, diff: -261
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed case indentation level & combined some nested ifs, mostly
within 80 lines now. This is a leftover from indent patch, it
just had to be done manually to avoid messing it up completely.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __acquires() and __releases() annotations to suppress some sparse
warnings.
example of warnings :
net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These were manually selected from indent's results which as is
are too noisy to be of any use without human reason. In addition,
some extra newlines between function and its comment were removed
too.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The snd_up check should be enough. I suspect this has been
there to provide a minor optimization in clean_rtx_queue which
used to have a small if (!->sacked) block which could skip
snd_up check among the other work.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
SACK reneging can be precalculated to a FLAG in clean_rtx_queue
which has the right skb looked up. This will help a bit in
future because skb->sacked access will be changed eventually,
changing it already won't hurt any.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's very little need to have the packets_out incrementing in
a separate function. Also name the combined function
appropriately.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier resolution for NewReno's sacked_out should now keep
it small enough for this to become invariant-like check.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces new memory accounting functions for each network
protocol. Most of them are renamed from memory accounting functions
for stream protocols. At the same time, some stream memory accounting
functions are removed since other functions do same thing.
Renaming:
sk_stream_free_skb() -> sk_wmem_free_skb()
__sk_stream_mem_reclaim() -> __sk_mem_reclaim()
sk_stream_mem_reclaim() -> sk_mem_reclaim()
sk_stream_mem_schedule -> __sk_mem_schedule()
sk_stream_pages() -> sk_mem_pages()
sk_stream_rmem_schedule() -> sk_rmem_schedule()
sk_stream_wmem_schedule() -> sk_wmem_schedule()
sk_charge_skb() -> sk_mem_charge()
Removeing
sk_stream_rfree(): consolidates into sock_rfree()
sk_stream_set_owner_r(): consolidates into skb_set_owner_r()
sk_stream_mem_schedule()
The following functions are added.
sk_has_account(): check if the protocol supports accounting
sk_mem_uncharge(): do the opposite of sk_mem_charge()
In addition, to achieve consolidation, updating sk_wmem_queued is
removed from sk_mem_charge().
Next, to consolidate memory accounting functions, this patch adds
memory accounting calls to network core functions. Moreover, present
memory accounting call is renamed to new accounting call.
Finally we replace present memory accounting calls with new interface
in TCP and SCTP.
Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Hideo Aoki <haoki@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's nip the code duplication in the bud :)
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When netfilter is off the transport-mode async resumption doesn't work
because we don't push back the IP header. This patch fixes that by
moving most of the code outside of ifdef NETFILTER since the only part
that's not common is the short-circuit in the protocol handler.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
While checking Gavin's patch I noticed that the returned seq_rtt
is not used by the caller.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
odd boundary by the callers of tcp_window_allows.
We try really hard to avoid unnecessary modulos. Therefore the
old caller side check "if (skb->len < limit)" was too wide as
well because limit is not bound in any way to skb->len and can
cause spurious testing for trimming in the middle of the queue
while we only wanted that to happen at the tail of the queue.
A simple additional caller side check for tcp_write_queue_tail
would likely have resulted 2 x modulos because the limit would
have to be first calculated from window, however, doing that
unnecessary modulo is not mandatory. After a minor change to
the algorithm, simply determine first if the modulo is needed
at all and at that point immediately decide also from which
value it should be calculated from.
This approach also kills some duplicated code.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm actually surprised at how much was involved. At first glance it
appears that the neighbour table data structures are already split by
network device so all that should be needed is to modify the user
interface commands to filter the set of neighbours by the network
namespace of their devices.
However a couple things turned up while I was reading through the
code. The proxy neighbour table allows entries with no network
device, and the neighbour parms are per network device (except for the
defaults) so they now need a per network namespace default.
So I updated the two structures (which surprised me) with their very
own network namespace parameter. Updated the relevant lookup and
destroy routines with a network namespace parameter and modified the
code that interacts with users to filter out neighbour table entries
for devices of other namespaces.
I'm a little concerned that we can modify and display the global table
configuration and from all network namespaces. But this appears good
enough for now.
I keep thinking modifying the neighbour table to have per network
namespace instances of each table type would should be cleaner. The
hash table is already dynamically sized so there are it is not a
limiter. The default parameter would be straight forward to take care
of. However when I look at the how the network table is built and
used I still find some assumptions that there is only a single
neighbour table for each type of table in the kernel. The netlink
operations, neigh_seq_start, the non-core network users that call
neigh_lookup. So while it might be doable it would require more
refactoring than my current approach of just doing a little extra
filtering in the code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a number of new IPsec audit events to meet the auditing
requirements of RFC4303. This includes audit hooks for the following events:
* Could not find a valid SA [sections 2.1, 3.4.2]
. xfrm_audit_state_notfound()
. xfrm_audit_state_notfound_simple()
* Sequence number overflow [section 3.3.3]
. xfrm_audit_state_replay_overflow()
* Replayed packet [section 3.4.3]
. xfrm_audit_state_replay()
* Integrity check failure [sections 3.4.4.1, 3.4.4.2]
. xfrm_audit_state_icvfail()
While RFC4304 deals only with ESP most of the changes in this patch apply to
IPsec in general, i.e. both AH and ESP. The one case, integrity check
failure, where ESP specific code had to be modified the same was done to the
AH code for the sake of consistency.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_win_from_space() being signed, compiler might emit an integer divide
to compute tcp_win_from_space()/2 .
Using right shifts is OK here and less expensive.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_mtu_to_mss() being signed, compiler might emit an integer divide
to compute tcp_mtu_to_mss()/2 .
Using a right shift is OK here and less expensive.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before submiting a patch to change a divide to a right shift, I felt
necessary to create a helper function tcp_mtu_probing() to reduce length of
lines exceeding 100 chars in tcp_write_timeout().
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 'goal' is a signed int, compiler may emit an integer divide
to compute goal/2.
Using a right shift is OK here and less expensive.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several length variables cannot be negative, so convert int to
unsigned int. This also allows us to do sane shift operations
on those variables.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because 'free_space' variable in __tcp_select_window() is signed,
expression (free_space / 2) forces compiler to emit an integer divide.
This can be changed to a plain right shift, less expensive.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 specific thing is wrongly removed from transformation at net-2.6.25.
This patch recovers it with current design.
o Update "path" of xfrm_dst since IPv6 transformation should
care about routing changes. It is required by MIPv6 and
off-link destined IPsec.
o Rename nfheader_len which is for non-fragment transformation used by
MIPv6 to rt6i_nfheader_len as IPv6 name space.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is -700 bytes from the net/ipv4/built-in.o
add/remove: 1/0 grow/shrink: 1/3 up/down: 340/-1040 (-700)
function old new delta
__inet_lookup_established - 339 +339
tcp_sacktag_write_queue 2254 2255 +1
tcp_v4_err 1304 973 -331
tcp_v4_rcv 2089 1744 -345
tcp_v4_do_rcv 826 462 -364
Exporting is for dccp module (used via e.g. inet_lookup).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one is used in quite many places in the networking code and
seems to big to be inline.
After the patch net/ipv4/build-in.o loses ~650 bytes:
add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653)
function old new delta
__inet_hash_nolisten - 282 +282
__inet_hash - 179 +179
tcp_sacktag_write_queue 2255 2254 -1
__inet_lookup_listener 284 274 -10
tcp_v4_syn_recv_sock 755 493 -262
tcp_v4_hash 389 35 -354
inet_hash_connect 1086 599 -487
This version addresses the issue pointed by Eric, that
while being inline this function was optimized by gcc
in respect to the 'listen_possible' argument.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It appears that I've managed to create two different functions both
called xfrm6_tunnel_output. This is because we have the plain tunnel
encapsulation named xfrmX_tunnel as well as the tunnel-mode encapsulation
which lives in the files xfrmX_mode_tunnel.c.
This patch renames functions from the latter to use the xfrmX_mode_tunnel
prefix to avoid name-space conflicts.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NETFILTER_ADVANCED option hides lots of the rather obscure netfilter
options when disabled and provides defaults (M) that should allow to
run a distribution firewall without further thinking.
Defaults to 'y' to avoid breaking current configurations.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Apply Eric Dumazet's jhash optimizations where applicable. Quoting Eric:
Thanks to jhash, hash value uses full 32 bits. Instead of returning
hash % size (implying a divide) we return the high 32 bits of the
(hash * size) that will give results between [0 and size-1] and same
hash distribution.
On most cpus, a multiply is less expensive than a divide, by an order
of magnitude.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parenthesize macro parameters.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few netfilter modules provide their own union of IPv4 and IPv6
address storage. Will unify that in this patch series.
(1/4): Rename union nf_conntrack_address to union nf_inet_addr and
move it to x_tables.h.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use %u format specifiers as ->family is unsigned.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use rcu_assign_pointer/rcu_dereference to avoid races.
Also remove an obsolete CONFIG_IP_NAT_NEEDED ifdef.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_nat_setup_info gets the hook number and translates that to the
manip type to perform. This is a relict from the time when one
manip per hook could exist, the exact hook number doesn't matter
anymore, its converted to the manip type. Most callers already
know what kind of NAT they want to perform, so pass the maniptype
in directly.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The combination of NAT and helpers may produce TCP sequence adjustments.
In failover setups, this information needs to be replicated in order to
achieve a successful recovery of mangled, related connections. This patch is
particularly useful for conntrackd, see:
http://people.netfilter.org/pablo/conntrack-tools/
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resync get_entries() with ip_tables.c by moving the checks from the
setsockopt handler to the function itself.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
More resyncing with ip_tables.c as preparation for compat support.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resync with ip_tables.c as preparation for compat support.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove compatiblity hack copied from ip_tables.c - ipchains didn't even
support arp_tables :)
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The size check is already performed by xt_check_target, no need
to do it again.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipchains support has been removed years ago. kill last remains.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reformat ip_tables.c and ip6_tables.c in order to eliminate non-functional
differences and minimize diff output.
This allows to get a view of the real differences using:
sed -e 's/IP6T/IPT/g' \
-e 's/IP6/IP/g' \
-e 's/INET6/INET/g' \
-e 's/ip6t/ipt/g' \
-e 's/ip6/ip/g' \
-e 's/ipv6/ip/g' \
-e 's/icmp6/icmp/g' \
net/ipv6/netfilter/ip6_tables.c | \
diff -wup /dev/stdin net/ipv4/netfilter/ip_tables.c
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use compat types and compat iterators when dealing with compat entries for
clarity. This doesn't actually make a difference for ip_tables, but is
needed for ip6_tables and arp_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Account for size differences when dumping entries or calculating the
entry positions. This doesn't actually make any difference for IPv4
since the structures have the same size, but its logically correct
and needed for IPv6.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make xt_compat_match_from_user return an int to make it usable in the
*tables iterator macros and kill a now unnecessary wrapper function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compat code has some very odd formating, clean it up before porting
it to ip6_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are scattered over the code, but almost all the
"critical" places already have the proper struct net
at hand except for snmp proc showing function and routing
rtnl handler.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are all collected in the net/ipv4/devinet.c file and
mostly use the IPV4_DEVCONF_DFLT macro.
So I add the net parameter to it and patch users accordingly.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the core.
Add all and default pointers on the netns_ipv4 and register
a new pernet subsys to initialize them.
Also add the ctl_table_header to register the
net.ipv4.ip_forward ctl.
I don't allocate additional memory for init_net, but use
global devinets.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some handers and strategies of devinet sysctl tables need
to know the net to propagate the ctl change to all the
net devices.
I use the (currently unused) extra2 pointer on the tables
to get it.
Holding the reference on the struct net is not possible,
because otherwise we'll get a net->ctl_table->net circular
dependency. But since the ctl tables are unregistered during
the net destruction, this is safe to get it w/o additional
protection.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one will need to set the IPV4_DEVCONF_ALL(PROXY_ARP), but
there's no ways to get the net right in place, so we have to
pull one from the inet_ioctl's struct sock.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, this function is void, so failures in creating
sysctls for new/renamed devices are not reported to anywhere.
Fixing this is another complex (needed?) task, but this
return value is needed during the namespaces creation to
handle the case, when we failed to create "all" and "default"
entries.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that external users may increment the counters directly, we need
to ensure that udp_stats_in6 is always available. Otherwise we'd
either have to requrie the external users to be built as modules or
ipv6 to be built-in.
This isn't too bad because udp_stats_in6 is just a pair of pointers
plus an EXPORT, e.g., just 40 (16 + 24) bytes on x86-64.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several thresholds for trie fib hash management. They are used
in the code as a constants. Make them constants from the compiler point of
view.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar to the change already done for IPIP tunnels.
Once created, a GRE tunnel can't be bound to another device.
To reproduce:
# create a tunnel:
ip tunnel add tunneltest0 mode gre remote 10.0.0.1 dev eth0
# try to change the bounding device from eth0 to eth1:
ip tunnel change tunneltest0 dev eth1
# show the result:
ip tunnel show tunneltest0
tunneltest0: gre/ip remote 10.0.0.1 local any dev eth0 ttl inherit
Notice the bound device has not changed from eth0 to eth1.
This patch fixes it. When changing the binding, it also recalculates the
MTU according to the new bound device's MTU.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a logical error in ICMP policy checks which lets
packets through if the state ICMP flag is off.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts all callers of xfrm_lookup that used an
explicit value of 1 to indiciate blocking to use the new flag
XFRM_LOOKUP_WAIT.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The policy check I added for ICMP on IPv6 is reversed. This
patch fixes that.
It also adds an skb->sp check so that unprotected packets that
fail the policy check do not crash the machine.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once created, an IP tunnel can't be bound to another device.
(reported as https://bugzilla.redhat.com/show_bug.cgi?id=419671)
To reproduce:
# create a tunnel:
ip tunnel add tunneltest0 mode ipip remote 10.0.0.1 dev eth0
# try to change the bounding device from eth0 to eth1:
ip tunnel change tunneltest0 dev eth1
# show the result:
ip tunnel show tunneltest0
tunneltest0: ip/ip remote 10.0.0.1 local any dev eth0 ttl inherit
Notice the bound device has not changed from eth0 to eth1.
This patch fixes it. When changing the binding, it also recalculates the
MTU according to the new bound device's MTU.
If the change is acceptable, I'll do the same for GRE and SIT tunnels.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4301 requires us to relookup ICMP traffic that does not match any
policies using the reverse of its payload. This patch implements this
for ICMP traffic that originates from or terminates on localhost.
This is activated on outbound with the new policy flag XFRM_POLICY_ICMP,
and on inbound by the new state flag XFRM_STATE_ICMP.
On inbound the policy check is now performed by the ICMP protocol so
that it can repeat the policy check where necessary.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4301 requires us to relookup ICMP traffic that does not match any
policies using the reverse of its payload. This patch adds the functions
xfrm_decode_session_reverse and xfrmX_policy_check_reverse so we can get
the reverse flow to perform such a lookup.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes:
* moving neigh_sysctl_(un)register calls inside
devinet_sysctl_(un)register ones, as they are always
called in pairs;
* making __devinet_sysctl_unregister() to unregister
the ipv4_devconf struct, while original devinet_sysctl_unregister()
works with the in_device to handle both - devconf and
neigh sysctls;
* make stubs for CONFIG_SYSCTL=n case to get rid of
in-code ifdefs.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the INET_TWDR_TWKILL_SLOTS vs sizeof(twdr->thread_slots)
check nicer.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sizeof(struct tcp_skb_cb) should not be less than the
sizeof(skb->cb). This is checked in net/ipv4/tcp.c, but
this check can be made more gracefully.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With fixes from Arnaldo Carvalho de Melo.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: no need pass pointer to a default into fib_detect_death
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
single list_head variable initialized with LIST_HEAD_INIT could almost
always can be replaced with LIST_HEAD declaration, this shrinks the code
and looks better.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only use these variables when displaying the trie in proc so
place them into the iterator to make this explicit. We should
probably do something smarter to handle the CONFIG_IP_MULTIPLE_TABLES
case but at least this makes it clear that the silliness is limited
to the display in /proc.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are only 2 users and it doesn't hurt to call fib_get_table
instead, and it makes it easier to make the fib network namespace
aware.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move dst entries to a namespace loopback to catch refcounting leaks.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PROXY_ARP is set on devconfigs in a similar way in
both calls.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ATF_PUBL requests are handled completely separate from
the others. Emphasize it with a separate function. This also
reduces the indentation level.
The same issue exists with the arp_delete_request, but
when I tried to make it in one patch diff produced completely
unreadable patch. So I split it into two, but they may be
done with one commit.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no need in having this function exist in a form
of macro. Properly formatted function looks much better.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rt_cache, stats/rt_cache and rt_acct(optional) files
creation looks a bit messy. Clean this out and join them
to other proc-related functions under the proper ifdef.
The struct net * argument in a new function will help net
namespaces patches look nicer.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net/ipv4/route.c file declares some entries for proc
to dump some routing info. The reading functions are
scattered over this file - collect them together.
Besides, remove a useless IP_RT_ACCT_CPU macro.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous move of the the UDP inDatagrams counter caused each
peek of the same packet to be counted separately. This may be
undesirable.
This patch fixes this by adding a bit to sk_buff to record whether
this packet has already been seen through skb_recv_datagram. We
then only increment the counter when the packet is seen for the
first time.
The only dodgy part is the fact that skb_recv_datagram doesn't have
a good way of returning this new bit of information. So I've added
a new function __skb_recv_datagram that does return this and made
skb_recv_datagram a wrapper around it.
The plan is to eventually replace all uses of skb_recv_datagram with
this new function at which time it can be renamed its proper name.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous move of the the UDP inDatagrams counter caused the
counting of encapsulated packets, SUNRPC data (as opposed to call)
packets and RXRPC packets to go missing.
This patch restores all of these.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is possible for two processes to peek on the same socket
and end up incrementing the error counter twice for the same packet.
This patch fixes it by making skb_kill_datagram return whether it
succeeded in unlinking the packet and only incrementing the counter
if it did.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
AFAIS these two entries should do the same thing - change the
forwarding state on ipv4_devconf and on all the devices.
I propose to merge the handlers together using ctl paths.
The inet_forward_change() is static after this and I move
it higher to be closer to other "propagation" helpers and
to avoid diff making patches based on { and } matching :)
i.e. - make them easier to read.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the same as I did for the net/core/ table in the
second patch in his series: use the paths and isolate the
whole table in the .c file.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes several cleanups:
* tune Makefile to compile out this file when SYSCTL=n. Now
it looks like net/core/sysctl_net_core.c one;
* move the ipv4_config to af_inet.c to exist all the time;
* remove additional sysctl_ip_nonlocal_bind declaration
(it is already declared in net/ip.h);
* remove no nonger needed ifdefs from this file.
This is a preparation for using ctl paths for net/ipv4/
sysctl table.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that issue_verdict doesn't need to free the queue entries anymore,
all it does is disable local BHs and call nf_reinject. Move the BH
disabling to the okfn invocation in nf_reinject and kill the
issue_verdict functions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move common fields for queue management to struct nf_info and rename it
to struct nf_queue_entry. The avoids one allocation/free per packet and
simplifies the code a bit.
Alternatively we could add some private room at the tail, but since
all current users use identical structs this seems easier.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A queue entry lookup currently looks like this:
ipq_find_dequeue_entry -> __ipq_find_dequeue_entry ->
__ipq_find_entry -> cmpfn -> id_cmp
Use simple open-coded list walking and kill the cmpfn for
ipq_find_dequeue_entry. Instead add it to ipq_flush (after
similar cleanups) and use ipq_flush for both complete flushes
and flushing entries related to a device.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use list_add_tail/list_for_each_entry instead of list_add and
list_for_each_prev as a preparation for switching to RCU.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the data pointer from struct nf_queue_handler. It has never been used
and is useless for the only handler that really matters, nfnetlink_queue,
since the handler is shared between all instances.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_conntrack_h323 needs ip6_route_output for the call forwarding filter.
Add a ->route function to nf_afinfo and use that to avoid pulling in the
ipv6 module.
Fix the #ifdef for the IPv6 code while I'm at it - the IPv6 support is
only needed when IPv6 conntrack is enabled.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Log GID in addition to UID
Signed-off-by: Maciej Soltysiak <maciej.soltysiak@ae.poznan.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the ipt_SAME target as scheduled in feature-removal-schedule.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge ipt_TOS into xt_DSCP.
Merge ipt_TOS (tos v0 target) into xt_DSCP. They both modify the same
field in the IPv4 header, so it seems reasonable to keep them in one
piece. This is part two of the implicit 4-patch series to move tos to
xtables and extend it by IPv6.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge ipt_tos into xt_dscp.
Merge ipt_tos (tos v0 match) into xt_dscp. They both match on the same
field in the IPv4 header, so it seems reasonable to keep them in one
piece. This is part one of the implicit 4-patch series to move tos to
xtables and extend it by IPv6.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unify netfilter match kconfig descriptions
Consistently use lowercase for matches in kconfig one-line
descriptions and name the match module.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Addrtype match has a new revision (1), which lets address type checking
limited to the interface the current packet belongs to. Either incoming
or outgoing interface can be used depending on the current hook. In the
FORWARD hook two maches should be used if both interfaces have to be checked.
The new structure is ipt_addrtype_info_v1.
Revision 0 lets older userspace programs use the match as earlier.
ipt_addrtype_info is used.
Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Address type search can be limited to an interface by
inet_dev_addr_type function.
Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
xt_owner merges ipt_owner and ip6t_owner, and adds a flag to match
on socket (non-)existence.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're not multiplying the size with the number of CPUs anymore, so the
check is obsolete.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a big array of NR_CPUS entries, we can compute the size
needed at runtime, using nr_cpu_ids
This should save some ram (especially on David's machines where NR_CPUS=4096 :
32 KB can be saved per table, and 64KB for dynamically allocated ones (because
of slab/slub alignements) )
In particular, the 'bootstrap' tables are not any more static (in data
section) but on stack as their size is now very small.
This also should reduce the size used on stack in compat functions
(get_info() declares an automatic variable, that could be bigger than kernel
stack size for big NR_CPUS)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Give all Netfilter modules consistent and unique symbol names.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When merging the input paths of IPsec I accidentally left a hard-coded
AF_INET for the state lookup call. This broke IPv6 obviously. This
patch fixes by getting the input callers to specify the family through
skb->cb.
Credit goes to Kazunori Miyazawa for diagnosing this and providing an
initial patch.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
System calls should be USER. So change the BH to USER for
UDP*_INC_STATS_BH().
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have macro IS_UDPLITE, we can use it.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks dave, herbert, gerrit, andi and other people for your
discussion about this problem.
UdpInDatagrams can be confusing because it counts packets that
might be dropped later.
Move UdpInDatagrams into recvmsg() as allowed by the RFC.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pointing to the next skb is necessary to avoid referencing
already SACKed skbs which will soon be on a separate list.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lines won't be that long and it's compiler's job to optimize
them.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
They better be valid when call to write_queue functions is made
once things that follow are going in.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bogus seqno compares just mislead, the code is identical for
both sides of the seqno compare (and was even executed just
once because of return in between).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To get there, highest_sack must have advanced. When it advances,
a new skb is SACKed, which already sets that FLAG. Besides, the
original purpose of it has puzzled me, never understood why
LOST bit setting of retransmitted skb is marked with
FLAG_DATA_SACKED.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually those skbs will have L set, not counting them as lost
retransmissions is misleading.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This looks very much like the patch for neighbors.
The path is also located on the stack and is prepared
inside the function. This time, the call to the registering
function is guarded with the RTNL lock, but I decided
to keep it on the stack not to litter the devinet.c file
with unneeded names and to make it look similar to the
neighbors code.
This is also intended to help us with the net namespaces
and saves the vmlinux size as well - this time by more
than 670 bytes.
The difference from the first version is just the patch
offsets, that changed due to changes in the patch #2.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this call is used to register sysctls for devices
and for the "default" confs. The "all" sysctls are registered
separately.
Besides, the inet_device is passed to this function, but it is
not needed there at all - just the device name and ifindex are
required.
Thanks to Herbert, who noticed, that this call doesn't even
require the devconf pointer (the last argument) - all we need
we can take from the in_device itself.
The fix is to make a __devinet_sysctl_register(), which registers
sysctls for all "devices" we need, including "default" and "all" :)
The original devinet_sysctl_register() works with struct net_device,
not the inet_device, and calls the introduced function, passing
the device name and ifindex (to be used as procname and ctl_name)
into it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I moved the call to kmalloc() from the *t declaration into
the code (this is confusing when a variable is initialized
with the result of some call) and removed unneeded comment
near the error path. Just like I did with the neigh ctl-s.
Besides, I fixed the goto's and the labels - they were indented
with spaces :(
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill the defines again, convert to the new checksum helper names and
remove the dependency of NET_ACT_NAT on NETFILTER.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Skip condition used to be wrong way around which made SACK
processing very broken, missed many blocks because of that.
2) Use highest_sack advancement only if some skbs are already
sacked because otherwise tcp_write_queue_next may move things
too far (occurs mainly with GSO). The other similar advancement
is not problem because highest_sack was previosly put to point
a sacked skb.
These problems were located because of problem report from Matt
Mathis <mathis@psc.edu>.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 3rd argument is always zero (according to grep :) Eliminate
it and merge the function with sk_stream_alloc_skb.
This saves 44 more bytes, and together with the previous patch
we have:
add/remove: 1/0 grow/shrink: 0/8 up/down: 183/-751 (-568)
function old new delta
sk_stream_alloc_skb - 183 +183
ip_rt_init 529 525 -4
arp_ignore 112 107 -5
__inet_lookup_listener 284 274 -10
tcp_sendmsg 2583 2481 -102
tcp_sendpage 1449 1300 -149
tso_fragment 417 258 -159
tcp_fragment 1149 988 -161
__tcp_push_pending_frames 1998 1837 -161
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function seems too big for inlining. Indeed, it saves
half-a-kilo when uninlined:
add/remove: 1/0 grow/shrink: 0/7 up/down: 195/-719 (-524)
function old new delta
sk_stream_alloc_pskb - 195 +195
ip_rt_init 529 525 -4
__inet_lookup_listener 284 274 -10
tcp_sendmsg 2583 2486 -97
tcp_sendpage 1449 1305 -144
tso_fragment 417 267 -150
tcp_fragment 1149 992 -157
__tcp_push_pending_frames 1998 1841 -157
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_hash: kmalloc + memset conversion to kzalloc
fix to avoid memset entirely.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_semantics: kmalloc + memset conversion to kzalloc
fix to avoid memset entirely.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Better place exists in update_send_head (other non-queue related
adjustments are done there as well) which is the only caller of
tcp_advance_send_head (now that the bogus call from mtu_probe is
gone).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sock_wake_async() performs a bit different actions
depending on "how" argument. Unfortunately this argument
ony has numerical magic values.
I propose to give names to their constants to help people
reading this function callers understand what's going on
without looking into this function all the time.
I suppose this is 2.6.25 material, but if it's not (or the
naming seems poor/bad/awful), I can rework it against the
current net-2.6 tree.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both try_module_get/module_put already handle the module == NULL
case, so no need in manual checking.
This patch fits both net-2.6 and net-2.6.25.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every 600 seconds (ip_rt_secret_interval), a softirq flush of the
whole ip route cache is triggered. On loaded machines, this can starve
softirq for many seconds and can eventually crash.
This patch moves this flush to a workqueue context, using the worker
we intoduced in commit 39c90ece75 (IPV4:
Convert rt_check_expire() from softirq processing to workqueue.)
Also, immediate flushes (echo 0 >/proc/sys/net/ipv4/route/flush) are
using rt_do_flush() helper function, wich take attention to
rescheduling.
Next step will be to handle delayed flushes
("echo -1 >/proc/sys/net/ipv4/route/flush" or "ip route flush cache")
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both ipv6/raw.c and ipv4/raw.c use the seq files to walk
through the raw sockets hash and show them.
The "walking" code is rather huge, but is identical in both
cases. The difference is the hash table to walk over and
the protocol family to check (this was not in the first
virsion of the patch, which was noticed by YOSHIFUJI)
Make the ->open store the needed hash table and the family
on the allocated raw_iter_state and make the start/next/stop
callbacks work with it.
This removes most of the code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same as the ->hash one, this is easily consolidated.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having the raw_hashinfo it's easy to consolidate the
raw[46]_hash functions.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4/raw.c and ipv6/raw.c contain many common code (most
of which is proc interface) which can be consolidated.
Most of the places to consolidate deal with the raw sockets
hashtable, so introduce a struct raw_hashinfo which describes
the raw sockets hash.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The raw sockets functions are explicitly used from
inside the kernel in two places:
1. in ip_local_deliver_finish to intercept skb-s
2. in icmp_error
For this purposes many functions and even data structures,
that are naturally internal for raw protocol, are exported.
Compact the API to two functions and hide all the other
(including hash table and rwlock) inside the net/ipv4/raw.c
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With this patch the synced connections are created with their real state,
which can be changed on the next synchronizations if necessary. This way
on fail-over all the connections will be treated according to their actual
state, causing no scheduling problems (the active and the nonactive
connections have different weights in the schedulers).
The backwards compatibility is preserved and the existing tools will show
the true connection states even on the backup director.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch labels the sync-created connections with IP_VS_CONN_F_SYNC
flag and creates /proc/net/ip_vs_conn_sync to enable monitoring of the
origin of the connections, if they are local or created by the
synchronization.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously one of the in-block skip branches was missing it.
Also, drop it from tail-fully-processed case because the next
iteration will do exactly the same thing, i.e., process the
SACK block that contains the DSACK information.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_rt_acct needs 4096 bytes per cpu to perform some accounting.
It is actually allocated as a single huge array [4096*NR_CPUS]
(rounded up to a power of two)
Converting it to a per cpu variable is wanted to :
- Save space on machines were num_possible_cpus() < NR_CPUS
- Better NUMA placement (each cpu gets memory on its node)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Key points of this patch are:
- In case new SACK information is advance only type, no skb
processing below previously discovered highest point is done
- Optimize cases below highest point too since there's no need
to always go up to highest point (which is very likely still
present in that SACK), this is not entirely true though
because I'm dropping the fastpath_skb_hint which could
previously optimize those cases even better. Whether that's
significant, I'm not too sure.
Currently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).
Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!
Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.
Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.
The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".
DSACKs are special case that must always be walked.
The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Worker function that implements the main logic of
the inner-most loop of tcp_sacktag_write_queue().
Idea was originally presented by David S. Miller.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>