Allow one-packet scheduling for UDP connections. When the fwmark-based or
normal virtual service is marked with '-o' or '--ops' options all
connections are created only to schedule one packet. Useful to schedule UDP
packets from same client port to different real servers. Recommended with
RR or WRR schedulers (the connections are not visible with ipvsadm -L).
Signed-off-by: Nick Chalk <nick@loadbalancer.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2.6.34 introduced 'conntrack zones' to deal with cases where packets
from multiple identical networks are handled by conntrack/NAT. Packets
are looped through veth devices, during which they are NATed to private
addresses, after which they can continue normally through the stack
and possibly have NAT rules applied a second time.
This works well, but is needlessly complicated for cases where only
a single SNAT/DNAT mapping needs to be applied to these packets. In that
case, all that needs to be done is to assign each network to a seperate
zone and perform NAT as usual. However this doesn't work for packets
destined for the machine performing NAT itself since its corrently not
possible to configure SNAT mappings for the LOCAL_IN chain.
This patch adds a new INPUT chain to the NAT table and changes the
targets performing SNAT to be usable in that chain.
Example usage with two identical networks (192.168.0.0/24) on eth0/eth1:
iptables -t raw -A PREROUTING -i eth0 -j CT --zone 1
iptables -t raw -A PREROUTING -i eth0 -j MARK --set-mark 1
iptables -t raw -A PREROUTING -i eth1 -j CT --zone 2
iptabels -t raw -A PREROUTING -i eth1 -j MARK --set-mark 2
iptables -t nat -A INPUT -m mark --mark 1 -j NETMAP --to 10.0.0.0/24
iptables -t nat -A POSTROUTING -m mark --mark 1 -j NETMAP --to 10.0.0.0/24
iptables -t nat -A INPUT -m mark --mark 2 -j NETMAP --to 10.0.1.0/24
iptables -t nat -A POSTROUTING -m mark --mark 2 -j NETMAP --to 10.0.1.0/24
iptables -t raw -A PREROUTING -d 10.0.0.0/24 -j CT --zone 1
iptables -t raw -A OUTPUT -d 10.0.0.0/24 -j CT --zone 1
iptables -t raw -A PREROUTING -d 10.0.1.0/24 -j CT --zone 2
iptables -t raw -A OUTPUT -d 10.0.1.0/24 -j CT --zone 2
iptables -t nat -A PREROUTING -d 10.0.0.0/24 -j NETMAP --to 192.168.0.0/24
iptables -t nat -A OUTPUT -d 10.0.0.0/24 -j NETMAP --to 192.168.0.0/24
iptables -t nat -A PREROUTING -d 10.0.1.0/24 -j NETMAP --to 192.168.0.0/24
iptables -t nat -A OUTPUT -d 10.0.1.0/24 -j NETMAP --to 192.168.0.0/24
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch implements an idletimer Xtables target that can be used to
identify when interfaces have been idle for a certain period of time.
Timers are identified by labels and are created when a rule is set with a new
label. The rules also take a timeout value (in seconds) as an option. If
more than one rule uses the same timer label, the timer will be restarted
whenever any of the rules get a hit.
One entry for each timer is created in sysfs. This attribute contains the
timer remaining for the timer to expire. The attributes are located under
the xt_idletimer class:
/sys/class/xt_idletimer/timers/<label>
When the timer expires, the target module sends a sysfs notification to the
userspace, which can then decide what to do (eg. disconnect to save power).
Cc: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
- clusterip_lock becomes a spinlock
- lockless lookups
- kfree() deferred after RCU grace period
- rcu_barrier_bh() inserted in clusterip_tg_exit()
v2)
- As Patrick pointed out, we use atomic_inc_not_zero() in
clusterip_config_find_get().
- list_add_rcu() and list_del_rcu() variants are used.
- atomic_dec_and_lock() used in clusterip_config_entry_put()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Change 2wire transfer rate of SFP+ module EEPROM from 400Khz to 100Khz
since some DACs(direct attached cables) do not work at 400Khz.
Reported-by: Krzysztof Oldzki <ole@ans.pl>
Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Try to reduce cache line contentions in peer management, to reduce IP
defragmentation overhead.
- peer_fake_node is marked 'const' to make sure its not modified.
(tested with CONFIG_DEBUG_RODATA=y)
- Group variables in two structures to reduce number of dirtied cache
lines. One named "peers" for avl tree root, its number of entries, and
associated lock. (candidate for RCU conversion)
- A second one named "unused_peers" for unused list and its lock
- Add a !list_empty() test in unlink_from_unused() to avoid taking lock
when entry is not unused.
- Use atomic_dec_and_lock() in inet_putpeer() to avoid taking lock in
some cases.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Uses a seqcount_t to synchronize stat producer and consumer, for packets
and bytes counter, now u64 types.
(dropped counter being rarely used, stay a native "unsigned long" type)
No noticeable performance impact on x86, as it only adds two increments
per frame. It might be more expensive on arches where smp_wmb() is not
free.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU to avoid atomic ops on idev refcnt in ipv6_get_mtu()
and ip6_dst_hoplimit()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use __in6_dev_get() instead of in6_dev_get()/in6_dev_put()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 97f8aefbbf "net: fix ethtool
coding style errors and warnings" changed the indentation of several
macro definitions in ethtool.h. These definitions line up in the diff
where there is an extra character at the start of each line, but not
in the resulting file.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameter (work) is unused, remove it.
Reported from Eric Dumazet.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Instead of doing one atomic operation per frag, we can factorize them.
Reported from Eric Dumazet.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
If the returned csum value is 0, We has set ip_summed with
CHECKSUM_UNNECESSARY flag in __skb_checksum_complete_head().
So this patch kills the check and changes to return to upper
caller directly.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
- must use atomic_inc_not_zero() in instance_lookup_get()
- must use hlist_add_head_rcu() instead of hlist_add_head()
- must use hlist_del_rcu() instead of hlist_del()
- Introduce NFULNL_COPY_DISABLED to stop lockless reader from using an
instance, before we do final instance_put() on it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch added to 2.6.34:
commit f8d1dcaf88
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue Apr 27 01:37:20 2010 +0000
ixgbe: enable extremely low latency
introduced a feature where LRO (called RSC on the hardware) was disabled
automatically when setting rx-usecs to 0 via ethtool. Some might not
like the fact that LRO was disabled automatically, but I'm fine with
that. What I don't like is that LRO/RSC is automatically enabled when
rx-usecs is set >0 via ethtool.
This would certainly be a problem if the device was used for forwarding
and it was determined that the low latency wasn't needed after the
device was already forwarding. I played around with saving the state of
LRO in the driver, but it just didn't seem worthwhile and would require
a small change to dev_disable_lro() that I did not like.
This patch simply leaves LRO disabled when setting rx-usecs >0 and
requires that the user enable it again. An extra informational message
will also now appear in the log so users can understand why LRO isn't
being enabled as they expect.
Inconsistency of LRO setting first noticed by Stanislaw Gruszka.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
CC: Stanislaw Gruszka <sgruszka@redhat.com>
CC: stable@kernel.org
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 675ad47375
removed the capability to use ethtool.set_msglevel to
control the types of messages emitted by the driver.
That commit should probably be reverted.
If not, then this patch fixes a message logging defect
introduced by converting a printk without KERN_<level>
to e_info.
This also reduces text by about 200 bytes.
Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a small window where the watchdog could be running as the
interface is brought down on a NIC with two ports wired back to back.
If ixgbe_update_status is then called can lead to a panic. This patch
allows the update to bail if we are in that condition.
This issue was orignally reported and fix proposed by Akihiko Saitou.
CC: Akihiko Saitou <asaitou@users.sourceforge.net>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to copy rxhash again in __skb_clone()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
deliver_no_wcard is not being set in skb_copy_header.
In the skb_cloned case it is not being cleared and
may cause the skb to be dropped when the loopback device
pushes it back up the stack.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device statistics have type unsigned long and several of the
device-specific parameters printed here have type __u32.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct rtnl_link_stats64 as the statistics structure.
On 32-bit architectures, insert 32 bits of padding after/before each
field of struct net_device_stats to make its layout compatible with
struct rtnl_link_stats64. Add an anonymous union in net_device; move
stats into the union and add struct rtnl_link_stats64 stats64.
Add net_device_ops::ndo_get_stats64, implementations of which will
return a pointer to struct rtnl_link_stats64. Drivers that implement
this operation must not update the structure asynchronously.
Change dev_get_stats() to call ndo_get_stats64 if available, and to
return a pointer to struct rtnl_link_stats64. Change callers of
dev_get_stats() accordingly.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ioctl operation (ndo_do_ioctl) is added to make mii-tools work
Signed-off-by: Sergey Matyukevich <geomatsi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix build warning on i386 (32-bit) with 32-bit dma_addr_t:
drivers/net/enic/vnic_dev.c: In function 'vnic_dev_init_prov':
drivers/net/enic/vnic_dev.c:716: warning: passing argument 3 of 'pci_alloc_consistent' from incompatible pointer type
include/asm-generic/pci-dma-compat.h:16: note: expected 'dma_addr_t *' but argument is of type 'u64 *'
Now builds without warnings on i386 and on x86_64.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Scott Feldman <scofeldm@cisco.com>
Cc: Vasanthy Kolluri <vkolluri@cisco.com>
Cc: Roopa Prabhu <roprabhu@cisco.com>
Acked-by: Scott Feldman <scofeldm@cisco.com>
This patch increases the granularity of the rate generated by pktgen.
The previous version of pktgen uses micro seconds (udelay) resolution when it
was delayed causing gaps in the rates. It is changed to nanosecond (ndelay).
Now any rate is possible.
Also it allows to set, the desired rate in Mb/s or packets per second.
The documentation has been updated.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
econet lacks proper locking. It holds econet_lock only when inserting or
deleting an entry in econet_sklist, not during lookups.
- convert econet_lock from rwlock to spinlock
- use econet_lock in ec_listening_socket() lookup
- use appropriate sock_hold() / sock_put() to avoid corruptions.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() API is incomplete or not well documented, since
caller should make sure an RCU grace period is respected before
freeing stats_lock.
This was partially addressed in commit 5d944c640b
(gen_estimator: deadlock fix), but same problem exist for all
gen_kill_estimator() users, if lock they use is not already RCU
protected.
A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
problem. Other are ok because they use qdisc lock, already RCU
protected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/bnx2.c: In function 'bnx2_disable_forced_2g5':
drivers/net/bnx2.c:1489: warning: 'bmcr' may be used uninitialized in this function
We fix it by checking return values from all bnx2_read_phy() and proceeding
to do read-modify-write only if the read operation is successful.
The related bnx2_enable_forced_2g5() is also fixed the same way.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If oui were a null variable then vic_provinfo_alloc() would leak memory.
But this function is only called from one place and oui is not null so
I removed the check.
I also moved the memory allocation down a line so it was easier to spot.
(No one ever reads variable declarations).
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The point of using the devres resource management routines is that they
simplify the driver by taking care of releasing resources on failure and
release. A recent commit added a bunch of error handling that is unnecessary
in this context.
This patch removes this redundant error handling, as well as using
dmam_alloc_coherent in place of dma_alloc_coherent in order to use this
framework consistenly throughout the driver.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
This matches what ethoc_mdio_read does and makes the functions
symmetric.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
- No need to iterate over all possible addresses on bus
- Use helper function phy_find_first
- Use phy_connect_direct as we already have the relevant structure
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the write of the TX_BD_NUM to init_ring together with the
rest of the code setting up the transmission buffers.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethoc driver should be writing bus addresses to the ethoc registers, not
virtual addresses. This patch adds an array to store the virtual addresses
in and references that array when manipulating the contents of the buffer
descriptors.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the calculation of the number of transmission buffers to
ethoc_probe where it more logically fits with the rest of the memory
allocation code.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
i2400m_fw_hdr_check() was accessing hardware field
bcf_hdr->module_type (little endian 32) without converting to host
byte sex.
Reported-by: Данилин Михаил <mdanilin@nsg.net.ru>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix a race at the end of NAPI complete processing, it had
better do __napi_complete() first before re-enable interrupt.
Signed-off-by:Figo.zhang <figo1802@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch correct a bug in the delay of pktgen.
It makes sure the inter-packet interval is accurate.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() / gen_new_estimator() is not always called with
RTNL held.
net/netfilter/xt_RATEEST.c is one user of these API that do not hold
RTNL, so random corruptions can occur between "tc" and "iptables".
Add a new fine grained lock instead of trying to use RTNL in netfilter.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 66018506e1 (ip: Router Alert RCU conversion) introduced RCU
lookups to ip_call_ra_chain(). It missed proper deinit phase :
When ip_ra_control() deletes an ip_ra_chain, it should make sure
ip_call_ra_chain() users can not start to use socket during the rcu
grace period. It should also delay the sock_put() after the grace
period, or we risk a premature socket freeing and corruptions, as
raw sockets are not rcu protected yet.
This delay avoids using expensive atomic_inc_not_zero() in
ip_call_ra_chain().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.
For example,
vlanx -> bond0 -> ethx
will be dropped in the vlan path and not delivered to any
packet handlers at all. However,
bond0 -> vlanx -> ethx
and
bond0 -> ethx
will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.
This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv(). Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly. This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.
I have tested the following 4 configurations in failover modes
and load balancing modes.
# bond0 -> ethx
# vlanx -> bond0 -> ethx
# bond0 -> vlanx -> ethx
# bond0 -> ethx
|
vlanx -> --
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1f8438a853 (icmp: Account for ICMP out errors), I did a typo
on IPV6 side, using ICMP6_MIB_OUTMSGS instead of ICMP6_MIB_OUTERRORS
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>