Commit Graph

12847 Commits

Author SHA1 Message Date
Rusty Russell
c45d286e72 [NET]: Inline net_device_stats
Network drivers which keep stats allocate their own stats structure
then write a get_stats() function to return them.  It would be nice if
this were done by default.

1) Add a new "stats" field to "struct net_device".
2) Add a new feature field to say "this driver uses the internal one"
3) Have a default "get_stats" which returns NULL if that feature not set.
4) Change callers to check result of get_stats call for NULL, not if
   ->get_stats is set.

This should not break backwards compatibility with older drivers, yet
allow modern drivers to shed some boilerplate code.

Lightly tested: works for a modified lguest network driver.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:26 -07:00
Arnaldo Carvalho de Melo
d626f62b11 [SK_BUFF]: Introduce skb_copy_from_linear_data{_offset}
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:28:23 -07:00
Arnaldo Carvalho de Melo
2a123b86e2 [BLUETOOTH]: Introduce skb->data accessor methods for hci_{acl,event,sco}_hdr
For consistency with other skb data accessors, reducing the number of direct
accesses to skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:28:21 -07:00
Thomas Graf
73417f617a [NET] fib_rules: Flush route cache after rule modifications
The results of FIB rules lookups are cached in the routing cache
except for IPv6 as no such cache exists. So far, it was the
responsibility of the user to flush the cache after modifying any
rules. This lead to many false bug reports due to misunderstanding
of this concept.

This patch automatically flushes the route cache after inserting
or deleting a rule.

Thanks to Muli Ben-Yehuda <muli@il.ibm.com> for catching a bug
in the previous patch.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:18 -07:00
Herbert Xu
35fc92a9de [NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE
Right now Xen has a horrible hack that lets it forward packets with
partial checksums.  One of the reasons that CHECKSUM_PARTIAL and
CHECKSUM_COMPLETE were added is so that we can get rid of this hack
(where it creates two extra bits in the skbuff to essentially mirror
ip_summed without being destroyed by the forwarding code).

I had forgotten that I've already gone through all the deivce drivers
last time around to make sure that they're looking at ip_summed ==
CHECKSUM_PARTIAL rather than ip_summed != 0 on transmit.  In any case,
I've now done that again so it should definitely be safe.

Unfortunately nobody has yet added any code to update CHECKSUM_COMPLETE
values on forward so we I'm setting that to CHECKSUM_NONE.  This should
be safe to remove for bridging but I'd like to check that code path
first.

So here is the patch that lets us get rid of the hack by preserving
ip_summed (mostly) on forwarded packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:16 -07:00
Thomas Graf
fa0b2d1d21 [NET] fib_rules: Add no-operation action
The use of nop rules simplifies the usage of goto rules
and adds more flexibility as they allow targets to remain
while the actual content of the branches can change easly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:14 -07:00
Thomas Graf
2b44368307 [NET] fib_rules: Mark rules detached from the device
Rules which match against device names in their selector can
remain while the device itself disappears, in fact the device
doesn't have to present when the rule is added in the first
place. The device name is resolved by trying when the rule is
added and later by listening to NETDEV_REGISTER/UNREGISTER
notifications.

This patch adds the flag FIB_RULE_DEV_DETACHED which is set
towards userspace when a rule contains a device match which
is unresolved at the moment. This eases spotting the reason
why certain rules seem not to function properly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:13 -07:00
Thomas Graf
0947c9fe56 [NET] fib_rules: goto rule action
This patch adds a new rule action FR_ACT_GOTO which allows
to skip a set of rules by jumping to another rule. The rule
to jump to is specified via the FRA_GOTO attribute which
carries a rule preference.

Referring to a rule which doesn't exists is explicitely allowed.
Such goto rules are marked with the flag FIB_RULE_UNRESOLVED
and will act like a rule with a non-matching selector. The rule
will become functional as soon as its target is present.

The goto action enables performance optimizations by reducing
the average number of rules that have to be passed per lookup.

Example:
0:      from all lookup local
40:     not from all to 192.168.23.128 goto 32766
41:     from all fwmark 0xa blackhole
42:     from all fwmark 0xff blackhole
32766:  from all lookup main

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:12 -07:00
David S. Miller
b3da2cf37c [INET]: Use jhash + random secret for ehash.
The days are gone when this was not an issue, there are folks out
there with huge bot networks that can be used to attack the
established hash tables on remote systems.

So just like the routing cache and connection tracking
hash, use Jenkins hash with random secret input.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:06 -07:00
Johannes Berg
d30045a0bc [NETLINK]: introduce NLA_BINARY type
This patch introduces a new NLA_BINARY attribute policy type with the
verification of simply checking the maximum length of the payload.

It also fixes a small typo in the example.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:05 -07:00
Vlad Yasevich
703315712c [SCTP]: Implement SCTP_MAX_BURST socket option.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:04 -07:00
Vlad Yasevich
a5a35e7675 [SCTP]: Implement sac_info field in SCTP_ASSOC_CHANGE notification.
As stated in the sctp socket api draft:

   sac_info: variable

   If the sac_state is SCTP_COMM_LOST and an ABORT chunk was received
   for this association, sac_info[] contains the complete ABORT chunk as
   defined in the SCTP specification RFC2960 [RFC2960] section 3.3.7.

We now save received ABORT chunks into the sac_info field and pass that
to the user.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:03 -07:00
Vlad Yasevich
bdf3092af6 [SCTP]: Honor flags when setting peer address parameters
Parameters only take effect when a corresponding flag bit is set
and a value is specified. This means we need to check the flags
in addition to checking for non-zero value.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:02 -07:00
Vlad Yasevich
1ae4114dce [SCTP]: Implement SCTP_ADDR_CONFIRMED state for ADDR_CHNAGE event
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:01 -07:00
Vlad Yasevich
d49d91d79a [SCTP]: Implement SCTP_PARTIAL_DELIVERY_POINT option.
This option induces partial delivery to run as soon
as the specified amount of data has been accumulated on
the association.  However, we give preference to fully
reassembled messages over PD messages.  In any case,
window and buffer is freed up.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@.hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:00 -07:00
Vlad Yasevich
b6e1331f3c [SCTP]: Implement SCTP_FRAGMENT_INTERLEAVE socket option
This option was introduced in draft-ietf-tsvwg-sctpsocket-13.  It
prevents head-of-line blocking in the case of one-to-many endpoint.
Applications enabling this option really must enable SCTP_SNDRCV event
so that they would know where the data belongs.  Based on an
earlier patch by Ivan Skytte Jørgensen.

Additionally, this functionality now permits multiple associations
on the same endpoint to enter Partial Delivery.  Applications should
be extra careful, when using this functionality, to track EOR indicators.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:59 -07:00
Patrick McHardy
a48b5a6144 [NET_SCHED]: Unline tcf_destroy
Uninline tcf_destroy and add a helper function to destroy an entire filter
chain.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:56 -07:00
Patrick McHardy
3bebcda280 [NET_SCHED]: turn PSCHED_GET_TIME into inline function
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:55 -07:00
Patrick McHardy
03cc45c0a5 [NET_SCHED]: turn PSCHED_TDIFF_SAFE into inline function
Also rename to psched_tdiff_bounded.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:54 -07:00
Patrick McHardy
8edc0c31d6 [NET_SCHED]: kill PSCHED_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:53 -07:00
Patrick McHardy
a084980dcb [NET_SCHED]: kill PSCHED_SET_PASTPERFECT/PSCHED_IS_PASTPERFECT
Use direct assignment and comparison instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:51 -07:00
Patrick McHardy
104e087898 [NET_SCHED]: kill PSCHED_TLESS
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:50 -07:00
Patrick McHardy
7c59e25f31 [NET_SCHED]: kill PSCHED_TADD/PSCHED_TADD2
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:49 -07:00
Patrick McHardy
26e252df1e [NET_SCHED]: kill PSCHED_AUDIT_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:48 -07:00
Yasuyuki Kozakai
de6e05c49f [NETFILTER]: nf_conntrack: kill destroy() in struct nf_conntrack for diet
The destructor per conntrack is unnecessary, then this replaces it with
system wide destructor.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:45 -07:00
Yasuyuki Kozakai
5f79e0f916 [NETFILTER]: nf_conntrack: don't use nfct in skb if conntrack is disabled
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:44 -07:00
Thomas Graf
1d00a4eb42 [NETLINK]: Remove error pointer from netlink message handler
The error pointer argument in netlink message handlers is used
to signal the special case where processing has to be interrupted
because a dump was started but no error happened. Instead it is
simpler and more clear to return -EINTR and have netlink_run_queue()
deal with getting the queue right.

nfnetlink passed on this error pointer to its subsystem handlers
but only uses it to signal the start of a netlink dump. Therefore
it can be removed there as well.

This patch also cleans up the error handling in the affected
message handlers to be consistent since it had to be touched anyway.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:30 -07:00
Thomas Graf
c454673da7 [NET] rules: Unified rules dumping
Implements a unified, protocol independant rules dumping function
which is capable of both, dumping a specific protocol family or
all of them. This speeds up dumping as less lookups are required.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:17 -07:00
Thomas Graf
c127ea2c45 [IPv6]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:13 -07:00
Thomas Graf
fa34ddd739 [DECNet]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:12 -07:00
Thomas Graf
be577ddc2b [PKT_SCHED] qdisc: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:09 -07:00
Thomas Graf
63f3444fb9 [IPv4]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:08 -07:00
Thomas Graf
9d9e6a5819 [NET] rules: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:07 -07:00
Thomas Graf
c8822a4e00 [NEIGH]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:06 -07:00
Thomas Graf
e284986385 [RTNL]: Message handler registration interface
This patch adds a new interface to register rtnetlink message
handlers replacing the exported rtnl_links[] array which
required many message handlers to be exported unnecessarly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:04 -07:00
Gerrit Renker
89560b53b9 [DCCP]: Sample RTT from SYN exchange
Function:
2007-04-25 22:27:02 -07:00
Arnaldo Carvalho de Melo
dc5fc579b9 [NETLINK]: Use nlmsg_trim() where appropriate
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:37 -07:00
Arnaldo Carvalho de Melo
a36ca73337 [NETLINK]: Remove NLMSG_{NEW_ANSWER,CANCEL,END}
Not used anywhere and defined inside __KERNEL__, Thomas acked this on irc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:26:36 -07:00
Arnaldo Carvalho de Melo
897933bcdf [SK_BUFF]: Remove skb_add_mtu() leftovers
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:26:35 -07:00
Arnaldo Carvalho de Melo
b529ccf279 [NETLINK]: Introduce nlmsg_hdr() helper
For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:34 -07:00
Arnaldo Carvalho de Melo
4305b54135 [SK_BUFF]: Convert skb->end to sk_buff_data_t
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:29 -07:00
Arnaldo Carvalho de Melo
27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
Arnaldo Carvalho de Melo
2e07fa9cd3 [SK_BUFF]: Use offsets for skb->{mac,network,transport}_header on 64bit architectures
With this we save 8 bytes per network packet, leaving a 4 bytes hole to be used
in further shrinking work, likely with the offsetization of other pointers,
such as ->{data,tail,end}, at the cost of adds, that were minimized by the
usual practice of setting skb->{mac,nh,n}.raw to a local variable that is then
accessed multiple times in each function, it also is not more expensive than
before with regards to most of the handling of such headers, like setting one
of these headers to another (transport to network, etc), or subtracting, adding
to/from it, comparing them, etc.

Now we have this layout for sk_buff on a x86_64 machine:

[acme@mica net-2.6.22]$ pahole vmlinux sk_buff
struct sk_buff {
	struct sk_buff *       next;             /*   0   8 */
	struct sk_buff *       prev;             /*   8   8 */
	struct rb_node         rb;               /*  16  24 */
	struct sock *          sk;               /*  40   8 */
	ktime_t                tstamp;           /*  48   8 */
	struct net_device *    dev;              /*  56   8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct net_device *    input_dev;        /*  64   8 */
	sk_buff_data_t         transport_header; /*  72   4 */
	sk_buff_data_t         network_header;   /*  76   4 */
	sk_buff_data_t         mac_header;       /*  80   4 */

	/* XXX 4 bytes hole, try to pack */

	struct dst_entry *     dst;              /*  88   8 */
	struct sec_path *      sp;               /*  96   8 */
	char                   cb[48];           /* 104  48 */
	/* cacheline 2 boundary (128 bytes) was 24 bytes ago*/
	unsigned int           len;              /* 152   4 */
	unsigned int           data_len;         /* 156   4 */
	unsigned int           mac_len;          /* 160   4 */
	union {
		__wsum         csum;             /*       4 */
		__u32          csum_offset;      /*       4 */
	};                                       /* 164   4 */
	__u32                  priority;         /* 168   4 */
	__u8                   local_df:1;       /* 172   1 */
	__u8                   cloned:1;         /* 172   1 */
	__u8                   ip_summed:2;      /* 172   1 */
	__u8                   nohdr:1;          /* 172   1 */
	__u8                   nfctinfo:3;       /* 172   1 */
	__u8                   pkt_type:3;       /* 173   1 */
	__u8                   fclone:2;         /* 173   1 */
	__u8                   ipvs_property:1;  /* 173   1 */

	/* XXX 2 bits hole, try to pack */

	__be16                 protocol;         /* 174   2 */
	void    (*destructor)(struct sk_buff *); /* 176   8 */
	struct nf_conntrack *  nfct;             /* 184   8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct sk_buff *       nfct_reasm;       /* 192   8 */
	struct nf_bridge_info *nf_bridge;        /* 200   8 */
	__u16                  tc_index;         /* 208   2 */
	__u16                  tc_verd;          /* 210   2 */
	dma_cookie_t           dma_cookie;       /* 212   4 */
	__u32                  secmark;          /* 216   4 */
	__u32                  mark;             /* 220   4 */
	unsigned int           truesize;         /* 224   4 */
	atomic_t               users;            /* 228   4 */
	unsigned char *        head;             /* 232   8 */
	unsigned char *        data;             /* 240   8 */
	unsigned char *        tail;             /* 248   8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	unsigned char *        end;              /* 256   8 */
}; /* size: 264, cachelines: 5 */
   /* sum members: 260, holes: 1, sum holes: 4 */
   /* bit holes: 1, sum bit holes: 2 bits */
   /* last cacheline: 8 bytes */

On 32 bits nothing changes, and pointers continue to be used with the compiler
turning all this abstraction layer into dust. But there are some sk_buff
validation tricks that are now possible, humm... :-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:21 -07:00
Arnaldo Carvalho de Melo
b0e380b1d8 [SK_BUFF]: unions of just one member don't get anything done, kill them
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:20 -07:00
Arnaldo Carvalho de Melo
cfe1fc7759 [SK_BUFF]: Introduce skb_network_header_len
For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:19 -07:00
Patrick McHardy
00c04af9df [NET_SCHED]: kill jiffie conversion macros
Now that all packet schedulers have been converted to hrtimers most users
of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use
it to convert external time units to packet scheduler clock ticks, so use
PSCHED_TICKS_PER_SEC instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:14 -07:00
Patrick McHardy
4179477f63 [NET_SCHED]: Add hrtimer based qdisc watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:05 -07:00
Patrick McHardy
641b9e0e8b [NET_SCHED]: Use ktime as clocksource
Get rid of the manual clock source selection mess and use ktime. Also
use a scalar representation, which allows to clean up pkt_sched.h a bit
more and results in less ktime_to_ns() calls in most cases.

The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite
inefficient by this patch, following patches will convert all qdiscs
to hrtimers and get rid of them entirely.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:04 -07:00
Arnaldo Carvalho de Melo
0a6114d94b [KBUILD]: Unifdef headers changed by the skb layer header refactorings
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:02 -07:00
Pablo Neira Ayuso
c8e2078cfe [NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling
This patch let userspace programs set the IP_CT_TCP_BE_LIBERAL flag to
force the pickup of established connections.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:57 -07:00
Yasuyuki Kozakai
e7ac05f340 [NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb
This unifies the codes to copy netfilter related datas. Before copying,
nf_copy() puts original members in destination skb.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:55 -07:00
Yasuyuki Kozakai
edda553c32 [NETFILTER]: nf_conntrack: add __nf_copy() to copy members in skb
This unifies the codes to copy netfilter related datas. Note that
__nf_copy() assumes destination skb doesn't have any netfilter
related members.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:54 -07:00
Patrick McHardy
8e87e014ec [JHASH]: Use const in jhash2
Use const to avoid forcing users to cast const data.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:52 -07:00
Patrick McHardy
010c7d6f86 [NETFILTER]: nf_conntrack: uninline notifier registration functions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:46 -07:00
Patrick McHardy
a3c5029cf7 [NETFILTER]: nfnetlink: use mutex instead of semaphore
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:43 -07:00
Patrick McHardy
ac5357ebac [NETFILTER]: nf_conntrack: remove ugly hack in l4proto registration
Remove ugly special-casing of nf_conntrack_l4proto_generic, all it
wants is its sysctl tables registered, so do that explicitly in an
init function and move the remaining protocol initialization and
cleanup code to nf_conntrack_proto.c as well.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:40 -07:00
Patrick McHardy
587aa64163 [NETFILTER]: Remove IPv4 only connection tracking/NAT
Remove the obsolete IPv4 only connection tracking/NAT as scheduled in
feature-removal-schedule.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:34 -07:00
Arnaldo Carvalho de Melo
9c70220b73 [SK_BUFF]: Introduce skb_transport_header(skb)
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo
39b89160df [SK_BUFF]: Introduce ipipv6_hdr(), remove skb->h.ipv6h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:28 -07:00
Arnaldo Carvalho de Melo
b0061ce49c [SK_BUFF]: Introduce ipip_hdr(), remove skb->h.ipiph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:27 -07:00
Arnaldo Carvalho de Melo
aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo
ab6a5bb6b2 [TCP]: Introduce tcp_hdrlen() and tcp_optlen()
The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
avoid the longer, open coded equivalent.

Ditched a no-op in bnx2 in the process.

I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:24 -07:00
Arnaldo Carvalho de Melo
88c7664f13 [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:23 -07:00
Arnaldo Carvalho de Melo
4bedb45203 [SK_BUFF]: Introduce udp_hdr(), remove skb->h.uh
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:22 -07:00
Arnaldo Carvalho de Melo
d9edf9e2be [SK_BUFF]: Introduce igmp_hdr() & friends, remove skb->h.igmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:21 -07:00
Arnaldo Carvalho de Melo
cc70ab261c [ICMP6]: Introduce icmp6_hdr()
For consistency with all the other skb->h.raw accessors.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:20 -07:00
Arnaldo Carvalho de Melo
2c0fd387b0 [SCTP]: Introduce sctp_hdr()
For consistency with all the other skb->h.raw accessors.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:19 -07:00
Arnaldo Carvalho de Melo
967b05f64e [SK_BUFF]: Introduce skb_set_transport_header
For the cases where the transport header is being set to a offset from
skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:17 -07:00
Arnaldo Carvalho de Melo
ea2ae17d64 [SK_BUFF]: Introduce skb_transport_offset()
For the quite common 'skb->h.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:16 -07:00
Arnaldo Carvalho de Melo
badff6d01a [SK_BUFF]: Introduce skb_reset_transport_header(skb)
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:15 -07:00
Arnaldo Carvalho de Melo
0660e03f6b [SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h
Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:14 -07:00
Arnaldo Carvalho de Melo
d0a92be05e [SK_BUFF]: Introduce arp_hdr(), remove skb->nh.arph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:12 -07:00
Arnaldo Carvalho de Melo
eddc9ec53b [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:10 -07:00
Arnaldo Carvalho de Melo
c9bdd4b525 [IP]: Introduce ip_hdrlen()
For the common sequence "skb->nh.iph->ihl * 4", removing a good number of open
coded skb->nh.iph uses, now to go after the rest...

Just out of curiosity, here are the idioms found to get the same result:

skb->nh.iph->ihl << 2
skb->nh.iph->ihl<<2
skb->nh.iph->ihl * 4
skb->nh.iph->ihl*4
(skb->nh.iph)->ihl * sizeof(u32)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:07 -07:00
Arnaldo Carvalho de Melo
c14d2450cb [SK_BUFF]: Introduce skb_set_network_header
For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:01 -07:00
Arnaldo Carvalho de Melo
d56f90a7c9 [SK_BUFF]: Introduce skb_network_header()
For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:59 -07:00
Arnaldo Carvalho de Melo
bbe735e424 [SK_BUFF]: Introduce skb_network_offset()
For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:58 -07:00
Arnaldo Carvalho de Melo
c1d2bbe1cd [SK_BUFF]: Introduce skb_reset_network_header(skb)
For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:46 -07:00
Arnaldo Carvalho de Melo
797659fb4a [PPPOE]: Introduce pppoe_hdr()
For consistency with all the other skb->nh.raw accessors.

Also do some really obvious simplifications in pppoe_recvmsg, well the
kfree_skb one is not so obvious, but free() and kfree() have the same behaviour
(hint :-) ).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:43 -07:00
Arnaldo Carvalho de Melo
37e6636669 [LLC]: Kill llc_set_pdu_hdr
We'll have skb_reset_network_header soon.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:42 -07:00
Arnaldo Carvalho de Melo
98e399f82a [SK_BUFF]: Introduce skb_mac_header()
For the places where we need a pointer to the mac header, it is still legal to
touch skb->mac.raw directly if just adding to, subtracting from or setting it
to another layer header.

This one also converts some more cases to skb_reset_mac_header() that my
regex missed as it had no spaces before nor after '=', ugh.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:41 -07:00
Arnaldo Carvalho de Melo
48d49d0ccd [SK_BUFF]: Introduce skb_set_mac_header()
For the cases where we want to set skb->mac.raw to an offset from skb->data.

Simple cases first, the memmove ones and specially pktgen will be left for later.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:37 -07:00
Arnaldo Carvalho de Melo
459a98ed88 [SK_BUFF]: Introduce skb_reset_mac_header(skb)
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:32 -07:00
Eric Dumazet
92f37fd2ee [NET]: Adding SO_TIMESTAMPNS / SCM_TIMESTAMPNS support
Now that network timestamps use ktime_t infrastructure, we can add a new
SOL_SOCKET sockopt  SO_TIMESTAMPNS.

This command is similar to SO_TIMESTAMP, but permits transmission of
a 'timespec struct' instead of a 'timeval struct' control message.
(nanosecond resolution instead of microsecond)

Control message is labelled SCM_TIMESTAMPNS instead of SCM_TIMESTAMP

A socket cannot mix SO_TIMESTAMP and SO_TIMESTAMPNS : the two modes are
mutually exclusive.

sock_recv_timestamp() became too big to be fully inlined so I added a
__sock_recv_timestamp() helper function.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: linux-arch@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:21 -07:00
Stephen Hemminger
a2a316fd06 [NET]: Replace CONFIG_NET_DEBUG with sysctl.
Covert network warning messages from a compile time to runtime choice.
Removes kernel config option and replaces it with new /proc/sys/net/core/warnings.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:05 -07:00
Eric Dumazet
ae40eb1ef3 [NET]: Introduce SIOCGSTAMPNS ioctl to get timestamps with nanosec resolution
Now network timestamps use ktime_t infrastructure, we can add a new
ioctl() SIOCGSTAMPNS command to get timestamps in 'struct timespec'.
User programs can thus access to nanosecond resolution.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:04 -07:00
David S. Miller
fe067e8ab5 [TCP]: Abstract out all write queue operations.
This allows the write queue implementation to be changed,
for example, to one which allows fast interval searching.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:02 -07:00
Herbert Xu
759e5d0064 [UDP]: Clean up UDP-Lite receive checksum
This patch eliminates some duplicate code for the verification of
receive checksums between UDP-Lite and UDP.  It does this by
introducing __skb_checksum_complete_head which is identical to
__skb_checksum_complete_head apart from the fact that it takes
a length parameter rather than computing the first skb->len bytes.

As a result UDP-Lite will be able to use hardware checksum offload
for packets which do not use partial coverage checksums.  It also
means that UDP-Lite loopback no longer does unnecessary checksum
verification.

If any NICs start support UDP-Lite this would also start working
automatically.

This patch removes the assumption that msg_flags has MSG_TRUNC clear
upon entry in recvmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:51 -07:00
David S. Miller
fc910a2783 [NETLINK]: Limit NLMSG_GOODSIZE to 8K.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:45 -07:00
Neil Horman
95c385b4d5 [IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support.
Nominally an autoconfigured IPv6 address is added to an interface in the
Tentative state (as per RFC 2462).  Addresses in this state remain in this
state while the Duplicate Address Detection process operates on them to
determine their uniqueness on the network.  During this period, these
tentative addresses may not be used for communication, increasing the time
before a node may be able to communicate on a network.  Using Optimistic
Duplicate Address Detection, autoconfigured addresses may be used
immediately for communication on the network, as long as certain rules are
followed to avoid conflicts with other nodes during the Duplicate Address
Detection process.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:43 -07:00
Eric Dumazet
b7aa0bf70c [NET]: convert network timestamps to ktime_t
We currently use a special structure (struct skb_timeval) and plain
'struct timeval' to store packet timestamps in sk_buffs and struct
sock.

This has some drawbacks :
- Fixed resolution of micro second.
- Waste of space on 64bit platforms where sizeof(struct timeval)=16

I suggest using ktime_t that is a nice abstraction of high resolution
time services, currently capable of nanosecond resolution.

As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
a 8 byte shrink of this structure on 64bit architectures. Some other
structures also benefit from this size reduction (struct ipq in
ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

Once this ktime infrastructure adopted, we can more easily provide
nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

Note : this patch includes a bug correction in
compat_sock_get_timestamp() where a "err = 0;" was missing (so this
syscall returned -ENOENT instead of 0)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
CC: John find <linux.kernel@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:34 -07:00
Stephen Hemminger
3927f2e8f9 [NET]: div64_64 consolidate (rev3)
Here is the current version of the 64 bit divide common code.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:33 -07:00
James Morris
9d729f72dc [NET]: Convert xtime.tv_sec to get_seconds()
Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:32 -07:00
Eric Dumazet
fa438ccfdf [NET]: Keep sk_backlog near sk_lock
sk_backlog is a critical field of struct sock. (known famous words)

It is (ab)used in hot paths, in particular in release_sock(), tcp_recvmsg(),
tcp_v4_rcv(), sk_receive_skb().

It really makes sense to place it next to sk_lock, because sk_backlog is only
used after sk_lock locked (and thus memory cache line in L1 cache). This
should reduce cache misses and sk_lock acquisition time.

(In theory, we could only move the head pointer near sk_lock, and leaving tail
far away, because 'tail' is normally not so hot, but keep it simple :) )

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:27 -07:00
Ilpo Järvinen
3cfe3baaf0 [TCP]: Add two new spurious RTO responses to FRTO
New sysctl tcp_frto_response is added to select amongst these
responses:
	- Rate halving based; reuses CA_CWR state (default)
	- Very conservative; used to be the only one available (=1)
	- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:23 -07:00
David S. Miller
e0ef57cc56 [TCP]: Make snd_cwnd_clamp a u32.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:21 -07:00
Eric Dumazet
54287cc178 [TCP]: Keep copied_seq, rcv_wup and rcv_next together.
I noticed in oprofile study a cache miss in tcp_rcv_established() to read
copied_seq.

ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293  
2.0400 */

 55493  0.0281 :ffffffff80400bc9:   mov    0x4c8(%r12),%eax copied_seq
543103  0.2746 :ffffffff80400bd1:   cmp    0x3e0(%r12),%eax   rcv_nxt    

if (tp->copied_seq == tp->rcv_nxt &&
        len - tcp_header_len <= tp->ucopy.len) {

In this function, the cache line 0x4c0 -> 0x500 is used only for this
reading 'copied_seq' field.

rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of
active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...)

As you suggested, I changed tcp_create_openreq_child() so that these fields
are changed together, to avoid adding a new store buffer stall.

Patch is 64bit friendly (no new hole because of alignment constraints)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:21 -07:00
John Heffner
886236c124 [TCP]: Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:19 -07:00
Ilpo Järvinen
46d0de4ed9 [TCP] FRTO: Entry is allowed only during (New)Reno like recovery
This interpretation comes from RFC4138:
    "If the sender implements some loss recovery algorithm other
     than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
     NOT be entered when earlier fast recovery is underway."

I think the RFC means to say (especially in the light of
Appendix B) that ...recovery is underway (not just fast recovery)
or was underway when it was interrupted by an earlier (F-)RTO
that hasn't yet been resolved (snd_una has not advanced enough).
Thus, my interpretation is that whenever TCP has ever
retransmitted other than head, basic version cannot be used
because then the order assumptions which are used as FRTO basis
do not hold.

NewReno has only the head segment retransmitted at a time.
Therefore, walk up to the segment that has not been SACKed, if
that segment is not retransmitted nor anything before it, we know
for sure, that nothing after the non-SACKed segment should be
either. This assumption is valid because TCPCB_EVER_RETRANS does
not leave holes but each non-SACKed segment is rexmitted
in-order.

Check for retrans_out > 1 avoids more expensive walk through the
skb list, as we can know the result beforehand: F-RTO will not be
allowed.

SACKed skb can turn into non-SACked only in the extremely rare
case of SACK reneging, in this case we might fail to detect
retransmissions if there were them for any other than head. To
get rid of that feature, whole rexmit queue would have to be
walked (always) or FRTO should be prevented when SACK reneging
happens. Of course RTO should still trigger after reneging which
makes this issue even less likely to show up. And as long as the
response is as conservative as it's now, nothing bad happens even
then.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:12 -07:00
Ilpo Järvinen
bdaae17da8 [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c
In addition, removed inline.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:02 -07:00
Linus Torvalds
12145387a0 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [BNX2]: Fix occasional NETDEV WATCHDOG on 5709.
  [IPV6]: Disallow RH0 by default.
  [XFRM]: beet: fix pseudo header length value
  [TCP]: Congestion control initialization.
2007-04-24 18:20:32 -07:00
YOSHIFUJI Hideaki
0bcbc92629 [IPV6]: Disallow RH0 by default.
A security issue is emerging.  Disallow Routing Header Type 0 by default
as we have been doing for IPv4.
Note: We allow RH2 by default because it is harmless.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-24 14:58:30 -07:00
Balbir Singh
7e40f2ab0a Taskstats fix the structure members alignment issue
We broke the the alignment of members of taskstats to the 8 byte boundary
with the CSA patches.  In the current kernel, the taskstats structure is
not suitable for use by 32 bit applications in a 64 bit kernel.

On x86_64

Offsets of taskstats' members (64 bit kernel, 64 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        116,    # ac_uid
        120,    # ac_gid
        124,    # ac_pid
        128,    # ac_ppid
        132,    # ac_btime
        136,    # ac_etime
        144,    # ac_utime
        152,    # ac_stime
        160,    # ac_minflt
        168,    # ac_majflt
        176,    # coremem
        184,    # virtmem
        192,    # hiwater_rss
        200,    # hiwater_vm
        208,    # read_char
        216,    # write_char
        224,    # read_syscalls
        232,    # write_syscalls
        240,    # read_bytes
        248,    # write_bytes
        256,    # cancelled_write_bytes
    );

Offsets of taskstats' members (64 bit kernel, 32 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        12,     # cpu_count
        20,     # cpu_delay_total
        28,     # blkio_count
        36,     # blkio_delay_total
        44,     # swapin_count
        52,     # swapin_delay_total
        60,     # cpu_run_real_total
        68,     # cpu_run_virtual_total
        76,     # ac_comm
        108,    # ac_sched
        109,    # ac_pad
        112,    # ac_uid
        116,    # ac_gid
        120,    # ac_pid
        124,    # ac_ppid
        128,    # ac_btime
        132,    # ac_etime
        140,    # ac_utime
        148,    # ac_stime
        156,    # ac_minflt
        164,    # ac_majflt
        172,    # coremem
        180,    # virtmem
        188,    # hiwater_rss
        196,    # hiwater_vm
        204,    # read_char
        212,    # write_char
        220,    # read_syscalls
        228,    # write_syscalls
        236,    # read_bytes
        244,    # write_bytes
        252,    # cancelled_write_bytes
    );

This is one way to solve the problem without re-arranging structure members
is to pack the structure.  The patch adds an __attribute__((aligned(8))) to
the taskstats structure members so that 32 bit applications using taskstats
can work with a 64 bit kernel.

Using __attribute__((packed)) would break the 64 bit alignment of members.

The fix was tested on x86_64. After the fix, we got

Offsets of taskstats' members (64 bit kernel, 64 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        120,    # ac_uid
        124,    # ac_gid
        128,    # ac_pid
        132,    # ac_ppid
        136,    # ac_btime
        144,    # ac_etime
        152,    # ac_utime
        160,    # ac_stime
        168,    # ac_minflt
        176,    # ac_majflt
        184,    # coremem
        192,    # virtmem
        200,    # hiwater_rss
        208,    # hiwater_vm
        216,    # read_char
        224,    # write_char
        232,    # read_syscalls
        240,    # write_syscalls
        248,    # read_bytes
        256,    # write_bytes
        264,    # cancelled_write_bytes
    );

Offsets of taskstats' members (64 bit kernel, 32 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        120,    # ac_uid
        124,    # ac_gid
        128,    # ac_pid
        132,    # ac_ppid
        136,    # ac_btime
        144,    # ac_etime
        152,    # ac_utime
        160,    # ac_stime
        168,    # ac_minflt
        176,    # ac_majflt
        184,    # coremem
        192,    # virtmem
        200,    # hiwater_rss
        208,    # hiwater_vm
        216,    # read_char
        224,    # write_char
        232,    # read_syscalls
        240,    # write_syscalls
        248,    # read_bytes
        256,    # write_bytes
        264,    # cancelled_write_bytes
    );

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 08:23:08 -07:00
Linus Torvalds
ea8df8c5e6 Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
  [MIPS] Fix wrong checksum for split TCP packets on 64-bit MIPS
  [MIPS] Fix BUG(), BUG_ON() handling
  [MIPS] Retry {save,restore}_fp_context if failed in atomic context.
  [MIPS] Disallow CpU exception in kernel again.
  [MIPS] Add missing silicon revisions for BCM112x
2007-04-20 22:57:51 -07:00
Trond Myklebust
8e821cad12 NFS: clean up the unstable write code
Get rid of the inlined #ifdefs.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-20 22:56:29 -07:00
Dave Johnson
1d464c26b5 [MIPS] Fix wrong checksum for split TCP packets on 64-bit MIPS
I've traced down an off-by-one TCP checksum calculation error under
the following conditions:

1) The TCP code needs to split a full-sized packet due to a reduced
   MSS (typically due to the addition of TCP options mid-stream like
   SACK).
   _AND_
2) The checksum of the 2nd fragment is larger than the checksum of the
   original packet.  After subtraction this results in a checksum for
   the 1st fragment with bits 16..31 set to 1. (this is ok)
   _AND_
3) The checksum of the 1st fragment's TCP header plus the previously
   32bit checksum of the 1st fragment DOES NOT cause a 32bit overflow
   when added together.  This results in a checksum of the TCP header
   plus TCP data that still has the upper 16 bits as 1's.
   _THEN_
4) The TCP+data checksum is added to the checksum of the pseudo IP
   header with csum_tcpudp_nofold() incorrectly (the bug).
    
The problem is the checksum of the TCP+data is passed to
csum_tcpudp_nofold() as an 32bit unsigned value, however the assembly
code acts on it as if it is a 64bit unsigned value.

This causes an incorrect 32->64bit extension if the sum has bit 31
set.  The resulting checksum is off by one.
    
This problems is data and TCP header dependent due to #2 and #3
above so it doesn't occur on every TCP packet split.
    
Signed-off-by: Dave Johnson <djohnson+linux-mips@sw.starentnetworks.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-04-20 14:58:37 +01:00
Atsushi Nemoto
ba755f8ec8 [MIPS] Fix BUG(), BUG_ON() handling
With commit 63dc68a8cf, kernel can not
handle BUG() and BUG_ON() properly since get_user() returns false for
kernel code.  Use __get_user() to skip unnecessary access_ok().  This
patch also make BRK_BUG code encoded in the TNE instruction.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-04-20 14:58:37 +01:00
Atsushi Nemoto
faea623464 [MIPS] Retry {save,restore}_fp_context if failed in atomic context.
The save_fp_context()/restore_fp_context() might sleep on accessing
user stack and therefore might lose FPU ownership in middle of them.

If these function failed due to "in_atomic" test in do_page_fault,
touch the sigcontext area in non-atomic context and retry these
save/restore operation.

This is a replacement of a (broken) fix which was titled "Allow CpU
exception in kernel partially".

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-04-20 14:58:37 +01:00
Atsushi Nemoto
5323180db7 [MIPS] Disallow CpU exception in kernel again.
The commit 4d40bff7110e9e1a97ff8c01bdd6350e9867cc10 ("Allow CpU
exception in kernel partially") was broken.  The commit was to fix
theoretical problem but broke usual case.  Revert it for now.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-04-20 14:58:37 +01:00
Mark Mason
9a9943575a [MIPS] Add missing silicon revisions for BCM112x
Recent versions of the BCM112X processors aren't recognized by Linux
(preventing Linux from booting on those processors).  This patch adds
support for those that are missing.

Signed-off-by: Mark Mason <mason@broadcom.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-04-20 14:58:37 +01:00
Linus Torvalds
80d74d5123 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [BRIDGE]: Unaligned access when comparing ethernet addresses
  [SCTP]: Unmap v4mapped addresses during SCTP_BINDX_REM_ADDR operation.
  [SCTP]: Fix assertion (!atomic_read(&sk->sk_rmem_alloc)) failed message
  [NET]: Set a separate lockdep class for neighbour table's proxy_queue
  [NET]: Fix UDP checksum issue in net poll mode.
  [KEY]: Fix conversion between IPSEC_MODE_xxx and XFRM_MODE_xxx.
  [NET]: Get rid of alloc_skb_from_cache
2007-04-17 16:51:32 -07:00
Russell King
93da28790c Provide dummy devm_ioport_* if !HAS_IOPORT
Provide an dummy implementation of devm_ioport_map() and
devm_ioport_unmap() to allow drivers (eg, pata_platform) to build for
platforms where CONFIG_NO_IOPORT is selected.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:27 -07:00
Ivan Kokshaysky
88ed39b064 alpha: build fixes - force architecture
Override compiler .arch directive for generic kernel build.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:27 -07:00
Ivan Kokshaysky
1b75b05b73 alpha: fixes for specific machine types
Files:

arch/alpha/kernel/core_mcpcia.c
arch/alpha/kernel/sys_rawhide.c
include/asm-alpha/core_mcpcia.h

	Determine correct hose configuration; RAWHIDE family can have
        2 or 4 hoses, so make sure non-existent hoses are ignored.

arch/alpha/kernel/err_titan.c

	Supply a needed #include <asm/irq_regs.h>

arch/alpha/kernel/module.c

	Add some useful output to the relocation overflow messages.

arch/alpha/kernel/sys_noritake.c

	Supply necessary noritake_end_irq() to correct interrupt handling.
	This fixes a problem first noted by hangs during boot probing with
	a DE500-BA TULIP NIC present.

arch/alpha/kernel/sys_sio.c

	Correct saving of original PIRQ register (PCI IRQ routing);
	change default PIRQ setting to leave PCI IRQs 9 and 14 free to
	be used for sound (Multia) and IDE (any), respectively.

include/asm-alpha/io.h

	Supply the "isa_virt_to_bus" routine.

Signed-off-by: Jay Estabrook <jay.estabrook@hp.com>
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:27 -07:00
Randy Dunlap
112654208b kernel-doc: fix plist.h comments
Make kernel-doc comments match macro names.
Correct parameter names in a few places.
Remove '#' from beginning of kernel-doc comment macro names.
Remove extra (erroneous) blank lines in kernel-doc.

Warning(plist.h:100): Cannot understand  * #PLIST_HEAD_INIT - static struct plist_head initializer on line 100 - I thought it was a doc line
Warning(plist.h:112): Cannot understand  * #PLIST_NODE_INIT - static struct plist_node initializer on line 112 - I thought it was a doc line
Warning(plist.h:103): No description found for parameter '_lock'
Warning(plist.h:129): No description found for parameter 'lock'
Warning(plist.h:158): No description found for parameter 'pos'
Warning(plist.h:169): No description found for parameter 'pos'
Warning(plist.h:169): No description found for parameter 'n'
Warning(plist.h:179): No description found for parameter 'mem'

This still leaves one warning & one error that need attention:
Error(plist.h:219): cannot understand prototype: '('
Warning(plist.h): no structured comments found

Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:26 -07:00
Don Zickus
c4b7e8754e allow vmsplice to work in 32-bit mode on ppc64
Trivial change to pass vmsplice arguments through the compat layer on
pp64.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:26 -07:00
Pavel Emelianov
c2ecba7171 [NET]: Set a separate lockdep class for neighbour table's proxy_queue
Otherwise the following calltrace will lead to a wrong
lockdep warning:

  neigh_proxy_process()
    `- lock(neigh_table->proxy_queue.lock);
  arp_redo /* via tbl->proxy_redo */
  arp_process
  neigh_event_ns
  neigh_update
  skb_queue_purge
    `- lock(neighbor->arp_queue.lock);

This is not a deadlock actually, as neighbor table's proxy_queue
and the neighbor's arp_queue are different queues.

Lockdep thinks there is a deadlock as both queues are initialized
with skb_queue_head_init() and thus have a common class.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:31 -07:00
Herbert Xu
b4dfa0b1fb [NET]: Get rid of alloc_skb_from_cache
Since this was added originally for Xen, and Xen has recently (~2.6.18)
stopped using this function, we can safely get rid of it.  Good timing
too since this function has started to bit rot.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:16 -07:00
sshahrom@micron.com
8c60e5475d [MTD][NAND] Add Micron Manufacturer ID
Add Micron Manufacturer ID.

Signed-off-by: Shahrom Sharif <sshahrom@micron.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2007-04-17 18:27:06 +01:00
Trond Myklebust
5a6d41b32a NFS: Ensure PG_writeback is cleared when writeback fails
If the writebacks are cancelled via nfs_cancel_dirty_list, or due to the
memory allocation failing in nfs_flush_one/nfs_flush_multi, then we must
ensure that the PG_writeback flag is cleared.

Also ensure that we actually own the PG_writeback flag whenever we
schedule a new writeback by making nfs_set_page_writeback() return the
value of test_set_page_writeback().
The PG_writeback page flag ends up replacing the functionality of the
PG_FLUSHING nfs_page flag, so we rip that out too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-14 21:46:48 -07:00
Linus Torvalds
8bd51cce98 Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6:
  ide: add "optical" to sysfs "media" attribute
  ide: ugly messages trying to open CD drive with no media present
  ide: correctly prevent IDE timer expiry function to run if request was already handled
2007-04-10 17:22:31 -07:00
Linus Torvalds
b4dfd6bc35 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] SGI Altix : fix pcibr_dmamap_ate32() bug
  [IA64] Fix CPU freq displayed in /proc/cpuinfo
  [IA64] Fix wrong assumption about irq and vector in msi_ia64.c
  [IA64] BTE error timer fix
2007-04-10 17:21:57 -07:00
Suleiman Souhlal
23450319e2 ide: correctly prevent IDE timer expiry function to run if request was already handled
It is possible for the timer expiry function to run even though the
request has already been handled: ide_timer_expiry() only checks that
the handler is not NULL, but it is possible that we have handled a
request (thus clearing the handler) and then started a new request
(thus starting the timer again, and setting a handler). 

A simple way to exhibit this is to set the DMA timeout to 1 jiffy and
run dd: The kernel will panic after a few minutes because
ide_timer_expiry() tries to add a timer when it's already active.

To fix this, we simply add a request generation count that gets
incremented at every interrupt, and check in ide_timer_expiry() that
we have not already handled a new interrupt before running the expiry
function.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-04-10 22:38:37 +02:00
Zachary Amsden
49f1971051 [PATCH] Proper fix for highmem kmap_atomic functions for VMI for 2.6.21
Since lazy MMU batching mode still allows interrupts to enter, it is
possible for interrupt handlers to try to use kmap_atomic, which fails when
lazy mode is active, since the PTE update to highmem will be delayed.  The
best workaround is to issue an explicit flush in kmap_atomic_functions
case; this is the only way nested PTE updates can happen in the interrupt
handler.

Thanks to Jeremy Fitzhardinge for noting the bug and suggestions on a fix.

This patch gets reverted again when we start 2.6.22 and the bug gets fixed
differently.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-08 19:47:55 -07:00
Ingo Molnar
995f054f2a [PATCH] high-res timers: resume fix
Soeren Sonnenburg reported that upon resume he is getting
this backtrace:

 [<c0119637>] smp_apic_timer_interrupt+0x57/0x90
 [<c0142d30>] retrigger_next_event+0x0/0xb0
 [<c0104d30>] apic_timer_interrupt+0x28/0x30
 [<c0142d30>] retrigger_next_event+0x0/0xb0
 [<c0140068>] __kfifo_put+0x8/0x90
 [<c0130fe5>] on_each_cpu+0x35/0x60
 [<c0143538>] clock_was_set+0x18/0x20
 [<c0135cdc>] timekeeping_resume+0x7c/0xa0
 [<c02aabe1>] __sysdev_resume+0x11/0x80
 [<c02ab0c7>] sysdev_resume+0x47/0x80
 [<c02b0b05>] device_power_up+0x5/0x10

it turns out that on resume we mistakenly re-enable interrupts too
early.  Do the timer retrigger only on the current CPU.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Soeren Sonnenburg <kernel@nn7.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-07 10:03:43 -07:00
Mike Habeck
2e0d232bff [IA64] SGI Altix : fix pcibr_dmamap_ate32() bug
On a SGI Altix TIOCP based PCI bus we need to include the ATE_PIO attribute
bit if we're mapping a 32bit MSI address.

Signed-off-by: Mike Habeck <habeck@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-04-06 15:38:12 -07:00
Maciej Zenczykowski
58e9491390 [PATCH] ia64: desc_empty thinko/typo fix
Just a one-byter for an ia64 thinko/typo - already fixed for i386 and x86_64.

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:48 -07:00
Andrew Morton
2363cc0264 [PATCH] remove protection of LANANA-reserved majors
Revert all this.  It can cause device-mapper to receive a different major from
earlier kernels and it turns out that the Amanda backup program (via GNU tar,
apparently) checks major numbers on files when performing incremental backups.

Which is a bit broken of Amanda (or tar), but this feature isn't important
enough to justify the churn.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
NeilBrown
5792a2856a [PATCH] md: avoid a deadlock when removing a device from an md array via sysfs
A device can be removed from an md array via e.g.
  echo remove > /sys/block/md3/md/dev-sde/state

This will try to remove the 'dev-sde' subtree which will deadlock
since
  commit e7b0d26a86

With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Albert Lee
6f23a31d1c libata: Limit ATAPI DMA to R/W commands only for TORiSAN DVD drives (take 3)
patch 4/4:

  Limit ATAPI DMA to R/W commands only for TORiSAN DRD-N216 DVD-ROM drives
  (http://bugzilla.kernel.org/show_bug.cgi?id=6710)

Signed-off-by: Albert Lee <albertcc@tw.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-04-04 02:12:27 -04:00
Albert Lee
18d6e9d518 libata: Limit max sector to 128 for TORiSAN DVD drives (take 3)
patch 3/4:
  The TORiSAN drive locks up when max sector == 256.
  Limit max sector to 128 for the TORiSAN DRD-N216 drives.
  (http://bugzilla.kernel.org/show_bug.cgi?id=6710)

Signed-off-by: Albert Lee <albertcc@tw.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-04-04 02:12:27 -04:00
Albert Lee
7152764700 libata: reorder HSM_ST_FIRST for easier decoding (take 3)
patch 1/4:
  Reorder HSM_ST_FIRST, such that the task state transition is easier decoded with human eyes.

Signed-off-by: Albert Lee <albertcc@tw.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-04-04 02:12:27 -04:00
Robert Reif
d80f0a4beb [SPARC]: Add unsigned to unused bit field in a.out.h
Add unsigned to unused bit field in a.out.h to make sparse happy.

[ I took care of the sparc64 side as well -DaveM ]

Signed-off-by: Robert Reif <reif@earthlink.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-02 14:26:21 -07:00
Linus Torvalds
9a5ee4cc9e Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6:
  [PATCH] x86: Don't probe for DDC on VBE1.2
  [PATCH] x86-64: Increase NMI watchdog probing timeout
  [PATCH] x86-64: Let oprofile reserve MSR on all CPUs
  [PATCH] x86-64: Disable local APIC timer use on AMD systems with C1E
2007-04-02 11:41:55 -07:00
Rodolfo Giometti
0ecbc81adf [MTD] [NOR] Support for auto locking flash on power up
Auto unlock sectors on resume for auto locking flash on power up.

Signed-off-by: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2007-04-02 14:12:23 -04:00
Rafael J. Wysocki
1d64b9cb1d [PATCH] Fix microcode-related suspend problem
Fix the regression resulting from the recent change of suspend code
ordering that causes systems based on Intel x86 CPUs using the microcode
driver to hang during the resume.

The problem occurs since the microcode driver uses request_firmware() in
its CPU hotplug notifier, which is called after tasks has been frozen and
hangs.  It can be fixed by telling the microcode driver to use the
microcode stored in memory during the resume instead of trying to load it
from disk.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Adrian Bunk <bunk@stusta.de>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maxim <maximlevitsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:09 -07:00
Kay Sievers
0c84ce268b [PATCH] driver core: fix built-in drivers sysfs links
built-in drivers had broken sysfs links that caused bootup hangs for
certain driver unregistry sequences.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:09 -07:00
Paolo 'Blaisorblade' Giarrusso
10fa1155a2 [PATCH] uml: fix unreasonably long udelay
Currently we have a confused udelay implementation.

* __const_udelay does not accept usecs but xloops in i386 and x86_64
* our implementation requires usecs as arg
* it gets a xloops count when called by asm/arch/delay.h

Bugs related to this (extremely long shutdown times) where reported by some
x86_64 users, especially using Device Mapper.

To hit this bug, a compile-time constant time parameter must be passed -
that's why UML seems to work most times.  Fix this with a simple udelay
implementation.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:08 -07:00
Andi Kleen
3556ddfa92 [PATCH] x86-64: Disable local APIC timer use on AMD systems with C1E
AMD dual core laptops with C1E do not run the APIC timer correctly
when they go idle. Previously the code assumed this only happened
on C2 or deeper.  But not all of these systems report support C2.

Use a AMD supplied snippet to detect C1E being enabled and then disable
local apic timer use.

This supercedes an earlier workaround using DMI detection of specific systems.

Thanks to Mark Langsdorf for the detection snippet.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-04-02 12:14:12 +02:00
Linus Torvalds
2e175a9004 Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
  [ARM] 4298/1: fix memory barriers for DMA coherent and SMP platforms
  [ARM] 4295/2: Fix error-handling in pxaficp_ir.c (version 2)
  [ARM] Fix __NR_kexec_load
  [ARM] Export dma_channel_active()
  [ARM] 4296/1: ixp4xx: compile fix
  [ARM] 4289/1: AT91: SAM9260 NAND flash timing
2007-04-01 14:43:57 -07:00
Lennert Buytenhek
398e692fd5 [ARM] 4298/1: fix memory barriers for DMA coherent and SMP platforms
This patch:
- Switches mb/rmb/wmb back to being full-blown DMBs on ARM SMP systems,
  since mb/rmb/wmb are required to order Normal memory accesses as well.
- Enables the use of DMB and ISB on XSC3 (which is an ARMv5TE ISA core
  but conforms to the ARMv6 memory ordering model and supports the
  various ARMv6 barriers.)
- Makes DMA coherent platforms (only ixp23xx at the moment) map
  mb/rmb/wmb to dmb(), as on DMA coherent platforms, DMA consistent
  mappings are done as Normal mappings, which are weakly ordered.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-04-01 22:38:36 +01:00
Russell King
6c330ba72c [ARM] Fix __NR_kexec_load
It's __NR_kexec_load, not __NR_sys_kexec_load

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-04-01 22:35:01 +01:00
Paolo 'Blaisorblade' Giarrusso
c35e584c08 [PATCH] uml: fix static linking for real
There was a typo in commit 7632fc8f80,
preventing it from working - 32bit binaries crashed hopelessly before
the below fix and work perfectly now.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-30 19:21:30 -07:00
Vladimir Barinov
6b8777b468 [ARM] 4296/1: ixp4xx: compile fix
Fix compilation fail for ixp4xx platforms for the case when CONFIG_IXP4XX_INDIRECT_PCI is set. That is due to the check_signature() is appeared in include/linux/io.h.

Signed-off-by: Vladimir Barinov <vbarinov@ru.mvista.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-03-30 16:56:05 +01:00
Ralf Baechle
8a1e97ee2e [MIPS] SMTC: Fix recursion in instant IPI replay code.
local_irq_restore -> raw_local_irq_restore -> irq_restore_epilog ->
	smtc_ipi_replay -> smtc_ipi_dq -> spin_unlock_irqrestore ->
	_spin_unlock_irqrestore -> local_irq_restore

The recursion does abort when there is no more IPI queued for a CPU, so
this isn't usually fatal which is why we got away with this for so long
until this was discovered by code inspection.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-29 23:46:36 +01:00
Ralf Baechle
eb541cb240 [MIPS] MV64340: Add missing prototype for mv64340_irq_init().
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-29 23:46:35 +01:00
Linus Torvalds
9415fddd99 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [IFB]: Fix crash on input device removal
  [BNX2]: Fix link interrupt problem.
2007-03-29 13:15:13 -07:00
Timur Tabi
297640e89e [POWERPC] qe: ucc_slow.guemr is in the wrong place
The definition of struct ucc_slow puts the guemr register immediately after the
utpt register, when it should be at offset 0x90.  This patch adds the missing
0x52-byte padding.

Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2007-03-29 14:33:56 -05:00
Patrick McHardy
c01003c205 [IFB]: Fix crash on input device removal
The input_device pointer is not refcounted, which means the device may
disappear while packets are queued, causing a crash when ifb passes packets
with a stale skb->dev pointer to netif_rx().

Fix by storing the interface index instead and do a lookup where neccessary.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-29 11:46:52 -07:00
Miklos Szeredi
602ed87ecd [PATCH] uml: fix pte bit collision
_PAGE_PROTNONE conflicts with the lowest bit of pgoff.  This causes all sorts
of weirdness when nonlinear mappings are used.

Took me a good half day to track this down.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Acked-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:24 -07:00
Yinghai Lu
c97beb4710 [PATCH] x86_64 irq: Fix comments after changing IRQ0_VECTOR from 0x20 to 0x30
Signed-off-by: Yinghai Lu <yinghai.lu@amd.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:16:23 -07:00
Linus Torvalds
d0a9af8091 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [VIDEO]: Fix section mismatch in cg3.c
  [SPARC]: sparc64 gcc-4.2.0 20070317 -Werror failure
  [VIDEO] ffb: Fix two DAC handling bugs.
  [SPARC32]: Fix SMP build regression
  [DRM]: Delete sparc64 FFB driver code that never gets built.
2007-03-28 14:01:21 -07:00
Linus Torvalds
6faee84b00 Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: Trivial fix for hp6xx build.
  sh: Fixup __cmpxchg() compile breakage with gcc4.
  sh: Kill bogus GCC4 symbol exports.
2007-03-28 13:53:40 -07:00
Kristoffer Ericson
b863f46e6a sh: Trivial fix for hp6xx build.
The IRQ3 define was removed when asm-sh/irq.h was cleaned up,
this updates the hp6xx header to use the IRQ number directly.

Signed-off-by: Kristoffer Ericson <kristoffer_e1@hotmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-03-28 19:45:59 +09:00
Paul Mundt
310f7963c2 sh: Fixup __cmpxchg() compile breakage with gcc4.
As reported by Manuel:

When I build linux with GCC-4.x and enable
CONFIG_CC_OPTIMIZE_FOR_SIZE linking fails with this error:

  LD      .tmp_vmlinux1
  kernel/built-in.o: In function '__cmpxchg_called_with_bad_pointer'
  make[1]: *** [.tmp_vmlinux1] Error 1
  make: *** [_all] Error 2

This ended up being an inlining problem, fixed by explicitly
including linux/compiler.h and grabbing the definitions from there.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-03-28 17:26:19 +09:00
Jeff Garzik
a9c87a10db Merge branch 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream-fixes 2007-03-28 02:21:18 -04:00
Jean Tourrilhes
c2805fbb86 [PATCH] WE-22 : prevent information leak on 64 bit
Johannes Berg discovered that kernel space was leaking to
userspace on 64 bit platform. He made a first patch to fix that. This
is an improved version of his patch.

Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-03-27 14:10:26 -04:00
Linus Torvalds
d0d87aae79 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/v4l-dvb
* git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/v4l-dvb:
  V4L/DVB (5472): Isl6421: don't reference freed memory
  V4L/DVB (5441): Saa7146: Fix allocation of clipping memory
  V4L/DVB (5421): Fix suspend/resume in msp3400 and tuner
  V4L/DVB (5415): Msp_attach must return 0 if no msp3400 was found.
  V4L/DVB (5408): Fix SECAM handling on saa7115
  V4L/DVB (5400): Core: fix several locking related problems
  V4L/DVB (5390): Radio: Fix error in Kbuild file
  V4L/DVB (5332): Ir_rc5_timer_end decoder lockup fix
2007-03-27 09:23:59 -07:00
Linus Torvalds
e0ab0bb6d2 Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block:
  Export __splice_from_pipe()
  2/2 splice: dont readpage
  1/2 splice: dont steal
  make elv_register() output atomic
  block: blk_max_pfn is somtimes wrong
2007-03-27 09:05:49 -07:00
Serge E. Hallyn
a28d193cbf [PATCH] ipcns: fix !CONFIG_IPC_NS behavior
When CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) claims success, but did not actually
clone a new IPC namespace.

Fix this to return -EINVAL so the caller knows his request was denied.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:16 -07:00
Jeff Dike
7632fc8f80 [PATCH] uml: fix static linking
During a static link, ld has started putting a .note section in the
.uml.setup.init section.  This has the result that the UML setups begin
with 32 bytes of garbage and UML crashes immediately on boot.

This patch creates a specific .note section for ld to drop this stuff
into.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
Serge E. Hallyn
78d832f626 [PATCH] utsns: fix !CONFIG_UTS_NS behavior
When CONFIG_UTS_NS=n, clone(CLONE_NEWUTS) quietly refuses.  So correctly does
not unshare a new uts namespace, but also does not return -EINVAL.

Fix this to return -EINVAL so the caller knows his request was denied.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
Alan Cox
62b3e920ed [PATCH] tty: minor merge correction
Its now used.. because we added the new definitions so enabled all the
goodies on i386

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
Jeff Dike
d75e26a829 [PATCH] uml: fix epoll
UML/x86_64 needs the same packing of struct epoll_event as x86_64.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
Oliver Endriss
7a7cd19209 V4L/DVB (5441): Saa7146: Fix allocation of clipping memory
Olaf Hering pointed out that SAA7146_CLIPPING_MEM would become
very large for PAGE_SIZE > 4K.
In fact, the number of clipping windows is limited to 16,
and calculate_clipping_registers_rect() does not use more
than 256 bytes. SAA7146_CLIPPING_MEM adjusted accordingly.

Thanks-to: Olaf Hering <olaf@aepfle.de>
Acked-by: Michael Hunold <hunold@linuxtv.org>
Signed-off-by: Oliver Endriss <o.endriss@gmx.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
2007-03-27 08:45:56 -03:00
Mikael Pettersson
7945d5626c [SPARC]: sparc64 gcc-4.2.0 20070317 -Werror failure
Compiling 2.6.21-rc5 with gcc-4.2.0 20070317 (prerelease)
for sparc64 fails as follows:

  gcc -Wp,-MD,arch/sparc64/kernel/.time.o.d  -nostdinc -isystem /home/mikpe/pkgs/linux-sparc64/gcc-4.2.0/lib/gcc/sparc64-unknown-linux-gnu/4.2.0/include -D__KERNEL__ -Iinclude  -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -m64 -pipe -mno-fpu -mcpu=ultrasparc -mcmodel=medlow -ffixed-g4 -ffixed-g5 -fcall-used-g7 -Wno-sign-compare -Wa,--undeclared-regs -fomit-frame-pointer  -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -Werror   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(time)"  -D"KBUILD_MODNAME=KBUILD_STR(time)" -c -o arch/sparc64/kernel/time.o arch/sparc64/kernel/time.c
cc1: warnings being treated as errors
arch/sparc64/kernel/time.c: In function 'kick_start_clock':
arch/sparc64/kernel/time.c:559: warning: overflow in implicit constant conversion
make[1]: *** [arch/sparc64/kernel/time.o] Error 1
make: *** [arch/sparc64/kernel] Error 2

gcc gets unhappy when the MSTK_SET macro's u8 __val variable
is updated with &= ~0xff (MSTK_YEAR_MASK). Making the constant
unsigned fixes the problem.

[ I fixed up the sparc32 side as well -DaveM ]

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-27 01:13:55 -07:00
Mark Fasheh
40bee44eae Export __splice_from_pipe()
Ocfs2 wants to implement it's own splice write actor so that it can better
manage cluster / page locks. This lets us re-use the rest of splice write
while only providing our own code where it's actually important.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:55:47 +02:00
Linus Torvalds
703071b5b9 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [SUNGEM]: Fix MAC address setting when interface is up.
  [IPV4] fib_trie: Document locking.
  [NET]: Correct accept(2) recovery after sock_attach_fd()
  [PPP]: Don't leak an sk_buff on interface destruction.
  [NET_SCHED]: Fix ingress locking
  [NET_SCHED]: cls_basic: fix NULL pointer dereference
  [DCCP]: make dccp_write_xmit_timer() static again
  [TG3]: Update version and reldate.
  [TG3]: Exit irq handler during chip reset.
  [TG3]: Eliminate the unused TG3_FLAG_SPLIT_MODE flag.
  [IPV6]: Fix routing round-robin locking.
  [DECNet] fib: Fix out of bound access of dn_fib_props[]
  [IPv4] fib: Fix out of bound access of fib_props[]
  [NET] AX.25 Kconfig and docs updates and fixes
  [NET]: Fix neighbour destructor handling.
  [NET]: Fix fib_rules compatibility breakage
  [SCTP]: Update SCTP Maintainers entry
  [NET]: remove unused header file: drivers/net/wan/lmc/lmc_media.h
2007-03-26 14:51:24 -07:00
Linus Torvalds
6288c33866 Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [S390] zcrypt: Fix ap_poll_requests counter in lost requests error path.
  [S390] zcrypt: Fix possible dead lock in AP bus module.
  [S390] cio: Device status validity.
  [S390] kprobes: Align probe address.
  [S390] Fix TCP/UDP pseudo header checksum computation.
  [S390] dasd: Work around gcc bug.
2007-03-26 14:45:56 -07:00
Russ Cox
04a3952330 [PATCH] Add const to pointer qualifiers for __chk_user_ptr and __chk_io_ptr.
Change prototypes for __chk_user_ptr and __chk_io_ptr to take const
void* instead of void*, so that code can pass "const void *" to them.

(Right now sparse does not warn about passing const void* to void*
functions, but that is a separate bug that I believe Josh is working on,
and once sparse does check this, the changed prototypes will be
necessary.)

Signed-off-by: Russ Cox <rsc@swtch.com>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Christopher Li <sparse@chrisli.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-26 14:23:52 -07:00
Suleiman Souhlal
513daadd15 ide: use correct IDE error recovery
IDE error recovery is using IDLE IMMEDIATE if the drive is busy or has DRQ set.
This violates the ATA spec (can only send IDLE IMMEDIATE when drive is not
busy) and really hoses up some drives (modern drives will not be able to
recover using this error handling).  The correct thing to do is issue a SRST
followed by a SET FEATURES command.  This is what Western Digital recommends
for error recovery and what Western Digital says Windows does.  It also does
not violate the ATA spec as far as I can tell.

Bart:
* port the patch over the current tree
* undo the recalibration code removal
* send SET FEATURES command after checking for good drive status
* don't check whether the current request is of REQ_TYPE_ATA_{CMD,TASK}
  type because we need to send SET FEATURES before handling any requests
* some pre-ATA4 drives require INITIALIZE DEVICE PARAMETERS command before
  other commands (except IDENTIFY) so send SET FEATURES only if there are
  no pending drive->special requests
* update comments and patch description
* any bugs introduced by this patch are mine and not Suleiman's :-)

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-03-26 23:03:20 +02:00
Heiko Carstens
afbc1e994d [S390] Fix TCP/UDP pseudo header checksum computation.
git commit f994aae1bd changed the
function declaration of csum_tcpudp_nofold. Argument types were
changed from unsigned long to __be32 (unsigned int). Therefore we
lost the implicit type conversion that zeroed the upper half of the
registers that are used to pass parameters. Since the inline assembly
relied on this we ended up adding random values and wrong checksums
were created.
Showed only up on machines with more than 4GB since gcc produced code
where the registers that are used to pass 'saddr' and 'daddr' previously
contained addresses before calling this function.
Fix this by using 32 bit arithmetics and convert code to C, since gcc
produces better code than these hand-optimized versions.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2007-03-26 20:43:46 +02:00
David S. Miller
f11e6659ce [IPV6]: Fix routing round-robin locking.
As per RFC2461, section 6.3.6, item #2, when no routers on the
matching list are known to be reachable or probably reachable we
do round robin on those available routes so that we make sure
to probe as many of them as possible to detect when one becomes
reachable faster.

Each routing table has a rwlock protecting the tree and the linked
list of routes at each leaf.  The round robin code executes during
lookup and thus with the rwlock taken as a reader.  A small local
spinlock tries to provide protection but this does not work at all
for two reasons:

1) The round-robin list manipulation, as coded, goes like this (with
   read lock held):

	walk routes finding head and tail

	spin_lock();
	rotate list using head and tail
	spin_unlock();

   While one thread is rotating the list, another thread can
   end up with stale values of head and tail and then proceed
   to corrupt the list when it gets the lock.  This ends up causing
   the OOPS in fib6_add() later onthat many people have been hitting.

2) All the other code paths that run with the rwlock held as
   a reader do not expect the list to change on them, they
   expect it to remain completely fixed while they hold the
   lock in that way.

So, simply stated, it is impossible to implement this correctly using
a manipulation of the list without violating the rwlock locking
semantics.

Reimplement using a per-fib6_node round-robin pointer.  This way we
don't need to manipulate the list at all, and since the round-robin
pointer can only ever point to real existing entries we don't need
to perform any locking on the changing of the round-robin pointer
itself.  We only need to reset the round-robin pointer to NULL when
the entry it is pointing to is removed.

The idea is from Thomas Graf and it is very similar to how this
was implemented before the advanced router selection code when in.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:05 -07:00
Alexey Kuznetsov
ecbb416939 [NET]: Fix neighbour destructor handling.
->neigh_destructor() is killed (not used), replaced with
->neigh_cleanup(), which is called when neighbor entry goes to dead
state. At this point everything is still valid: neigh->dev,
neigh->parms etc.

The device should guarantee that dead neighbor entries (neigh->dead !=
0) do not get private part initialized, otherwise nobody will cleanup
it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down.  But it
would be better to add explicit test for neigh->dead in any case.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:01 -07:00
Thomas Graf
e1701c68c1 [NET]: Fix fib_rules compatibility breakage
Based upon a patch from Patrick McHardy.

The fib_rules netlink attribute policy introduced in 2.6.19 broke
userspace compatibilty. When specifying a rule with "from all"
or "to all", iproute adds a zero byte long netlink attribute,
but the policy requires all addresses to have a size equal to
sizeof(struct in_addr)/sizeof(struct in6_addr), resulting in a
validation error.

Check attribute length of FRA_SRC/FRA_DST in the generic framework
by letting the family specific rules implementation provide the
length of an address. Report an error if address length is non
zero but no address attribute is provided. Fix actual bug by
checking address length for non-zero instead of relying on
availability of attribute.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:00 -07:00
Linus Torvalds
317ec6cd00 Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
  [ARM] 4278/1: configure pxa27x I2C SCL as "input"
  [ARM] 4272/1: Missing symbol h1940_pm_return fix
  [ARM] 4235/1: ns9xxx: declare the clock functions as "const"
  [ARM] 4271/1: iop32x: fix ep80219 detection (support iq80219 platforms)
  [ARM] 4270/2: mach-s3c2443/irq.c off by one error in dma irqs
2007-03-24 17:01:45 -07:00
Guennadi Liakhovetski
53698d2537 [ARM] 4278/1: configure pxa27x I2C SCL as "input"
It has been reported by Julian Deng that configuring the pxa27x i2c SCL line as output generates a short negative pulse on it during the call to pxa_gpio_mode(GPIO117_I2CSCL_MD); as it first switches it to output and then configures it for the alternate function. The SCL line is in fact bidirectional and can also be configured as 117 | GPIO_ALT_FN_1_IN, in which case the pulse is not generated. This is exactly what this patch does.

Author: Julian Deng <dengtj@sitek.cn>

Signed-off-by: G. Liakhovetski <gl@dsa-ac.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-03-24 23:24:39 +00:00
Ralf Baechle
8fb303c7f1 [MIPS] SB1250: Fix bugs/warnings by creative use of volatile.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:50 +00:00
Ralf Baechle
ce486cd810 [MIPS] ARC: Fix warning.
The missing cast did result a warning when calling an 32-bit ARC firmware
function that takes 5 arguments where the 5th argument is a pointer from a
64-bit kernel.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:49 +00:00
Ralf Baechle
7575a49f20 [MIPS] Implement flush_anon_page().
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:49 +00:00
Ralf Baechle
7605b39061 [MIPS] Fix pipeline hazard.
In the the sequence:
        ei
        ..
        mfc0    $x, $status

the mfc0 may not see the SR_IE bit set. This was a deliberate bug in the
kernel code because we knew this was a safe thing to do on all R2 silicon
so far but new silicon is changing this.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:49 +00:00
Deepak Saxena
83598f1cb0 [MIPS] Make MIPS udelay() preempt safe under DEBUG_PREEMPT
Signed-off-by: Manish Lachwani <mlachwani@mvista.com>
Signed-off-by: Deepak Saxena <dsaxena@mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:49 +00:00
Franck Bui-Huu
c9d0696223 [MIPS] Always use virt_to_phys() when translating kernel addresses
This patch fixes two places where we used plain 'x - PAGE_OFFSET' to
achieve virtual to physical address convertions. This type of convertion
is no more allowed since commit 6f284a2ce7.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>

[Build fixes for machines that don't use the generic dma-coherence.h]

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-03-24 17:01:49 +00:00
Roland McGrath
6ea65ff79c [PATCH] i386: clear segment register padding in core dumps
The segment register slots in struct pt_regs are padded to 32 bits.
Some of these are stored with instructions like "pushl %es", which
leaves the high 16 bits as they were.  So the high bits of these
fields in struct pt_regs contain kernel stack garbage.  These bits are
ignored by everything and never leak to user space, except in core
dumps.  The user struct pt_regs is always at the base of the thread's
kernel stack and so it seems unlikely the information that leaks from
here is ever worthwhile so as to be a security concern, but I'm not
sure about that.  It has been this way for ages; userland consumers of
core dumps all mask off these high bits themselves.  So it is not urgent.

This change masks off the padding bits of the segment register slots
in core dumps.  ptrace already masks off these high bits, so this
makes the values in core dumps consistent with what ptrace would
report just before the process died.

As I read the processor manuals, the cs and ss values will always be
padded with zero bits rather than stack garbage.  But unlike "pushl %es",
this is not simple to test with a userland program.  So I added the two
instructions rather than wonder if they are really never necessary.

I think that x86_64 does not have this problem (for either 32-bit or
64-bit processes).  It only uses "mov" instructions from segment
registers, which zero-extend.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 15:32:58 -07:00
Linus Torvalds
2e7c28382b x86-64: add "local_apic_timer_c2_ok" here too
Needed for any architecture that claims ARCH_APICTIMER_STOPS_ON_C3,
not just i386.

I'm hoping Thomas will clean this up a bit later..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 11:32:31 -07:00
Thomas Gleixner
e585bef815 [PATCH] i386: add command line option "local_apic_timer_c2_ok"
It turned out that it is almost impossible to trust ACPI, BIOS & Co.
regarding the C states. This was the reason to switch the local apic
timer off in C2 state already. OTOH there are sane and well behaving
systems, which get punished by that decision.

Allow the user to confirm that the local apic timer is trustworthy in C2
state. This keeps the default behaviour on the safe side.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 10:21:02 -07:00
Robert P. J. Day
118af321b2 [MTD] Delete unused header file linux/mtd/iflash.h.
Delete the unreferenced header file include/linux/mtd/iflash.h.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2007-03-23 15:34:05 +00:00
Linus Torvalds
37c70d0d09 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
  ACPI: IA64: fix %ll build warnings
  ACPI: IA64: fix allnoconfig build
  ACPI: Only use IPI on known broken machines (AMD, Dothan/BaniasPentium M)
  ACPI: ibm-acpi: allow module to load when acpi notifiers can't be set (v2)
  ACPI: parse 2nd MADT by default
  ACPICA: revert "acpi_serialize" changes
  sony-laptop: MAINTAINERS fix entry, add L: and W:
  ACPI: resolve HP nx6125 S3 immediate wakeup regression
  ACPI: Add support to parse 2nd MADT
2007-03-22 19:43:02 -07:00
Linus Torvalds
7f52a3afc4 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] Bypass hcall stats until cpu features have run
  [POWERPC] Avoid hypervisor statistics calculation in real mode
  [POWERPC] Fix atomicity of TIF update in flush_thread()
2007-03-22 19:42:42 -07:00
Jarek Poplawski
e3a55fd18d [PATCH] lockdep: lockdep_depth vs. debug_locks
lockdep found a bug during a run of workqueue function - this could be also
caused by a bug from other code running simultaneously.

lockdep really shouldn't be used when debug_locks == 0!

Reported-by: Folkert van Heusden <folkert@vanheusden.com>
Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
David Howells
8da38d7bac [PATCH] FRV: fix unannotated variable declarations
Fix unannotated variable declarations.  Variables that have allocation
section annotations (such as __meminitdata) on their definitions must also
have them on their declarations as not doing so may affect the addressing
mode used by the compiler and may result in a linker error.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:05 -07:00
Linus Torvalds
d6e8823e7b Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NETFILTER]: nat: avoid rerouting packets if only XFRM policy key changed
  [NETFILTER]: nf_conntrack_netlink: add missing dependency on NF_NAT
  [NET]: fix up misplaced inlines.
  [SCTP]: Correctly reset ssthresh when restarting association
  [BRIDGE]: Fix fdb RCU race
  [NET]: Fix fib_rules dump race
  [XFRM]: ipsecv6 needs a space when printing audit record.
  [X25] x25_forward_call(): fix NULL dereferences
  [SCTP]: Reset some transport and association variables on restart
  [SCTP]: Increment error counters on user requested HBs.
  [SCTP]: Clean up stale data during association restart
  [IrDA]: Calling ppp_unregister_channel() from process context
  [IrDA]: irttp_dup spin_lock initialisation
  [IrDA]: Delay needed when uploading firmware chunks
2007-03-22 19:34:09 -07:00
Mohan Kumar M
b4aea36b79 [POWERPC] Avoid hypervisor statistics calculation in real mode
kexec invokes plpar_hcall hypervisor call in real mode.  plpar_hcall
refers to per cpu variables for accounting hypervisor statistics.
These variables may not be in the RMO region, so accesses to them
in real mode may result in a data storage exception.

This fixes this problem by using a new plpar_hcall_raw function which
does not update the hypervisor call statistics.  Thanks to Anton for
suggesting this idea.

Signed-off-by: Mohan Kumar M <mohan@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 15:01:43 +11:00
Linus Torvalds
8559840c4c Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] Fix wrong /proc/iomem on SGI Altix
  [IA64] Altix: ioremap vga_console_iobase
  [IA64] Fix typo/thinko in crash.c
  [IA64] Fix get_model_name() for mixed cpu type systems
  [IA64] min_low_pfn and max_low_pfn calculation fix
2007-03-21 19:45:50 -07:00
Zou Nan hai
a3f5c338b9 [IA64] min_low_pfn and max_low_pfn calculation fix
We have seen bad_pte_print when testing crashdump on an SN machine in
recent 2.6.20 kernel.  There are tons of bad pte print (pfn < max_low_pfn)
reports when the crash kernel boots up, all those reported bad pages
are inside initmem range; That is because if the crash kernel code and
data happens to be at the beginning of the 1st node. build_node_maps in
discontig.c will bypass reserved regions with filter_rsvd_memory. Since
min_low_pfn is calculated in build_node_map, so in this case, min_low_pfn
will be greater than kernel code and data.

Because pages inside initmem are freed and reused later, we saw
pfn_valid check fail on those pages.

I think this theoretically happen on a normal kernel. When I check
min_low_pfn and max_low_pfn calculation in contig.c and discontig.c.
I found more issues than this.

1. min_low_pfn and max_low_pfn calculation is inconsistent between
contig.c and discontig.c,
min_low_pfn is calculated as the first page number of boot memmap in
contig.c (Why? Though this may work at the most of the time, I don't
think it is the right logic). It is calculated as the lowest physical
memory page number bypass reserved regions in discontig.c.
max_low_pfn is calculated include reserved regions in contig.c. It is
calculated exclude reserved regions in discontig.c.

2. If kernel code and data region is happen to be at the begin or the
end of physical memory, when min_low_pfn and max_low_pfn calculation is
bypassed kernel code and data, pages in initmem will report bad.

3. initrd is also in reserved regions, if it is at the begin or at the
end of physical memory, kernel will refuse to reuse the memory. Because
the virt_addr_valid check in free_initrd_mem.

So it is better to fix and clean up those issues.
Calculate min_low_pfn and max_low_pfn in a consistent way.

Signed-off-by:	Zou Nan hai <nanhai.zou@intel.com>
Acked-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-03-20 13:41:57 -07:00
Uwe Kleine-König
9d5cf5adcb [ARM] 4235/1: ns9xxx: declare the clock functions as "const"
This patch removes some "const"s that I introduced thinking they mean
the same thing as the "const"s introduced here.  So it fixes three warnings.

Signed-off-by: Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-03-20 17:22:23 +00:00
Len Brown
f5ea908c8f Pull bugzilla-8171 into release branch 2007-03-20 11:06:00 -04:00
Len Brown
54b8c39fbd Pull misc-for-upstream into release branch 2007-03-20 11:05:41 -04:00
Vlad Yasevich
749bf9215e [SCTP]: Reset some transport and association variables on restart
If the association has been restarted, we need to reset the
transport congestion variables as well as accumulated error
counts and CACC variables.  If we do not, the association
will use the wrong values and may terminate prematurely.

This was found with a scenario where the peer restarted
the association when lksctp was in the last HB timeout for
its association.  The restart happened, but the error counts
have not been reset and when the timeout occurred, a newly
restarted association was terminated due to excessive
retransmits.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:45 -07:00
Vlad Yasevich
0b58a81146 [SCTP]: Clean up stale data during association restart
During association restart we may have stale data sitting
on the ULP queue waiting for ordering or reassembly.  This
data may cause severe problems if not cleaned up.  In particular
stale data pending ordering may cause problems with receive
window exhaustion if our peer has decided to restart the
association.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:43 -07:00