commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
ipv4: add sysctl to accept packets with local source addresses
Change fib_validate_source() to accept packets with a local source address when
the "accept_local" sysctl is set for the incoming inet device. Combined with the
previous patches, this allows to communicate between multiple local interfaces
over the wire.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d124356ce314fff22a047ea334379d5105b2d834
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
net: fib_rules: allow to delete local rule
Allow to delete the local rule and recreate it with a higher priority. This
can be used to force packets with a local destination out on the wire instead
of routing them to loopback. Additionally this patch allows to recreate rules
with a priority of 0.
Combined with the previous patch to allow oif classification, a socket can
be bound to the desired interface and packets routed to the wire like this:
# move local rule to lower priority
ip rule add pref 1000 lookup local
ip rule del pref 0
# route packets of sockets bound to eth0 to the wire independant
# of the destination address
ip rule add pref 100 oif eth0 lookup 100
ip route add default dev eth0 table 100
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise:
ERROR: "sysctl_tcp_cookie_size" [net/ipv6/ipv6.ko] undefined!
make[1]: *** [__modpost] Error 1
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/tcp_output.c: In function ‘tcp_make_synack’:
net/ipv4/tcp_output.c:2488: warning: cast from pointer to integer of different size
Signed-off-by: David S. Miller <davem@davemloft.net>
Parse incoming TCP_COOKIE option(s).
Calculate <SYN,ACK> TCP_COOKIE option.
Send optional <SYN,ACK> data.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1d: define TCP cookie option, extend existing struct's
TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1f: Initiator Cookie => Responder
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Calculate and format <SYN> TCP_COOKIE option.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
Requires:
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1d: define TCP cookie option, extend existing struct's
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide per socket control of the TCP cookie option and SYN/SYNACK data.
This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.
Allocations have been rearranged to avoid requiring GFP_ATOMIC.
Requires:
net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1d: define TCP cookie option, extend existing struct's
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration). There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data. This is more flexible and
less subject to user configuration error. Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.
"Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
"Re: what a new TCP header might look like", May 12, 1998.
ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
These functions will also be used in subsequent patches that implement
additional features.
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Define sysctl (tcp_cookie_size) to turn on and off the cookie option
default globally, instead of a compiled configuration option.
Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
data values, retrieving variable cookie values, and other facilities.
Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h,
near its corresponding struct tcp_options_received (prior to changes).
This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
These functions will also be used in subsequent patches that implement
additional features.
Requires:
net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Define (missing) hash message size for SHA1.
Define hashing size constants specific to TCP cookies.
Add new function: tcp_cookie_generator().
Maintain global secret values for tcp_cookie_generator().
This is a significantly revised implementation of earlier (15-year-old)
Photuris [RFC-2522] code for the KA9Q cooperative multitasking platform.
Linux RCU technique appears to be well-suited to this application, though
neither of the circular queue items are freed.
These functions will also be used in subsequent patches that implement
additional features.
Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.
Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.
Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The motivation for an additional notifier in batched netdevice
notification (rt_do_flush) only needs to be called once per batch not
once per namespace.
For further batching improvements I need a guarantee that the
netdevices are unregistered in order allowing me to unregister an all
of the network devices in a network namespace at the same time with
the guarantee that the loopback device is really and truly
unregistered last.
Additionally it appears that we moved the route cache flush after
the final synchronize_net, which seems wrong and there was no
explanation. So I have restored the original location of the final
synchronize_net.
Cc: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a large packet gets reassembled by ip_defrag(), the head skb
accounts for all the fragments in skb->truesize. If this packet is
refragmented again, skb->truesize is not re-adjusted to reflect only
the head size since its not owned by a socket. If the head fragment
then gets recycled and reused for another received fragment, it might
exceed the defragmentation limits due to its large truesize value.
skb_recycle_check() explicitly checks for linear skbs, so any recycled
skb should reflect its true size in skb->truesize. Change ip_fragment()
to also adjust the truesize value of skbs not owned by a socket.
Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFS can reuse its TCP socket after calling tcp_disconnect().
We noticed window scaling was not negotiated in SYN packet of next
connection request.
Fix is to clear tp->window_clamp in tcp_disconnect().
Reported-by: Krzysztof Oledzki <ole@ans.pl>
Tested-by: Krzysztof Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4 ip_frag_reasm(), fully replace 'dev_net(dev)' with 'net', defined
previously patched into 2.6.29.
Between 2.6.28.10 and 2.6.29, net/ipv4/ip_fragment.c was patched,
changing from dev_net(dev) to container_of(...). Unfortunately the goto
section (out_fail) on oversized packets inside ip_frag_reasm() didn't
get touched up as well. Oversized IP packets cause a NULL pointer
dereference and immediate hang.
I discovered this running openvasd and my previous email on this is
titled: NULL pointer dereference at 2.6.32-rc8:net/ipv4/ip_fragment.c:566
Signed-off-by: David Ford <david@blue-labs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not including net/atm/
Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using the hardcoded truncation for authentication
algorithms, use the truncation length specified on xfrm_state.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rewrite statistics accumulation to be in terms of structure fields,
not raw u32 additions. Keep them in same order, though.
This is the last user of create_proc_read_entry() in net/,
please NAK all new ones as well as all new ->write_proc, ->read_proc and
create_proc_entry() users. Cc me if there are problems. :-)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generated with the following semantic patch
@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)
@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)
applied over {include,net,drivers/net}.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote:
> It should be of the form:
> if (x &&
> y)
>
> or:
> if (x && y)
>
> Fix patches, rather than complaints, for existing cases where things
> do not follow this pattern are certainly welcome.
Also collapsed some multiple tabs to single space.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That's extremely non-intuitive, noticed by William Allen Simpson.
And let's make the default be on, it's been suggested by a lot of
people so we'll give it a try.
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
While investigating for network latencies, I found inet_getid() was a
contention point for some workloads, as inet_peer_idlock is shared
by all inet_getid() users regardless of peers.
One way to fix this is to make ip_id_count an atomic_t instead
of __u16, and use atomic_add_return().
In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
tcp_ts_stamp is also converted to __u32 instead of "unsigned long".
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger a écrit :
> On Thu, 12 Nov 2009 15:11:36 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> When handling large number of netdevices, inet_dump_ifaddr()
>> is very slow because it has O(N^2) complexity.
>>
>> Instead of scanning one single list, we can use the NETDEV_HASHENTRIES
>> sub lists of the dev_index hash table, and RCU lookups.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> You might be able to make RCU critical section smaller by moving
> it into loop.
>
Indeed. But we dump at most one skb (<= 8192 bytes ?), so rcu_read_lock
holding time is small, unless we meet many netdevices without
addresses. I wonder if its really common...
Thanks
[PATCH net-next-2.6] ipv4: speedup inet_dump_ifaddr()
When handling large number of netdevices, inet_dump_ifaddr()
is very slow because it has O(N2) complexity.
Instead of scanning one single list, we can use the NETDEV_HASHENTRIES
sub lists of the dev_index hash table, and RCU lookups.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use next_det_device_rcu() in RCU protected section.
We also can avoid in_dev_get()/in_dev_put() overhead (code size mainly)
in rcu_read_lock() sections.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define two symbols needed in both kernel and user space.
Remove old (somewhat incorrect) kernel variant that wasn't used in
most cases. Default should apply to both RMSS and SMSS (RFC2581).
Replace numeric constants with defined symbols.
Stand-alone patch, originally developed for TCPCT.
Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The other error paths in front of this one have a dev_put() but this one
got missed.
Found by smatch static checker.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Wang Chen <ellre923@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addition of rcv_nxt allows to discern whether the skb
was out of place or tp->copied. Also catch fancy combination
of flags if necessary (sadly we might miss the actual causer
flags as it might have already returned).
Btw, we perhaps would want to forward copied_seq in
somewhere or otherwise we might have some nice loop with
WARN stuff within but where to do that safely I don't
know at this stage until more is known (but it is not
made significantly worse by this patch).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.
In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.
Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This also needs to be optimized for large number of devices.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP bind() can be O(N^2) in some pathological cases.
Thanks to secondary hash tables, we can make it O(N)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb_clone() fails, we should increment sk_drops and SNMP counters.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.
Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.
It's also a base for a future RCU conversion of multicast recption.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.
If too many sockets are listed, we take a look at the secondary
(port, address) hash chain we added in previous patch.
We choose the shortest chain and proceed with a RCU lookup on the elected chain.
But, if we chose (port, address) chain, and fail to find a socket on given address,
we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
to a particular IP.
-> No extra cost for typical setups, where the first lookup will probabbly
be performed.
RCU lookups everywhere, we dont acquire spinlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extends udp_table to contain a secondary hash table.
socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node
udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)
udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)
Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Union sk_hash with two u16 hashes for udp (no extra memory taken)
One 16 bits hash on (local port) value (the previous udp 'hash')
One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.
Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.
This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no good reason to not support userspace specifying the
network namespace during device creation, and it makes it easier
to create a network device and pass it to a child network namespace
with a well known name.
We have to be careful to ensure that the target network namespace
for the new device exists through the life of the call. To keep
that logic clear I have factored out the network namespace grabbing
logic into rtnl_link_get_net.
In addtion we need to continue to pass the source network namespace
to the rtnl_link_ops.newlink method so that we can find the base
device source network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
RFC 2003 requires the outer header to have DF set if DF is set
on the inner header, even when PMTU discovery is off for the
tunnel. Our implementation does exactly that.
For this to work properly the IPIP gateway also needs to engate
in PMTU when the inner DF bit is set. As otherwise the original
host would not be able to carry out its PMTU successfully since
part of the path is only visible to the gateway.
Unfortunately when the tunnel PMTU discovery setting is off, we
do not collect the necessary soft state, resulting in blackholes
when the original host tries to perform PMTU discovery.
This problem is not reproducible on the IPIP gateway itself as
the inner packet usually has skb->local_df set. This is not
correctly cleared (an unrelated bug) when the packet passes
through the tunnel, which allows fragmentation to occur. For
hosts behind the IPIP gateway it is readily visible with a simple
ping.
This patch fixes the problem by performing PMTU discovery for
all packets with the inner DF bit set, regardless of the PMTU
discovery setting on the tunnel itself.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NETLINK_URELEASE notifier is only invoked for bound sockets, so
there is no need to check ->pid again.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Conflicts:
drivers/net/usb/cdc_ether.c
All CDC ethernet devices of type USB_CLASS_COMM need to use
'&mbm_info'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e36).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.
The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending fragmentation expiration ICMP V4/V6 messages,
we can avoid touching device refcount, thanks to RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling capable(CAP_NET_RAW) check if this operations is on behalf
of the kernel or on behalf of userspace. Do not do the security check if
it is on behalf of the kernel.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace. This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct can_proto had a capability field which wasn't ever used. It is
dropped entirely.
struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable 'other_way' gets initialized but is not read afterwards,
so remove it. Pass the right arguments to a pr_debug call.
While being at tidy up a bit and it fix this checkpatch warning:
WARNING: suspect code indent for conditional statements
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Trying to parse the option of a SYN packet that we have
no route entry for should just use global wide defaults
for route entry options.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling IPv4 specific inet_csk_route_req in tcp_check_req
is a bad idea and crashes machine on IPv6 connections, as reported
by Valdis Kletnieks
Also, all we are really interested in is the timestamp
option in the header, so calling tcp_parse_options()
with the "estab" set to false flag is an overkill as
it tries to parse half a dozen other TCP options.
We know whether timestamp should be enabled or not
using data from request_sock.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed by Stephen Rothwell, commit c6d14c84 added a warning :
net/ipv4/devinet.c: In function 'inet_select_addr':
net/ipv4/devinet.c:902: warning: label 'out' defined but not used
delete unused 'out' label and do some cleanups as well
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_unregister_queue_handlers() already does a synchronize_rcu()
call, we dont need to do it again in callers.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adds RCU management to the list of netdevices.
Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock
Ie:
read_lock(&dev_base_loack);
for_each_netdev(net, dev)
some_action();
read_unlock(&dev_base_lock);
becomes :
rcu_read_lock();
for_each_netdev_rcu(net, dev)
some_action();
rcu_read_unlock();
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid touching device refcount in icmp_send(),
using dev_get_by_index_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use dev_get_by_index_rcu() instead of __dev_get_by_index() and
dev_base_lock rwlock
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Neulinger noticed that gretap devices get their MAC address
from the local IP address, which results in invalid MAC addresses
half of the time.
This is because gretap is still using the tunnel netdev ops rather
than the correct tap netdev ops struct.
This patch also fixes changelink to not clobber the MAC address
for the gretap case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Nathan Neulinger <nneul@mst.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.
Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Policy routing is not looked up by mark on reverse path filtering.
This fixes it.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when we implement new protocol support we should keep such
a detail in mind (which is easy to forget about).
Lets introduce DECLARE_SOCKADDR helper which check if
storage unit is not overfowed at build time.
Eventually inet_getname is switched to use DECLARE_SOCKADDR
(to show example of usage).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
optlen is unsigned so the `< 0' test is never true.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add and use no DSCAK bit in the features field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add and use no window scale bit in the features field.
Note that this is not the same as setting a window scale of 0
as would happen with window limit on route.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement querying and acting upon the no timestamp bit in the feature
field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement querying and acting upon the no sack bit in the features
field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Augment raw_send_hdrinc to correct for incorrect ip header length values
A series of oopses was reported to me recently. Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.
I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.
Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call. In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid. Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers. During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl). Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.
This fix for this is pretty straightforward, simply validate the value of of
iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer
size, drop the frame and return -EINVAL. I just confirmed this fixes the
reported crashes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corrected a spelling error in a function name.
Signed-off-by: Andreas Petlund <apetlund@simula.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're adding enough nfs documentation that it may as well have its own
subdirectory.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
GRE tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPIP tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b6b39e8f3f (tcp: Try to catch MSG_PEEK bug) added a printk()
to the WARN_ON() that's in tcp.c. This patch changes this combination
to WARN(); the advantage of WARN() is that the printk message shows up
inside the message, so that kerneloops.org will collect the message.
In addition, this gets rid of an extra if() statement.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tries to print out more information when we hit the
MSG_PEEK bug in tcp_recvmsg. It's been around since at least
2005 and it's about time that we finally fix it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.
Remove redundant test in ip_rt_frag_needed() (done by caller).
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.
Fix is to call dev_get_by_index()/dev_put().
If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6d01a026b7.
Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
with a more correct way to deal with this.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts ah4 to the new ahash interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- skb_kill_datagram() can increment sk->sk_drops itself, not callers.
- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.
This adds sk_drops managment to many protocols that not cared yet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_poll() can in some circumstances drop frames with incorrect checksums.
Problem is we now have to lock the socket while dropping frames, or risk
sk_forward corruption.
This bug is present since commit 95766fff6b
([UDP]: Add memory accounting.)
While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was trying to use TCP_DEFER_ACCEPT and noticed that if the
client does not talk, the connection is never accepted and
remains in SYN_RECV state until the retransmits expire, where
it finally is deleted. This is bad when some firewall such as
netfilter sits between the client and the server because the
firewall sees the connection in ESTABLISHED state while the
server will finally silently drop it without sending an RST.
This behaviour contradicts the man page which says it should
wait only for some time :
TCP_DEFER_ACCEPT (since Linux 2.4)
Allows a listener to be awakened only when data arrives
on the socket. Takes an integer value (seconds), this
can bound the maximum number of attempts TCP will
make to complete the connection. This option should not
be used in code intended to be portable.
Also, looking at ipv4/tcp.c, a retransmit counter is correctly
computed :
case TCP_DEFER_ACCEPT:
icsk->icsk_accept_queue.rskq_defer_accept = 0;
if (val > 0) {
/* Translate value in seconds to number of
* retransmits */
while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
val > ((TCP_TIMEOUT_INIT / HZ) <<
icsk->icsk_accept_queue.rskq_defer_accept))
icsk->icsk_accept_queue.rskq_defer_accept++;
icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;
==> rskq_defer_accept is used as a counter of retransmits.
But in tcp_minisocks.c, this counter is only checked. And in
fact, I have found no location which updates it. So I think
that what was intended was to decrease it in tcp_minisocks
whenever it is checked, which the trivial patch below does.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.
4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.
Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper
Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.
dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason that cipso_v4_delopt() is not
defined as a static function.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for route lookup using sk_mark on IPv4 listening sockets.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug with arp_notify.
If arp_notify is enabled, kernel will crash if address is changed
and no IP address is assigned.
http://bugzilla.kernel.org/show_bug.cgi?id=14330
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All usages of structure net_proto_ops should be declared const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When routing daemon wants to enable forwarding of multicast traffic it
performs something like:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = 0,
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
This leads (in the kernel) to calling vif_add() function call which
search the (physical) device using assigned IP address:
dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.
The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = VIFF_USE_IFINDEX, /* NEW */
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_ifindex = if_nametoindex("eth0"), /* NEW */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
Signed-off-by: Ilia K. <mail4ilia@gmail.com>
=== modified file 'include/linux/mroute.h'
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently dirty a cache line to update tunnel device stats
(tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
counters that already are present in cpu cache, in the cache
line shared with txq->_xmit_lock
This patch extends IPTUNNEL_XMIT() macro to use txq pointer
provided by the caller.
Also &tunnel->dev->stats can be replaced by &dev->stats
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB algorithim for IPV4 is set at compile time, but kernel goes through
the overhead of function call indirection at runtime. Save some
cycles by turning the indirect calls to direct calls to either
hash or trie code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can make icmp messages tx completion callback a litle bit faster.
Setting SOCK_USE_WRITE_QUEUE sk flag tells sock_wfree() to
not call sk_write_space() on a socket we know no thread is posssibly
waiting for write space. (on per cpu kernel internal icmp sockets only)
This avoids the sock_def_write_space() call and
read_lock(&sk->sk_callback_lock)/read_unlock(&sk->sk_callback_lock) calls
as well.
We avoid three atomic ops.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag
Before this patch :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE);
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.
User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.
Many projects cannot use splice(tcp -> pipe) because of this flaw.
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEADhttp://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html
Linus introduced SPLICE_F_NONBLOCK in commit 29e350944f
(splice: add SPLICE_F_NONBLOCK flag )
It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.
Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only
This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.
Users will then call :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK );
to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)
First version of this patch was submitted by Octavian Purdila
Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acknowledge TCP window scale support by inserting the proper option in SYN/ACK
and SYN headers even if our window scale is zero.
This fixes the following observed behavior:
1. Client sends a SYN with TCP window scaling option and non zero window scale
value to a Linux box.
2. Linux box notes large receive window from client.
3. Linux decides on a zero value of window scale for its part.
4. Due to compare against requested window scale size option, Linux does not to
send windows scale TCP option header on SYN/ACK at all.
With the following result:
Client box thinks TCP window scaling is not supported, since SYN/ACK had no
TCP window scale option, while Linux thinks that TCP window scaling is
supported (and scale might be non zero), since SYN had TCP window scale
option and we have a mismatched idea between the client and server
regarding window sizes.
Probably it also fixes up the following bug (not observed in practice):
1. Linux box opens TCP connection to some server.
2. Linux decides on zero value of window scale.
3. Due to compare against computed window scale size option, Linux does
not to set windows scale TCP option header on SYN.
With the expected result that the server OS does not use window scale option
due to not receiving such an option in the SYN headers, leading to suboptimal
performance.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/tcp.c: In function 'do_tcp_setsockopt':
net/ipv4/tcp.c:2050: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems recursion field from "struct ip_tunnel" is not anymore needed.
recursion prevention is done at the upper level (in dev_queue_xmit()),
since we use HARD_TX_LOCK protection for tunnels.
This avoids a cache line ping pong on "struct ip_tunnel" : This structure
should be now mostly read on xmit and receive paths.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to man page of setsockopt, if optlen is not valid, kernel should return
-EINVAL. But a simple testcase as following, errno is 0, which means setsockopt
is successful.
addr.s_addr = inet_addr("192.1.2.3");
setsockopt(s, IPPROTO_IP, IP_MULTICAST_IF, &addr, 1);
printf("errno is %d\n", errno);
Xiaotian Feng(dfeng@redhat.com) caught the bug. We fix it firstly checking
the availability of optlen and then dealing with the logic like other options.
Reported-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.
It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.
Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (66 commits)
be2net: fix some cmds to use mccq instead of mbox
atl1e: fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA
pkt_sched: Fix qstats.qlen updating in dump_stats
ipv6: Log the affected address when DAD failure occurs
wl12xx: Fix print_mac() conversion.
af_iucv: fix race when queueing skbs on the backlog queue
af_iucv: do not call iucv_sock_kill() twice
af_iucv: handle non-accepted sockets after resuming from suspend
af_iucv: fix race in __iucv_sock_wait()
iucv: use correct output register in iucv_query_maxconn()
iucv: fix iucv_buffer_cpumask check when calling IUCV functions
iucv: suspend/resume error msg for left over pathes
wl12xx: switch to %pM to print the mac address
b44: the poll handler b44_poll must not enable IRQ unconditionally
ipv6: Ignore route option with ROUTER_PREF_INVALID
bonding: make ab_arp select active slaves as other modes
cfg80211: fix SME connect
rc80211_minstrel: fix contention window calculation
ssb/sdio: fix printk format warnings
p54usb: add Zcomax XG-705A usbid
...
I have recently came across a preemption imbalance detected by:
<4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
<0>------------[ cut here ]------------
<2>kernel BUG at /usr/src/linux/kernel/timer.c:664!
<0>invalid opcode: 0000 [1] PREEMPT SMP
with ffffffff80644630 being inet_twdr_hangman().
This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
bit, so I looked at what might have caused it.
One thing that struck me as strange is tcp_twsk_destructor(), as it
calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
as far as I can tell.
Signed-off-by: Robert Varga <nite@hq.alert.sk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
This patch fixes commit e36b9d16c6. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:
*. It assumes tha the device is UP when calling dev_close(), or otherwise
dev_close() has no affect. It is worth to mention that initscripts (Redhat)
and sysconfig (Suse) doesn't act the same in this matter.
*. dev_close() has other side affects, like deleting entries from the routing
table, which might be unnecessary.
The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was once upon time so that snd_sthresh was a 16-bit quantity.
...That has not been true for long period of time. I run across
some ancient compares which still seem to trust such legacy.
Put all that magic into a single place, I hopefully found all
of them.
Compile tested, though linking of allyesconfig is ridiculous
nowadays it seems.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove long removed "inet_protocol_base" declaration.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
netxen: update copyright
netxen: fix tx timeout recovery
netxen: fix file firmware leak
netxen: improve pci memory access
netxen: change firmware write size
tg3: Fix return ring size breakage
netxen: build fix for INET=n
cdc-phonet: autoconfigure Phonet address
Phonet: back-end for autoconfigured addresses
Phonet: fix netlink address dump error handling
ipv6: Add IFA_F_DADFAILED flag
net: Add DEVTYPE support for Ethernet based devices
mv643xx_eth.c: remove unused txq_set_wrr()
ucc_geth: Fix hangs after switching from full to half duplex
ucc_geth: Rearrange some code to avoid forward declarations
phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
drivers/net/phy: introduce missing kfree
drivers/net/wan: introduce missing kfree
net: force bridge module(s) to be GPL
Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
...
Fixed up trivial conflicts:
- arch/x86/include/asm/socket.h
converted to <asm-generic/socket.h> in the x86 tree. The generic
header has the same new #define's, so that works out fine.
- drivers/net/tun.c
fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
switched over to using 'tun->socket.sk' instead of the redundantly
available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
to the TUN driver") which added a new 'tun->sk' use.
Noted in 'next' by Stephen Rothwell.
This fixed a lockdep warning which appeared when doing stress
memory tests over NFS:
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
mount_root => nfs_root_data => tcp_close => lock sk_lock =>
tcp_send_fin => alloc_skb_fclone => page reclaim
David raised a concern that if the allocation fails in tcp_send_fin(), and it's
GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
for the allocation to succeed.
But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
loop endlessly under memory pressure.
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: David S. Miller <davem@davemloft.net>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.
IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.
This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.
Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048
send() syscalls, do however still return an OK status, to not
break applications.
Note : send() manual page explicitly says for -ENOBUFS error :
"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "
This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.
Many thanks to Christoph, David, and last but not least, Alexey !
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function block inet_connect_sock_af_ops contains no data
make it constant.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These tables are never modified at runtime. Move to read-only
section.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.
This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.
Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to
disambiguate from another sk_buff variable, which will be introduced
in a separate patch.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the copy of the MD5 authentication key from tcp_check_req().
This key has already been copied by tcp_v4_syn_recv_sock() or
tcp_v6_syn_recv_sock().
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition in the time-wait sockets code that can lead
to premature termination of FIN_WAIT2 and, subsequently, to RST
generation when the FIN,ACK from the peer finally arrives:
Time TCP header
0.000000 30755 > http [SYN] Seq=0 Win=2920 Len=0 MSS=1460 TSV=282912 TSER=0
0.000008 http > 30755 aSYN, ACK] Seq=0 Ack=1 Win=2896 Len=0 MSS=1460 TSV=...
0.136899 HEAD /1b.html?n1Lg=v1 HTTP/1.0 [Packet size limited during capture]
0.136934 HTTP/1.0 200 OK [Packet size limited during capture]
0.136945 http > 30755 [FIN, ACK] Seq=187 Ack=207 Win=2690 Len=0 TSV=270521...
0.136974 30755 > http [ACK] Seq=207 Ack=187 Win=2734 Len=0 TSV=283049 TSER=...
0.177983 30755 > http [ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283089 TSER=...
0.238618 30755 > http [FIN, ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283151...
0.238625 http > 30755 [RST] Seq=188 Win=0 Len=0
Say twdr->slot = 1 and we are running inet_twdr_hangman and in this
instance inet_twdr_do_twkill_work returns 1. At that point we will
mark slot 1 and schedule inet_twdr_twkill_work. We will also make
twdr->slot = 2.
Next, a connection is closed and tcp_time_wait(TCP_FIN_WAIT2, timeo)
is called which will create a new FIN_WAIT2 time-wait socket and will
place it in the last to be reached slot, i.e. twdr->slot = 1.
At this point say inet_twdr_twkill_work will run which will start
destroying the time-wait sockets in slot 1, including the just added
TCP_FIN_WAIT2 one.
To avoid this issue we increment the slot only if all entries in the
slot have been purged.
This change may delay the slots cleanup by a time-wait death row
period but only if the worker thread didn't had the time to run/purge
the current slot in the next period (6 seconds with default sysctl
settings). However, on such a busy system even without this change we
would probably see delays...
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is rework and cleanup of the resize function.
Some bugs we had. We were using ->parent when we should use
node_parent(). Also we used ->parent which is not assigned by
inflate in inflate loop.
Also a fix to set thresholds to power 2 to fit halve
and double strategy.
max_resize is renamed to max_work which better indicates
it's function.
Reaching max_work is not an error, so warning is removed.
max_work only limits amount of work done per resize.
(limits CPU-usage, outstanding memory etc).
The clean-up makes it relatively easy to add fixed sized
root-nodes if we would like to decrease the memory pressure
on routers with large routing tables and dynamic routing.
If we'll need that...
Its been tested with 280k routes.
Work done together with Robert Olsson.
Signed-off-by: Jens Låås <jens.laas@its.uu.se>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing some forwarding benchmarks, I noticed
ip_rt_send_redirect() is rather expensive, even if send_redirects is
false for the device.
Fix is to avoid two atomic ops, we dont really need to take a
reference on in_dev
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce keepalive_probes(tp) helper, and use it, like
keepalive_time_when(tp) and keepalive_intvl_when(tp)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a check in ip_append_data() for NULL *rtp to prevent future bugs in
callers from being exploitable.
Signed-off-by: Julien Tinnes <julien@cr0.org>
Signed-off-by: Tavis Ormandy <taviso@sdf.lonestar.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Log packets dropped by helpers using the netfilter logging API. This
is useful in combination with nfnetlink_log to analyze those packets
in userspace for debugging.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Kernel 2.6.30 introduced a patch [1] for the persistent option in the
netfilter SNAT target. This is exactly what we need here so I had a quick look
at the code and noticed that the patch is wrong. The logic is simply inverted.
The patch below fixes this.
Also note that because of this the default behavior of the SNAT target has
changed since kernel 2.6.30 as it now ignores the destination IP in choosing
the source IP for nating (which should only be the case if the persistent
option is set).
[1] http://git.eu.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=98d500d66cb7940747b424b245fc6a51ecfbf005
Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The inputted table is never modified, so should be considered const.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The GRE header length should be subtracted when the tunnel MTU is
calculated. This just corrects for the associativity change
introduced by commit 42aa916265
("gre: Move MTU setting out of ipgre_tunnel_bind_dev").
Signed-off-by: Tom Goff <thomas.goff@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
The networking code checks CAP_SYS_MODULE before using request_module() to
try to load a kernel module. While this seems reasonable it's actually
weakening system security since we have to allow CAP_SYS_MODULE for things
like /sbin/ip and bluetoothd which need to be able to trigger module loads.
CAP_SYS_MODULE actually grants those binaries the ability to directly load
any code into the kernel. We should instead be protecting modprobe and the
modules on disk, rather than granting random programs the ability to load code
directly into the kernel. Instead we are going to gate those networking checks
on CAP_NET_ADMIN which still limits them to root but which does not grant
those processes the ability to load arbitrary code into the kernel.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
This adds the second check that Rusty wanted to have a long time ago. :-)
Base chain policies must have absolute verdicts that cease processing
in the table, otherwise rule execution may continue in an unexpected
spurious fashion (e.g. next chain that follows in memory).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
This adds a check that iptables's original author Rusty set forth in
a FIXME comment.
Underflows in iptables are better known as chain policies, and are
required to be unconditional or there would be a stochastical chance
for the policy rule to be skipped if it does not match. If that were
to happen, rule execution would continue in an unexpected spurious
fashion.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The "hook_entry" and "underflow" array contains values even for hooks
not provided, such as PREROUTING in conjunction with the "filter"
table. Usually, the values point to whatever the next rule is. For
the upcoming unconditionality and underflow checking patches however,
we must not inspect that arbitrary rule.
Skipping unassigned hooks seems like a good idea, also because
newinfo->hook_entry and newinfo->underflow will then continue to have
the poison value for detecting abnormalities.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Instead of inspecting each u32/char open-coded, clean up and make use
of memcmp. On some arches, memcmp is implemented as assembly or GCC's
__builtin_memcmp which can possibly take advantages of known
alignment.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix build errors when SYSCTLs are not enabled:
(.init.text+0x5154): undefined reference to `net_ipv4_ctl_path'
(.init.text+0x5176): undefined reference to `register_net_sysctl_table'
xfrm4_policy.c:(.exit.text+0x573): undefined reference to `unregister_net_sysctl_table
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Choose saner defaults for xfrm[4|6] gc_thresh values on init
Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
(set to 1024). Given that the ipv4 and ipv6 routing caches are sized
dynamically at boot time, the static selections can be non-sensical.
This patch dynamically selects an appropriate gc threshold based on
the corresponding main routing table size, using the assumption that
we should in the worst case be able to handle as many connections as
the routing table can.
For ipv4, the maximum route cache size is 16 * the number of hash
buckets in the route cache. Given that xfrm4 starts garbage
collection at the gc_thresh and prevents new allocations at 2 *
gc_thresh, we set gc_thresh to half the maximum route cache size.
For ipv6, its a bit trickier. there is no maximum route cache size,
but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems
sane to select a simmilar gc_thresh for the xfrm6 code that is half
the number of hash buckets in the v6 route cache times 16 (like the v4
code does).
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If arp_format_neigh_entry() can be called with n->dev->addr_len == 0, then a
write to hbuffer[-1] occurs.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export garbage collector thresholds for xfrm[4|6]_dst_ops
Had a problem reported to me recently in which a high volume of ipsec
connections on a system began reporting ENOBUFS for new connections
eventually.
It seemed that after about 2000 connections we started being unable to
create more. A quick look revealed that the xfrm code used a dst_ops
structure that limited the gc_thresh value to 1024, and always
dropped route cache entries after 2x the gc_thresh.
It seems the most direct solution is to export the gc_thresh values in
the xfrm[4|6] dst_ops as sysctls, like the main routing table does, so
that higher volumes of connections can be supported. This patch has
been tested and allows the reporter to increase their ipsec connection
volume successfully.
Reported-by: Joe Nall <joe@nall.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
ipv4/xfrm4_policy.c | 18 ++++++++++++++++++
ipv6/xfrm6_policy.c | 18 ++++++++++++++++++
2 files changed, 36 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking for something else I spent some time adding
one liner comments to the tcp_output.c functions that
didn't have any. That makes the comments more consistent.
I hope I documented everything right.
No code changes.
v2: Incorporated feedback from Ilpo.
v3: Change style of one liner comments, add a few more comments.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the TCP connection handshake completes on the passive
side, a variety of state must be set up in the "child" sock,
including the key if MD5 authentication is being used. Fix TCP
for both address families to label the key with the peer's
destination address, rather than the address from the listening
sock, which is usually the wildcard.
Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix MD5 signature checking so that an IPv4 active open
to an IPv6 socket can succeed. In particular, use the
correct address family's signature generation function
for the SYN/ACK.
Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking for other fib_trie problems reported by Pawel Staszewski
I noticed there are a few uses of tnode_get_child() and node_parent()
in lookups instead of their rcu versions.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During large updates there could be triggered warnings like: "Fix
inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
node isn't finished in 10 loops. It should be much rarer now, after
changing the threshold from 15 to 25, and a temporary problem, so
this patch tries to handle it automatically using a fix variable to
increase by one inflate threshold for next root resizes (up to the 35
limit, max fix = 10). The fix variable is decreased when root's
inflate() finishes below 7 loops (even if some other, smaller table/
trie is updated -- for simplicity the fix variable is global for now).
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
During trie_rebalance() we free memory after resizing with call_rcu(),
but large updates, especially with PREEMPT_NONE configs, can cause
memory stresses, so this patch calls synchronize_rcu() in
tnode_free_flush() after each sync_pages to guarantee such freeing
(especially before resizing the root node).
The value of sync_pages = 128 is based on Pawel Staszewski's tests as
the lowest which doesn't hinder updating times. (For testing purposes
there was a sysfs module parameter to change it on demand, but it's
removed until we're sure it could be really useful.)
The patch is based on suggestions by: Paul E. McKenney
<paulmck@linux.vnet.ibm.com>
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes two bugs:
- ToS/DiffServ inheritance was unintentionally activated when using impair fixed ToS values
- ECN bit was lost during ToS/DiffServ inheritance
Signed-off-by: Andreas Jaggi <aj@open.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
- validate and forward GSO UDP/IPv4 packets from untrusted sources.
- do software UFO if the outgoing device doesn't support UFO.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
we do not take any more references on sk->sk_refcnt on outgoing packets.
I forgot to delete two __sock_put() from ip_push_pending_frames()
and ip6_push_pending_frames().
Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.
Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.
CPU1 CPU2
sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}
If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.
Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.
Calls to poll_wait in following modules were ommited:
net/bluetooth/af_bluetooth.c
net/irda/af_irda.c
net/irda/irnet/irnet_ppp.c
net/mac80211/rc80211_pid_debugfs.c
net/phonet/socket.c
net/rds/af_rds.c
net/rfkill/core.c
net/sunrpc/cache.c
net/sunrpc/rpc_pipe.c
net/tipc/socket.c
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648
and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits
cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
Aver depth: 2.28
Max depth: 6
Leaves: 276539
Prefixes: 289922
Internal nodes: 66762
1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5
9: 1 18: 1
Pointers: 691228
Null ptrs: 347928
Total size: 35709 kB
</blockquote>
It seems, the current threshold for root resizing is too aggressive,
and it causes misleading warnings during big updates, but it might be
also responsible for memory problems, especially with non-preempt
configs, when RCU freeing is delayed long after call_rcu.
It should be also mentioned that because of non-atomic changes during
resizing/rebalancing the current lookup algorithm can miss valid leaves
so it's additional argument to shorten these activities even at a cost
of a minimally longer searching.
This patch restores values before the patch "[IPV4]: fib_trie root
node settings", commit: 965ffea43d from
v2.6.22.
Pawel's report:
<blockquote>
I dont see any big change of (cpu load or faster/slower
routing/propagating routes from bgpd or something else) - in avg there
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"
[...]
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.44
Max depth: 6
Leaves: 277814
Prefixes: 291306
Internal nodes: 66420
1: 32737 2: 14850 3: 10332 4: 4871 5: 2313 6: 942 7: 371 8: 3 17: 1
Pointers: 599098
Null ptrs: 254865
Total size: 18067 kB
</blockquote>
According to this and other similar reports average depth is slightly
increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
there is no visible performance decrease. So, until memory handling is
improved or added parameters for changing this individually, this
patch resets to safer defaults.
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is the result of an automatic spatch transformation to convert
all ndo_start_xmit() return values of 0 to NETDEV_TX_OK.
Some occurences are missed by the automatic conversion, those will be
handled in a seperate patch.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SCTP pushed the skb data above the sctp chunk header, so the check
of pskb_may_pull(skb, xprth + 4 - skb->data) in _decode_session4() will
never return 0 because xprth + 4 - skb->data < 0, the ports decode of
sctp will always fail.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
This reverts commit 73ce7b01b4.
After discovering that we don't listen to gratuitious arps in 2.6.30
I tracked the failure down to this commit.
The patch makes absolutely no sense. RFC2131 RFC3927 and RFC5227.
are all in agreement that an arp request with sip == 0 should be used
for the probe (to prevent learning) and an arp request with sip == tip
should be used for the gratitous announcement that people can learn
from.
It appears the author of the broken patch got those two cases confused
and modified the code to drop all gratuitous arp traffic. Ouch!
Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alas current delaying of freeing old tnodes by RCU in trie_rebalance
is still not enough because we can free a top tnode before updating a
t->trie pointer.
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a socket starts out on a non-TSO route, and then switches to
a TSO route, then we will tack on data to the tail of the tx queue
even if it started out life as non-TSO. This is suboptimal because
all of it will then be copied and checksummed unnecessarily.
This patch fixes this by ensuring that skb->ip_summed is set to
CHECKSUM_PARTIAL before appending extra data beyond the MSS.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a socket starts out on a non-TSO route, and then switches to
a TSO route, then the tail on the tx queue can morph into a TSO
packet, causing mischief because the rest of the stack does not
expect a partially linear TSO packet.
This patch fixes this by ensuring that skb->ip_summed is set to
CHECKSUM_PARTIAL before declaring a packet as TSO.
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NAT helpers change the TCP packet size, the highest seen sequence
number needs to be corrected. This is currently only done upwards, when
the packet size is reduced the sequence number is unchanged. This causes
TCP conntrack to falsely detect unacknowledged data and decrease the
timeout.
Fix by updating the highest seen sequence number in both directions after
packet mangling.
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As transparent proxying looks up the socket early and assigns
it to the skb for later processing, we must drop any existing
socket ownership prior to that in order to distinguish between
the case where tproxy is active and where it is not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop
the segment and return[Page 72]. But this check is missing in function
tcp_timewait_state_process(). This cause the segment with FIN flag but
no ACK has two diffent action:
Case 1:
Node A Node B
<------------- FIN,ACK
(enter FIN-WAIT-1)
ACK ------------->
(enter FIN-WAIT-2)
FIN -------------> discard
(move sk to tw list)
Case 2:
Node A Node B
<------------- FIN,ACK
(enter FIN-WAIT-1)
ACK ------------->
(enter FIN-WAIT-2)
(move sk to tw list)
FIN ------------->
<------------- ACK
This patch fixed the problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
bnx2: Fix the behavior of ethtool when ONBOOT=no
qla3xxx: Don't sleep while holding lock.
qla3xxx: Give the PHY time to come out of reset.
ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
net: Move rx skb_orphan call to where needed
ipv6: Use correct data types for ICMPv6 type and code
net: let KS8842 driver depend on HAS_IOMEM
can: let SJA1000 driver depend on HAS_IOMEM
netxen: fix firmware init handshake
netxen: fix build with without CONFIG_PM
netfilter: xt_rateest: fix comparison with self
netfilter: xt_quota: fix incomplete initialization
netfilter: nf_log: fix direct userspace memory access in proc handler
netfilter: fix some sparse endianess warnings
netfilter: nf_conntrack: fix conntrack lookup race
netfilter: nf_conntrack: fix confirmation race condition
netfilter: nf_conntrack: death_by_timeout() fix
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Currently, the following three different ways to define percpu arrays
are in use.
1. DEFINE_PER_CPU(elem_type[array_len], array_name);
2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
3. DEFINE_PER_CPU(elem_type, array_name)[array_len];
Unify to #1 which correctly separates the roles of the two parameters
and thus allows more flexibility in the way percpu variables are
defined.
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
When route caching is disabled (rt_caching returns false), We still use route
cache entries that are created and passed into rt_intern_hash once. These
routes need to be made usable for the one call path that holds a reference to
them, and they need to be reclaimed when they're finished with their use. To be
made usable, they need to be associated with a neighbor table entry (which they
currently are not), otherwise iproute_finish2 just discards the packet, since we
don't know which L2 peer to send the packet to. To do this binding, we need to
follow the path a bit higher up in rt_intern_hash, which calls
arp_bind_neighbour, but not assign the route entry to the hash table.
Currently, if caching is off, we simply assign the route to the rp pointer and
are reutrn success. This patch associates us with a neighbor entry first.
Secondly, we need to make sure that any single use routes like this are known to
the garbage collector when caching is off. If caching is off, and we try to
hash in a route, it will leak when its refcount reaches zero. To avoid this,
this patch calls rt_free on the route cache entry passed into rt_intern_hash.
This places us on the gc list for the route cache garbage collector, so that
when its refcount reaches zero, it will be reclaimed (Thanks to Alexey for this
suggestion).
I've tested this on a local system here, and with these patches in place, I'm
able to maintain routed connectivity to remote systems, even if I set
/proc/sys/net/ipv4/rt_cache_rebuild_count to -1, which forces rt_caching to
return false.
Signed-off-by: Neil Horman <nhorman@redhat.com>
Reported-by: Jarek Poplawski <jarkao2@gmail.com>
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (43 commits)
via-velocity: Fix velocity driver unmapping incorrect size.
mlx4_en: Remove redundant refill code on RX
mlx4_en: Removed redundant check on lso header size
mlx4_en: Cancel port_up check in transmit function
mlx4_en: using stop/start_all_queues
mlx4_en: Removed redundant skb->len check
mlx4_en: Counting all the dropped packets on the TX side
usbnet cdc_subset: fix issues talking to PXA gadgets
Net: qla3xxx, remove sleeping in atomic
ipv4: fix NULL pointer + success return in route lookup path
isdn: clean up documentation index
cfg80211: validate station settings
cfg80211: allow setting station parameters in mesh
cfg80211: allow adding/deleting stations on mesh
ath5k: fix beacon_int handling
MAINTAINERS: Fix Atheros pattern paths
ath9k: restore PS mode, before we put the chip into FULL SLEEP state.
ath9k: wait for beacon frame along with CAB
acer-wmi: fix rfkill conversion
ath5k: avoid PCI FATAL interrupts by restoring RETRY_TIMEOUT disabling
...
Don't drop route if we're not caching
I recently got a report of an oops on a route lookup. Maxime was
testing what would happen if route caching was turned off (doing so by setting
making rt_caching always return 0), and found that it triggered an oops. I
looked at it and found that the problem stemmed from the fact that the route
lookup routines were returning success from their lookup paths (which is good),
but never set the **rp pointer to anything (which is bad). This happens because
in rt_intern_hash, if rt_caching returns false, we call rt_drop and return 0.
This almost emulates slient success. What we should be doing is assigning *rp =
rt and _not_ dropping the route. This way, during slow path lookups, when we
create a new route cache entry, we don't immediately discard it, rather we just
don't add it into the cache hash table, but we let this one lookup use it for
the purpose of this route request. Maxime has tested and reports it prevents
the oops. There is still a subsequent routing issue that I'm looking into
further, but I'm confident that, even if its related to this same path, this
patch makes sense to take.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
netxen: fix tx ring accounting
netxen: fix detection of cut-thru firmware mode
forcedeth: fix dma api mismatches
atm: sk_wmem_alloc initial value is one
net: correct off-by-one write allocations reports
via-velocity : fix no link detection on boot
Net / e100: Fix suspend of devices that cannot be power managed
TI DaVinci EMAC : Fix rmmod error
net: group address list and its count
ipv4: Fix fib_trie rebalancing, part 2
pkt_sched: Update drops stats in act_police
sky2: version 1.23
sky2: add GRO support
sky2: skb recycling
sky2: reduce default transmit ring
sky2: receive counter update
sky2: fix shutdown synchronization
sky2: PCI irq issues
sky2: more receive shutdown
sky2: turn off pause during shutdown
...
Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.
We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My previous patch, which explicitly delays freeing of tnodes by adding
them to the list to flush them after the update is finished, isn't
strict enough. It treats exceptionally tnodes without parent, assuming
they are newly created, so "invisible" for the read side yet.
But the top tnode doesn't have parent as well, so we have to exclude
all exceptions (at least until a better way is found). Additionally we
need to move rcu assignment of this node before flushing, so the
return type of the trie_rebalance() function is changed.
Reported-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of bitfields here would lead to false positive warnings with
kmemcheck. Silence them.
(Additionally, one erroneous comment related to the bitfield was also
fixed.)
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
While doing trie_rebalance(): resize(), inflate(), halve() RCU free
tnodes before updating their parents. It depends on RCU delaying the
real destruction, but if RCU readers start after call_rcu() and before
parent update they could access freed memory.
It is currently prevented with preempt_disable() on the update side,
but it's not safe, except maybe classic RCU, plus it conflicts with
memory allocations with GFP_KERNEL flag used from these functions.
This patch explicitly delays freeing of tnodes by adding them to the
list, which is flushed after the update is finished.
Reported-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4:
- make PIM register vifs netns local
- set the netns when a PIM register vif is created
- make PIM available in all network namespaces (if CONFIG_IP_PIMSM_V2)
by adding the protocol handler when multicast routing is initialized
IPv6:
- make PIM register vifs netns local
- make PIM available in all network namespaces (if CONFIG_IPV6_PIMSM_V2)
by adding the protocol handler when multicast routing is initialized
Signed-off-by: Tom Goff <thomas.goff@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed the statements about ARP cache size as this config option does
not affect it. The cache size is controlled by neigh_table gc thresholds.
Remove also expiremental and obsolete markings as the API originally
intended for arp caching is useful for implementing ARP-like protocols
(e.g. NHRP) in user space and has been there for a long enough time.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the sake of power saver lovers, use a deferrable timer to fire
rt_check_expire()
As some big routers cache equilibrium depends on garbage collection
done in time, we take into account elapsed time between two
rt_check_expire() invocations to adjust the amount of slots we have to
check.
Based on an initial idea and patch from Tero Kristo
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Tero Kristo <tero.kristo@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix build error introduced by commit bb70dfa5 (netfilter: xtables:
consolidate comefrom debug cast access):
net/ipv4/netfilter/ip_tables.c: In function 'ipt_do_table':
net/ipv4/netfilter/ip_tables.c:421: error: 'comefrom' undeclared (first use in this function)
net/ipv4/netfilter/ip_tables.c:421: error: (Each undeclared identifier is reported only once
net/ipv4/netfilter/ip_tables.c:421: error: for each function it appears in.)
Signed-off-by: Patrick McHardy <kaber@trash.net>
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.
This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.
We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.
We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.
As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.
skb_set_owner_w() doesnt call sock_hold() anymore
sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.
Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.
Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current conntrack code kills the ICMP conntrack entry as soon as
the first reply is received. This is incorrect, as we then see only
the first ICMP echo reply out of several possible duplicates as
ESTABLISHED, while the rest will be INVALID. Also this unnecessarily
increases the conntrackd traffic on H-A firewalls.
Make all the ICMP conntrack entries (including the replied ones)
last for the default of nf_conntrack_icmp{,v6}_timeout seconds.
Signed-off-by: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The lock "protects" an assignment and a comparision of an integer.
When the caller of device_cmp() evaluates the result, nat->masq_index
may already have been changed (regardless if the lock is there or not).
So, the lock either has to be held during nf_ct_iterate_cleanup(),
or can be removed.
This does the latter.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ip_mc_drop_socket() method is declared in linux/igmp.h, which
is included anyhow in af_inet.c. So there is no need for this declaration.
This patch removes it from af_inet.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb
Delete skb->rtable field
Setting rtable is not allowed, just set dst instead as rtable is an alias.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies the conntrack event caching system by removing
several events:
* IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
since the have no clients.
* IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
days.
* IPCT_REFRESH which is not of any use since we always include the
timeout in the messages.
After this patch, the existing events are:
* IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
addition and deletion of entries.
* IPCT_STATUS, that notes that the status bits have changes,
eg. IPS_SEEN_REPLY and IPS_ASSURED.
* IPCT_PROTOINFO, that reports that internal protocol information has
changed, eg. the TCP, DCCP and SCTP protocol state.
* IPCT_HELPER, that a helper has been assigned or unassigned to this
entry.
* IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
covers the case when a mark is set to zero.
* IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
adjustment.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After some discussion offline with Christoph Lameter and David Stevens
regarding multicast behaviour in Linux, I'm submitting a slightly
modified patch from the one Christoph submitted earlier.
This patch provides a new socket option IP_MULTICAST_ALL.
In this case, default behaviour is _unchanged_ from the current
Linux standard. The socket option is set by default to provide
original behaviour. Sockets wishing to receive data only from
multicast groups they join explicitly will need to clear this
socket option.
Signed-off-by: Nivedita Singhvi <niv@us.ibm.com>
Signed-off-by: Christoph Lameter<cl@linux.com>
Acked-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somewhat luckily, I was looking into these parts with very fine
comb because I've made somewhat similar changes on the same
area (conflicts that arose weren't that lucky though). The loop
was very much overengineered recently in commit 915219441d
(tcp: Use SKB queue and list helpers instead of doing it
by-hand), while it basically just wants to know if there are
skbs after 'skb'.
Also it got broken because skb1 = skb->next got translated into
skb1 = skb1->next (though abstracted) improperly. Note that
'skb1' is pointing to previous sk_buff than skb or NULL if at
head. Two things went wrong:
- We'll kfree 'skb' on the first iteration instead of the
skbuff following 'skb' (it would require required SACK reneging
to recover I think).
- The list head case where 'skb1' is NULL is checked too early
and the loop won't execute whereas it previously did.
Conclusion, mostly revert the recent changes which makes the
cset very messy looking but using proper accessor in the
previous-like version.
The effective changes against the original can be viewed with:
git-diff 915219441d566f1da0caa0e262be49b666159e17^ \
net/ipv4/tcp_input.c | sed -n -e '57,70 p'
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_tunnel_xmit() might need skb->dst, so tell dev_hard_start_xmit()
to no release it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipip_tunnel_xmit() might need skb->dst, so tell dev_hard_start_xmit()
to no release it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to repeatedly check flush when comparing TCP
options for GRO as it will be false 99% of the time where it
matters.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch optimises the IPv4 GRO code by using 32-bit loads
(instead of 16-bit ones) on the ID and length checks in the receive
function.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the overwhelming majority of cases, skb_gro_header's return
value cannot be NULL. Yet we must check it because of its current
form. This patch splits it up into multiple functions in order
to avoid this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of checking len > mss || len == 0, we can accomplish
both by checking (len - 1) > mss using the unsigned wraparound.
At nearly a million times a second, this might just help.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The window has already been checked as part of the flag word
so there is no need to check it explicitly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of doing two 16-bit operations for the source/destination
ports, we can do one 32-bit operation to take care both.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes ssthresh accounting issues in tcp_vegas when cwnd decreases
Signed-off-by: Doug Leith <doug.leith@nuim.ie>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems we can fix this by disabling preemption while we re-balance the
trie. This is with the CONFIG_CLASSIC_RCU. It's been stress-tested at high
loads continuesly taking a full BGP table up/down via iproute -batch.
Note. fib_trie is not updated for CONFIG_PREEMPT_RCU
Reported-by: Andrei Popa
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink message header (struct nlmsghdr) is an unused parameter in
fill method of fib_rules_ops struct. This patch removes this
parameter from this method and fixes the places where this method is
called.
(include/net/fib_rules.h)
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
Quoted here because its a perfect one :
begin_of_quotation
2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
patch has at least one critical flaw, and another problem.
rt_intern_hash calculates rthi pointer, which is later used for new entry
insertion. The same loop calculates cand pointer which is used to clean the
list. If the pointers are the same, rtable leak occurs, as first the cand is
removed then the new entry is appended to it.
This leak leads to unregister_netdevice problem (usage count > 0).
Another problem of the patch is that it tries to insert the entries in certain
order, to facilitate counting of entries distinct by all but QoS parameters.
Unfortunately, referencing an existing rtable entry moves it to list beginning,
to speed up further lookups, so the carefully built order is destroyed.
For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
it will also destroy the ordering.
end_of_quotation
Problematic commit is 1080d709fb
(net: implement emergency route cache rebulds when gc_elasticity is exceeded)
Trying to keep dst_entries ordered is too complex and breaks the fact that
order should depend on the frequency of use for garbage collection.
A possible fix is to make rt_intern_hash() simpler, and only makes
rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
entries order. The added loop is running on cache hot data, while cpu
is prefetching next object, so should be unnoticied.
Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt_check_expire() computes average and standard deviation of chain lengths,
but not correclty reset length to 0 at beginning of each chain.
This probably gives overflows for sum2 (and sum) on loaded machines instead
of meaningful results.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DHCP spec allows the server to specify the MTU. This can be useful
for netbooting with UDP-based NFS-root on a network using jumbo frames.
This patch allows the kernel IP autoconfiguration to handle this option
correctly.
It would be possible to use initramfs and add a script to set the MTU,
but that seems like a complicated solution if no initramfs is otherwise
necessary, and would bloat the kernel image more than this code would.
This patch was originally submitted to LKML in 2003 by Hans-Peter Jansen.
Signed-off-by: Chris Friesen <cfriesen@nortel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctls are unregistered with the rntl_lock held making
it unsafe to unconditionally grab the the rtnl_lock. Instead
we need to call rtnl_trylock and restart the system call
if we can not grab it. Otherwise we could deadlock at unregistration
time.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e81963b1 ("ipv4: Make INET_LRO a bool instead of tristate.")
changed this config from tristate to bool. Add default so that it is
consistent with the help text.
Signed-off-by: Frans Pop <elendil@planet.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need for net/icmp.h header in net/ipv4/fib_frontend.c.
This patch removes the #include net/icmp.h from it.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 518a09ef11 (tcp: Fix recvmsg MSG_PEEK influence of
blocking behavior) lets the loop run longer than the race check
did previously expect, so we need to be more careful with this
check and consider the work we have been doing.
I tried my best to deal with urg hole madness too which happens
here:
if (!sock_flag(sk, SOCK_URGINLINE)) {
++*seq;
...
by using additional offset by one but I certainly have very
little interest in testing that part.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Frans Pop <elendil@planet.nl>
Tested-by: Ian Zimmermann <itz@buug.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a DHCP server is delayed, it's possible for the client to receive the
DHCPOFFER after it has already sent out a new DHCPDISCOVER message from
a second interface. The client then sends out a DHCPREQUEST from the
second interface, but the server doesn't recognize the device and
rejects the request.
This patch simply tracks the current device being configured and throws
away the OFFER if it is not intended for the current device. A more
sophisticated approach would be to put the OFFER information into the
struct ic_device rather than storing it globally.
Signed-off-by: Chris Friesen <cfriesen@nortel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This code is used as a library by several device drivers,
which select INET_LRO.
If some are modules and some are statically built into the
kernel, we get build failures if INET_LRO is modular.
Signed-off-by: David S. Miller <davem@davemloft.net>
By separating the freeing code from the refcounting decrementing.
Probably reducing icache pressure when we still have reference counts to
go.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
the actual value might be tuned. The following patches fix this and make
tcp_prequeue get the actual value returns from tcp_rto_min().
Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should be very safe compared with full enabled, so I see
no reason why it shouldn't be done right away. As ECN can only
be negotiated if the SYN sending party is also supporting it,
somebody in the loop probably knows what he/she is doing. If
SYN does not ask for ECN, the server side SYN-ACK is identical
to what it is without ECN. Thus it's quite safe.
The chosen value is safe w.r.t to existing configs which
choose to currently set manually either 0 or 1 but
silently upgrades those who have not explicitly requested
ECN off.
Whether to just enable both sides comes up time to time but
unless that gets done now we can at least make the servers
aware of ECN already. As there are some known problems to occur
if ECN is enabled, it's currently questionable whether there's
any real gain from enabling clients as servers mostly won't
support it anyway (so we'd hit just the negative sides). After
enabling the servers and getting that deployed, the client end
enable really has some potential gain too.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The x_tables are organized with a table structure and a per-cpu copies
of the counters and rules. On older kernels there was a reader/writer
lock per table which was a performance bottleneck. In 2.6.30-rc, this
was converted to use RCU and the counters/rules which solved the performance
problems for do_table but made replacing rules much slower because of
the necessary RCU grace period.
This version uses a per-cpu set of spinlocks and counters to allow to
table processing to proceed without the cache thrashing of a global
reader lock and keeps the same performance for table updates.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are later assigned to other values without being used meanwhile.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a brand new GRO skb, we cannot call ip_hdr since the header
may lie in the non-linear area. This patch adds the helper
skb_gro_network_header to handle this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now we have no upper limit on the size of the route cache hash table.
On a 128GB POWER6 box it ends up as 32MB:
IP route cache hash table entries: 4194304 (order: 9, 33554432 bytes)
It would be nice to cap this for memory consumption reasons, but a massive
hashtable also causes a significant spike when measuring OS jitter.
With a 32MB hashtable and 4 million entries, rt_worker_func is taking
5 ms to complete. On another system with more memory it's taking 14 ms.
Even though rt_worker_func does call cond_sched() to limit its impact,
in an HPC environment we want to keep all sources of OS jitter to a minimum.
With the patch applied we limit the number of entries to 512k which
can still be overriden by using the rt_entries boot option:
IP route cache hash table entries: 524288 (order: 6, 4194304 bytes)
With this patch rt_worker_func now takes 0.460 ms on the same system.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me
With help from Eric Dumazet.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
last_synq_overflow eats 4 or 8 bytes in struct tcp_sock, even
though it is only used when a listening sockets syn queue
is full.
We can (ab)use rx_opt.ts_recent_stamp to store the same information;
it is not used otherwise as long as a socket is in listen state.
Move linger2 around to avoid splitting struct mtu_probe
across cacheline boundary on 32 bit arches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just noticed while doing some new work that the recent
mid-wq adjustment logic will misbehave when FACK is not
in use (happens either due sysctl'ed off or auto-detected
reordering) because I forgot the relevant TCPCB tagbit.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_register_protosw() function is responsible for adding a new
inet protocol into a global table (inetsw[]) that is used with RCU rules.
As soon as the store of the pointer is done, other cpus might see
this new protocol in inetsw[], so we have to make sure new protocol
is ready for use. All pending memory updates should thus be committed
to memory before setting the pointer.
This is correctly done using rcu_assign_pointer()
synchronize_net() is typically used at unregister time, after
unsetting the pointer, to make sure no other cpu is still using
the object we want to dismantle. Using it at register time
is only adding an artificial delay that could hide a real bug,
and this bug could popup if/when synchronize_rcu() can proceed
faster than now.
This saves about 13 ms on boot time on a HZ=1000 8 cpus machine ;)
(4 calls to inet_register_protosw(), and about 3200 us per call)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After calling skb_gro_receive skb->len can no longer be relied
on since if the skb was merged using frags, then its pages will
have been removed and the length reduced.
This caused tcp_gro_receive to prematurely end merging which
resulted in suboptimal performance with ixgbe.
The fix is to store skb->len on the stack.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The removal of the SAME target accidentally removed one feature that is
not available from the normal NAT targets so far, having multi-range
mappings that use the same mapping for each connection from a single
client. The current behaviour is to choose the address from the range
based on source and destination IP, which breaks when communicating
with sites having multiple addresses that require all connections to
originate from the same IP address.
Introduce a IP_NAT_RANGE_PERSISTENT option that controls whether the
destination address is taken into account for selecting addresses.
http://bugzilla.kernel.org/show_bug.cgi?id=12954
Signed-off-by: Patrick McHardy <kaber@trash.net>
A long-standing feature in tcp_init_metrics() is such that
any of its goto reset prevents call to tcp_init_cwnd().
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (54 commits)
glge: remove unused #include <version.h>
dnet: remove unused #include <version.h>
tcp: miscounts due to tcp_fragment pcount reset
tcp: add helper for counter tweaking due mid-wq change
hso: fix for the 'invalid frame length' messages
hso: fix for crash when unplugging the device
fsl_pq_mdio: Fix compile failure
fsl_pq_mdio: Revive UCC MDIO support
ucc_geth: Pass proper device to DMA routines, otherwise oops happens
i.MX31: Fixing cs89x0 network building to i.MX31ADS
tc35815: Fix build error if NAPI enabled
hso: add Vendor/Product ID's for new devices
ucc_geth: Remove unused header
gianfar: Remove unused header
kaweth: Fix locking to be SMP-safe
net: allow multiple dev per napi with GRO
r8169: reset IntrStatus after chip reset
ixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring parameters
ixgbe: fix ethtool -A|a behavior
ixgbe: Patch to fix driver panic while freeing up tx & rx resources
...
It seems that trivial reset of pcount to one was not sufficient
in tcp_retransmit_skb. Multiple counters experience a positive
miscount when skb's pcount gets lowered without the necessary
adjustments (depending on skb's sacked bits which exactly), at
worst a packets_out miscount can crash at RTO if the write queue
is empty!
Triggering this requires mss change, so bidir tcp or mtu probe or
like.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Uwe Bugla <uwe.bugla@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need full-scale adjustment to fix a TCP miscount in the next
patch, so just move it into a helper and call for that from the
other places.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 784544739a
(netfilter: iptables: lock free counters) forgot to disable BH
in arpt_do_table(), ipt_do_table() and ip6t_do_table()
Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.
Reported-and-bisected-by: Roman Mindalev <r000n@r000n.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
smack: Add a new '-CIPSO' option to the network address label configuration
netlabel: Cleanup the Smack/NetLabel code to fix incoming TCP connections
lsm: Remove the socket_post_accept() hook
selinux: Remove the "compat_net" compatibility code
netlabel: Label incoming TCP connections correctly in SELinux
lsm: Relocate the IPv4 security_inet_conn_request() hooks
TOMOYO: Fix a typo.
smack: convert smack to standard linux lists
The current NetLabel/SELinux behavior for incoming TCP connections works but
only through a series of happy coincidences that rely on the limited nature of
standard CIPSO (only able to convey MLS attributes) and the write equality
imposed by the SELinux MLS constraints. The problem is that network sockets
created as the result of an incoming TCP connection were not on-the-wire
labeled based on the security attributes of the parent socket but rather based
on the wire label of the remote peer. The issue had to do with how IP options
were managed as part of the network stack and where the LSM hooks were in
relation to the code which set the IP options on these newly created child
sockets. While NetLabel/SELinux did correctly set the socket's on-the-wire
label it was promptly cleared by the network stack and reset based on the IP
options of the remote peer.
This patch, in conjunction with a prior patch that adjusted the LSM hook
locations, works to set the correct on-the-wire label format for new incoming
connections through the security_inet_conn_request() hook. Besides the
correct behavior there are many advantages to this change, the most significant
is that all of the NetLabel socket labeling code in SELinux now lives in hooks
which can return error codes to the core stack which allows us to finally get
ride of the selinux_netlbl_inode_permission() logic which greatly simplfies
the NetLabel/SELinux glue code. In the process of developing this patch I
also ran into a small handful of AF_INET6 cleanliness issues that have been
fixed which should make the code safer and easier to extend in the future.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <jmorris@namei.org>
The current placement of the security_inet_conn_request() hooks do not allow
individual LSMs to override the IP options of the connection's request_sock.
This is a problem as both SELinux and Smack have the ability to use labeled
networking protocols which make use of IP options to carry security attributes
and the inability to set the IP options at the start of the TCP handshake is
problematic.
This patch moves the IPv4 security_inet_conn_request() hooks past the code
where the request_sock's IP options are set/reset so that the LSM can safely
manipulate the IP options as needed. This patch intentionally does not change
the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options
are not currently implemented, once they are we will have a better idea of
the correct placement for the IPv6 hooks.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
Conflicts:
arch/sparc/kernel/time_64.c
drivers/gpu/drm/drm_proc.c
Manual merge to resolve build warning due to phys_addr_t type change
on x86:
drivers/gpu/drm/drm_info.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.
This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.
Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.
First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.
- It allows fast reuse of just freed elements, permitting better use of
CPU cache.
- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.
This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit e1b4b9f ([NETFILTER]: {ip,ip6,arp}_tables: fix exponential worst-case
search for loops) introduced a regression in the loop detection algorithm,
causing sporadic incorrectly detected loops.
When a chain has already been visited during the check, it is treated as
having a standard target containing a RETURN verdict directly at the
beginning in order to not check it again. The real target of the first
rule is then incorrectly treated as STANDARD target and checked not to
contain invalid verdicts.
Fix by making sure the rule does actually contain a standard target.
Based on patch by Francis Dupont <Francis_Dupont@isc.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
We use same not trivial helper function in four places. We can factorize it.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The ipv6 version of bind_conflict code calls ipv6_rcv_saddr_equal()
which at times wrongly identified intersections between addresses.
It particularly broke down under a few instances and caused erroneous
bind conflicts.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arches without efficient unaligned access can still perform a loop
assuming 16bit alignment in ifname_compare()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discard incoming packets whose ack field iincludes data not yet sent.
This is consistent with RFC 793 Section 3.9.
Change tcp_ack() to distinguish between too-small and too-large ack
field values. Keep segments with too-large ack fields out of the fast
path, and change slow path to discard them.
Reported-by: Oliver Zheng <mailinglists+netdev@oliverzheng.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_sack_swap seems unnecessary so I pushed swap to the caller.
Also removed comment that seemed then pointless, and added include
when not already there. Compile tested.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev can be NULL in ip[6]_frag_reasm for skb's coming from RAW sockets.
Quagga's OSPFD sends fragmented packets on a RAW socket, when netfilter
conntrack reassembles them on the OUTPUT path you hit this code path.
You can test it with something like "hping2 -0 -d 2000 -f AA.BB.CC.DD"
With help from Jarek Poplawski.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes an unused parameter (addr_len) from tcp_recv_urg()
method in net/ipv4/tcp.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip_queue module is missing the net-pf-16-proto-3 alias that would
causae it to be auto-loaded when a socket of that type is opened. This
patch adds the alias.
Signed-off-by: Scott James Remnant <scott@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch increases the statistics of packets drop if the sequence
adjustment fails in ipv4_confirm().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch modifies nf_log to use a linked list of loggers for each
protocol. This list of loggers is read and write protected with a
mutex.
This patch separates registration and binding. To be used as
logging module, a module has to register calling nf_log_register()
and to bind to a protocol it has to call nf_log_bind_pf().
This patch also converts the logging modules to the new API. For nfnetlink_log,
it simply switchs call to register functions to call to bind function and
adds a call to nf_log_register() during init. For other modules, it just
remove a const flag from the logger structure and replace it with a
__read_mostly.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It's not too likely to happen, would basically require crafted
packets (must hit the max guard in tcp_bound_to_half_wnd()).
It seems that nothing that bad would happen as there's tcp_mems
and congestion window that prevent runaway at some point from
hurting all too much (I'm not that sure what all those zero
sized segments we would generate do though in write queue).
Preventing it regardless is certainly the best way to go.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The results is very unlikely change every so often so we
hardly need to divide again after doing that once for a
connection. Yet, if divide still becomes necessary we
detect that and do the right thing and again settle for
non-divide state. Takes the u16 space which was previously
taken by the plain xmit_size_goal.
This should take care part of the tso vs non-tso difference
we found earlier.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's very little need for most of the callsites to get
tp->xmit_goal_size updated. That will cost us divide as is,
so slice the function in two. Also, the only users of the
tp->xmit_goal_size are directly behind tcp_current_mss(),
so there's no need to store that variable into tcp_sock
at all! The drop of xmit_goal_size currently leaves 16-bit
hole and some reorganization would again be necessary to
change that (but I'm aiming to fill that hole with u16
xmit_goal_size_segs to cache the results of the remaining
divide to get that tso on regression).
Bring xmit_goal_size parts into tcp.c
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that no variables clash such that we couldn't do
the check just once later on. Therefore move it.
Also kill dead obvious comment, dead argument and add
unlikely since this mtu probe does not happen too often.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wow, it was quite tricky to merge that stream of negations
but I think I finally got it right:
check & replace_ts_recent:
(s32)(rcv_tsval - ts_recent) >= 0 => 0
(s32)(ts_recent - rcv_tsval) <= 0 => 0
discard:
(s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1
(s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW => 0
I toggled the return values of tcp_paws_check around since
the old encoding added yet-another negation making tracking
of truth-values really complicated.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've already forgotten what for this was necessary, anyway
it's no longer used (if it ever was).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the pure assignment case, the earlier zeroing is
still in effect.
David S. Miller raised concerns if the ifs are there to avoid
dirtying cachelines. I came to these conclusions:
> We'll be dirty it anyway (now that I check), the first "real" statement
> in tcp_rcv_established is:
>
> tp->rx_opt.saw_tstamp = 0;
>
> ...that'll land on the same dword. :-/
>
> I suppose the blocks are there just because they had more complexity
> inside when they had to calculate the eff_sacks too (maybe it would
> have been better to just remove them in that drop-patch so you would
> have had less head-ache :-)).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Windows (XP at least) hosts on boot, with configured static ip, performing
address conflict detection, which is defined in RFC3927.
Here is quote of important information:
"
An ARP announcement is identical to the ARP Probe described above,
except that now the sender and target IP addresses are both set
to the host's newly selected IPv4 address.
"
But it same time this goes wrong with RFC5227.
"
The 'sender IP address' field MUST be set to all zeroes; this is to avoid
polluting ARP caches in other hosts on the same link in the case
where the address turns out to be already in use by another host.
"
When ARP proxy configured, it must not answer to both cases, because
it is address conflict verification in any case. For Windows it is just
causing to detect false "ip conflict". Already there is code for RFC5227, so
just trivially we just check also if source ip == target ip.
Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some systems send SYN packets with apparently wrong RFC1323 timestamp
option values [timestamp tsval=0 tsecr=0].
It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )
Linux TCP stack ignores this option and sends back a SYN+ACK packet
without timestamp option, thus many TCP flows cannot use timestamps
and lose some benefit of RFC1323.
Other operating systems seem to not care about initial tsval value, and let
tcp flows to negotiate timestamp option.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols that use packet_type can be __read_mostly section for better
locality. Elminate any unnecessary initializations of NULL.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To remove the possibility of packets flying around when network
devices are being cleaned up use reisger_pernet_subsys instead of
register_pernet_device.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently I had a kernel panic in icmp_send during a network namespace
cleanup. There were packets in the arp queue that failed to be sent
and we attempted to generate an ICMP host unreachable message, but
failed because icmp_sk_exit had already been called.
The network devices are removed from a network namespace and their
arp queues are flushed before we do attempt to shutdown subsystems
so this error should have been impossible.
It turns out icmp_init is using register_pernet_device instead
of register_pernet_subsys. Which resulted in icmp being shut down
while we still had the possibility of packets in flight, making
a nasty NULL pointer deference in interrupt context possible.
Changing this to register_pernet_subsys fixes the problem in
my testing.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The above functions from include/net/tcp.h have been defined with an
argument that they never use. The argument is 'u32 ack' which is never
used inside the function body, and thus it can be removed. The rest of
the patch involves the necessary changes to the function callers of the
above two functions.
Signed-off-by: Hantzis Fotis <xantzis@ceid.upatras.gr>
Signed-off-by: David S. Miller <davem@davemloft.net>
I guess these fields were one day 16-bit in the struct but
nowadays they're just using 8 bits anyway.
This is just a precaution, didn't result any change in my
case but who knows what all those varying gcc versions &
options do. I've been told that 16-bit is not so nice with
some cpus.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
copied was assigned zero right before the goto, so if (copied)
cannot ever be true.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also fixes insignificant bug that would cause sending of stale
SACK block (would occur in some corner cases).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that implementation in yeah was inconsistent to what
other did as it would increase cwnd one ack earlier than the
others do.
Size benefits:
bictcp_cong_avoid | -36
tcp_cong_avoid_ai | +52
bictcp_cong_avoid | -34
tcp_scalable_cong_avoid | -36
tcp_veno_cong_avoid | -12
tcp_yeah_cong_avoid | -38
= -104 bytes total
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to what is done elsewhere in TCP code when double
state checks are being done.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Redundant checks made indentation impossible to follow.
However, it might be useful to make this ca_state+is_sack
indexed array.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some comment about its current state added. So far I have
seen very few cases where the thing is actually useful,
usually just marginally (though admittedly I don't usually
see top of window losses where it seems possible that there
could be some gain), instead, more often the cases suffer
from L-marking spike which is certainly not desirable
(I'll bury improving it to my todo list, but on a low
prio position).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Hannemann <hannemann@nets.rwth-aachen.de> noticed and was
puzzled by the fact that !tcp_is_fack(tp) leads to early return
near the beginning and the later on tcp_is_fack(tp) was still
used in an if condition. The later check was a left-over from
RFC3517 SACK stuff (== !tcp_is_fack(tp) behavior nowadays) as
there wasn't clear way how to handle this particular check
cheaply in the spirit of RFC3517 (using only SACK blocks, not
holes + SACK blocks as with FACK). I sort of left it there as
a reminder but since it's confusing other people just remove
it and comment the missing-feature stuff instead.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If cur_mss grew very recently so that the previously G/TSOed skb
now fits well into a single segment it would get send up in
parts unless we calculate # of segments again. This corner-case
could happen eg. after mtu probe completes or less than
previously sack blocks are required for the opposite direction.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) We didn't remove any skbs, so no need to handle stale refs.
2) scoreboard_skb_hint is trivial, no timestamps were changed
so no need to clear that one
3) lost_skb_hint needs tweaking similar to that of
tcp_sacktag_one().
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
If skb can be sent right away, we certainly should do that
if it's in the middle of the queue because it won't get
more data into it.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible that lost_cnt_hint gets underflow in
tcp_clean_rtx_queue because the cumulative ACK can cover
the segment where lost_skb_hint points to only partially,
which means that the hint is not cleared, opposite to what
my (earlier) comment claimed.
Also I don't agree what I ended up writing about non-trivial
case there to be what I intented to say. It was not supposed
to happen that the hint won't get cleared and we underflow
in any scenario.
In general, this is quite hard to trigger in practice.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backtracking to sacked skbs is a horrible performance killer
since the hint cannot be advanced successfully past them...
...And it's totally unnecessary too.
In theory this is 2.6.27..28 regression but I doubt anybody
can make .28 to have worse performance because of other TCP
improvements.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's conflicting assumptions in shifting, the caller assumes
that dupsack results in S'ed skbs (or a part of it) for sure but
never gave a hint to tcp_sacktag_one when dsack is actually in
use. Thus DSACK retrans_out -= pcount was not taken and the
counter became out of sync. Remove obstacle from that information
flow to get DSACKs accounted in tcp_sacktag_one as expected.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: Attribute function with __releases(...)
Fix this sparse warning:
net/ipv4/inet_fragment.c:276:35: warning: context imbalance in 'inet_frag_find' - unexpected unlock
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: build fix
API was changed, but not all usage sites were converted:
net/ipv4/route.c: In function ‘ip_rt_init’:
net/ipv4/route.c:3379: error: too few arguments to function ‘__alloc_percpu’
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The functions time_before is more robust for comparing
jiffies against other values.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions time_before is more robust for comparing
jiffies against other values.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the return value of nlmsg_notify() as follows:
If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.
This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IP_ADVANCED_ROUTER Kconfig describes the rp_filter
proc option. Recent changes added a loose mode.
Instead of documenting this change too places, refer to
the document describing it:
Documentation/networking/ip-sysctl.txt
I'm considering moving the rp_filter description away
from the Kconfig file into ip-sysctl.txt.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix double free at netns creation
veth : add the set_mac_address capability
sunlance: Beyond ARRAY_SIZE of ib->btx_ring
sungem: another error printed one too early
ISDN: fix sc/shmem printk format warning
SMSC: timeout reaches -1
smsc9420: handle magic field of ethtool_eeprom
sundance: missing parentheses?
smsc9420: fix another postfixed timeout
wimax/i2400m: driver loads firmware v1.4 instead of v1.3
vlan: Update skb->mac_header in __vlan_put_tag().
cxgb3: Add support for PCI ID 0x35.
tcp: remove obsoleted comment about different passes
TG3: &&/|| confusion
ATM: misplaced parentheses?
net/mv643xx: don't disable the mib timer too early and lock properly
net/mv643xx: use GFP_ATOMIC while atomic
atl1c: Atheros L1C Gigabit Ethernet driver
net: Kill skb_truesize_check(), it only catches false-positives.
net: forcedeth: Fix wake-on-lan regression
To remove the possibility of packets flying around when network
devices are being cleaned up use reisger_pernet_subsys instead of
register_pernet_device.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently I had a kernel panic in icmp_send during a network namespace
cleanup. There were packets in the arp queue that failed to be sent
and we attempted to generate an ICMP host unreachable message, but
failed because icmp_sk_exit had already been called.
The network devices are removed from a network namespace and their
arp queues are flushed before we do attempt to shutdown subsystems
so this error should have been impossible.
It turns out icmp_init is using register_pernet_device instead
of register_pernet_subsys. Which resulted in icmp being shut down
while we still had the possibility of packets in flight, making
a nasty NULL pointer deference in interrupt context possible.
Changing this to register_pernet_subsys fixes the problem in
my testing.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While going through net/ipv4/Kconfig cleanup whitespaces.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reverse path filter (rp_filter) will NOT get enabled
when enabling forwarding. Read the code and tested in
in practice.
Most distributions do enable it in startup scripts.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of compile warning about non-const format
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend existing reverse path filter option to allow strict or loose
filtering. (See http://en.wikipedia.org/wiki/Reverse_path_filtering).
For compatibility with existing usage, the value 1 is chosen for strict mode
and 2 for loose mode.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CIPSO protocol engine incorrectly stated that the FIPS-188 specification
could be found in the kernel's Documentation directory. This patch corrects
that by removing the comment and directing users to the FIPS-188 documented
hosted online. For the sake of completeness I've also included a link to the
CIPSO draft specification on the NetLabel website.
Thanks to Randy Dunlap for spotting the error and letting me know.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: James Morris <jmorris@namei.org>
Our TCP stack does not set the urgent flag if the urgent pointer
does not fit in 16 bits, i.e., if it is more than 64K from the
sequence number of a packet.
This behaviour is different from the BSDs, and clearly contradicts
the purpose of urgent mode, which is to send the notification
(though not necessarily the associated data) as soon as possible.
Our current behaviour may in fact delay the urgent notification
indefinitely if the receiver window does not open up.
Simply matching BSD however may break legacy applications which
incorrectly rely on the out-of-band delivery of urgent data, and
conversely the in-band delivery of non-urgent data.
Alexey Kuznetsov suggested a safe solution of following BSD only
if the urgent pointer itself has not yet been transmitted. This
way we guarantee that when the remote end sees the packet with
non-urgent data marked as urgent due to wrap-around we would have
advanced the urgent pointer beyond, either to the actual urgent
data or to an as-yet untransmitted packet.
The only potential downside is that applications on the remote
end may see multiple SIGURG notifications. However, this would
occur anyway with other TCP stacks. More importantly, the outcome
of such a duplicate notification is likely to be harmless since
the signal itself does not carry any information other than the
fact that we're in urgent mode.
Thanks to Ilpo Järvinen for fixing a critical bug in this and
Jeff Chua for reporting that bug.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing oprofile tests I noticed two loops are not properly unrolled by gcc
Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The reader/writer lock in ip_tables is acquired in the critical path of
processing packets and is one of the reasons just loading iptables can cause
a 20% performance loss. The rwlock serves two functions:
1) it prevents changes to table state (xt_replace) while table is in use.
This is now handled by doing rcu on the xt_table. When table is
replaced, the new table(s) are put in and the old one table(s) are freed
after RCU period.
2) it provides synchronization when accesing the counter values.
This is now handled by swapping in new table_info entries for each cpu
then summing the old values, and putting the result back onto one
cpu. On a busy system it may cause sampling to occur at different
times on each cpu, but no packet/byte counts are lost in the process.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This prepares for a real __alloc_percpu, by adding an alignment argument.
Only one place uses __alloc_percpu directly, and that's for a string.
tj: af_inet also uses __alloc_percpu(), update it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Concern has been expressed about the changing Kconfig options.
Provide the old options that forward-select.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This is obsolete since the passes got combined.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested by: James King <t.james.king@gmail.com>
Similarly to commit c9fd496809, merge
TTL and HL. Since HL does not depend on any IPv6-specific function,
no new module dependencies would arise.
With slight adjustments to the Kconfig help text.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
x86 and powerpc can perform long word accesses in an efficient maner.
We can use this to unroll two loops in arp_packet_match(), to
perform arithmetic on long words instead of bytes. This is a win
on x86_64 for example.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Later patches change the locking on xt_table and the initialization of
the lock element is not needed since the lock is always initialized in
xt_table_register anyway.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Impact: syntax fix
Interestingly enough this compiles w/o any complaints:
orphans = percpu_counter_sum_positive(&tcp_orphan_count),
sockets = percpu_counter_sum_positive(&tcp_sockets_allocated),
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro: Optimise TCP packet reception
As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.
This patch uses bit ops to logical ops, as well as open coding
memcmp to exploit alignment properties.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.
This patch does some obvious changes to use 2-byte and 4-byte
operations instead of byte-oriented ones where possible. Bit
ops are also used to replace logical ops to reduce branching.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like the UDP header fix, pskb_may_pull() can potentially
alter the SKB buffer. Thus the saddr and daddr, pointers
may point to the old skb->data buffer.
I haven't seen corruptions, as its only seen if the old
skb->data buffer were reallocated by another user and
written into very quickly (or poison'd by SLAB debugging).
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP header pointer assignment must happen after calling
pskb_may_pull(). As pskb_may_pull() can potentially alter the SKB
buffer.
This was exposted by running multicast traffic through the NIU driver,
as it won't prepull the protocol headers into the linear area on
receive.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 93821778de (udp: Fix rcv socket
locking) accidentally removed sk_drops increments for UDP IPV4
sockets.
This field can be used to detect incorrect sizing of socket receive
buffers.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_alloc now sets sk_family so this is redundant. In fact it caught
my eye because sock_init_data already uses sk_family so this is too
late anyway.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
And switch bsockets to atomic_t since it might be changed in parallel.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Stephen Hemminger <shemminger@vyatta.com>
Fix regression introduced by a9d8f9110d
("inet: Allowing more than 64k connections and heavily optimize
bind(0) time.")
Based upon initial patches and feedback from Evegniy Polyakov and
Eric Dumazet.
From Eric Dumazet:
--------------------
Also there might be a problem at line 175
if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.
--------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds another inet device option to enable gratuitous ARP
when device is brought up or address change. This is handy for
clusters or virtualization.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Base versions handle constant folding now.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately simplicity isn't always the best. The fraginfo
interface turned out to be suboptimal. The problem was quite
obvious. For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.
LRO didn't have this problem because it directly read the headers
from the frags structure.
This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it. Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_splice_data_recv has two lengths to consider: the len parameter it
gets from tcp_read_sock, which specifies the amount of data in the skb,
and rd_desc->count, which is the amount of data the splice caller still
wants. Currently it passes just the latter to skb_splice_bits, which then
splices min(rd_desc->count, skb->len - offset) bytes.
Most of the time this is fine, except when the skb contains urgent data.
In that case len goes only up to the urgent byte and is less than
skb->len - offset. By ignoring len tcp_splice_data_recv may a) splice
data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len.
Now, tcp_read_sock doesn't handle used > len and leaves the socket in a
bad state (both sk_receive_queue and copied_seq are bad at that point)
resulting in duplicated data and corruption.
Fix by passing min(rd_desc->count, len) to skb_splice_bits.
Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9088c56095
(udp: Improve port randomization) introduced a regression for UDP bind() syscall
to null port (getting a random port) in case lot of ports are already in use.
This is because we do about 28000 scans of very long chains (220 sockets per chain),
with many spin_lock_bh()/spin_unlock_bh() calls.
Fix this using a bitmap (64 bytes for current value of UDP_HTABLE_SIZE)
so that we scan chains at most once.
Instead of 250 ms per bind() call, we get after patch a time of 2.9 ms
Based on a report from Vitaly Mayatskikh
Reported-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of keeping candidate tunnel device from all categories,
keep only one candidate with best score. This optimizes stack
usage and speeds up exit code.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This last patch makes the appropriate changes to use and propagate the
network namespace where needed in IPv4 multicast routing code.
This consists mainly in replacing all the remaining init_net occurences
with current netns pointer retrieved from sockets, net devices or
mfc_caches depending on the routines' contexts.
Some routines receive a new 'struct net' parameter to propagate the current
netns:
* vif_add/vif_delete
* ipmr_new_tunnel
* mroute_clean_tables
* ipmr_cache_find
* ipmr_cache_report
* ipmr_cache_unresolved
* ipmr_mfc_add/ipmr_mfc_delete
* ipmr_get_route
* rt_fill_info (in route.c)
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Declare IPv4 multicast forwarding /proc/net entries per-namespace:
/proc/net/ip_mr_vif
/proc/net/ip_mr_cache
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Declare variable 'reg_vif_num' per-namespace, move into struct netns_ipv4.
At the moment, this variable is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>