From: Kirill Korotaev <dev@sw.ru>
During OpenVZ stress testing we found that UDP traffic with random src
can generate too much excessive rt hash growing leading finally to OOM
and kernel panics.
It was found that for 4GB i686 system (having 1048576 total pages and
225280 normal zone pages) kernel allocates the following route hash:
syslog: IP route cache hash table entries: 262144 (order: 8, 1048576
bytes) => ip_rt_max_size = 4194304 entries, i.e. max rt size is
4194304 * 256b = 1Gb of RAM > normal_zone
Attached the patch which removes HASH_HIGHMEM flag from
alloc_large_system_hash() call.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will linearize and check there is enough data.
It handles the pprop case as well as avoiding a whole audit of
the routing code.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we don't find the item we are lookng for, we allocate a new one, and
then grab the lock again and search to see if it has been added while we
did the alloc. If it had been added we need to 'cache_put' the newly
created item that we are never going to use. But as it hasn't been
initialised properly, putting it can cause an oops.
So move the ->init call earlier to that it will always be fully initilised
if we have to put it.
Thanks to Philipp Matthias Hahn <pmhahn@svs.Informatik.Uni-Oldenburg.de>
for reporting the problem.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In bug #6954, Norbert Reinartz reported the following issue:
"Function lapb_setparms() in file net/lapb/lapb_iface.c checks if the given
parameters are valid. If the given window size is in the range of 8 .. 127,
lapb_setparms() fails and returns an error value of LAPB_INVALUE, even if bit
LAPB_EXTENDED in parms->mode is set.
If bit LAPB_EXTENDED in parms->mode is set and the window size is in the range
of 8 .. 127, the first check "(parms->mode & LAPB_EXTENDED)" results true and
the second check "(parms->window < 1 || parms->window > 127)" results false.
Both checks in conjunction result to false, thus the third check "(parms->window
< 1 || parms->window > 7)" is done by fault.
This third check results true, so that we leave lapb_setparms() by 'goto out_put'.
Seems that this bug doesn't cause any problems, because lapb_setparms() isn't
used to change the default values of LAPB. We are using kernel lapb in our
software project and also change the default parameters of lapb, so we found
this bug"
He also pasted a fix, that I've transformated into a patch:
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a transfer is application limited, we are allowed at least
initial window worth of data per window unless cwnd is previously
less than that.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The datagram interface of LLC is broken in a couple of ways.
These were discovered when trying to use it to build an out-of-kernel
version of STP.
First it didn't pass the source address of the received packet
in recvfrom(). It needs to copy the source address of received LLC packets
into the socket control block. At the same time fix a security issue
because there was uninitialized data leakage. Every recvfrom call
was just copying out old data.
Second, LLC should not merge multiple packets in one receive call
on datagram sockets. LLC should preserve packet boundaries on
SOCK_DGRAM.
This fix goes against the old historical comments about UNIX98 semantics
but without this fix SOCK_DGRAM is broken and useless. So either ANK's
interpretation was incorect or UNIX98 standard was wrong.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix code that passes back netlink status messages about
bridge changes. Submitted by Aji_Srinivas@emc.com
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we're part way through transmitting a TCP request, and the client
errors, then we need to disconnect and reconnect the TCP socket in order to
avoid confusing the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 031a50c8b9ea82616abd4a4e18021a25848941ce commit)
Remove the lock_cpu_hotplug()/unlock_cpu_hotplug() calls from
net_dma_rebalance
The lock_cpu_hotplug()/unlock_cpu_hotplug() sequence in
net_dma_rebalance is both incorrect (as pointed out by David Miller)
because lock_cpu_hotplug() may sleep while the net_dma_event_lock
spinlock is held, and unnecessary (as pointed out by Andrew Morton) as
spin_lock() disables preemption which protects from CPU hotplug
events.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug in the DECnet routing code where we were
selecting a loopback device in preference to an outward facing device
even when the destination was known non-local. This patch should fix
the problem.
Signed-off-by: Patrick Caulfield <patrick@tykepenguin.com>
Signed-off-by: Steven Whitehouse <steve@chygwyn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Catherine Zhang <cxzhang@watson.ibm.com>
This patch implements a cleaner fix for the memory leak problem of the
original unix datagram getpeersec patch. Instead of creating a
security context each time a unix datagram is sent, we only create the
security context when the receiver requests it.
This new design requires modification of the current
unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
secid_to_secctx and release_secctx. The former retrieves the security
context and the latter releases it. A hook is required for releasing
the security context because it is up to the security module to decide
how that's done. In the case of Selinux, it's a simple kfree
operation.
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I tested linux kernel 2.6.71.7 about statistics
"ipv6IfStatsOutFragCreates", and found that it couldn't increase
correctly. The criteria is RFC 2465:
ipv6IfStatsOutFragCreates OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of output datagram fragments that have
been generated as a result of fragmentation at
this output interface."
::= { ipv6IfStatsEntry 15 }
I think there are two issues in Linux kernel.
1st:
RFC2465 specifies the counter is "The number of output datagram
fragments...". I think increasing this counter after output a fragment
successfully is better. And it should not be increased even though a
fragment is created but failed to output.
2nd:
If we send a big ICMP/ICMPv6 echo request to a host, and receive
ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in
Linux kernel first fragmentation occurs in ICMP layer(maybe saying
transport layer is better), but this is not the "real"
fragmentation,just do some "pre-fragment" -- allocate space for date,
and form a frag_list, etc. The "real" fragmentation happens in IP layer
-- set offset and MF flag and so on. So I think in "fast path" for
ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by
upper layer we should also increase "ipv6IfStatsOutFragCreates".
Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I tested Linux kernel 2.6.17.7 about statistics
"ipv6IfStatsInHdrErrors", found that this counter couldn't increase
correctly. The criteria is RFC2465:
ipv6IfStatsInHdrErrors OBJECT-TYPE
SYNTAX Counter3
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of input datagrams discarded due to
errors in their IPv6 headers, including version
number mismatch, other format errors, hop count
exceeded, errors discovered in processing their
IPv6 options, etc."
::= { ipv6IfStatsEntry 2 }
When I send TTL=0 and TTL=1 a packet to a router which need to be
forwarded, router just sends an ICMPv6 message to tell the sender that
TIME_EXCEED and HOPLIMITS, but no increments for this counter(in the
function ip6_forward).
Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have a more complete solution in the works, involving
the seperation of CHECKSUM_HW on input vs. output, and
having netfilter properly do incremental checksums.
But that is a very involved patch and is thus 2.6.19
material.
What we have now is infinitely better than the past,
wherein all TSO packets were dropped due to corrupt
checksums as soon at the NAT module was loaded. At
least now, the checksums do get fixed up, it just
isn't the cleanest nor most optimal solution.
Signed-off-by: David S. Miller <davem@davemloft.net>
The hashlimit table name and the textsearch algorithm need to be
terminated, the textsearch pattern length must not exceed the
maximum size.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we don't know in which direction the first packet will arrive, we
need to create one expectation for each direction, which is currently
prevented by max_expected beeing set to 1.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a dev_alloc_skb variant that takes a struct net_device * paramater.
For now that paramater is unused, but I'll use it to allocate the skb
from node-local memory in a follow-up patch. Also there have been some
other plans mentioned on the list that can use it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon guidance from Alexey Kuznetsov.
When linger2 is active, we check to see if the fin_wait2
timeout is longer than the timewait. If it is, we schedule
the keepalive timer for the difference between the timewait
timeout and the fin_wait2 timeout.
When this orphan socket is seen by tcp_keepalive_timer()
it will try to transform this fin_wait2 socket into a
fin_wait2 mini-socket, again if linger2 is active.
Not all paths were setting this initial keepalive timer correctly.
The tcp input path was doing it correctly, but tcp_close() wasn't,
potentially making the socket linger longer than it really needs to.
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch below fixes a problem in the iptables SECMARK target, where
the user-supplied 'selctx' string may not be nul-terminated.
From initial analysis, it seems that the strlen() called from
selinux_string_to_sid() could run until it arbitrarily finds a zero,
and possibly cause a kernel oops before then.
The impact of this appears limited because the operation requires
CAP_NET_ADMIN, which is essentially always root. Also, the module is
not yet in wide use.
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generate netevents for:
- neighbour changes
- routing redirects
- pmtu changes
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses notifier blocks to implement a network event
notifier mechanism.
Clients register their callback function by calling
register_netevent_notifier() like this:
static struct notifier_block nb = {
.notifier_call = my_callback_func
};
...
register_netevent_notifier(&nb);
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refer to RFC2012, tcpAttemptFails is defined as following:
tcpAttemptFails OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of times TCP connections have made a direct
transition to the CLOSED state from either the SYN-SENT
state or the SYN-RCVD state, plus the number of times TCP
connections have made a direct transition to the LISTEN
state from the SYN-RCVD state."
::= { tcp 7 }
When I lookup into RFC793, I found that the state change should occured
under following condition:
1. SYN-SENT -> CLOSED
a) Received ACK,RST segment when SYN-SENT state.
2. SYN-RCVD -> CLOSED
b) Received SYN segment when SYN-RCVD state(came from LISTEN).
c) Received RST segment when SYN-RCVD state(came from SYN-SENT).
d) Received SYN segment when SYN-RCVD state(came from SYN-SENT).
3. SYN-RCVD -> LISTEN
e) Received RST segment when SYN-RCVD state(came from LISTEN).
In my test, those direct state transition can not be counted to
tcpAttemptFails.
Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a patch by Jesper Juhl.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the trim point is within the head and there is no paged data,
___pskb_trim fails to drop the first element in the frag_list.
This patch fixes this by moving the len <= offset case out of the
page data loop.
This patch also adds a missing kfree_skb on the frag that we just
cloned.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current users of ip6_dst_lookup can be divided into two classes:
1) The caller holds no locks and is in user-context (UDP).
2) The caller does not want to lookup the dst cache at all.
The second class covers everyone except UDP because most people do
the cache lookup directly before calling ip6_dst_lookup. This patch
adds ip6_sk_dst_lookup for the first class.
Similarly ip6_dst_store users can be divded into those that need to
take the socket dst lock and those that don't. This patch adds
__ip6_dst_store for those (everyone except UDP/datagram) that don't
need an extra lock.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We also do not try regenarating new temporary address corresponding to an
address with infinite preferred lifetime.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
ieee80211_crypt_tkip will not work without CRC32.
LD .tmp_vmlinux1
net/built-in.o: In function `ieee80211_tkip_encrypt':
net/ieee80211/ieee80211_crypt_tkip.c:349: undefined reference to `crc32_le'
Reported by Toralf Foerster <toralf.foerster@gmx.de>
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Johann Uhrmann reported a bcm43xx crash and Michael Buesch tracked
it down to a problem with the new shared key auth code (recursive
calls into the driver)
This patch (effectively Michael's patch with a couple of small
modifications) solves the problem by sending the authentication
challenge response frame from a workqueue entry.
I also removed a lone \n from the bcm43xx messages relating to
authentication mode - this small change was previously discussed but
not patched in.
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
From: Tetsuo Handa from-linux-kernel@i-love.sakura.ne.jp
The recvmsg() for raw socket seems to return random u16 value
from the kernel stack memory since port field is not initialized.
But I'm not sure this patch is correct.
Does raw socket return any information stored in port field?
[ BSD defines RAW IP recvmsg to return a sin_port value of zero.
This is described in Steven's TCP/IP Illustrated Volume 2 on
page 1055, which is discussing the BSD rip_input() implementation. ]
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP multicast route code was reusing an skb which causes use after free
and double free.
From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
the whole half-prepared netlink message plus room for the rest.
It could be also skb_copy(), if we want to be puristic about mangling
cloned data, but original copy is really not going to be used.
Acked-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clear the accumulated junk in IP6CB when starting to handle an IPV6
packet.
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the recent problems with all the SCTP stuff it seems reasonable
to mark this as experimental.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add bridge netfilter deferred output hooks to feature-removal-schedule
and disable them by default. Until their removal they will be
activated by the physdev match when needed.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Locally generated broadcast and multicast packets have pkttype set to
PACKET_LOOPBACK instead of PACKET_BROADCAST or PACKET_MULTICAST. This
causes the pkttype match to fail to match packets of either type.
The below patch remedies this by using the daddr as a hint as to
broadcast|multicast. While not pretty, this seems like the only way
to solve the problem short of just noting this as a limitation of the
match.
This resolves netfilter bugzilla #484
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of an unknown verdict or NF_STOP the packet leaks. Unknown verdicts
can happen when userspace is buggy. Reinject the packet in case of NF_STOP,
drop on unknown verdicts.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>