linux

Author	SHA1	Message	Date
David S. Miller	caacf05e5a	ipv4: Properly purge netdev references on uncached routes. When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 15:06:50 -07:00
David S. Miller	c5038a8327	ipv4: Cache routes in nexthop exception entries. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 15:02:02 -07:00
Eric Dumazet	d26b3a7c4b	ipv4: percpu nh_rth_output cache Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 14:41:39 -07:00
Eric Dumazet	54764bb647	ipv4: Restore old dst_free() behavior. commit `404e0a8b6a` (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 14:41:38 -07:00
Eric Dumazet	0c7462a235	ipv4: remove rt_cache_rebuild_count After IP route cache removal, rt_cache_rebuild_count is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	404e0a8b6a	net: ipv4: fix RCU races on dst refcounts commit `c6cffba4ff` (ipv4: Fix input route performance regression.) added various fatal races with dst refcounts. crashes happen on tcp workloads if routes are added/deleted at the same time. The dst_free() calls from free_fib_info_rcu() are clearly racy. We need instead regular dst refcounting (dst_release()) and make sure dst_release() is aware of RCU grace periods : Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period before dst destruction for cached dst Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero() to make sure we dont increase a zero refcount (On a dst currently waiting an rcu grace period before destruction) rt_cache_route() must take a reference on the new cached route, and release it if was not able to install it. With this patch, my machines survive various benchmarks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	cca32e4bf9	net: TCP early demux cleanup early_demux() handlers should be called in RCU context, and as we use skb_dst_set_noref(skb, dst), caller must not exit from RCU context before dst use (skb_dst(skb)) or release (skb_drop(dst)) Therefore, rcu_read_lock()/rcu_read_unlock() pairs around ->early_demux() are confusing and not needed : Protocol handlers are already in an RCU read lock section. (__netif_receive_skb() does the rcu_read_lock() ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:21 -07:00
Lin Ming	61648d91fc	ipv4: clean up put_child The first parameter struct trie *t is not used anymore. Remove it. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:30 -07:00
Lin Ming	4ea4bf7ebc	ipv4: fix debug info in tnode_new It should print size of struct rt_trie_node * allocated instead of size of struct rt_trie_node. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:30 -07:00
Jiri Kosina	59ea33a68a	tcp: perform DMA to userspace only if there is a task waiting for it Back in 2006, commit `1a2449a87b` ("[I/OAT]: TCP recv offload to I/OAT") added support for receive offloading to IOAT dma engine if available. The code in tcp_rcv_established() tries to perform early DMA copy if applicable. It however does so without checking whether the userspace task is actually expecting the data in the buffer. This is not a problem under normal circumstances, but there is a corner case where this doesn't work -- and that's when MSG_TRUNC flag to recvmsg() is used. If the IOAT dma engine is not used, the code properly checks whether there is a valid ucopy.task and the socket is owned by userspace, but misses the check in the dmaengine case. This problem can be observed in real trivially -- for example 'tbench' is a good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they have been already early-copied in tcp_rcv_established() using dma engine. This patch introduces the same check we are performing in the simple iovec copy case to the IOAT case as well. It fixes the indefinite recvmsg(MSG_TRUNC) hangs. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:51 -07:00
Eric Dumazet	505fbcf035	ipv4: fix TCP early demux commit `92101b3b2e` (ipv4: Prepare for change of rt->rt_iif encoding.) invalidated TCP early demux, because rx_dst_ifindex is not properly initialized and checked. Also remove the use of inet_iif(skb) in favor or skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:51 -07:00
Hangbin Liu	4249357010	tcp: Add TCP_USER_TIMEOUT negative value check TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But patch "tcp: Add TCP_USER_TIMEOUT socket option"(`dca43c75`) didn't check the negative values. If a user assign -1 to it, the socket will set successfully and wait for 4294967295 miliseconds. This patch add a negative value check to avoid this issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:50 -07:00
Eric Dumazet	c7109986db	ipv6: Early TCP socket demux This is the IPv6 missing bits for infrastructure added in commit `41063e9dd1` (ipv4: Early TCP socket demux.) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-26 15:50:39 -07:00
David S. Miller	c6cffba4ff	ipv4: Fix input route performance regression. With the routing cache removal we lost the "noref" code paths on input, and this can kill some routing workloads. Reinstate the noref path when we hit a cached route in the FIB nexthops. With help from Eric Dumazet. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-26 15:50:39 -07:00
Eric Dumazet	4331debc51	ipv4: rt_cache_valid must check expired routes commit `d2d68ba9fe` (ipv4: Cache input routes in fib_info nexthops.) introduced rt_cache_valid() helper. It unfortunately doesn't check if route is expired before caching it. I noticed sk_setup_caps() was constantly called on a tcp workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-25 15:24:14 -07:00
Eric Dumazet	9cb429d692	tcp: early_demux fixes 1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux() and fix a potential bug if skb->head was reallocated (iph & th pointers were not reloaded) TCP stack will pull/check headers anyway. 2) must reload iph in ip_rcv_finish() after early_demux() call since skb->head might have changed. 3) skb->dev->ifindex can be now replaced by skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-24 13:54:15 -07:00
David S. Miller	13378cad02	ipv4: Change rt->rt_iif encoding. On input packet processing, rt->rt_iif will be zero if we should use skb->dev->ifindex. Since we access rt->rt_iif consistently via inet_iif(), that is the only spot whose interpretation have to adjust. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 16:36:27 -07:00
David S. Miller	92101b3b2e	ipv4: Prepare for change of rt->rt_iif encoding. Use inet_iif() consistently, and for TCP record the input interface of cached RX dst in inet sock. rt->rt_iif is going to be encoded differently, so that we can legitimately cache input routes in the FIB info more aggressively. When the input interface is "use SKB device index" the rt->rt_iif will be set to zero. This forces us to move the TCP RX dst cache installation into the ipv4 specific code, and as well it should since doing the route caching for ipv6 is pointless at the moment since it is not inspected in the ipv6 input paths yet. Also, remove the unlikely on dst->obsolete, all ipv4 dsts have obsolete set to a non-zero value to force invocation of the check callback. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 16:36:26 -07:00
David S. Miller	fe3edf4579	ipv4: Remove all RTCF_DIRECTSRC handliing. The last and final kernel user, ICMP address replies, has been removed. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:22:20 -07:00
David S. Miller	838942a594	ipv4: Really ignore ICMP address requests/replies. Alexey removed kernel side support for requests, and the only thing we do for replies is log a message if something doesn't look right. As Alexey's comment indicates, this belongs in userspace (if anywhere), and thus we can safely just get rid of this code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:20:26 -07:00
Saurabh	e7d4b18cbe	net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse. With CONFIG_SPARSE_RCU_POINTER=y sparse identified references which did not specificy __rcu in ip_vti.c Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:00:54 -07:00
Lin Ming	8fe5cb873b	ipv4: Remove redundant assignment It is redundant to set no_addr and accept_local to 0 and then set them with other values just after that. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:00:54 -07:00
Eric Dumazet	563d34d057	tcp: dont drop MTU reduction indications ICMP messages generated in output path if frame length is bigger than mtu are actually lost because socket is owned by user (doing the xmit) One example is the ipgre_tunnel_xmit() calling icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); We had a similar case fixed in commit `a34a101e1e` (ipv6: disable GSO on sockets hitting dst_allfrag). Problem of such fix is that it relied on retransmit timers, so short tcp sessions paid a too big latency increase price. This patch uses the tcp_release_cb() infrastructure so that MTU reduction messages (ICMP messages) are not lost, and no extra delay is added in TCP transmits. Reported-by: Maciej Żenczykowski <maze@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:58:46 -07:00
Julian Anastasov	9a0a9502cb	tcp: avoid oops in tcp_metrics and reset tcpm_stamp In tcp_tw_remember_stamp we incorrectly checked tw instead of tm, it can lead to oops if the cached entry is not found. tcpm_stamp was not updated in tcpm_check_stamp when tcpm_suck_dst was called, move the update into tcpm_suck_dst, so that we do not call it infinitely on every next cache hit after TCP_METRICS_TIMEOUT. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:57:12 -07:00
David S. Miller	5e9965c15b	Merge branch 'kill_rtcache' The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. The general flow of this patch series is that first the routing cache is removed. We build a completely new rtable entry every lookup request. Next we make some simplifications due to the fact that removing the routing cache causes several members of struct rtable to become no longer necessary. Then we need to make some amends such that we can legally cache pre-constructed routes in the FIB nexthops. Firstly, we need to invalidate routes which are hit with nexthop exceptions. Secondly we have to change the semantics of rt->rt_gateway such that zero means that the destination is on-link and non-zero otherwise. Now that the preparations are ready, we start caching precomputed routes in the FIB nexthops. Output and input routes need different kinds of care when determining if we can legally do such caching or not. The details are in the commit log messages for those changes. The patch series then winds down with some more struct rtable simplifications and other tidy ups that remove unnecessary overhead. On a SPARC-T3 output route lookups are ~876 cycles. Input route lookups are ~1169 cycles with rpfilter disabled, and about ~1468 cycles with rpfilter enabled. These measurements were taken with the kbench_mod test module in the net_test_tools GIT tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git That GIT tree also includes a udpflood tester tool and stresses route lookups on packet output. For example, on the same SPARC-T3 system we can run: time ./udpflood -l 10000000 10.2.2.11 with routing cache: real 1m21.955s user 0m6.530s sys 1m15.390s without routing cache: real 1m31.678s user 0m6.520s sys 1m25.140s Performance undoubtedly can easily be improved further. For example fib_table_lookup() performs a lot of excessive computations with all the masking and shifting, some of it conditionalized to deal with edge cases. Also, Eric's no-ref optimization for input route lookups can be re-instated for the FIB nexthop caching code path. I would be really pleased if someone would work on that. In fact anyone suitable motivated can just fire up perf on the loading of the test net_test_tools benchmark kernel module. I spend much of my time going: bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2 bash# perf report Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben Hutchings, and others. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 17:04:15 -07:00
Eric Dumazet	0980e56e50	ipv4: tcp: set unicast_sock uc_ttl to -1 Set unicast_sock uc_ttl to -1 so that we select the right ttl, instead of sending packets with a 0 ttl. Bug added in commit `be9f4a44e7` (ipv4: tcp: remove per net tcp_sock) Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:06:21 -07:00
David S. Miller	2860583fe8	ipv4: Kill rt->fi It's not really needed. We only grabbed a reference to the fib_info for the sake of fib_info local metrics. However, fib_info objects are freed using RCU, as are therefore their private metrics (if any). We would have triggered a route cache flush if we eliminated a reference to a fib_info object in the routing tables. Therefore, any existing cached routes will first check and see that they have been invalidated before an errant reference to these metric values would occur. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:40:07 -07:00
David S. Miller	9917e1e876	ipv4: Turn rt->rt_route_iif into rt->rt_is_input. That is this value's only use, as a boolean to indicate whether a route is an input route or not. So implement it that way, using a u16 gap present in the struct already. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:40:02 -07:00
David S. Miller	4fd551d7be	ipv4: Kill rt->rt_oif Never actually used. It was being set on output routes to the original OIF specified in the flow key used for the lookup. Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of the flowi4_oif and flowi4_iif values, thanks to feedback from Julian Anastasov. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:38:34 -07:00
David S. Miller	93ac53410a	ipv4: Dirty less cache lines in route caching paths. Don't bother incrementing dst->__use and setting dst->lastuse, they are completely pointless and just slow things down. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:55 -07:00
David S. Miller	ba3f7f04ef	ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:54 -07:00
David S. Miller	d2d68ba9fe	ipv4: Cache input routes in fib_info nexthops. Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:40 -07:00
David S. Miller	f2bb4bedf3	ipv4: Cache output routes in fib_info nexthops. If we have an output route that lacks nexthop exceptions, we can cache it in the FIB info nexthop. Such routes will have DST_HOST cleared because such routes refer to a family of destinations, rather than just one. The sequence of the handling of exceptions during route lookup is adjusted to make the logic work properly. Before we allocate the route, we lookup the exception. Then we know if we will cache this route or not, and therefore whether DST_HOST should be set on the allocated route. Then we use DST_HOST to key off whether we should store the resulting route, during rt_set_nexthop(), in the FIB nexthop cache. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:16 -07:00
David S. Miller	ceb3320610	ipv4: Kill routes during PMTU/redirect updates. Mark them obsolete so there will be a re-lookup to fetch the FIB nexthop exception info. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:22 -07:00
David S. Miller	f5b0a87436	net: Document dst->obsolete better. Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:21 -07:00
David S. Miller	f8126f1d51	ipv4: Adjust semantics of rt->rt_gateway. In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>	2012-07-20 13:31:20 -07:00
David S. Miller	f1ce3062c5	ipv4: Remove 'rt_dst' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:19 -07:00
David Miller	b48698895d	ipv4: Remove 'rt_mark' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:18 -07:00
David Miller	d6c0a4f609	ipv4: Kill 'rt_src' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:00 -07:00
David Miller	1a00fee4ff	ipv4: Remove rt_key_{src,dst,tos} from struct rtable. They are always used in contexts where they can be reconstituted, or where the finally resolved rt->rt_{src,dst} is semantically equivalent. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:59 -07:00
David Miller	38a424e465	ipv4: Kill ip_route_input_noref(). The "noref" argument to ip_route_input_common() is now always ignored because we do not cache routes, and in that case we must always grab a reference to the resulting 'dst'. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:59 -07:00
David S. Miller	89aef8921b	ipv4: Delete routing cache. The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:27 -07:00
Julian Anastasov	521f549097	ipv4: show pmtu in route list Override the metrics with rt_pmtu Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 11:16:49 -07:00
Eric Dumazet	6f458dfb40	tcp: improve latencies of timer triggered events Modern TCP stack highly depends on tcp_write_timer() having a small latency, but current implementation doesn't exactly meet the expectations. When a timer fires but finds the socket is owned by the user, it rearms itself for an additional delay hoping next run will be more successful. tcp_write_timer() for example uses a 50ms delay for next try, and it defeats many attempts to get predictable TCP behavior in term of latencies. Use the recently introduced tcp_release_cb(), so that the user owning the socket will call various handlers right before socket release. This will permit us to post a followup patch to address the tcp_tso_should_defer() syndrome (some deferred packets have to wait RTO timer to be transmitted, while cwnd should allow us to send them sooner) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Eric Dumazet	9dc274151a	tcp: fix ABC in tcp_slow_start() When/if sysctl_tcp_abc > 1, we expect to increase cwnd by 2 if the received ACK acknowledges more than 2*MSS bytes, in tcp_slow_start() Problem is this RFC 3465 statement is not correctly coded, as the while () loop increases snd_cwnd one by one. Add a new variable to avoid this off-by one error. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: John Heffner <johnwheffner@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Eric Dumazet	5815d5e7aa	tcp: use hash_32() in tcp_metrics Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not guaranteed to be a power of two. Uses hash_32() instead of custom hash. tcpmhash_entries should be an unsigned int. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Vijay Subramanian	67b95bd78f	tcp: Return bool instead of int where appropriate Applied to a set of static inline functions in tcp_input.c Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Julian Anastasov	f31fd38382	ipv4: Fix again the time difference calculation Fix again the diff value in rt_bind_exception after collision of two latest patches, my original commit actually fixed the same problem. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 13:01:44 -07:00
David S. Miller	abaa72d7fd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c	2012-07-19 11:17:30 -07:00
Yuchung Cheng	67da22d23f	net-tcp: Fast Open client - cookie-less mode In trusted networks, e.g., intranet, data-center, the client does not need to use Fast Open cookie to mitigate DoS attacks. In cookie-less mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless of cookie availability. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	aab4874355	net-tcp: Fast Open client - detecting SYN-data drops On paths with firewalls dropping SYN with data or experimental TCP options, Fast Open connections will have experience SYN timeout and bad performance. The solution is to track such incidents in the cookie cache and disables Fast Open temporarily. Since only the original SYN includes data and/or Fast Open option, the SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect such drops. If a path has recurring Fast Open SYN drops, Fast Open is disabled for 2^(recurring_losses) minutes starting from four minutes up to roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but it behaves as connect() then write(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	cf60af03ca	net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2) and write(2). The application should replace connect() with it to send data in the opening SYN packet. For blocking socket, sendmsg() blocks until all the data are buffered locally and the handshake is completed like connect() call. It returns similar errno like connect() if the TCP handshake fails. For non-blocking socket, it returns the number of bytes queued (and transmitted in the SYN-data packet) if cookie is available. If cookie is not available, it transmits a data-less SYN packet with Fast Open cookie request option and returns -EINPROGRESS like connect(). Using MSG_FASTOPEN on connecting or connected socket will result in simlar errno like repeating connect() calls. Therefore the application should only use this flag on new sockets. The buffer size of sendmsg() is independent of the MSS of the connection. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	8e4178c1c7	net-tcp: Fast Open client - receiving SYN-ACK On receiving the SYN-ACK after SYN-data, the client needs to a) update the cached MSS and cookie (if included in SYN-ACK) b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of the handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	783237e8da	net-tcp: Fast Open client - sending SYN-data This patch implements sending SYN-data in tcp_connect(). The data is from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch). The length of the cookie in tcp_fastopen_req, init'd to 0, controls the type of the SYN. If the cookie is not cached (len==0), the host sends data-less SYN with Fast Open cookie request option to solicit a cookie from the remote. If cookie is not available (len > 0), the host sends a SYN-data with Fast Open cookie option. If cookie length is negative, the SYN will not include any Fast Open option (for fall back operations). To deal with middleboxes that may drop SYN with data or experimental TCP option, the SYN-data is only sent once. SYN retransmits do not include data or Fast Open options. The connection will fall back to regular TCP handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	1fe4c481ba	net-tcp: Fast Open client - cookie cache With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache. The basic ones are MSS and the cookies. Later patch will cache more to handle unfriendly middleboxes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:55:36 -07:00
Yuchung Cheng	2100c8d2d9	net-tcp: Fast Open base This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:55:36 -07:00
Eric Dumazet	be9f4a44e7	ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:35:30 -07:00
Julian Anastasov	aee06da672	ipv4: use seqlock for nh_exceptions Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:30:14 -07:00
David S. Miller	7fed84f622	ipv4: Fix time difference calculation in rt_bind_exception(). Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:46:59 -07:00
Julian Anastasov	0cc535a299	ipv4: fix address selection in fib_compute_spec_dst ip_options_compile can be called for forwarded packets, make sure the specific-destionation address is a local one as specified in RFC 1812, 4.2.2.2 Addresses in Options Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:30:49 -07:00
Julian Anastasov	6255e5ead0	ipv4: optimize fib_compute_spec_dst call in ip_options_echo Move fib_compute_spec_dst at the only place where it is needed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:30:49 -07:00
Eric Dumazet	ddbe503203	ipv6: add ipv6_addr_hash() helper Introduce ipv6_addr_hash() helper doing a XOR on all bits of an IPv6 address, with an optimized x86_64 version. Use it in flow dissector, as suggested by Andrew McGregor, to reduce hash collision probabilities in fq_codel (and other users of flow dissector) Use it in ip6_tunnel.c and use more bit shuffling, as suggested by David Laight, as existing hash was ignoring most of them. Use it in sunrpc and use more bit shuffling, using hash_32(). Use it in net/ipv6/addrconf.c, using hash_32() as well. As a cleanup, use it in net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew McGregor <andrewmcgr@gmail.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 11:28:46 -07:00
Saurabh	1181412c1a	net/ipv4: VTI support new module for ip_vti. New VTI tunnel kernel module, Kconfig and Makefile changes. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:36:12 -07:00
Saurabh	eb8637cd4a	net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel. Incorporated David and Steffen's comments. Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:36:12 -07:00
Eric Dumazet	e371589917	tcp: refine SYN handling in tcp_validate_incoming Followup of commit `0c24604b68` (tcp: implement RFC 5961 4.2) As reported by Vijay Subramanian, we should send a challenge ACK instead of a dup ack if a SYN flag is set on a packet received out of window. This permits the ratelimiting to work as intended, and to increase correct SNMP counters. Suggested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:31:25 -07:00
Paul Moore	89d7ae34cd	cipso: don't follow a NULL pointer when setsockopt() is called As reported by Alan Cox, and verified by Lin Ming, when a user attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL tag the kernel dies a terrible death when it attempts to follow a NULL pointer (the skb argument to cipso_v4_validate() is NULL when called via the setsockopt() syscall). This patch fixes this by first checking to ensure that the skb is non-NULL before using it to find the incoming network interface. In the unlikely case where the skb is NULL and the user attempts to add a CIPSO option with the _TAG_LOCAL tag we return an error as this is not something we want to allow. A simple reproducer, kindly supplied by Lin Ming, although you must have the CIPSO DOI #3 configure on the system first or you will be caught early in cipso_v4_validate(): #include <sys/types.h> #include <sys/socket.h> #include <linux/ip.h> #include <linux/in.h> #include <string.h> struct local_tag { char type; char length; char info[4]; }; struct cipso { char type; char length; char doi[4]; struct local_tag local; }; int main(int argc, char **argv) { int sockfd; struct cipso cipso = { .type = IPOPT_CIPSO, .length = sizeof(struct cipso), .local = { .type = 128, .length = sizeof(struct local_tag), }, }; memset(cipso.doi, 0, 4); cipso.doi[3] = 3; sockfd = socket(AF_INET, SOCK_DGRAM, 0); #define SOL_IP 0 setsockopt(sockfd, SOL_IP, IP_OPTIONS, &cipso, sizeof(struct cipso)); return 0; } CC: Lin Ming <mlin@ss.pku.edu.cn> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:01:12 -07:00
Eric Dumazet	5abf7f7e0f	ipv4: fix rcu splat free_nh_exceptions() should use rcu_dereference_protected(..., 1) since its called after one RCU grace period. Also add some const-ification in recent code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 13:47:33 -07:00
David S. Miller	d3a25c980f	ipv4: Fix nexthop exception hash computation. Need to mask it with (FNHE_HASH_SIZE - 1). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 13:23:08 -07:00
David S. Miller	a6ff1a2f1e	Merge branch 'nexthop_exceptions' These patches implement the final mechanism necessary to really allow us to go without the route cache in ipv4. We need a place to have long-term storage of PMTU/redirect information which is independent of the routes themselves, yet does not get us back into a situation where we have to write to metrics or anything like that. For this we use an "next-hop exception" table in the FIB nexthops. The one thing I desperately want to avoid is having to create clone routes in the FIB trie for this purpose, because that is very expensive. However, I'm willing to entertain such an idea later if this current scheme proves to have downsides that the FIB trie variant would not have. In order to accomodate this any such scheme, we need to be able to produce a full flow key at PMTU/redirect time. That required an adjustment of the interface call-sites used to propagate these events. For a PMTU/redirect with a fully specified socket, we pass that socket and use it to produce the flow key. Otherwise we use a passed in SKB to formulate the key. There are two cases that need to be distinguished, ICMP message processing (in which case the IP header is at skb->data) and output packet processing (mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)). We also have to make the code able to handle the case where the dst itself passed into the dst_ops->{update_pmtu,redirect} method is invalidated. This matters for calls from sockets that have cached that route. We provide a inet{,6} helper function for this purpose, and edit SCTP specially since it caches routes at the transport rather than socket level. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 10:48:26 -07:00
David S. Miller	4895c771c7	ipv4: Add FIB nexthop exceptions. In a regime where we have subnetted route entries, we need a way to store persistent storage about destination specific learned values such as redirects and PMTU values. This is implemented here via nexthop exceptions. The initial implementation is a 2048 entry hash table with relaiming starting at chain length 5. A more sophisticated scheme can be devised if that proves necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 08:48:50 -07:00
Eric Dumazet	0c24604b68	tcp: implement RFC 5961 4.2 Implement the RFC 5691 mitigation against Blind Reset attack using SYN bit. Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop incoming packet, instead of resetting the session. Add a new SNMP counter to count number of challenge acks sent in response to SYN packets. (netstat -s \| grep TCPSYNChallenge) Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session because of a SYN flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 07:40:46 -07:00
David S. Miller	6700c2709c	net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 03:29:28 -07:00
Eric Dumazet	282f23c6ee	tcp: implement RFC 5961 3.2 Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s \| grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 01:36:20 -07:00
Andrey Vagin	51d7cccf07	net: make sock diag per-namespace Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:31:34 -07:00
Eric Dumazet	a6df1ae938	tcp: add OFO snmp counters Add three SNMP TCP counters, to better track TCP behavior at global stage (netstat -s), when packets are received Out Of Order (OFO) TCPOFOQueue : Number of packets queued in OFO queue TCPOFODrop : Number of packets meant to be queued in OFO but dropped because socket rcvbuf limit hit. TCPOFOMerge : Number of packets in OFO that were merged with other packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:12:00 -07:00
David S. Miller	80d0a69fc5	ipv4: Add helper inet_csk_update_pmtu(). This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 03:28:06 -07:00
David S. Miller	85b91b0339	ipv4: Don't store a rule pointer in fib_result. We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 08:21:29 -07:00
Eric Dumazet	d01cb20711	tcp: add LAST_ACK as a valid state for TSQ Socket state LAST_ACK should allow TSQ to send additional frames, or else we rely on incoming ACKS or timers to send them. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 05:48:36 -07:00
David S. Miller	391e5c22f5	ipv4: Remove tb_peers from fib_table. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 09:39:28 -07:00
David S. Miller	f0a70e902f	ipv4: Put proper checks into icmp_socket_deliver(). All handler->err() routines expect that we've done a pskb_may_pull() test to make sure that IP header length + 8 bytes can be safely pulled. Reported-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 08:06:04 -07:00
David S. Miller	99ee038d41	ipv4: Fix warnings in ip_do_redirect() for some configurations. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 07:40:05 -07:00
David S. Miller	1ed5c48f23	net: Remove checks for dst_ops->redirect being NULL. No longer necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:41:25 -07:00
David S. Miller	b587ee3ba2	net: Add dummy dst_ops->redirect method where needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:39:24 -07:00
David S. Miller	1f42539d25	ipv4: Kill ip_rt_redirect(). No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 21:30:08 -07:00
David S. Miller	55be7a9c60	ipv4: Add redirect support to all protocol icmp error handlers. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 21:27:49 -07:00
David S. Miller	b42597e2f3	ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 21:25:45 -07:00
David S. Miller	e47a185b31	ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect. All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 20:55:47 -07:00
David S. Miller	94206125c4	ipv4: Rearrange arguments to ip_rt_redirect() Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 20:38:08 -07:00
David S. Miller	d0da720f9f	ipv4: Pull redirect instantiation out into a helper function. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 20:27:54 -07:00
David S. Miller	d3351b75a7	ipv4: Deliver ICMP redirects to sockets too. And thus, we can remove the ping_err() hack. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 18:35:12 -07:00
David S. Miller	1de9243bbf	ipv4: Pull icmp socket delivery out into a helper function. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 18:32:17 -07:00
Eric Dumazet	46d3ceabd8	tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 18:12:59 -07:00
Alexander Duyck	2100844ca9	tcp: Fix out of bounds access to tcpm_vals The recent patch "tcp: Maintain dynamic metrics in local cache." introduced an out of bounds access due to what appears to be a typo. I believe this change should resolve the issue by replacing the access to RTAX_CWND with TCP_METRIC_CWND. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 17:30:41 -07:00
Ben Hutchings	ae86b9e384	net: Fix non-kernel-doc comments with kernel-doc start marker Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 23:13:45 -07:00
Ben Hutchings	2c53040f01	net: Fix (nearly-)kernel-doc comments for various functions Fix incorrect start markers, wrapped summary lines, missing section breaks, incorrect separators, and some name mismatches. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 23:13:45 -07:00
David S. Miller	f185071ddf	ipv4: Remove inetpeer from routes. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:18 -07:00
David S. Miller	312487313d	ipv4: Calling ->cow_metrics() now is a bug. Nothing every writes to ipv4 metrics any longer. PMTU is stored in rt->rt_pmtu. Dynamic TCP metrics are stored in a special TCP metrics cache, completely outside of the routes. Therefore ->cow_metrics() can simply nothing more than a WARN_ON trigger so we can catch anyone who tries to add new writes to ipv4 route metrics. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:17 -07:00
David S. Miller	2db2d67e4c	ipv4: Kill dst_copy_metrics() call from ipv4_blackhole_route(). Blackhole routes have a COW metrics operation that returns NULL always, therefore this dst_copy_metrics() call did absolutely nothing. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:16 -07:00
David S. Miller	710ab6c031	ipv4: Enforce max MTU metric at route insertion time. Rather than at every struct rtable creation. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:15 -07:00
David S. Miller	5943634fc5	ipv4: Maintain redirect and PMTU info in struct rtable again. Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:14 -07:00
David S. Miller	87a50699cb	rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). Nobody provides non-zero values any longer. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:13 -07:00
David S. Miller	3e12939a2a	inet: Kill FLOWI_FLAG_PRECOW_METRICS. No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:12 -07:00
David S. Miller	1d861aa4b3	inet: Minimize use of cached route inetpeer. Only use it in the absolutely required cases: 1) COW'ing metrics 2) ipv4 PMTU 3) ipv4 redirects Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:11 -07:00
David S. Miller	16d1839907	inet: Remove ->get_peer() method. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:10 -07:00
David S. Miller	b6242b9b45	tcp: Remove tw->tw_peer No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:09 -07:00
David S. Miller	81166dd6fa	tcp: Move timestamps from inetpeer to metrics cache. With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:08 -07:00
David S. Miller	794785bf12	net: Don't report route RTT metric value in cache dumps. We don't maintain it dynamically any longer, so reporting it would be extremely misleading. Report zero instead. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:06 -07:00
David S. Miller	51c5d0c4b1	tcp: Maintain dynamic metrics in local cache. Maintain a local hash table of TCP dynamic metrics blobs. Computed TCP metrics are no longer maintained in the route metrics. The table uses RCU and an extremely simple hash so that it has low latency and low overhead. A simple hash is legitimate because we only make metrics blobs for fully established connections. Some tweaking of the default hash table sizes, metric timeouts, and the hash chain length limit certainly could use some tweaking. But the basic design seems sound. With help from Eric Dumazet and Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:39:57 -07:00
David S. Miller	ab92bb2f67	tcp: Abstract back handling peer aliveness test into helper function. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 20:33:49 -07:00
David S. Miller	4aabd8ef8c	tcp: Move dynamnic metrics handling into seperate file. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 20:31:36 -07:00
David S. Miller	d3a5ea6e21	Merge branch 'master' of git://1984.lsi.us.es/nf-next	2012-07-07 16:18:50 -07:00
David S. Miller	f4530fa574	ipv4: Avoid overhead when no custom FIB rules are installed. If the user hasn't actually installed any custom rules, or fiddled with the default ones, don't go through the whole FIB rules layer. It's just pure overhead. Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check the individual tables by hand, one by one. Also, move fib_num_tclassid_users into the ipv4 network namespace. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 22:13:13 -07:00
Eric Dumazet	bf5e53e371	ipv4: defer fib_compute_spec_dst() call ip_options_compile() can avoid calling fib_compute_spec_dst() by default, and perform the call only if needed. David suggested to add a helper to make the call only once. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 03:03:32 -07:00
David S. Miller	f187bc6efb	ipv4: No need to set generic neighbour pointer. Nobody reads it any longer. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 02:41:59 -07:00
David S. Miller	f894cbf847	net: Add optional SKB arg to dst_ops->neigh_lookup(). Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:04:01 -07:00
David S. Miller	5110effee8	net: Do delayed neigh confirmation. When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:03:06 -07:00
David S. Miller	3c521f2ba9	ipv4: Don't report neigh uptodate state in rtcache procfs. Soon routes will not have a cached neigh attached, nor will we be able to necessarily go directly to a neigh from an arbitrary route. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:02:31 -07:00
David S. Miller	a263b30936	ipv4: Make neigh lookups directly in output packet path. Do not use the dst cached neigh, we'll be getting rid of that. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:02:12 -07:00
David S. Miller	11604721a3	ipv4: Fix crashes in ip_options_compile(). The spec_dst uses should be guarded by skb_rtable() being non-NULL not just the SKB being non-null. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-04 16:13:17 -07:00
Pablo Neira Ayuso	08911475d1	netfilter: nf_conntrack: generalize nf_ct_l4proto_net This patch generalizes nf_ct_l4proto_net by splitting it into chunks and moving the corresponding protocol part to where it really belongs to. To clarify, note that we follow two different approaches to support per-net depending if it's built-in or run-time loadable protocol tracker. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>	2012-07-04 19:37:22 +02:00
Pablo Neira Ayuso	a31f2d17b3	netlink: add netlink_kernel_cfg parameter to netlink_kernel_create This patch adds the following structure: struct netlink_kernel_cfg { unsigned int groups; void (input)(struct sk_buff skb); struct mutex *cb_mutex; }; That can be passed to netlink_kernel_create to set optional configurations for netlink kernel sockets. I've populated this structure by looking for NULL and zero parameters at the existing code. The remaining parameters that always need to be set are still left in the original interface. That includes optional parameters for the netlink socket creation. This allows easy extensibility of this interface in the future. This patch also adapts all callers to use this new interface. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-29 16:46:02 -07:00
David S. Miller	7a9bc9b81a	ipv4: Elide fib_validate_source() completely when possible. If rpfilter is off (or the SKB has an IPSEC path) and there are not tclassid users, we don't have to do anything at all when fib_validate_source() is invoked besides setting the itag to zero. We monitor tclassid uses with a counter (modified only under RTNL and marked __read_mostly) and we protect the fib_validate_source() real work with a test against this counter and whether rpfilter is to be done. Having a way to know whether we need no tclassid processing or not also opens the door for future optimized rpfilter algorithms that do not perform full FIB lookups. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-29 01:36:36 -07:00
David S. Miller	3085a4b7d3	ipv4: Remove extraneous assignment of dst->tclassid. We already set it several lines above. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 22:17:39 -07:00
David S. Miller	9e56e3800e	ipv4: Adjust in_dev handling in fib_validate_source() Checking for in_dev being NULL is pointless. In fact, all of our callers have in_dev precomputed already, so just pass it in and remove the NULL checking. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 18:54:02 -07:00
David S. Miller	a207a4b2e8	ipv4: Fix bugs in fib_compute_spec_dst(). Based upon feedback from Julian Anastasov. 1) Use route flags to determine multicast/broadcast, not the packet flags. 2) Leave saddr unspecified in flow key. 3) Adjust how we invoke inet_select_addr(). Pass ip_hdr(skb)->saddr as second arg, and if it was zeronet use link scope. 4) Use loopback as input interface in flow key. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 18:33:58 -07:00
David S. Miller	41347dcdd8	ipv4: Kill rt->rt_spec_dst, no longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 04:05:27 -07:00
David S. Miller	35ebf65e85	ipv4: Create and use fib_compute_spec_dst() helper. The specific destination is the host we direct unicast replies to. Usually this is the original packet source address, but if we are responding to a multicast or broadcast packet we have to use something different. Specifically we must use the source address we would use if we were to send a packet to the unicast source of the original packet. The routing cache precomputes this value, but we want to remove that precomputation because it creates a hard dependency on the expensive rpfilter source address validation which we'd like to make cheaper. There are only three places where this matters: 1) ICMP replies. 2) pktinfo CMSG 3) IP options Now there will be no real users of rt->rt_spec_dst and we can simply remove it altogether. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 03:59:11 -07:00
David S. Miller	70e7341673	ipv4: Show that ip_send_reply() is purely unicast routine. Rename it to ip_send_unicast_reply() and add explicit 'saddr' argument. This removed one of the few users of rt->rt_spec_dst. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 03:21:41 -07:00
David S. Miller	160eb5a6b1	ipv4: Kill early demux method return value. It's completely unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 22:01:22 -07:00
David S. Miller	c10237e077	Revert "ipv4: tcp: dont cache unconfirmed intput dst" This reverts commit `c074da2810`. This change has several unwanted side effects: 1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll thus never create a real cached route. 2) All TCP traffic will use DST_NOCACHE and never use the routing cache at all. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 17:05:06 -07:00
Eric Dumazet	22911fc581	net: skb_free_datagram_locked() doesnt drop all packets dropwatch wrongly diagnose all received UDP packets as drops. This patch removes trace_kfree_skb() done in skb_free_datagram_locked(). Locations calling skb_free_datagram_locked() should do it on their own. As a result, drops are accounted on the right function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:40:57 -07:00
Thomas Graf	92a395e52f	ipmr: Do not use RTA_PUT() macros Also fix a needless skb tailroom check for a 4 bytes area after after each rtnexthop block. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:36:44 -07:00
Thomas Graf	6e277ed59a	inet_diag: Do not use RTA_PUT() macros Also, no need to trim on nlmsg_put() failure, nothing has been added yet. We also want to use nlmsg_end(), nlmsg_new() and nlmsg_free(). Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:36:44 -07:00
Eric Dumazet	c074da2810	ipv4: tcp: dont cache unconfirmed intput dst DDOS synflood attacks hit badly IP route cache. On typical machines, this cache is allowed to hold up to 8 Millions dst entries, 256 bytes for each, for a total of 2GB of memory. rt_garbage_collect() triggers and tries to cleanup things. Eventually route cache is disabled but machine is under fire and might OOM and crash. This patch exploits the new TCP early demux, to set a nocache boolean in case incoming TCP frame is for a not yet ESTABLISHED or TIMEWAIT socket. This 'nocache' boolean is then used in case dst entry is not found in route cache, to create an unhashed dst entry (DST_NOCACHE) SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache output dst for syncookies), so after this patch, a machine is able to absorb a DDOS synflood attack without polluting its IP route cache. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:34:24 -07:00
Gao feng	a9082b45ad	netfilter: nf_ct_icmp: add icmp_kmemdup[_compat]_sysctl_table function Split sysctl function into smaller chucks to cleanup code and prepare patches to reduce ifdef pollution. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-27 19:14:55 +02:00
Gao feng	f1caad2745	netfilter: nf_conntrack: prepare l4proto->init_net cleanup l4proto->init contain quite redundant code. We can simplify this by adding a new parameter l3proto. This patch prepares that code simplification. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-27 18:31:14 +02:00
David S. Miller	c2bd4baf41	netfilter: ipt_ULOG: Move away from NLMSG_PUT(). And use nlmsg_data() while we're here too. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-26 21:30:49 -07:00
David S. Miller	d106352d9f	inet_diag: Move away from NLMSG_PUT(). And use nlmsg_data() while we're here too, and remove useless casts. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-26 21:28:54 -07:00
David S. Miller	251da41301	ipv4: Cache ip_error() routes even when not forwarding. And account for the fact that, when we are not forwarding, we should bump statistic counters rather than emit an ICMP response. RP-filter rejected lookups are still not cached. Since -EHOSTUNREACH and -ENETUNREACH can now no longer be seen in ip_rcv_finish(), remove those checks. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-26 16:27:09 -07:00
David S. Miller	df67e6c9a6	ipv4: Remove unnecessary code from rt_check_expire(). IPv4 routing cache entries no longer use dst->expires, because the metrics, PMTU, and redirect information are stored in the inetpeer cache. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-26 00:10:09 -07:00
Vijay Subramanian	7011d0851b	tcp: Fix bug in tcp socket early demux The dest port for the call to __inet_lookup_established() in TCP early demux code is passed with the wrong endian-ness. This causes the lookup to fail leading to early demux not being used. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-23 23:22:38 -07:00
David S. Miller	0b4a9e1a59	Merge branch 'master' of git://1984.lsi.us.es/nf-next Pablo says: ==================== The following four patches provide Netfilter fixes for the cthelper infrastructure that was recently merged mainstream, they are: * two fixes for compilation breakage with two different configurations: - CONFIG_NF_NAT=m and CONFIG_NF_CT_NETLINK=y - NF_CONNTRACK_EVENTS=n and CONFIG_NETFILTER_NETLINK_QUEUE_CT=y * two fixes for sparse warnings. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-23 17:10:10 -07:00
Eric Dumazet	7586eceb0a	ipv4: tcp: dont cache output dst for syncookies Don't cache output dst for syncookies, as this adds pressure on IP route cache and rcu subsystem for no gain. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-22 21:47:33 -07:00
Alexander Duyck	6648bd7e0e	ipv4: Add sysctl knob to control early socket demux This change is meant to add a control for disabling early socket demux. The main motivation behind this patch is to provide an option to disable the feature as it adds an additional cost to routing that reduces overall throughput by up to 5%. For example one of my systems went from 12.1Mpps to 11.6 after the early socket demux was added. It looks like the reason for the regression is that we are now having to perform two lookups, first the one for an established socket, and then the one for the routing table. By adding this patch and toggling the value for ip_early_demux to 0 I am able to get back to the 12.1Mpps I was previously seeing. [ Move local variables in ip_rcv_finish() down into the basic block in which they are actually used. -DaveM ] Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-22 17:11:13 -07:00
Pablo Neira Ayuso	d584a61a93	netfilter: nfnetlink_queue: fix compilation with CONFIG_NF_NAT=m and CONFIG_NF_CT_NETLINK=y LD init/built-in.o net/built-in.o:(.data+0x4408): undefined reference to `nf_nat_tcp_seq_adjust' make: *** [vmlinux] Error 1 This patch adds a new pointer hook (nfq_ct_nat_hook) similar to other existing in Netfilter to solve our complicated configuration dependencies. Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-22 02:49:52 +02:00
David S. Miller	fd62e09b94	tcp: Validate route interface in early demux. Otherwise we might violate reverse path filtering. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-21 14:58:10 -07:00
Eric Dumazet	da55737467	inetpeer: inetpeer_invalidate_tree() cleanup No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold base lock. Also use correct rcu annotations to remove sparse errors (CONFIG_SPARSE_RCU_POINTER=y) net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison expression (different address spaces) net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison expression (different address spaces) net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-20 14:38:55 -07:00
David S. Miller	41063e9dd1	ipv4: Early TCP socket demux. Input packet processing for local sockets involves two major demuxes. One for the route and one for the socket. But we can optimize this down to one demux for certain kinds of local sockets. Currently we only do this for established TCP sockets, but it could at least in theory be expanded to other kinds of connections. If a TCP socket is established then it's identity is fully specified. This means that whatever input route was used during the three-way handshake must work equally well for the rest of the connection since the keys will not change. Once we move to established state, we cache the receive packet's input route to use later. Like the existing cached route in sk->sk_dst_cache used for output packets, we have to check for route invalidations using dst->obsolete and dst->ops->check(). Early demux occurs outside of a socket locked section, so when a route invalidation occurs we defer the fixup of sk->sk_rx_dst until we are actually inside of established state packet processing and thus have the socket locked. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-19 21:22:05 -07:00
David S. Miller	f9242b6b28	inet: Sanitize inet{,6} protocol demux. Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay are just a straight arrays. Remove all unnecessary hash masking. Document MAX_INET_PROTOS. Use RAW_HTABLE_SIZE when appropriate. Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-19 18:56:21 -07:00
David S. Miller	6fac262526	ipv4: Cap ADVMSS metric in the FIB rather than the routing cache. It makes no sense to execute this limit test every time we create a routing cache entry. We can't simply error out on these things since we've silently accepted and truncated them forever. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-17 19:47:34 -07:00
David S. Miller	82f437b950	Merge branch 'master' of git://1984.lsi.us.es/nf-next Pablo says: ==================== This is the second batch of Netfilter updates for net-next. It contains the kernel changes for the new user-space connection tracking helper infrastructure. More details on this infrastructure are provides here: http://lwn.net/Articles/500196/ Still, I plan to provide some official documentation through the conntrack-tools user manual on how to setup user-space utilities for this. So far, it provides two helper in user-space, one for NFSv3 and another for Oracle/SQLnet/TNS. Yet in my TODO list. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-16 15:23:35 -07:00
Pablo Neira Ayuso	12f7a50533	netfilter: add user-space connection tracking helper infrastructure There are good reasons to supports helpers in user-space instead: * Rapid connection tracking helper development, as developing code in user-space is usually faster. * Reliability: A buggy helper does not crash the kernel. Moreover, we can monitor the helper process and restart it in case of problems. * Security: Avoid complex string matching and mangling in kernel-space running in privileged mode. Going further, we can even think about running user-space helpers as a non-root process. * Extensibility: It allows the development of very specific helpers (most likely non-standard proprietary protocols) that are very likely not to be accepted for mainline inclusion in the form of kernel-space connection tracking helpers. This patch adds the infrastructure to allow the implementation of user-space conntrack helpers by means of the new nfnetlink subsystem `nfnetlink_cthelper' and the existing queueing infrastructure (nfnetlink_queue). I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register ipv[4\|6]_helper which results from splitting ipv[4\|6]_confirm into two pieces. This change is required not to break NAT sequence adjustment and conntrack confirmation for traffic that is enqueued to our user-space conntrack helpers. Basic operation, in a few steps: 1) Register user-space helper by means of `nfct': nfct helper add ftp inet tcp [ It must be a valid existing helper supported by conntrack-tools ] 2) Add rules to enable the FTP user-space helper which is used to track traffic going to TCP port 21. For locally generated packets: iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp For non-locally generated packets: iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp 3) Run the test conntrackd in helper mode (see example files under doc/helper/conntrackd.conf conntrackd 4) Generate FTP traffic going, if everything is OK, then conntrackd should create expectations (you can check that with `conntrack': conntrack -E expect [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp [DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp This confirms that our test helper is receiving packets including the conntrack information, and adding expectations in kernel-space. The user-space helper can also store its private tracking information in the conntrack structure in the kernel via the CTA_HELP_INFO. The kernel will consider this a binary blob whose layout is unknown. This information will be included in the information that is transfered to user-space via glue code that integrates nfnetlink_queue and ctnetlink. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-16 15:40:02 +02:00
Pablo Neira Ayuso	8c88f87cb2	netfilter: nfnetlink_queue: add NAT TCP sequence adjustment if packet mangled User-space programs that receive traffic via NFQUEUE may mangle packets. If NAT is enabled, this usually puzzles sequence tracking, leading to traffic disruptions. With this patch, nfnl_queue will make the corresponding NAT TCP sequence adjustment if: 1) The packet has been mangled, 2) the NFQA_CFG_F_CONNTRACK flag has been set, and 3) NAT is detected. There are some records on the Internet complaning about this issue: http://stackoverflow.com/questions/260757/packet-mangling-utilities-besides-iptables By now, we only support TCP since we have no helpers for DCCP or SCTP. Better to add this if we ever have some helper over those layer 4 protocols. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-16 15:09:08 +02:00
Pablo Neira Ayuso	1afc56794e	netfilter: nf_ct_helper: implement variable length helper private data This patch uses the new variable length conntrack extensions. Instead of using union nf_conntrack_help that contain all the helper private data information, we allocate variable length area to store the private helper data. This patch includes the modification of all existing helpers. It also includes a couple of include header to avoid compilation warnings. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-16 15:08:55 +02:00
David S. Miller	3639339553	ipv4: Handle PMTU in all ICMP error handlers. With ip_rt_frag_needed() removed, we have to explicitly update PMTU information in every ICMP error handler. Create two helper functions to facilitate this. 1) ipv4_sk_update_pmtu() This updates the PMTU when we have a socket context to work with. 2) ipv4_update_pmtu() Raw version, used when no socket context is available. For this interface, we essentially just pass in explicit arguments for the flow identity information we would have extracted from the socket. And you'll notice that ipv4_sk_update_pmtu() is simply implemented in terms of ipv4_update_pmtu() Note that __ip_route_output_key() is used, rather than something like ip_route_output_flow() or ip_route_output_key(). This is because we absolutely do not want to end up with a route that does IPSEC encapsulation and the like. Instead, we only want the route that would get us to the node described by the outermost IP header. Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-14 22:22:07 -07:00
David S. Miller	43b03f1f6d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: MAINTAINERS drivers/net/wireless/iwlwifi/pcie/trans.c The iwlwifi conflict was resolved by keeping the code added in 'net' that turns off the buggy chip feature. The MAINTAINERS conflict was merely overlapping changes, one change updated all the wireless web site URLs and the other changed some GIT trees to be Johannes's instead of John's. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 21:59:18 -07:00
Michel Machado	95603e2293	net-next: add dev_loopback_xmit() to avoid duplicate code Add dev_loopback_xmit() in order to deduplicate functions ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c). I was about to reinvent the wheel when I noticed that ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I need and are not IP-only functions, but they were not available to reuse elsewhere. ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I understand that this is harmless, and should be in dev_loopback_xmit(). Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jpirko@redhat.com> CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 18:51:09 -07:00
Thomas Graf	d0daebc3d6	ipv4: Add interface option to enable routing of 127.0.0.0/8 Routing of 127/8 is tradtionally forbidden, we consider packets from that address block martian when routing and do not process corresponding ARP requests. This is a sane default but renders a huge address space practically unuseable. The RFC states that no address within the 127/8 block should ever appear on any network anywhere but it does not forbid the use of such addresses outside of the loopback device in particular. For example to address a pool of virtual guests behind a load balancer. This patch adds a new interface option 'route_localnet' enabling routing of the 127/8 address block and processing of ARP requests on a specific interface. Note that for the feature to work, the default local route covering 127/8 dev lo needs to be removed. Example: $ sysctl -w net.ipv4.conf.eth0.route_localnet=1 $ ip route del 127.0.0.0/8 dev lo table local $ ip addr add 127.1.0.1/16 dev eth0 $ ip route flush cache V2: Fix invalid check to auto flush cache (thanks davem) Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 15:25:46 -07:00
David S. Miller	67da255210	Merge branch 'master' of git://1984.lsi.us.es/net-next	2012-06-11 12:56:14 -07:00
David S. Miller	7b34ca2ac7	inet: Avoid potential NULL peer dereference. We handle NULL in rt{,6}_set_peer but then our caller will try to pass that NULL pointer into inet_putpeer() which isn't ready for it. Fix this by moving the NULL check one level up, and then remove the now unnecessary NULL check from inetpeer_ptr_set_peer(). Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 04:13:57 -07:00
David S. Miller	8b96d22d7a	inet: Use FIB table peer roots in routes. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 02:10:54 -07:00
David S. Miller	8e77327783	inet: Add inetpeer tree roots to the FIB tables. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 02:09:16 -07:00
David S. Miller	b48c80ece9	inet: Add family scope inetpeer flushes. This implementation can deal with having many inetpeer roots, which is a necessary prerequisite for per-FIB table rooted peer tables. Each family (AF_INET, AF_INET6) has a sequence number which we bump when we get a family invalidation request. Each peer lookup cheaply checks whether the flush sequence of the root we are using is out of date, and if so flushes it and updates the sequence number. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 02:09:10 -07:00
David S. Miller	46517008e1	ipv4: Kill ip_rt_frag_needed(). There is zero point to this function. It's only real substance is to perform an extremely outdated BSD4.2 ICMP check, which we can safely remove. If you really have a MTU limited link being routed by a BSD4.2 derived system, here's a nickel go buy yourself a real router. The other actions of ip_rt_frag_needed(), checking and conditionally updating the peer, are done by the per-protocol handlers of the ICMP event. TCP, UDP, et al. have a handler which will receive this event and transmit it back into the associated route via dst_ops->update_pmtu(). This simplification is important, because it eliminates the one place where we do not have a proper route context in which to make an inetpeer lookup. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 02:08:59 -07:00
David S. Miller	97bab73f98	inet: Hide route peer accesses behind helpers. We encode the pointer(s) into an unsigned long with one state bit. The state bit is used so we can store the inetpeer tree root to use when resolving the peer later. Later the peer roots will be per-FIB table, and this change works to facilitate that. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-11 02:08:47 -07:00
David S. Miller	c0efc887dc	inet: Pass inetpeer root into inet_getpeer*() interfaces. Otherwise we reference potentially non-existing members when ipv6 is disabled. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 19:12:36 -07:00
David S. Miller	56a6b248eb	inet: Consolidate inetpeer_invalidate_tree() interfaces. We only need one interface for this operation, since we always know which inetpeer root we want to flush. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 16:32:41 -07:00
David S. Miller	c3426b4719	inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c Instead of net/ipv4/inetpeer.c Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 16:27:05 -07:00
David S. Miller	2397849baa	[PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. Since it's guarenteed that we will access the inetpeer if we're trying to do timewait recycling and TCP options were enabled on the connection, just cache the peer in the timewait socket. In the future, inetpeer lookups will be context dependent (per routing realm), and this helps facilitate that as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 14:56:12 -07:00
David S. Miller	4670fd819e	tcp: Get rid of inetpeer special cases. The get_peer method TCP uses is full of special cases that make no sense accommodating, and it also gets in the way of doing more reasonable things here. First of all, if the socket doesn't have a usable cached route, there is no sense in trying to optimize timewait recycling. Likewise for the case where we have IP options, such as SRR enabled, that make the IP header destination address (and thus the destination address of the route key) differ from that of the connection's destination address. Just return a NULL peer in these cases, and thus we're also able to get rid of the clumsy inetpeer release logic. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 01:25:47 -07:00
David S. Miller	fbfe95a42e	inet: Create and use rt{,6}_get_peer_create(). There's a lot of places that open-code rt{,6}_get_peer() only because they want to set 'create' to one. So add an rt{,6}_get_peer_create() for their sake. There were also a few spots open-coding plain rt{,6}_get_peer() and those are transformed here as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-08 23:24:18 -07:00
Gao feng	54db0cc2ba	inetpeer: add parameter net for inet_getpeer_v4,v6 add struct net as a parameter of inet_getpeer_v[4,6], use net to replace &init_net. and modify some places to provide net for inet_getpeer_v[4,6] Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-08 14:27:23 -07:00
Gao feng	c8a627ed06	inetpeer: add namespace support for inetpeer now inetpeer doesn't support namespace,the information will be leaking across namespace. this patch move the global vars v4_peers and v6_peers to netns_ipv4 and netns_ipv6 as a field peers. add struct pernet_operations inetpeer_ops to initial pernet inetpeer data. and change family_to_base and inet_getpeer to support namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-08 14:27:23 -07:00
Vincent Bernat	2d8dbb04c6	snmp: fix OutOctets counter to include forwarded datagrams RFC 4293 defines ipIfStatsOutOctets (similar definition for ipSystemStatsOutOctets): The total number of octets in IP datagrams delivered to the lower layers for transmission. Octets from datagrams counted in ipIfStatsOutTransmits MUST be counted here. And ipIfStatsOutTransmits: The total number of IP datagrams that this entity supplied to the lower layers for transmission. This includes datagrams generated locally and those forwarded by this entity. Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing IPSTATS_MIB_OUTFORWDATAGRAMS. IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not include forwarded datagrams: The total number of IP datagrams that local IP user-protocols (including ICMP) supplied to IP in requests for transmission. Note that this counter does not include any datagrams counted in ipIfStatsOutForwDatagrams. Signed-off-by: Vincent Bernat <bernat@luffy.cx> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-07 14:50:56 -07:00
Alban Crequy	89a48e35f5	netfilter: ipv4, defrag: switch hook PFs to nfproto This patch is a cleanup. Use NFPROTO_* for consistency with other netfilter code. Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk> Reviewed-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Reviewed-by: Vincent Sanders <vincent.sanders@collabora.co.uk>	2012-06-07 14:58:42 +02:00
Gao feng	8264deb818	netfilter: nf_conntrack: add namespace support for cttimeout This patch adds namespace support for cttimeout. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:41 +02:00
Pablo Neira Ayuso	e76d0af5e4	netfilter: nf_conntrack: remove now unused sysctl for nf_conntrack_l[3\|4]proto Since the sysctl data for l[3\|4]proto now resides in pernet nf_proto_net. We can now remove this unused fields from struct nf_contrack_l[3,4]proto. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:41 +02:00
Gao feng	3ea04dd3a7	netfilter: nf_ct_ipv4: add namespace support This patch adds namespace support for IPv4 protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:40 +02:00
Gao feng	4b626b9c5d	netfilter: nf_ct_icmp: add namespace support This patch adds namespace support for ICMP protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:40 +02:00
Gao feng	524a53e5ad	netfilter: nf_conntrack: prepare namespace support for l3 protocol trackers This patch prepares the namespace support for layer 3 protocol trackers. Basically, this modifies the following interfaces: * nf_ct_l3proto_[un]register_sysctl. * nf_conntrack_l3proto_[un]register. We add a new nf_ct_l3proto_net is used to get the pernet data of l3proto. This adds rhe new struct nf_ip_net that is used to store the sysctl header and l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6) because the protos such tcp and tcp6 use the same data,so making nf_ip_net as a field of netns_ct is the easiest way to manager it. This patch also adds init_net to struct nf_conntrack_l3proto to initial the layer 3 protocol pernet data. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
Gao feng	2c352f444c	netfilter: nf_conntrack: prepare namespace support for l4 protocol trackers This patch prepares the namespace support for layer 4 protocol trackers. Basically, this modifies the following interfaces: * nf_ct_[un]register_sysctl * nf_conntrack_l4proto_[un]register to include the namespace parameter. We still use init_net in this patch to prepare the ground for follow-up patches for each layer 4 protocol tracker. We add a new net_id field to struct nf_conntrack_l4proto that is used to store the pernet_operations id for each layer 4 protocol tracker. Note that AF_INET6's protocols do not need to do sysctl compat. Thus, we only register compat sysctl when l4proto.l3proto != AF_INET6. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
David S. Miller	c1864cfb80	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-06-06 15:06:41 -07:00
Eric Dumazet	55432d2b54	inetpeer: fix a race in inetpeer_gc_worker() commit `5faa5df1fa` (inetpeer: Invalidate the inetpeer tree along with the routing cache) added a race : Before freeing an inetpeer, we must respect a RCU grace period, and make sure no user will attempt to increase refcnt. inetpeer_invalidate_tree() waits for a RCU grace period before inserting inetpeer tree into gc_list and waking the worker. At that time, no concurrent lookup can find a inetpeer in this tree. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-06 10:45:15 -07:00
Joe Perches	e3192690a3	net: Remove casts to same type Adding casts of objects to the same type is unnecessary and confusing for a human reader. For example, this cast: int y; int p = (int )&y; I used the coccinelle script below to find and remove these unnecessary casts. I manually removed the conversions this script produces of casts with __force and __user. @@ type T; T p; @@ - (T )p + p Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-04 11:45:11 -04:00
Eric Dumazet	5d0ba55b64	net: use consume_skb() in place of kfree_skb() Remove some dropwatch/drop_monitor false positives. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-04 11:27:40 -04:00
Eric Dumazet	4aea39c11c	tcp: tcp_make_synack() consumes dst parameter tcp_make_synack() clones the dst, and callers release it. We can avoid two atomic operations per SYNACK if tcp_make_synack() consumes dst instead of cloning it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-04 11:27:39 -04:00
Eric Dumazet	90ba9b1986	tcp: tcp_make_synack() can use alloc_skb() There is no value using sock_wmalloc() in tcp_make_synack(). A listener socket only sends SYNACK packets, they are not queued in a socket queue, only in Qdisc and device layers, so the number of in flight packets is limited in these layers. We used sock_wmalloc() with the %force parameter set to 1 to ignore socket limits anyway. This patch removes two atomic operations per SYNACK packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-04 11:27:39 -04:00
Eric Dumazet	fff3269907	tcp: reflect SYN queue_mapping into SYNACK packets While testing how linux behaves on SYNFLOOD attack on multiqueue device (ixgbe), I found that SYNACK messages were dropped at Qdisc level because we send them all on a single queue. Obvious choice is to reflect incoming SYN packet @queue_mapping to SYNACK packet. Under stress, my machine could only send 25.000 SYNACK per second (for 200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues. After patch, not a single SYNACK is dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-01 14:22:11 -04:00
Eric Dumazet	7433819a1e	tcp: do not create inetpeer on SYNACK message Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting larger and larger, using lots of memory and cpu time. tcp_v4_send_synack() ->inet_csk_route_req() ->ip_route_output_flow() ->rt_set_nexthop() ->rt_init_metrics() ->inet_getpeer( create = true) This is a side effect of commit `a4daad6b09` (net: Pre-COW metrics for TCP) added in 2.6.39 Possible solution : Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS Before patch : # grep peer /proc/slabinfo inet_peer_cache 4175430 4175430 192 42 2 : tunables 0 0 0 : slabdata 99415 99415 0 Samples: 41K of event 'cycles', Event count (approx.): 30716565122 + 20,24% ksoftirqd/0 [kernel.kallsyms] [k] inet_getpeer + 8,19% ksoftirqd/0 [kernel.kallsyms] [k] peer_avl_rebalance.isra.1 + 4,81% ksoftirqd/0 [kernel.kallsyms] [k] sha_transform + 3,64% ksoftirqd/0 [kernel.kallsyms] [k] fib_table_lookup + 2,36% ksoftirqd/0 [ixgbe] [k] ixgbe_poll + 2,16% ksoftirqd/0 [kernel.kallsyms] [k] __ip_route_output_key + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] kernel_map_pages + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] ip_route_input_common + 2,01% ksoftirqd/0 [kernel.kallsyms] [k] __inet_lookup_established + 1,83% ksoftirqd/0 [kernel.kallsyms] [k] md5_transform + 1,75% ksoftirqd/0 [kernel.kallsyms] [k] check_leaf.isra.9 + 1,49% ksoftirqd/0 [kernel.kallsyms] [k] ipt_do_table + 1,46% ksoftirqd/0 [kernel.kallsyms] [k] hrtimer_interrupt + 1,45% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] inet_csk_search_req + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb + 1,16% ksoftirqd/0 [kernel.kallsyms] [k] copy_user_generic_string + 1,15% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free + 1,02% ksoftirqd/0 [kernel.kallsyms] [k] tcp_make_synack + 0,93% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh + 0,87% ksoftirqd/0 [kernel.kallsyms] [k] __call_rcu + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] rt_garbage_collect + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] fib_rules_lookup Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-01 14:22:11 -04:00
Linus Torvalds	13199a0845	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking changes from David S. Miller: 1) Fix IPSEC header length calculation for transport mode in ESP. The issue is whether to do the calculation before or after alignment. Fix from Benjamin Poirier. 2) Fix regression in IPV6 IPSEC fragment length calculations, from Gao Feng. This is another transport vs tunnel mode issue. 3) Handle AF_UNSPEC connect()s properly in L2TP to avoid OOPSes. Fix from James Chapman. 4) Fix USB ASIX driver's reception of full sized VLAN packets, from Eric Dumazet. 5) Allow drop monitor (and, more generically, all generic netlink protocols) to be automatically loaded as a module. From Neil Horman. Fix up trivial conflict in Documentation/feature-removal-schedule.txt due to new entries added next to each other at the end. As usual. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits) net/smsc911x: Repair broken failure paths virtio-net: remove useless disable on freeze netdevice: Update netif_dbg for CONFIG_DYNAMIC_DEBUG drop_monitor: Add module alias to enable automatic module loading genetlink: Build a generic netlink family module alias net: add MODULE_ALIAS_NET_PF_PROTO_NAME r6040: Do a Proper deinit at errorpath and also when driver unloads (calling r6040_remove_one) r6040: disable pci device if the subsequent calls (after pci_enable_device) fails skb: avoid unnecessary reallocations in __skb_cow net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens asix: allow full size 8021Q frames to be received rds_rdma: don't assume infiniband device is PCI l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case mac80211: fix ADDBA declined after suspend with wowlan wlcore: fix undefined symbols when CONFIG_PM is not defined mac80211: fix flag check for QoS NOACK frames ath9k_hw: apply internal regulator settings on AR933x ath9k_hw: update AR933x initvals to fix issues with high power devices ath9k: fix a use-after-free-bug when ath_tx_setup_buffer() fails ath9k: stop rx dma before stopping tx ...	2012-05-31 10:32:36 -07:00
Glauber Costa	3f13461939	memcg: decrement static keys at real destroy time We call the destroy function when a cgroup starts to be removed, such as by a rmdir event. However, because of our reference counters, some objects are still inflight. Right now, we are decrementing the static_keys at destroy() time, meaning that if we get rid of the last static_key reference, some objects will still have charges, but the code to properly uncharge them won't be run. This becomes a problem specially if it is ever enabled again, because now new charges will be added to the staled charges making keeping it pretty much impossible. We just need to be careful with the static branch activation: since there is no particular preferred order of their activation, we need to make sure that we only start using it after all call sites are active. This is achieved by having a per-memcg flag that is only updated after static_key_slow_inc() returns. At this time, we are sure all sites are active. This is made per-memcg, not global, for a reason: it also has the effect of making socket accounting more consistent. The first memcg to be limited will trigger static_key() activation, therefore, accounting. But all the others will then be accounted no matter what. After this patch, only limited memcgs will have its sockets accounted. [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h, document enum sock_flag_bits, convert memcg_proto_active() and memcg_proto_activated() to test_bit(), redo tcp_update_limit() comment to 80 cols] Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-29 16:22:28 -07:00
Benjamin Poirier	91657eafb6	xfrm: take net hdr len into account for esp payload size calculation Corrects the function that determines the esp payload size. The calculations done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for certain mtu values and suboptimal frames for others. According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(), net_header_len must be taken into account before doing the alignment calculation. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-27 01:08:29 -04:00
Linus Torvalds	28f3d71761	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull more networking updates from David Miller: "Ok, everything from here on out will be bug fixes." 1) One final sync of wireless and bluetooth stuff from John Linville. These changes have all been in his tree for more than a week, and therefore have had the necessary -next exposure. John was just away on a trip and didn't have a change to send the pull request until a day or two ago. 2) Put back some defines in user exposed header file areas that were removed during the tokenring purge. From Stephen Hemminger and Paul Gortmaker. 3) A bug fix for UDP hash table allocation got lost in the pile due to one of those "you got it.. no I've got it.." situations. :-) From Tim Bird. 4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll try to coalesce overlapping frags and crash. Fix from Eric Dumazet. 5) RCU routing table lookups can race with free_fib_info(), causing crashes when we deref the device pointers in the route. Fix by releasing the net device in the RCU callback. From Yanmin Zhang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits) tcp: take care of overlaps in tcp_try_coalesce() ipv4: fix the rcu race between free_fib_info and ip_route_output_slow mm: add a low limit to alloc_large_system_hash ipx: restore token ring define to include/linux/ipx.h if: restore token ring ARP type to header xen: do not disable netfront in dom0 phy/micrel: Fix ID of KSZ9021 mISDN: Add X-Tensions USB ISDN TA XC-525 gianfar:don't add FCB length to hard_header_len Bluetooth: Report proper error number in disconnection Bluetooth: Create flags for bt_sk() Bluetooth: report the right security level in getsockopt Bluetooth: Lock the L2CAP channel when sending Bluetooth: Restore locking semantics when looking up L2CAP channels Bluetooth: Fix a redundant and problematic incoming MTU check Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C Bluetooth: Fix EIR data generation for mgmt_device_found Bluetooth: Fix Inquiry with RSSI event mask Bluetooth: improve readability of l2cap_seq_list code Bluetooth: Fix skb length calculation ...	2012-05-24 11:54:29 -07:00
Eric Dumazet	1ca7ee3063	tcp: take care of overlaps in tcp_try_coalesce() Sergio Correia reported following warning : WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110() WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq), "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n", tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt); It appears TCP coalescing, and more specifically commit `b081f85c29` (net: implement tcp coalescing in tcp_queue_rcv()) should take care of possible segment overlaps in receive queue. This was properly done in the case of out_or_order_queue by the caller. For example, segment at tail of queue have sequence 1000-2000, and we add a segment with sequence 1500-2500. This can happen in case of retransmits. In this case, just don't do the coalescing. Reported-by: Sergio Correia <lists@uece.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Sergio Correia <lists@uece.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-24 00:28:21 -04:00
Yanmin Zhang	e49cc0da72	ipv4: fix the rcu race between free_fib_info and ip_route_output_slow We hit a kernel OOPS. <3>[23898.789643] BUG: sleeping function called from invalid context at /data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103 <3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name: Thread-6683 <4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me to suspend... <4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G W 3.0.8-137685-ge7742f9 #1 <4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me to suspend... <4>[23899.904225] Call Trace: <4>[23899.989209] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.000416] [<c1238c2a>] __might_sleep+0x10a/0x110 <4>[23900.007357] [<c1228021>] do_page_fault+0xd1/0x3c0 <4>[23900.013764] [<c18e9ba9>] ? restore_all+0xf/0xf <4>[23900.024024] [<c17c007b>] ? napi_complete+0x8b/0x690 <4>[23900.029297] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.123739] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.128955] [<c18ea0c3>] error_code+0x5f/0x64 <4>[23900.133466] [<c1227f50>] ? pgtable_bad+0x130/0x130 <4>[23900.138450] [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0 <4>[23900.144312] [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0 <4>[23900.150730] [<c17f63df>] ip_route_output_flow+0x1f/0x60 <4>[23900.156261] [<c181de58>] ip4_datagram_connect+0x188/0x2b0 <4>[23900.161960] [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30 <4>[23900.167834] [<c18298d6>] inet_dgram_connect+0x36/0x80 <4>[23900.173224] [<c14f9e88>] ? _copy_from_user+0x48/0x140 <4>[23900.178817] [<c17ab9da>] sys_connect+0x9a/0xd0 <4>[23900.183538] [<c132e93c>] ? alloc_file+0xdc/0x240 <4>[23900.189111] [<c123925d>] ? sub_preempt_count+0x3d/0x50 Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing fi. Other cpu might be accessing fi. Fixing it by delaying the releasing. With the patch, we ran MTBF testing on Android mobile for 12 hours and didn't trigger the issue. Thank Eric for very detailed review/checking the issue. Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Kun Jiang <kunx.jiang@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-24 00:28:21 -04:00
Tim Bird	31fe62b958	mm: add a low limit to alloc_large_system_hash UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-24 00:28:21 -04:00
Linus Torvalds	644473e9c6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...	2012-05-23 17:42:39 -07:00
Linus Torvalds	88d6ae8dc3	Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "cgroup file type addition / removal is updated so that file types are added and removed instead of individual files so that dynamic file type addition / removal can be implemented by cgroup and used by controllers. blkio controller changes which will come through block tree are dependent on this. Other changes include res_counter cleanup and disallowing kthread / PF_THREAD_BOUND threads to be attached to non-root cgroups. There's a reported bug with the file type addition / removal handling which can lead to oops on cgroup umount. The issue is being looked into. It shouldn't cause problems for most setups and isn't a security concern." Fix up trivial conflict in Documentation/feature-removal-schedule.txt * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) res_counter: Account max_usage when calling res_counter_charge_nofail() res_counter: Merge res_counter_charge and res_counter_charge_nofail cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads cgroup: remove cgroup_subsys->populate() cgroup: get rid of populate for memcg cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg cgroup: make css->refcnt clearing on cgroup removal optional cgroup: use negative bias on css->refcnt to block css_tryget() cgroup: implement cgroup_rm_cftypes() cgroup: introduce struct cfent cgroup: relocate __d_cgrp() and __d_cft() cgroup: remove cgroup_add_file[s]() cgroup: convert memcg controller to the new cftype interface memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP cgroup: convert all non-memcg controllers to the new cftype interface cgroup: relocate cftype and cgroup_subsys definitions in controllers cgroup: merge cft_release_agent cftype array into the base files array cgroup: implement cgroup_add_cftypes() and friends cgroup: build list of all cgroups under a given cgroupfs_root cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() ...	2012-05-22 17:40:19 -07:00
David S. Miller	17eea0df5f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-05-20 21:53:04 -04:00
Eldad Zack	413c27d869	net/ipv4: replace simple_strtoul with kstrtoul Replace simple_strtoul with kstrtoul in three similar occurrences, all setup handlers: * route.c: set_rhash_entries * tcp.c: set_thash_entries * udp.c: set_uhash_entries Also check if the conversion failed. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-20 04:06:17 -04:00
Eldad Zack	b37f4d7b01	net/ipv4/ipconfig: neaten __setup placement The __setup macro should follow the corresponding setup handler. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-20 04:06:16 -04:00
Eric Dumazet	3cc4949269	ipv4: use skb coalescing in defragmentation ip_frag_reasm() can use skb_try_coalesce() to build optimized skb, reducing memory used by them (truesize), and reducing number of cache line misses and overhead for the consumer. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-19 18:34:57 -04:00
Eric Dumazet	bad43ca832	net: introduce skb_try_coalesce() Move tcp_try_coalesce() protocol independent part to skb_try_coalesce(). skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly, to build optimized skbs (less sk_buff, and possibly less 'headers') skb_try_coalesce() is zero copy, unless the copy can fit in destination header (its a rare case) kfree_skb_partial() is also moved to net/core/skbuff.c and exported, because IPv6 will need it in patch (ipv6: use skb coalescing in reassembly). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-19 18:34:57 -04:00
Eric Dumazet	cbc264cacd	ip_frag: struct inet_frags match() method returns a bool - match() method returns a boolean - return (A && B && C && D) -> return A && B && C && D - fix indentation Signed-off-by: Eric Dumazet <edumazet@google.com>	2012-05-18 01:40:27 -04:00
Willy Tarreau	bad115cfe5	tcp: do_tcp_sendpages() must try to push data out on oom conditions Since recent changes on TCP splicing (starting with commits `2f533844` "tcp: allow splice() to build full TSO packets" and `35f9c09f` "tcp: tcp_sendpages() should call tcp_push() once"), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Acked-by: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org>	2012-05-17 18:31:43 -04:00
Eric Dumazet	a2a385d627	tcp: bool conversions bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-17 14:59:59 -04:00
Eric Dumazet	dc6b9b7823	net: include/net/sock.h cleanup bool/const conversions where possible __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-17 04:50:21 -04:00
David S. Miller	028940342a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-05-16 22:17:37 -04:00
David S. Miller	c727e7f007	Merge branch 'delete-tokenring' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2012-05-16 01:02:40 -04:00
Joe Perches	91df42bedc	net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug Use the current debugging style and enable dynamic_debug. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-16 01:01:03 -04:00
Paul Gortmaker	211ed86510	net: delete all instances of special processing for token ring We are going to delete the Token ring support. This removes any special processing in the core networking for token ring, (aside from net/tr.c itself), leaving the drivers and remaining tokenring support present but inert. The mass removal of the drivers and net/tr.c will be in a separate commit, so that the history of these files that we still care about won't have the giant deletion tied into their history. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-05-15 20:14:35 -04:00
Joe Perches	e87cc4728f	net: Convert net_ratelimit uses to net_<level>_ratelimited Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-15 13:45:03 -04:00
Jan Beulich	7e15252498	xfrm: make xfrm_algo.c a module By making this a standalone config option (auto-selected as needed), selecting CRYPTO from here rather than from XFRM (which is boolean) allows the core crypto code to become a module again even when XFRM=y. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-15 13:13:34 -04:00
Pavel Emelyanov	1070b1b831	tcp: Out-line tcp_try_rmem_schedule As proposed by Eric, make the tcp_input.o thinner. add/remove: 1/1 grow/shrink: 1/4 up/down: 868/-1329 (-461) function old new delta tcp_try_rmem_schedule - 864 +864 tcp_ack 4811 4815 +4 tcp_validate_incoming 817 815 -2 tcp_collapse 860 858 -2 tcp_send_rcvq 555 353 -202 tcp_data_queue 3435 3033 -402 tcp_prune_queue 721 - -721 Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-10 23:24:36 -04:00
Pavel Emelyanov	3c961afed4	tcp: Schedule rmem for rcvq repair send As noted by Eric, no checks are performed on the data size we're putting in the read queue during repair. Thus, validate the given data size with the common rmem management routine. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-10 23:24:35 -04:00
Pavel Emelyanov	292e8d8c85	tcp: Move rcvq sending to tcp_input.c It actually works on the input queue and will use its read mem routines, thus it's better to have in in the tcp_input.c file. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-10 23:24:35 -04:00
David S. Miller	dccd9ecc37	ipv4: Do not use dead fib_info entries. Due to RCU lookups and RCU based release, fib_info objects can be found during lookup which have fi->fib_dead set. We must ignore these entries, otherwise we risk dereferencing the parts of the entry which are being torn down. Reported-by: Yevgen Pronenko <yevgen.pronenko@sonymobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-10 22:16:32 -04:00
Pablo Neira Ayuso	d16cf20e2f	netfilter: remove ip_queue support This patch removes ip_queue support which was marked as obsolete years ago. The nfnetlink_queue modules provides more advanced user-space packet queueing mechanism. This patch also removes capability code included in SELinux that refers to ip_queue. Otherwise, we break compilation. Several warning has been sent regarding this to the mailing list in the past month without anyone rising the hand to stop this with some strong argument. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-05-08 20:25:42 +02:00
David S. Miller	0d6c4a2e46	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/param.c drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h Resolved the iwlwifi conflict with mainline using 3-way diff posted by John Linville and Stephen Rothwell. In 'net' we added a bug fix to make iwlwifi report a more accurate skb->truesize but this conflicted with RX path changes that happened meanwhile in net-next. In e1000e a conflict arose in the validation code for settings of adapter->itr. 'net-next' had more sophisticated logic so that logic was used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-07 23:35:40 -04:00
Jiri Pirko	3a084ddb4b	net: IP_MULTICAST_IF setsockopt now recognizes struct mreq Until now, struct mreq has not been recognized and it was worked with as with struct in_addr. That means imr_multiaddr was copied to imr_address. So do recognize struct mreq here and copy that correctly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-07 23:03:22 -04:00
Eric Dumazet	bd14b1b2e2	tcp: be more strict before accepting ECN negociation It appears some networks play bad games with the two bits reserved for ECN. This can trigger false congestion notifications and very slow transferts. Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can disable TCP ECN negociation if it happens we receive mangled CT bits in the SYN packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Perry Lorier <perryl@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Wilmer van der Gaast <wilmer@google.com> Cc: Ankur Jain <jankur@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Dave Täht <dave.taht@bufferbloat.net> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-04 12:05:27 -04:00
Alexander Duyck	3a7c1ee4ab	skb: Add skb_head_is_locked helper function This patch adds support for a skb_head_is_locked helper function. It is meant to be used any time we are considering transferring the head from skb->head to a paged frag. If the head is locked it means we cannot remove the head from the skb so it must be copied or we must take the skb as a whole. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-03 13:18:37 -04:00
Eric W. Biederman	ae2975bc34	userns: Convert group_info values from gid_t to kgid_t. As a first step to converting struct cred to be all kuid_t and kgid_t values convert the group values stored in group_info to always be kgid_t values. Unless user namespaces are used this change should have no effect. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-05-03 03:27:21 -07:00
Alexander Duyck	34a802a5b9	tcp: move stats merge to the end of tcp_try_coalesce This change cleans up the last bits of tcp_try_coalesce so that we only need one goto which jumps to the end of the function. The idea is to make the code more readable by putting things in a linear order so that we start execution at the top of the function, and end it at the bottom. I also made a slight tweak to the code for handling frags when we are a clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed the logic so that if (!clone) we just set the number of frags to 0 which disables the for loop anyway. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-03 04:21:33 -04:00
Alexander Duyck	57b55a7ec6	tcp: Move code related to head frag in tcp_try_coalesce This change reorders the code related to the use of an skb->head_frag so it is placed before we check the rest of the frags. This allows the code to read more linearly instead of like some sort of loop. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-03 04:21:33 -04:00
Alexander Duyck	c73c3d9c49	tcp: Fix truesize accounting in tcp_try_coalesce This patch addresses several issues in the way we were tracking the truesize in tcp_try_coalesce. First it was using ksize which prevents us from having a 0 sized head frag and getting a usable result. To resolve that this patch uses the end pointer which is set based off either ksize, or the frag_size supplied in build_skb. This allows us to compute the original truesize of the entire buffer and remove that value leaving us with just what was added as pages. The second issue was the use of skb->len if there is a mergeable head frag. We should only need to remove the size of an data aligned sk_buff from our current skb->truesize to compute the delta for a buffer with a reused head. By using skb->len the value of truesize was being artificially reduced which means that head frags could use more memory than buffers using standard allocations. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-03 04:21:33 -04:00
Alexander Duyck	2996d31f9f	net: Stop decapitating clones that have a head_frag This change is meant ot prevent stealing the skb->head to use as a page in the event that the skb->head was cloned. This allows the other clones to track each other via shinfo->dataref. Without this we break down to two methods for tracking the reference count, one being dataref, the other being the page count. As a result it becomes difficult to track how many references there are to skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-03 01:34:37 -04:00
Eric Dumazet	b081f85c29	net: implement tcp coalescing in tcp_queue_rcv() Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 21:11:11 -04:00
Eric Dumazet	923dd347b8	net: take care of cloned skbs in tcp_try_coalesce() Before stealing fragments or skb head, we must make sure skbs are not cloned. Alexander was worried about destination skb being cloned : In bridge setups, a driver could be fooled if skb->data_len would not match skb nr_frags. If source skb is cloned, we must take references on pages instead. Bug happened using tcpdump (if not using mmap()) Introduce kfree_skb_partial() helper to cleanup code. Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 21:11:11 -04:00
Eric Dumazet	b49960a05e	tcp: change tcp_adv_win_scale and tcp_rmem[2] tcp_adv_win_scale default value is 2, meaning we expect a good citizen skb to have skb->len / skb->truesize ratio of 75% (3/4) In 2.6 kernels we (mis)accounted for typical MSS=1460 frame : 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392. So these skbs were considered as not bloated. With recent truesize fixes, a typical MSS=1460 frame truesize is now the more precise : 2048 + 256 = 2304. But 2304 * 3/4 = 1728. So these skb are not good citizen anymore, because 1460 < 1728 (GRO can escape this problem because it build skbs with a too low truesize.) This also means tcp advertises a too optimistic window for a given allocated rcvspace : When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often, especially when application is slow to drain its receive queue or in case of losses (netperf is fast, scp is slow). This is a major latency source. We should adjust the len/truesize ratio to 50% instead of 75% This patch : 1) changes tcp_adv_win_scale default to 1 instead of 2 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account better truesize tracking and to allow autotuning tcp receive window to reach same value than before. Note that same amount of kernel memory is consumed compared to 2.6 kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 21:08:58 -04:00
Yuchung Cheng	750ea2bafa	tcp: early retransmit: delayed fast retransmit Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 20:56:10 -04:00
Yuchung Cheng	eed530b6c6	tcp: early retransmit This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 20:56:10 -04:00
Yuchung Cheng	1fbc340514	tcp: early retransmit: tcp_enter_recovery() This a prepartion patch that refactors the code to enter recovery into a new function tcp_enter_recovery(). It's needed to implement the delayed fast retransmit in ER. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 20:56:09 -04:00
Eric Dumazet	329033f645	tcp: makes tcp_try_coalesce aware of skb->head_frag TCP coalesce can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We had to disable coalescing in this case, for performance reasons. We 'upgrade' skb->head as a fragment in itself. This reduces number of cache misses when user makes its copies, since a less sk_buff are fetched. This makes receive and ofo queues shorter and thus reduce cache line misses in TCP stack. This is a followup of patch "net: allow skb->head to be a page fragment" Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce" counter being incremented. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-30 21:35:49 -04:00
Yuchung Cheng	1cebce36d6	tcp: fix infinite cwnd in tcp_complete_cwr() When the cwnd reduction is done, ssthresh may be infinite if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e., undo_marker is set, tcp_complete_cwr() falsely set cwnd to the infinite ssthresh value. The correct operation is to keep cwnd intact because it has been updated in ECN or F-RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-30 13:44:39 -04:00
Neal Cardwell	651913ce9d	tcp: clean up use of jiffies in tcp_rcv_rtt_measure() Clean up a reference to jiffies in tcp_rcv_rtt_measure() that should instead reference tcp_time_stamp. Since the result of the subtraction is passed into a function taking u32, this should not change any behavior (and indeed the generated assembly does not change on x86_64). However, it seems worth cleaning this up for consistency and clarity (and perhaps to avoid bugs if this is copied and pasted somewhere else). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-27 12:34:39 -04:00
Eric Dumazet	6746960140	ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-27 00:03:34 -04:00
Pavel Emelyanov	de248a75c3	tcp repair: Fix unaligned access when repairing options (v2) Don't pick __u8/__u16 values directly from raw pointers, but instead use an array of structures of code:value pairs. This is OK, since the buffer we take options from is not an skb memory, but a user-to-kernel one. For those options which don't require any value now, require this to be zero (for potential future extension of this API). v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-26 06:13:51 -04:00
Shan Wei	8dcf01fc00	net: sock_diag_handler structs can be const read only, so change it to const. Signed-off-by: Shan Wei <davidshan@tencent.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-25 20:46:59 -04:00
Shan Wei	62ad6fcd74	udp_diag: implement idiag_get_info for udp/udplite to get queue information When we use netlink to monitor queue information for udp socket, idiag_rqueue and idiag_wqueue of inet_diag_msg are returned with 0. Keep consistent with netstat, just return back allocated rmem/wmem size. Signed-off-by: Shan Wei <davidshan@tencent.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-25 20:43:01 -04:00
Eric Dumazet	38ba0a65fa	net: skb_can_coalesce returns a boolean Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-24 00:18:02 -04:00
Eric Dumazet	783c175f90	tcp: tcp_try_coalesce returns a boolean This clarifies code intention, as suggested by David. Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 23:36:58 -04:00
David S. Miller	f24001941c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Fix merge between commit `3adadc08cc` ("net ax25: Reorder ax25_exit to remove races") and commit `0ca7a4c87d` ("net ax25: Simplify and cleanup the ax25 sysctl handling") The former moved around the sysctl register/unregister calls, the later simply removed them. With help from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 23:15:17 -04:00
Eric Dumazet	1402d36601	tcp: introduce tcp_try_coalesce commit `c8628155ec` (tcp: reduce out_of_order memory use) took care of coalescing tcp segments provided by legacy devices (linear skbs) We extend this idea to fragged skbs, as their truesize can be heavy. ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. Use this coalescing strategy for receive queue too. This contributes to reduce number of tcp collapses, at minimal cost, and reduces memory overhead and packets drops. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 22:42:49 -04:00
Eric Dumazet	da882c1f2e	tcp: sk_add_backlog() is too agressive for TCP While investigating TCP performance problems on 10Gb+ links, we found a tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends one ACK every two MSS segments. A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default value (87380), allowing a too small backlog. A TCP ACK, even being small, can consume nearly same truesize space than outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and is fast to compute. Performance results on netperf, single flow, receiver with disabled GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop increments at sender. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 22:28:28 -04:00
Eric Dumazet	f545a38f74	net: add a limit parameter to sk_add_backlog() sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the memory limit. We need to make this limit a parameter for TCP use. No functional change expected in this patch, all callers still using the old sk_rcvbuf limit. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 22:28:28 -04:00
David S. Miller	ac807fa8e6	tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation. net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock': net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable] net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock': net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable] Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-23 03:21:58 -04:00
Neal Cardwell	900f65d361	tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() This commit moves the (substantial) common code shared between tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family independent function, tcp_init_sock(). Centralizing this functionality should help avoid drift issues, e.g. where the IPv4 side is updated without a corresponding update to IPv6. There was already some drift: IPv4 initialized snd_cwnd to TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to 2 (in this case it should not matter, since snd_cwnd is also initialized in tcp_init_metrics(), but the general risks and maintenance overhead remain). When diffing the old and new code, note that new tcp_init_sock() function uses the order of steps from the tcp_v4_init_sock() implementation (the order is slightly different in tcp_v6_init_sock()). Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 16:36:42 -04:00
Pavel Emelyanov	b139ba4e90	tcp: Repair connection-time negotiated parameters There are options, which are set up on a socket while performing TCP handshake. Need to resurrect them on a socket while repairing. A new sockoption accepts a buffer and parses it. The buffer should be CODE:VALUE sequence of bytes, where CODE is standard option code and VALUE is the respective value. Only 4 options should be handled on repaired socket. To read 3 out of 4 of these options the TCP_INFO sockoption can be used. An ability to get the last one (the mss_clamp) was added by the previous patch. Now the restore. Three of these options -- timestamp_ok, mss_clamp and snd_wscale -- are just restored on a coket. The sack_ok flags has 2 issues. First, whether or not to do sacks at all. This flag is just read and set back. No other sack info is saved or restored, since according to the standart and the code dropping all sack-ed segments is OK, the sender will resubmit them again, so after the repair we will probably experience a pause in connection. Next, the fack bit. It's just set back on a socket if the respective sysctl is set. No collected stats about packets flow is preserved. As far as I see (plz, correct me if I'm wrong) the fack-based congestion algorithm survives dropping all of the stats and repairs itself eventually, probably losing the performance for that period. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 15:52:25 -04:00
Pavel Emelyanov	5e6a3ce657	tcp: Report mss_clamp with TCP_MAXSEG option in repair mode The mss_clamp is the only connection-time negotiated option which cannot be obtained from the user space. Make the TCP_MAXSEG sockopt report one in the repair mode. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 15:52:25 -04:00

... 3 4 5 6 7 ...

5135 Commits