linux

History

Ilpo Järvinen 832d11c5cd tcp: Try to restore large SKBs while SACK processing During SACK processing, most of the benefits of TSO are eaten by the SACK blocks that one-by-one fragment SKBs to MSS sized chunks. Then we're in problems when cleanup work for them has to be done when a large cumulative ACK comes. Try to return back to pre-split state already while more and more SACK info gets discovered by combining newly discovered SACK areas with the previous skb if that's SACKed as well. This approach has a number of benefits: 1) The processing overhead is spread more equally over the RTT 2) Write queue has less skbs to process (affect everything which has to walk in the queue past the sacked areas) 3) Write queue is consistent whole the time, so no other parts of TCP has to be aware of this (this was not the case with some other approach that was, well, quite intrusive all around). 4) Clean_rtx_queue can release most of the pages using single put_page instead of previous PAGE_SIZE/mss+1 calls In case a hole is fully filled by the new SACK block, we attempt to combine the next skb too which allows construction of skbs that are even larger than what tso split them to and it handles hole per on every nth patterns that often occur during slow start overshoot pretty nicely. Though this to be really useful also a retransmission would have to get lost since cumulative ACKs advance one hole at a time in the most typical case. TODO: handle upwards only merging. That should be rather easy when segment is fully sacked but I'm leaving that as future work item (it won't make very large difference anyway since this current approach already covers quite a lot of normal cases). I was earlier thinking of some sophisticated way of tracking timestamps of the first and the last segment but later on realized that it won't be that necessary at all to store the timestamp of the last segment. The cases that can occur are basically either: 1) ambiguous => no sensible measurement can be taken anyway 2) non-ambiguous is due to reordering => having the timestamp of the last segment there is just skewing things more off than does some good since the ack got triggered by one of the holes (besides some substle issues that would make determining right hole/skb even harder problem). Anyway, it has nothing to do with this change then. I choose to route some abnormal looking cases with goto noop, some could be handled differently (eg., by stopping the walking at that skb but again). In general, they either shouldn't happen at all or are rare enough to make no difference in practice. In theory this change (as whole) could cause some macroscale regression (global) because of cache misses that are taken over the round-trip time but it gets very likely better because of much less (local) cache misses per other write queue walkers and the big recovery clearing cumulative ack. Worth to note that these benefits would be very easy to get also without TSO/GSO being on as long as the data is in pages so that we can merge them. Currently I won't let that happen because DSACK splitting at fragment that would mess up pcounts due to sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets avoided, we have some conditions that can be made less strict. TODO: I will probably have to convert the excessive pointer passing to struct sacktag_state... :-) My testing revealed that considerable amount of skbs couldn't be shifted because they were cloned (most likely still awaiting tx reclaim)... [The rest is considering future work instead since I got repeatably EFAULT to tcpdump's recvfrom when I added pskb_expand_head to deal with clones, so I separated that into another, later patch] ...To counter that, I gave up on the fifth advantage: 5) When growing previous SACK block, less allocs for new skbs are done, basically a new alloc is needed only when new hole is detected and when the previous skb runs out of frags space ...which now only happens of if reclaim is fast enough to dispose the clone before the SACK block comes in (the window is RTT long), otherwise we'll have to alloc some. With clones being handled I got these numbers (will be somewhat worse without that), taken with fine-grained mibs: TCPSackShifted 398 TCPSackMerged 877 TCPSackShiftFallback 320 TCPSACKCOLLAPSEFALLBACKGSO 0 TCPSACKCOLLAPSEFALLBACKSKBBITS 0 TCPSACKCOLLAPSEFALLBACKSKBDATA 0 TCPSACKCOLLAPSEFALLBACKBELOW 0 TCPSACKCOLLAPSEFALLBACKFIRST 1 TCPSACKCOLLAPSEFALLBACKPREVBITS 318 TCPSACKCOLLAPSEFALLBACKMSS 1 TCPSACKCOLLAPSEFALLBACKNOHEAD 0 TCPSACKCOLLAPSEFALLBACKSHIFT 0 TCPSACKCOLLAPSENOOPSEQ 0 TCPSACKCOLLAPSENOOPSMALLPCOUNT 0 TCPSACKCOLLAPSENOOPSMALLLEN 0 TCPSACKCOLLAPSEHOLE 12 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>		2008-11-24 21:20:15 -08:00
..
netfilter	net: '&' redux	2008-11-03 18:21:05 -08:00
af_inet.c	net: some optimizations in af_inet	2008-11-23 15:42:23 -08:00
ah4.c	net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c	2008-11-03 00:23:42 -08:00
arp.c	ipv4: Fix ARP behavior with many mac-vlans	2008-11-16 19:19:38 -08:00
cipso_ipv4.c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2008-10-31 00:17:34 -07:00
datagram.c	mib: add net to IP_INC_STATS_BH	2008-07-16 20:20:11 -07:00
devinet.c	net: clean up net/ipv4/devinet.c	2008-11-03 02:48:48 -08:00
esp4.c	net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c	2008-11-03 00:23:42 -08:00
fib_frontend.c	net: clean up net/ipv4/fib_frontend.c fib_hash.c ip_gre.c	2008-11-03 00:25:16 -08:00
fib_hash.c	net: clean up net/ipv4/fib_frontend.c fib_hash.c ip_gre.c	2008-11-03 00:25:16 -08:00
fib_lookup.h
fib_rules.c	net: add fib_rules_ops to flush_cache method	2008-07-05 19:01:28 -07:00
fib_semantics.c	net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c	2008-11-03 00:23:42 -08:00
fib_trie.c	net: replace NIPQUAD() in net/ipv4/ net/ipv6/	2008-10-31 00:53:57 -07:00
icmp.c	net: avoid a pair of dst_hold()/dst_release() in ip_append_data()	2008-11-24 15:52:46 -08:00
igmp.c	net: clean up net/ipv4/igmp.c	2008-11-03 00:26:09 -08:00
inet_connection_sock.c	net: ib_net pointer should depends on CONFIG_NET_NS	2008-11-12 00:54:20 -08:00
inet_diag.c	net: Convert TCP/DCCP listening hash tables to use RCU	2008-11-23 17:22:55 -08:00
inet_fragment.c	net: convert BUG_TRAP to generic WARN_ON	2008-07-25 21:43:18 -07:00
inet_hashtables.c	net: Make sure BHs are disabled in sock_prot_inuse_add()	2008-11-24 00:09:29 -08:00
inet_lro.c	include/net net/ - csum_partial - remove unnecessary casts	2008-11-19 15:44:53 -08:00
inet_timewait_sock.c	net: convert TCP/DCCP ehash rwlocks to spinlocks	2008-11-20 20:39:09 -08:00
inetpeer.c	net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c	2008-11-03 00:23:42 -08:00
ip_forward.c	net: reduce structures when XFRM=n	2008-10-28 13:24:06 -07:00
ip_fragment.c	net: '&' redux	2008-11-03 18:21:05 -08:00
ip_gre.c	net: fix tunnels in netns after ndo_ changes	2008-11-23 17:26:26 -08:00
ip_input.c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2008-11-18 23:38:23 -08:00
ip_options.c	cipso: Add support for native local labeling and fixup mapping names	2008-10-10 10:16:34 -04:00
ip_output.c	net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames()	2008-11-24 16:07:50 -08:00
ip_sockglue.c	net: ip_sockglue.c add static, annotate ports' endianness	2008-11-20 01:54:27 -08:00
ipcomp.c	net: replace NIPQUAD() in net/ipv4/ net/ipv6/	2008-10-31 00:53:57 -07:00
ipconfig.c	net: replace NIPQUAD() in net/ipv4/ net/ipv6/	2008-10-31 00:53:57 -07:00
ipip.c	net: fix tunnels in netns after ndo_ changes	2008-11-23 17:26:26 -08:00
ipmr.c	ipmr: convert ipmr virtual interface to net_device_ops	2008-11-20 20:28:35 -08:00
Kconfig	IPVS: Move IPVS to net/netfilter/ipvs	2008-10-07 08:38:24 +11:00
Makefile	IPVS: Move IPVS to net/netfilter/ipvs	2008-10-07 08:38:24 +11:00
netfilter.c	netfilter: netns: fix {ip,6}_route_me_harder() in netns	2008-10-08 11:35:03 +02:00
proc.c	net: fix /proc/net/snmp as memory corruptor	2008-11-10 21:43:08 -08:00
protocol.c	net: remove CVS keywords	2008-06-11 21:00:38 -07:00
raw.c	net: avoid a pair of dst_hold()/dst_release() in ip_append_data()	2008-11-24 15:52:46 -08:00
route.c	net: remove struct dst_entry::entry_size	2008-11-11 17:25:22 -08:00
syncookies.c	tcp: Port redirection support for TCP	2008-10-01 07:46:49 -07:00
sysctl_net_ipv4.c	net: '&' redux	2008-11-03 18:21:05 -08:00
tcp_bic.c
tcp_cong.c	net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)	2008-10-16 15:24:51 -07:00
tcp_cubic.c	[TCP] CUBIC v2.3	2008-11-02 00:28:10 -07:00
tcp_diag.c	net: inet_diag_handler structs can be const	2008-11-19 15:43:27 -08:00
tcp_highspeed.c
tcp_htcp.c	tcp_htcp: last_cong bug fix	2008-11-12 01:41:09 -08:00
tcp_hybla.c	tcp: Fix tcp_hybla zero congestion window growth with small rho and large cwnd.	2008-10-07 15:58:17 -07:00
tcp_illinois.c
tcp_input.c	tcp: Try to restore large SKBs while SACK processing	2008-11-24 21:20:15 -08:00
tcp_ipv4.c	net: Convert TCP/DCCP listening hash tables to use RCU	2008-11-23 17:22:55 -08:00
tcp_lp.c
tcp_minisocks.c	net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c	2008-11-03 00:24:34 -08:00
tcp_output.c	tcp: move tcp_simple_retransmit to tcp_input	2008-11-24 21:11:55 -08:00
tcp_probe.c	net: replace NIPQUAD() in net/ipv4/ net/ipv6/	2008-10-31 00:53:57 -07:00
tcp_scalable.c
tcp_timer.c	net: clean up net/ipv4/ip_fragment.c tcp_timer.c ip_input.c	2008-11-03 02:47:38 -08:00
tcp_vegas.c	net: fix returning void-valued expression warnings	2008-05-01 02:47:38 -07:00
tcp_vegas.h
tcp_veno.c	net: fix returning void-valued expression warnings	2008-05-01 02:47:38 -07:00
tcp_westwood.c
tcp_yeah.c	net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c	2008-11-03 00:24:34 -08:00
tcp.c	net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls	2008-11-16 19:40:17 -08:00
tunnel4.c	[IPV4] TUNNEL4: Fix incoming packet length check for inter-protocol tunnel.	2008-06-05 04:02:33 +09:00
udp_impl.h	udp: introduce struct udp_table and multiple spinlocks	2008-10-29 01:41:45 -07:00
udp.c	net: avoid a pair of dst_hold()/dst_release() in ip_append_data()	2008-11-24 15:52:46 -08:00
udplite.c	udp: RCU handling for Unicast packets.	2008-10-29 02:11:14 -07:00
xfrm4_input.c
xfrm4_mode_beet.c	ipsec: Interfamily IPSec BEET	2008-08-06 02:39:30 -07:00
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c	xfrm: fix fragmentation for ipv4 xfrm tunnel	2008-06-17 16:38:23 -07:00
xfrm4_output.c
xfrm4_policy.c	net: remove struct dst_entry::entry_size	2008-11-11 17:25:22 -08:00
xfrm4_state.c	xfrm: Have af-specific init_tempsel() initialize family field of temporary selector	2008-11-04 14:49:19 -08:00
xfrm4_tunnel.c