Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sockets are hashed in a 128 slots hash table.
This hash table is protected by *one* rwlock.
This rwlock is readlocked each time an incoming UDP message is handled.
This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)
This is not scalable on SMP machines :
1) Even in read mode, lock() and unlock() are atomic operations and
must dirty a contended cache line, shared by all cpus.
2) A writer might be starved if many readers are 'in flight'. This can
happen on a machine with some NIC receiving many UDP messages. User
process can be delayed a long time at socket creation/dismantle time.
This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.
Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.
udp_lib_unhash() was uninlined, becoming to big.
Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Open code NIP6_FMT in the one call inside sscanf and one user
of NIP6() that could use %p6 in the netfilter code.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace all uses of IPOIB_GID_FMT, IPOIB_GID_RAW_ARG() and IPOIB_GID_ARG()
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables more ethtool information. The speed and settings of the
underlying device are propagated up. This makes services like SNMP that
use ethtool to get speed setting, work when managing a vlan, without adding
silly heurtistics into SNMP daemon.
For the driver info, just use existing driver strings.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The veth network device is stored in a list in the netdev private.
AFAICS, this list is never used so I removed this list from the code.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The veth private structure contains a netdev pointer refering to its peer.
This field is never used and it is pointless because if we can access,
the veth_priv, that means we already have the netdev which is stored
in veth_priv->dev.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The iscsi_ibft.c changes are almost certainly a bugfix as the
pointer 'ip' is a u8 *, so they never print the last 8 bytes
of the IPv6 address, and the eight bytes they do print have
a zero byte with them in each 16-bit word.
Other than that, this should cause no difference in functionality.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The define in kernel.h can be done away with at a later time.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Takes a pointer to a IPv6 address and formats it in the usual
colon-separated hex format:
xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx
Each 16 bit word is printed in network-endian byteorder.
%#p6 is also supported and will omit the colons.
%p6 is a replacement for NIP6_FMT and NIP6()
%#p6 is a replacement for NIP6_SEQFMT and NIP6()
Note that NIP6() took a struct in6_addr whereas this takes a pointer
to a struct in6_addr.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new_mapping() implementation to the netlink xfrm_mgr to notify
address/port changes detected in UDP encapsulated ESP packets.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
call_rcu() will unconditionally rewrite RCU head anyway.
Applies to
struct neigh_parms
struct neigh_table
struct net
struct cipso_v4_doi
struct in_ifaddr
struct in_device
rt->u.dst
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make testing of the network namespace simpler allow
the network namespace code and the sysfs code to be
compiled and run at the same time. To do this only
virtual devices are allowed in the additional network
namespaces and those virtual devices are not placed
in the kobject tree.
Since virtual devices don't actually do anything interesting
hardware wise that needs device management there should
be no loss in keeping them out of the kobject tree and
by implication sysfs. The gain in ease of testing
and code coverage should be significant.
Changelog:
v2: As pointed out by Benjamin Thery it only makes sense to call
device_rename in the initial network namespace for now.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of places still use %02x:...:%02x because it's
in debug statements or for no real reason. Make a few
of them use %pM.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts pretty much everything to print_mac. There were
a few things that had conflicts which I have just dropped for
now, no harm done.
I've built an allyesconfig with this and looked at the files
that weren't built very carefully, but it's a huge patch.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also remove a few stray DECLARE_MAC_BUF that were no longer
used at all.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add format specifiers for printing out six colon-separated bytes:
MAC addresses (%pM):
xx:xx:xx:xx:xx:xx
%#pM is also supported and omits the colon separators.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a patch to provide on demand route cache rebuilding. Currently, our
route cache is rebulid periodically regardless of need. This introduced
unneeded periodic latency. This patch offers a better approach. Using code
provided by Eric Dumazet, we compute the standard deviation of the average hash
bucket chain length while running rt_check_expire. Should any given chain
length grow to larger that average plus 4 standard deviations, we trigger an
emergency hash table rebuild for that net namespace. This allows for the common
case in which chains are well behaved and do not grow unevenly to not incur any
latency at all, while those systems (which may be being maliciously attacked),
only rebuild when the attack is detected. This patch take 2 other factors into
account:
1) chains with multiple entries that differ by attributes that do not affect the
hash value are only counted once, so as not to unduly bias system to rebuilding
if features like QOS are heavily used
2) if rebuilding crosses a certain threshold (which is adjustable via the added
sysctl in this patch), route caching is disabled entirely for that net
namespace, since constant rebuilding is less efficient that no caching at all
Tested successfully by me.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: ASoC: Blackfin: update SPORT0 port selector (v2)
ALSA: hda - Restore default pin configs for realtek codecs
sound: use a common working email address
pci: use pci_ioremap_bar() in sound/
- Setting the TFS pin selector for SPORT 0 based on whether the selected
port id F or G. If the port is F then no conflict should exist for the
TFS. When Port G is selected and EMAC then there is a conflict between
the PHY interrupt line and TFS. Current settings prevent the conflict
by ignoring the TFS pin when Port G is selected. This allows both
ssm2602 using Port G and EMAC concurrently.
- some code cleanup
Signed-off-by: Cliff Cai <cliff.cai@analog.com>
Signed-off-by: Bryan Wu <cooloney@kernel.org>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Some machines have broken BIOS resume that doesn't restore the default
pin configuration properly, which results in a wrong detection of HP
pin. This causes a silent speaker output due to missing HP detection.
Related bug: Novell bug#406101
https://bugzilla.novell.com/show_bug.cgi?id=406101
This patch fixes the issue by saving/restoring the default pin configs
by the driver itself.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
syncookies: fix inclusion of tcp options in syn-ack
libertas: free sk_buff with kfree_skb
btsdio: free sk_buff with kfree_skb
Phonet: do not reply to indication reset packets
Phonet: include generic link-layer header size in MAX_PHONET_HEADER
The leds-da903x LED driver was missing the proper #include of
linux/workqueue.h, but happened to compile on ARM due to implied
includes through other header files.
We do need the explict include on other architectures (reported at least
for x86-64).
Reported-tested-and-acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Eric Miao <eric.miao@marvell.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller noticed that commit
33ad798c92 '(tcp: options clean up')
did not move the req->cookie_ts check.
This essentially disabled commit 4dfc281702
'[Syncookies]: Add support for TCP options via timestamps.'.
This restores the original logic.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a potential error packet loop.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an OOPS in hard_header if a Phonet address is assigned to a
non-Phonet network interface.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://neil.brown.name/md:
md: allow extended partitions on md devices.
md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
md: use sysfs_notify_dirent to notify changes to md/array_state
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: psmouse - add support for Elantech touchpads
Input: i8042 - add Blue FB5601 to noloop exception table
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: Add support for Sony Vaio VGX-TP1E
HID: fix lock imbalance in hiddev
HID: fix lock imbalance in hidraw
HID: fix hidbus/appletouch device binding regression
HID: add hid_type to general hid struct
HID: quirk for OLED devices present in ASUS G50/G70/G71
HID: Remove "default m" for Thrustmaster and Zeroplus
HID: fix hidraw_exit section mismatch
HID: add support for another Gyration remote control
Revert "HID: Invert HWHEEL mappings for some Logitech mice"
Allow macros that are annotated with kernel-doc to contain whitespace
between the '#' and "define". It's valid and being used, so allow it.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
leds-hp-disk: fix build warning
ACPI: Oops in ACPI with git latest
ACPI suspend: build fix for ACPI_SLEEP=n && XEN_SAVE_RESTORE=y.
toshiba_acpi: always call input_sync() after input_report_switch()
ACPI: Always report a sync event after a lid state change
ACPI: cpufreq, processor: fix compile error in drivers/acpi/processor_perflib.c
i7300_idle: Fix compile warning CONFIG_I7300_IDLE_IOAT_CHANNEL not defined
i7300_idle: Cleanup based review comments
i7300_idle: Disable ioat channel only on platforms where ile driver can load