2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* net-sysfs.c - network device class and attributes
|
|
|
|
*
|
|
|
|
* Copyright (c) 2003 Stephen Hemminger <shemminger@osdl.org>
|
2007-02-09 07:24:36 -07:00
|
|
|
*
|
2005-04-16 15:20:36 -07:00
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*/
|
|
|
|
|
2006-01-11 13:17:47 -07:00
|
|
|
#include <linux/capability.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/netdevice.h>
|
|
|
|
#include <linux/if_arp.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 01:04:11 -07:00
|
|
|
#include <linux/slab.h>
|
2010-05-04 17:36:45 -07:00
|
|
|
#include <linux/nsproxy.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <net/sock.h>
|
2010-05-04 17:36:45 -07:00
|
|
|
#include <net/net_namespace.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/rtnetlink.h>
|
|
|
|
#include <linux/wireless.h>
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
#include <linux/vmalloc.h>
|
2009-09-28 06:26:43 -07:00
|
|
|
#include <net/wext.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-10-23 21:14:45 -07:00
|
|
|
#include "net-sysfs.h"
|
|
|
|
|
2007-09-26 22:02:53 -07:00
|
|
|
#ifdef CONFIG_SYSFS
|
2005-04-16 15:20:36 -07:00
|
|
|
static const char fmt_hex[] = "%#x\n";
|
2005-05-29 20:28:25 -07:00
|
|
|
static const char fmt_long_hex[] = "%#lx\n";
|
2005-04-16 15:20:36 -07:00
|
|
|
static const char fmt_dec[] = "%d\n";
|
|
|
|
static const char fmt_ulong[] = "%lu\n";
|
2010-06-08 00:19:54 -07:00
|
|
|
static const char fmt_u64[] = "%llu\n";
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-02-09 07:24:36 -07:00
|
|
|
static inline int dev_isalive(const struct net_device *dev)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2006-05-06 17:56:03 -07:00
|
|
|
return dev->reg_state <= NETREG_REGISTERED;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* use same locking rules as GIF* ioctl's */
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t netdev_show(const struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf,
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t (*format)(const struct net_device *, char *))
|
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct net_device *net = to_net_dev(dev);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EINVAL;
|
|
|
|
|
|
|
|
read_lock(&dev_base_lock);
|
|
|
|
if (dev_isalive(net))
|
|
|
|
ret = (*format)(net, buf);
|
|
|
|
read_unlock(&dev_base_lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* generate a show function for simple field */
|
|
|
|
#define NETDEVICE_SHOW(field, format_string) \
|
|
|
|
static ssize_t format_##field(const struct net_device *net, char *buf) \
|
|
|
|
{ \
|
|
|
|
return sprintf(buf, format_string, net->field); \
|
|
|
|
} \
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_##field(struct device *dev, \
|
|
|
|
struct device_attribute *attr, char *buf) \
|
2005-04-16 15:20:36 -07:00
|
|
|
{ \
|
2002-04-09 12:14:34 -07:00
|
|
|
return netdev_show(dev, attr, buf, format_##field); \
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* use same locking and permission rules as SIF* ioctl's */
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
|
2005-04-16 15:20:36 -07:00
|
|
|
const char *buf, size_t len,
|
|
|
|
int (*set)(struct net_device *, unsigned long))
|
|
|
|
{
|
|
|
|
struct net_device *net = to_net_dev(dev);
|
|
|
|
char *endp;
|
|
|
|
unsigned long new;
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!capable(CAP_NET_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
new = simple_strtoul(buf, &endp, 0);
|
|
|
|
if (endp == buf)
|
|
|
|
goto err;
|
|
|
|
|
2009-02-25 23:49:24 -07:00
|
|
|
if (!rtnl_trylock())
|
2009-05-13 09:57:25 -07:00
|
|
|
return restart_syscall();
|
2009-02-25 23:49:24 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
if (dev_isalive(net)) {
|
|
|
|
if ((ret = (*set)(net, new)) == 0)
|
|
|
|
ret = len;
|
|
|
|
}
|
|
|
|
rtnl_unlock();
|
|
|
|
err:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-04-20 16:07:43 -07:00
|
|
|
NETDEVICE_SHOW(dev_id, fmt_hex);
|
2010-07-21 19:50:21 -07:00
|
|
|
NETDEVICE_SHOW(addr_assign_type, fmt_dec);
|
2005-12-18 17:42:56 -07:00
|
|
|
NETDEVICE_SHOW(addr_len, fmt_dec);
|
|
|
|
NETDEVICE_SHOW(iflink, fmt_dec);
|
|
|
|
NETDEVICE_SHOW(ifindex, fmt_dec);
|
|
|
|
NETDEVICE_SHOW(features, fmt_long_hex);
|
|
|
|
NETDEVICE_SHOW(type, fmt_dec);
|
2006-03-20 18:09:11 -07:00
|
|
|
NETDEVICE_SHOW(link_mode, fmt_dec);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/* use same locking rules as GIFHWADDR ioctl's */
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_address(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct net_device *net = to_net_dev(dev);
|
|
|
|
ssize_t ret = -EINVAL;
|
|
|
|
|
|
|
|
read_lock(&dev_base_lock);
|
|
|
|
if (dev_isalive(net))
|
2007-12-24 22:28:09 -07:00
|
|
|
ret = sysfs_format_mac(buf, net->dev_addr, net->addr_len);
|
2005-04-16 15:20:36 -07:00
|
|
|
read_unlock(&dev_base_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_broadcast(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct net_device *net = to_net_dev(dev);
|
|
|
|
if (dev_isalive(net))
|
2007-12-24 22:28:09 -07:00
|
|
|
return sysfs_format_mac(buf, net->broadcast, net->addr_len);
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_carrier(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct net_device *netdev = to_net_dev(dev);
|
|
|
|
if (netif_running(netdev)) {
|
|
|
|
return sprintf(buf, fmt_dec, !!netif_carrier_ok(netdev));
|
|
|
|
}
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2009-10-02 02:26:12 -07:00
|
|
|
static ssize_t show_speed(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = to_net_dev(dev);
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
|
|
|
|
2009-10-25 18:23:33 -07:00
|
|
|
if (netif_running(netdev) &&
|
|
|
|
netdev->ethtool_ops &&
|
|
|
|
netdev->ethtool_ops->get_settings) {
|
2009-10-02 02:26:12 -07:00
|
|
|
struct ethtool_cmd cmd = { ETHTOOL_GSET };
|
|
|
|
|
|
|
|
if (!netdev->ethtool_ops->get_settings(netdev, &cmd))
|
|
|
|
ret = sprintf(buf, fmt_dec, ethtool_cmd_speed(&cmd));
|
|
|
|
}
|
|
|
|
rtnl_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t show_duplex(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = to_net_dev(dev);
|
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
|
|
|
|
2009-10-25 18:23:33 -07:00
|
|
|
if (netif_running(netdev) &&
|
|
|
|
netdev->ethtool_ops &&
|
|
|
|
netdev->ethtool_ops->get_settings) {
|
2009-10-02 02:26:12 -07:00
|
|
|
struct ethtool_cmd cmd = { ETHTOOL_GSET };
|
|
|
|
|
|
|
|
if (!netdev->ethtool_ops->get_settings(netdev, &cmd))
|
|
|
|
ret = sprintf(buf, "%s\n", cmd.duplex ? "full" : "half");
|
|
|
|
}
|
|
|
|
rtnl_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_dormant(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2006-03-20 18:09:11 -07:00
|
|
|
{
|
|
|
|
struct net_device *netdev = to_net_dev(dev);
|
|
|
|
|
|
|
|
if (netif_running(netdev))
|
|
|
|
return sprintf(buf, fmt_dec, !!netif_dormant(netdev));
|
|
|
|
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2009-08-05 10:42:58 -07:00
|
|
|
static const char *const operstates[] = {
|
2006-03-20 18:09:11 -07:00
|
|
|
"unknown",
|
|
|
|
"notpresent", /* currently unused */
|
|
|
|
"down",
|
|
|
|
"lowerlayerdown",
|
|
|
|
"testing", /* currently unused */
|
|
|
|
"dormant",
|
|
|
|
"up"
|
|
|
|
};
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_operstate(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2006-03-20 18:09:11 -07:00
|
|
|
{
|
|
|
|
const struct net_device *netdev = to_net_dev(dev);
|
|
|
|
unsigned char operstate;
|
|
|
|
|
|
|
|
read_lock(&dev_base_lock);
|
|
|
|
operstate = netdev->operstate;
|
|
|
|
if (!netif_running(netdev))
|
|
|
|
operstate = IF_OPER_DOWN;
|
|
|
|
read_unlock(&dev_base_lock);
|
|
|
|
|
2006-04-05 22:19:47 -07:00
|
|
|
if (operstate >= ARRAY_SIZE(operstates))
|
2006-03-20 18:09:11 -07:00
|
|
|
return -EINVAL; /* should not happen */
|
|
|
|
|
|
|
|
return sprintf(buf, "%s\n", operstates[operstate]);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/* read-write attributes */
|
|
|
|
NETDEVICE_SHOW(mtu, fmt_dec);
|
|
|
|
|
|
|
|
static int change_mtu(struct net_device *net, unsigned long new_mtu)
|
|
|
|
{
|
|
|
|
return dev_set_mtu(net, (int) new_mtu);
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t store_mtu(struct device *dev, struct device_attribute *attr,
|
|
|
|
const char *buf, size_t len)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
return netdev_store(dev, attr, buf, len, change_mtu);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
NETDEVICE_SHOW(flags, fmt_hex);
|
|
|
|
|
|
|
|
static int change_flags(struct net_device *net, unsigned long new_flags)
|
|
|
|
{
|
|
|
|
return dev_change_flags(net, (unsigned) new_flags);
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t store_flags(struct device *dev, struct device_attribute *attr,
|
|
|
|
const char *buf, size_t len)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
return netdev_store(dev, attr, buf, len, change_flags);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
NETDEVICE_SHOW(tx_queue_len, fmt_ulong);
|
|
|
|
|
|
|
|
static int change_tx_queue_len(struct net_device *net, unsigned long new_len)
|
|
|
|
{
|
|
|
|
net->tx_queue_len = new_len;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t store_tx_queue_len(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
const char *buf, size_t len)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
return netdev_store(dev, attr, buf, len, change_tx_queue_len);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2008-09-22 21:28:11 -07:00
|
|
|
static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
|
|
|
|
const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = to_net_dev(dev);
|
|
|
|
size_t count = len;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_NET_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* ignore trailing newline */
|
|
|
|
if (len > 0 && buf[len - 1] == '\n')
|
|
|
|
--count;
|
|
|
|
|
2009-05-13 09:57:25 -07:00
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
2008-09-22 21:28:11 -07:00
|
|
|
ret = dev_set_alias(netdev, buf, count);
|
|
|
|
rtnl_unlock();
|
|
|
|
|
|
|
|
return ret < 0 ? ret : len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t show_ifalias(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
const struct net_device *netdev = to_net_dev(dev);
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
2009-05-13 09:57:25 -07:00
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
2008-09-22 21:28:11 -07:00
|
|
|
if (netdev->ifalias)
|
|
|
|
ret = sprintf(buf, "%s\n", netdev->ifalias);
|
|
|
|
rtnl_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2002-04-09 12:14:34 -07:00
|
|
|
static struct device_attribute net_class_attributes[] = {
|
2010-07-21 19:50:21 -07:00
|
|
|
__ATTR(addr_assign_type, S_IRUGO, show_addr_assign_type, NULL),
|
2005-12-18 17:42:56 -07:00
|
|
|
__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
|
2008-04-20 16:07:43 -07:00
|
|
|
__ATTR(dev_id, S_IRUGO, show_dev_id, NULL),
|
2008-09-22 21:28:11 -07:00
|
|
|
__ATTR(ifalias, S_IRUGO | S_IWUSR, show_ifalias, store_ifalias),
|
2005-12-18 17:42:56 -07:00
|
|
|
__ATTR(iflink, S_IRUGO, show_iflink, NULL),
|
|
|
|
__ATTR(ifindex, S_IRUGO, show_ifindex, NULL),
|
|
|
|
__ATTR(features, S_IRUGO, show_features, NULL),
|
|
|
|
__ATTR(type, S_IRUGO, show_type, NULL),
|
2006-03-20 18:09:11 -07:00
|
|
|
__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
|
2005-12-18 17:42:56 -07:00
|
|
|
__ATTR(address, S_IRUGO, show_address, NULL),
|
|
|
|
__ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
|
|
|
|
__ATTR(carrier, S_IRUGO, show_carrier, NULL),
|
2009-10-02 02:26:12 -07:00
|
|
|
__ATTR(speed, S_IRUGO, show_speed, NULL),
|
|
|
|
__ATTR(duplex, S_IRUGO, show_duplex, NULL),
|
2006-03-20 18:09:11 -07:00
|
|
|
__ATTR(dormant, S_IRUGO, show_dormant, NULL),
|
|
|
|
__ATTR(operstate, S_IRUGO, show_operstate, NULL),
|
2005-12-18 17:42:56 -07:00
|
|
|
__ATTR(mtu, S_IRUGO | S_IWUSR, show_mtu, store_mtu),
|
|
|
|
__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
|
|
|
|
__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
|
|
|
|
store_tx_queue_len),
|
|
|
|
{}
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/* Show a given an attribute in the statistics group */
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t netstat_show(const struct device *d,
|
|
|
|
struct device_attribute *attr, char *buf,
|
2005-04-16 15:20:36 -07:00
|
|
|
unsigned long offset)
|
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct net_device *dev = to_net_dev(d);
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EINVAL;
|
|
|
|
|
2010-06-08 00:19:54 -07:00
|
|
|
WARN_ON(offset > sizeof(struct rtnl_link_stats64) ||
|
|
|
|
offset % sizeof(u64) != 0);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
read_lock(&dev_base_lock);
|
2008-05-21 14:12:46 -07:00
|
|
|
if (dev_isalive(dev)) {
|
2010-07-07 14:58:56 -07:00
|
|
|
struct rtnl_link_stats64 temp;
|
|
|
|
const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
|
|
|
|
|
2010-06-08 00:19:54 -07:00
|
|
|
ret = sprintf(buf, fmt_u64, *(u64 *)(((u8 *) stats) + offset));
|
2008-05-21 14:12:46 -07:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
read_unlock(&dev_base_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* generate a read-only statistics attribute */
|
|
|
|
#define NETSTAT_ENTRY(name) \
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_##name(struct device *d, \
|
|
|
|
struct device_attribute *attr, char *buf) \
|
2005-04-16 15:20:36 -07:00
|
|
|
{ \
|
2002-04-09 12:14:34 -07:00
|
|
|
return netstat_show(d, attr, buf, \
|
2010-06-08 00:19:54 -07:00
|
|
|
offsetof(struct rtnl_link_stats64, name)); \
|
2005-04-16 15:20:36 -07:00
|
|
|
} \
|
2002-04-09 12:14:34 -07:00
|
|
|
static DEVICE_ATTR(name, S_IRUGO, show_##name, NULL)
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
NETSTAT_ENTRY(rx_packets);
|
|
|
|
NETSTAT_ENTRY(tx_packets);
|
|
|
|
NETSTAT_ENTRY(rx_bytes);
|
|
|
|
NETSTAT_ENTRY(tx_bytes);
|
|
|
|
NETSTAT_ENTRY(rx_errors);
|
|
|
|
NETSTAT_ENTRY(tx_errors);
|
|
|
|
NETSTAT_ENTRY(rx_dropped);
|
|
|
|
NETSTAT_ENTRY(tx_dropped);
|
|
|
|
NETSTAT_ENTRY(multicast);
|
|
|
|
NETSTAT_ENTRY(collisions);
|
|
|
|
NETSTAT_ENTRY(rx_length_errors);
|
|
|
|
NETSTAT_ENTRY(rx_over_errors);
|
|
|
|
NETSTAT_ENTRY(rx_crc_errors);
|
|
|
|
NETSTAT_ENTRY(rx_frame_errors);
|
|
|
|
NETSTAT_ENTRY(rx_fifo_errors);
|
|
|
|
NETSTAT_ENTRY(rx_missed_errors);
|
|
|
|
NETSTAT_ENTRY(tx_aborted_errors);
|
|
|
|
NETSTAT_ENTRY(tx_carrier_errors);
|
|
|
|
NETSTAT_ENTRY(tx_fifo_errors);
|
|
|
|
NETSTAT_ENTRY(tx_heartbeat_errors);
|
|
|
|
NETSTAT_ENTRY(tx_window_errors);
|
|
|
|
NETSTAT_ENTRY(rx_compressed);
|
|
|
|
NETSTAT_ENTRY(tx_compressed);
|
|
|
|
|
|
|
|
static struct attribute *netstat_attrs[] = {
|
2002-04-09 12:14:34 -07:00
|
|
|
&dev_attr_rx_packets.attr,
|
|
|
|
&dev_attr_tx_packets.attr,
|
|
|
|
&dev_attr_rx_bytes.attr,
|
|
|
|
&dev_attr_tx_bytes.attr,
|
|
|
|
&dev_attr_rx_errors.attr,
|
|
|
|
&dev_attr_tx_errors.attr,
|
|
|
|
&dev_attr_rx_dropped.attr,
|
|
|
|
&dev_attr_tx_dropped.attr,
|
|
|
|
&dev_attr_multicast.attr,
|
|
|
|
&dev_attr_collisions.attr,
|
|
|
|
&dev_attr_rx_length_errors.attr,
|
|
|
|
&dev_attr_rx_over_errors.attr,
|
|
|
|
&dev_attr_rx_crc_errors.attr,
|
|
|
|
&dev_attr_rx_frame_errors.attr,
|
|
|
|
&dev_attr_rx_fifo_errors.attr,
|
|
|
|
&dev_attr_rx_missed_errors.attr,
|
|
|
|
&dev_attr_tx_aborted_errors.attr,
|
|
|
|
&dev_attr_tx_carrier_errors.attr,
|
|
|
|
&dev_attr_tx_fifo_errors.attr,
|
|
|
|
&dev_attr_tx_heartbeat_errors.attr,
|
|
|
|
&dev_attr_tx_window_errors.attr,
|
|
|
|
&dev_attr_rx_compressed.attr,
|
|
|
|
&dev_attr_tx_compressed.attr,
|
2005-04-16 15:20:36 -07:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
static struct attribute_group netstat_group = {
|
|
|
|
.name = "statistics",
|
|
|
|
.attrs = netstat_attrs,
|
|
|
|
};
|
|
|
|
|
2008-07-10 02:16:47 -07:00
|
|
|
#ifdef CONFIG_WIRELESS_EXT_SYSFS
|
2005-04-16 15:20:36 -07:00
|
|
|
/* helper function that does all the locking etc for wireless stats */
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t wireless_show(struct device *d, char *buf,
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t (*format)(const struct iw_statistics *,
|
|
|
|
char *))
|
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct net_device *dev = to_net_dev(d);
|
2009-09-28 06:26:43 -07:00
|
|
|
const struct iw_statistics *iw;
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret = -EINVAL;
|
2007-02-09 07:24:36 -07:00
|
|
|
|
2010-02-19 06:23:47 -07:00
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
2006-01-09 21:51:28 -07:00
|
|
|
if (dev_isalive(dev)) {
|
2009-09-28 06:26:43 -07:00
|
|
|
iw = get_wireless_stats(dev);
|
|
|
|
if (iw)
|
2006-01-09 21:51:28 -07:00
|
|
|
ret = (*format)(iw, buf);
|
|
|
|
}
|
2009-10-05 02:22:23 -07:00
|
|
|
rtnl_unlock();
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* show function template for wireless fields */
|
|
|
|
#define WIRELESS_SHOW(name, field, format_string) \
|
|
|
|
static ssize_t format_iw_##name(const struct iw_statistics *iw, char *buf) \
|
|
|
|
{ \
|
|
|
|
return sprintf(buf, format_string, iw->field); \
|
|
|
|
} \
|
2002-04-09 12:14:34 -07:00
|
|
|
static ssize_t show_iw_##name(struct device *d, \
|
|
|
|
struct device_attribute *attr, char *buf) \
|
2005-04-16 15:20:36 -07:00
|
|
|
{ \
|
2002-04-09 12:14:34 -07:00
|
|
|
return wireless_show(d, buf, format_iw_##name); \
|
2005-04-16 15:20:36 -07:00
|
|
|
} \
|
2002-04-09 12:14:34 -07:00
|
|
|
static DEVICE_ATTR(name, S_IRUGO, show_iw_##name, NULL)
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
WIRELESS_SHOW(status, status, fmt_hex);
|
|
|
|
WIRELESS_SHOW(link, qual.qual, fmt_dec);
|
|
|
|
WIRELESS_SHOW(level, qual.level, fmt_dec);
|
|
|
|
WIRELESS_SHOW(noise, qual.noise, fmt_dec);
|
|
|
|
WIRELESS_SHOW(nwid, discard.nwid, fmt_dec);
|
|
|
|
WIRELESS_SHOW(crypt, discard.code, fmt_dec);
|
|
|
|
WIRELESS_SHOW(fragment, discard.fragment, fmt_dec);
|
|
|
|
WIRELESS_SHOW(misc, discard.misc, fmt_dec);
|
|
|
|
WIRELESS_SHOW(retries, discard.retries, fmt_dec);
|
|
|
|
WIRELESS_SHOW(beacon, miss.beacon, fmt_dec);
|
|
|
|
|
|
|
|
static struct attribute *wireless_attrs[] = {
|
2002-04-09 12:14:34 -07:00
|
|
|
&dev_attr_status.attr,
|
|
|
|
&dev_attr_link.attr,
|
|
|
|
&dev_attr_level.attr,
|
|
|
|
&dev_attr_noise.attr,
|
|
|
|
&dev_attr_nwid.attr,
|
|
|
|
&dev_attr_crypt.attr,
|
|
|
|
&dev_attr_fragment.attr,
|
|
|
|
&dev_attr_retries.attr,
|
|
|
|
&dev_attr_misc.attr,
|
|
|
|
&dev_attr_beacon.attr,
|
2005-04-16 15:20:36 -07:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct attribute_group wireless_group = {
|
|
|
|
.name = "wireless",
|
|
|
|
.attrs = wireless_attrs,
|
|
|
|
};
|
|
|
|
#endif
|
2010-05-16 21:59:45 -07:00
|
|
|
#endif /* CONFIG_SYSFS */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-03-29 01:00:44 -07:00
|
|
|
#ifdef CONFIG_RPS
|
2010-03-16 01:03:29 -07:00
|
|
|
/*
|
|
|
|
* RX queue sysfs structures and functions.
|
|
|
|
*/
|
|
|
|
struct rx_queue_attribute {
|
|
|
|
struct attribute attr;
|
|
|
|
ssize_t (*show)(struct netdev_rx_queue *queue,
|
|
|
|
struct rx_queue_attribute *attr, char *buf);
|
|
|
|
ssize_t (*store)(struct netdev_rx_queue *queue,
|
|
|
|
struct rx_queue_attribute *attr, const char *buf, size_t len);
|
|
|
|
};
|
|
|
|
#define to_rx_queue_attr(_attr) container_of(_attr, \
|
|
|
|
struct rx_queue_attribute, attr)
|
|
|
|
|
|
|
|
#define to_rx_queue(obj) container_of(obj, struct netdev_rx_queue, kobj)
|
|
|
|
|
|
|
|
static ssize_t rx_queue_attr_show(struct kobject *kobj, struct attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
|
|
|
|
struct netdev_rx_queue *queue = to_rx_queue(kobj);
|
|
|
|
|
|
|
|
if (!attribute->show)
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
return attribute->show(queue, attribute, buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t rx_queue_attr_store(struct kobject *kobj, struct attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
|
|
|
|
struct netdev_rx_queue *queue = to_rx_queue(kobj);
|
|
|
|
|
|
|
|
if (!attribute->store)
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
return attribute->store(queue, attribute, buf, count);
|
|
|
|
}
|
|
|
|
|
2010-08-31 05:14:13 -07:00
|
|
|
static const struct sysfs_ops rx_queue_sysfs_ops = {
|
2010-03-16 01:03:29 -07:00
|
|
|
.show = rx_queue_attr_show,
|
|
|
|
.store = rx_queue_attr_store,
|
|
|
|
};
|
|
|
|
|
|
|
|
static ssize_t show_rps_map(struct netdev_rx_queue *queue,
|
|
|
|
struct rx_queue_attribute *attribute, char *buf)
|
|
|
|
{
|
|
|
|
struct rps_map *map;
|
|
|
|
cpumask_var_t mask;
|
|
|
|
size_t len = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
map = rcu_dereference(queue->rps_map);
|
|
|
|
if (map)
|
|
|
|
for (i = 0; i < map->len; i++)
|
|
|
|
cpumask_set_cpu(map->cpus[i], mask);
|
|
|
|
|
|
|
|
len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
|
|
|
|
if (PAGE_SIZE - len < 3) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
free_cpumask_var(mask);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
free_cpumask_var(mask);
|
|
|
|
len += sprintf(buf + len, "\n");
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rps_map_release(struct rcu_head *rcu)
|
|
|
|
{
|
|
|
|
struct rps_map *map = container_of(rcu, struct rps_map, rcu);
|
|
|
|
|
|
|
|
kfree(map);
|
|
|
|
}
|
|
|
|
|
2010-04-19 14:40:57 -07:00
|
|
|
static ssize_t store_rps_map(struct netdev_rx_queue *queue,
|
2010-03-16 01:03:29 -07:00
|
|
|
struct rx_queue_attribute *attribute,
|
|
|
|
const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
struct rps_map *old_map, *map;
|
|
|
|
cpumask_var_t mask;
|
|
|
|
int err, cpu, i;
|
|
|
|
static DEFINE_SPINLOCK(rps_map_lock);
|
|
|
|
|
|
|
|
if (!capable(CAP_NET_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (!alloc_cpumask_var(&mask, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
|
|
|
|
if (err) {
|
|
|
|
free_cpumask_var(mask);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
map = kzalloc(max_t(unsigned,
|
|
|
|
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!map) {
|
|
|
|
free_cpumask_var(mask);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
for_each_cpu_and(cpu, mask, cpu_online_mask)
|
|
|
|
map->cpus[i++] = cpu;
|
|
|
|
|
|
|
|
if (i)
|
|
|
|
map->len = i;
|
|
|
|
else {
|
|
|
|
kfree(map);
|
|
|
|
map = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&rps_map_lock);
|
2010-10-24 20:02:02 -07:00
|
|
|
old_map = rcu_dereference_protected(queue->rps_map,
|
|
|
|
lockdep_is_held(&rps_map_lock));
|
2010-03-16 01:03:29 -07:00
|
|
|
rcu_assign_pointer(queue->rps_map, map);
|
|
|
|
spin_unlock(&rps_map_lock);
|
|
|
|
|
|
|
|
if (old_map)
|
|
|
|
call_rcu(&old_map->rcu, rps_map_release);
|
|
|
|
|
|
|
|
free_cpumask_var(mask);
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
|
|
|
|
struct rx_queue_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct rps_dev_flow_table *flow_table;
|
|
|
|
unsigned int val = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
flow_table = rcu_dereference(queue->rps_flow_table);
|
|
|
|
if (flow_table)
|
|
|
|
val = flow_table->mask + 1;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return sprintf(buf, "%u\n", val);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rps_dev_flow_table_release_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct rps_dev_flow_table *table = container_of(work,
|
|
|
|
struct rps_dev_flow_table, free_work);
|
|
|
|
|
|
|
|
vfree(table);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rps_dev_flow_table_release(struct rcu_head *rcu)
|
|
|
|
{
|
|
|
|
struct rps_dev_flow_table *table = container_of(rcu,
|
|
|
|
struct rps_dev_flow_table, rcu);
|
|
|
|
|
|
|
|
INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
|
|
|
|
schedule_work(&table->free_work);
|
|
|
|
}
|
|
|
|
|
2010-04-19 14:40:57 -07:00
|
|
|
static ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
struct rx_queue_attribute *attr,
|
|
|
|
const char *buf, size_t len)
|
|
|
|
{
|
|
|
|
unsigned int count;
|
|
|
|
char *endp;
|
|
|
|
struct rps_dev_flow_table *table, *old_table;
|
|
|
|
static DEFINE_SPINLOCK(rps_dev_flow_lock);
|
|
|
|
|
|
|
|
if (!capable(CAP_NET_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
count = simple_strtoul(buf, &endp, 0);
|
|
|
|
if (endp == buf)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (count) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (count > 1<<30) {
|
|
|
|
/* Enforce a limit to prevent overflow */
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
count = roundup_pow_of_two(count);
|
|
|
|
table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
|
|
|
|
if (!table)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
table->mask = count - 1;
|
|
|
|
for (i = 0; i < count; i++)
|
|
|
|
table->flows[i].cpu = RPS_NO_CPU;
|
|
|
|
} else
|
|
|
|
table = NULL;
|
|
|
|
|
|
|
|
spin_lock(&rps_dev_flow_lock);
|
2010-10-24 20:02:02 -07:00
|
|
|
old_table = rcu_dereference_protected(queue->rps_flow_table,
|
|
|
|
lockdep_is_held(&rps_dev_flow_lock));
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
rcu_assign_pointer(queue->rps_flow_table, table);
|
|
|
|
spin_unlock(&rps_dev_flow_lock);
|
|
|
|
|
|
|
|
if (old_table)
|
|
|
|
call_rcu(&old_table->rcu, rps_dev_flow_table_release);
|
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2010-03-16 01:03:29 -07:00
|
|
|
static struct rx_queue_attribute rps_cpus_attribute =
|
|
|
|
__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
|
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
|
|
|
|
static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
|
|
|
|
__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
|
|
|
|
show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
|
|
|
|
|
2010-03-16 01:03:29 -07:00
|
|
|
static struct attribute *rx_queue_default_attrs[] = {
|
|
|
|
&rps_cpus_attribute.attr,
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
&rps_dev_flow_table_cnt_attribute.attr,
|
2010-03-16 01:03:29 -07:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
|
|
|
static void rx_queue_release(struct kobject *kobj)
|
|
|
|
{
|
|
|
|
struct netdev_rx_queue *queue = to_rx_queue(kobj);
|
|
|
|
struct netdev_rx_queue *first = queue->first;
|
2010-10-24 20:02:02 -07:00
|
|
|
struct rps_map *map;
|
|
|
|
struct rps_dev_flow_table *flow_table;
|
2010-03-16 01:03:29 -07:00
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
|
2010-10-24 20:02:02 -07:00
|
|
|
map = rcu_dereference_raw(queue->rps_map);
|
|
|
|
if (map)
|
|
|
|
call_rcu(&map->rcu, rps_map_release);
|
|
|
|
|
|
|
|
flow_table = rcu_dereference_raw(queue->rps_flow_table);
|
|
|
|
if (flow_table)
|
|
|
|
call_rcu(&flow_table->rcu, rps_dev_flow_table_release);
|
2010-03-16 01:03:29 -07:00
|
|
|
|
|
|
|
if (atomic_dec_and_test(&first->count))
|
|
|
|
kfree(first);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kobj_type rx_queue_ktype = {
|
|
|
|
.sysfs_ops = &rx_queue_sysfs_ops,
|
|
|
|
.release = rx_queue_release,
|
|
|
|
.default_attrs = rx_queue_default_attrs,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int rx_queue_add_kobject(struct net_device *net, int index)
|
|
|
|
{
|
|
|
|
struct netdev_rx_queue *queue = net->_rx + index;
|
2010-10-07 03:09:10 -07:00
|
|
|
struct netdev_rx_queue *first = queue->first;
|
2010-03-16 01:03:29 -07:00
|
|
|
struct kobject *kobj = &queue->kobj;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
kobj->kset = net->queues_kset;
|
|
|
|
error = kobject_init_and_add(kobj, &rx_queue_ktype, NULL,
|
|
|
|
"rx-%u", index);
|
|
|
|
if (error) {
|
|
|
|
kobject_put(kobj);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
kobject_uevent(kobj, KOBJ_ADD);
|
2010-10-07 03:09:10 -07:00
|
|
|
atomic_inc(&first->count);
|
2010-03-16 01:03:29 -07:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2010-09-27 01:24:33 -07:00
|
|
|
int
|
|
|
|
net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
|
2010-03-16 01:03:29 -07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int error = 0;
|
|
|
|
|
2010-09-27 01:24:33 -07:00
|
|
|
for (i = old_num; i < new_num; i++) {
|
2010-03-16 01:03:29 -07:00
|
|
|
error = rx_queue_add_kobject(net, i);
|
2010-09-27 01:24:33 -07:00
|
|
|
if (error) {
|
|
|
|
new_num = old_num;
|
2010-03-16 01:03:29 -07:00
|
|
|
break;
|
2010-09-27 01:24:33 -07:00
|
|
|
}
|
2010-03-16 01:03:29 -07:00
|
|
|
}
|
|
|
|
|
2010-09-27 01:24:33 -07:00
|
|
|
while (--i >= new_num)
|
|
|
|
kobject_put(&net->_rx[i].kobj);
|
2010-03-16 01:03:29 -07:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2010-09-27 01:24:33 -07:00
|
|
|
static int rx_queue_register_kobjects(struct net_device *net)
|
2010-03-16 01:03:29 -07:00
|
|
|
{
|
2010-09-27 01:24:33 -07:00
|
|
|
net->queues_kset = kset_create_and_add("queues",
|
|
|
|
NULL, &net->dev.kobj);
|
|
|
|
if (!net->queues_kset)
|
|
|
|
return -ENOMEM;
|
|
|
|
return net_rx_queue_update_kobjects(net, 0, net->real_num_rx_queues);
|
|
|
|
}
|
2010-03-16 01:03:29 -07:00
|
|
|
|
2010-09-27 01:24:33 -07:00
|
|
|
static void rx_queue_remove_kobjects(struct net_device *net)
|
|
|
|
{
|
|
|
|
net_rx_queue_update_kobjects(net, net->real_num_rx_queues, 0);
|
2010-03-16 01:03:29 -07:00
|
|
|
kset_unregister(net->queues_kset);
|
|
|
|
}
|
2010-03-29 01:00:44 -07:00
|
|
|
#endif /* CONFIG_RPS */
|
2010-05-04 17:36:45 -07:00
|
|
|
|
|
|
|
static const void *net_current_ns(void)
|
|
|
|
{
|
|
|
|
return current->nsproxy->net_ns;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const void *net_initial_ns(void)
|
|
|
|
{
|
|
|
|
return &init_net;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const void *net_netlink_ns(struct sock *sk)
|
|
|
|
{
|
|
|
|
return sock_net(sk);
|
|
|
|
}
|
|
|
|
|
2010-08-05 08:45:15 -07:00
|
|
|
struct kobj_ns_type_operations net_ns_type_operations = {
|
2010-05-04 17:36:45 -07:00
|
|
|
.type = KOBJ_NS_TYPE_NET,
|
|
|
|
.current_ns = net_current_ns,
|
|
|
|
.netlink_ns = net_netlink_ns,
|
|
|
|
.initial_ns = net_initial_ns,
|
|
|
|
};
|
2010-08-05 08:45:15 -07:00
|
|
|
EXPORT_SYMBOL_GPL(net_ns_type_operations);
|
2010-05-04 17:36:45 -07:00
|
|
|
|
|
|
|
static void net_kobj_ns_exit(struct net *net)
|
|
|
|
{
|
|
|
|
kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
|
|
|
|
}
|
|
|
|
|
2010-05-16 21:59:45 -07:00
|
|
|
static struct pernet_operations kobj_net_ops = {
|
2010-05-04 17:36:45 -07:00
|
|
|
.exit = net_kobj_ns_exit,
|
|
|
|
};
|
|
|
|
|
2007-09-26 22:02:53 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_HOTPLUG
|
2007-08-14 06:15:12 -07:00
|
|
|
static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct net_device *dev = to_net_dev(d);
|
2007-08-14 06:15:12 -07:00
|
|
|
int retval;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2005-11-16 01:00:00 -07:00
|
|
|
/* pass interface to uevent. */
|
2007-08-14 06:15:12 -07:00
|
|
|
retval = add_uevent_var(env, "INTERFACE=%s", dev->name);
|
2007-03-30 22:23:12 -07:00
|
|
|
if (retval)
|
|
|
|
goto exit;
|
2007-03-07 11:49:30 -07:00
|
|
|
|
|
|
|
/* pass ifindex to uevent.
|
|
|
|
* ifindex is useful as it won't change (interface name may change)
|
|
|
|
* and is what RtNetlink uses natively. */
|
2007-08-14 06:15:12 -07:00
|
|
|
retval = add_uevent_var(env, "IFINDEX=%d", dev->ifindex);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-03-30 22:23:12 -07:00
|
|
|
exit:
|
|
|
|
return retval;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2007-02-09 07:24:36 -07:00
|
|
|
* netdev_release -- destroy and free a dead device.
|
2002-04-09 12:14:34 -07:00
|
|
|
* Called when last reference to device kobject is gone.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2002-04-09 12:14:34 -07:00
|
|
|
static void netdev_release(struct device *d)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct net_device *dev = to_net_dev(d);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
BUG_ON(dev->reg_state != NETREG_RELEASED);
|
|
|
|
|
2008-09-22 21:28:11 -07:00
|
|
|
kfree(dev->ifalias);
|
2005-04-16 15:20:36 -07:00
|
|
|
kfree((char *)dev - dev->padded);
|
|
|
|
}
|
|
|
|
|
2010-05-04 17:36:45 -07:00
|
|
|
static const void *net_namespace(struct device *d)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
dev = container_of(d, struct net_device, dev);
|
|
|
|
return dev_net(dev);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
static struct class net_class = {
|
|
|
|
.name = "net",
|
2002-04-09 12:14:34 -07:00
|
|
|
.dev_release = netdev_release,
|
2007-09-26 22:02:53 -07:00
|
|
|
#ifdef CONFIG_SYSFS
|
2002-04-09 12:14:34 -07:00
|
|
|
.dev_attrs = net_class_attributes,
|
2007-09-26 22:02:53 -07:00
|
|
|
#endif /* CONFIG_SYSFS */
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_HOTPLUG
|
2002-04-09 12:14:34 -07:00
|
|
|
.dev_uevent = netdev_uevent,
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2010-05-04 17:36:45 -07:00
|
|
|
.ns_type = &net_ns_type_operations,
|
|
|
|
.namespace = net_namespace,
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2007-05-19 15:39:25 -07:00
|
|
|
/* Delete sysfs entries but hold kobject reference until after all
|
|
|
|
* netdev references are gone.
|
|
|
|
*/
|
2007-09-26 22:02:53 -07:00
|
|
|
void netdev_unregister_kobject(struct net_device * net)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2007-05-19 15:39:25 -07:00
|
|
|
struct device *dev = &(net->dev);
|
|
|
|
|
|
|
|
kobject_get(&dev->kobj);
|
2008-10-27 17:51:47 -07:00
|
|
|
|
2010-03-29 01:00:44 -07:00
|
|
|
#ifdef CONFIG_RPS
|
2010-03-16 01:03:29 -07:00
|
|
|
rx_queue_remove_kobjects(net);
|
2010-03-22 18:06:47 -07:00
|
|
|
#endif
|
2010-03-16 01:03:29 -07:00
|
|
|
|
2007-05-19 15:39:25 -07:00
|
|
|
device_del(dev);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Create sysfs entries for network device. */
|
2007-09-26 22:02:53 -07:00
|
|
|
int netdev_register_kobject(struct net_device *net)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2002-04-09 12:14:34 -07:00
|
|
|
struct device *dev = &(net->dev);
|
2009-06-24 10:06:31 -07:00
|
|
|
const struct attribute_group **groups = net->sysfs_groups;
|
2010-03-16 01:03:29 -07:00
|
|
|
int error = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-05-04 17:36:49 -07:00
|
|
|
device_initialize(dev);
|
2002-04-09 12:14:34 -07:00
|
|
|
dev->class = &net_class;
|
|
|
|
dev->platform_data = net;
|
|
|
|
dev->groups = groups;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2009-03-09 06:51:55 -07:00
|
|
|
dev_set_name(dev, "%s", net->name);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-09-26 22:02:53 -07:00
|
|
|
#ifdef CONFIG_SYSFS
|
2009-10-29 07:18:21 -07:00
|
|
|
/* Allow for a device specific group */
|
|
|
|
if (*groups)
|
|
|
|
groups++;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2009-10-29 07:18:21 -07:00
|
|
|
*groups++ = &netstat_group;
|
2008-07-10 02:16:47 -07:00
|
|
|
#ifdef CONFIG_WIRELESS_EXT_SYSFS
|
2009-09-29 14:27:28 -07:00
|
|
|
if (net->ieee80211_ptr)
|
2006-05-06 17:56:03 -07:00
|
|
|
*groups++ = &wireless_group;
|
2009-09-29 14:27:28 -07:00
|
|
|
#ifdef CONFIG_WIRELESS_EXT
|
|
|
|
else if (net->wireless_handlers)
|
|
|
|
*groups++ = &wireless_group;
|
|
|
|
#endif
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif
|
2007-09-26 22:02:53 -07:00
|
|
|
#endif /* CONFIG_SYSFS */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2010-03-16 01:03:29 -07:00
|
|
|
error = device_add(dev);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
2010-03-29 01:00:44 -07:00
|
|
|
#ifdef CONFIG_RPS
|
2010-03-16 01:03:29 -07:00
|
|
|
error = rx_queue_register_kobjects(net);
|
|
|
|
if (error) {
|
|
|
|
device_del(dev);
|
|
|
|
return error;
|
|
|
|
}
|
2010-03-22 18:06:47 -07:00
|
|
|
#endif
|
2010-03-16 01:03:29 -07:00
|
|
|
|
|
|
|
return error;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2008-06-13 18:12:04 -07:00
|
|
|
int netdev_class_create_file(struct class_attribute *class_attr)
|
|
|
|
{
|
|
|
|
return class_create_file(&net_class, class_attr);
|
|
|
|
}
|
2010-07-09 14:22:04 -07:00
|
|
|
EXPORT_SYMBOL(netdev_class_create_file);
|
2008-06-13 18:12:04 -07:00
|
|
|
|
|
|
|
void netdev_class_remove_file(struct class_attribute *class_attr)
|
|
|
|
{
|
|
|
|
class_remove_file(&net_class, class_attr);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_class_remove_file);
|
|
|
|
|
2007-09-26 22:02:53 -07:00
|
|
|
int netdev_kobject_init(void)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2010-05-04 17:36:45 -07:00
|
|
|
kobj_ns_type_register(&net_ns_type_operations);
|
2010-05-16 21:59:45 -07:00
|
|
|
register_pernet_subsys(&kobj_net_ops);
|
2005-04-16 15:20:36 -07:00
|
|
|
return class_register(&net_class);
|
|
|
|
}
|