Impact: cleanup
Time to clean up remaining laggards using the old cpu_ functions.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Trond.Myklebust@netapp.com
This patch fixes a dependency with IPv6:
ERROR: "__ipv6_addr_type" [net/netfilter/xt_cluster.ko] undefined!
This patch adds a function that checks if the higher bits of the
address is 0xFF to identify a multicast address, instead of adding a
dependency due to __ipv6_addr_type(). I came up with this idea after
Patrick McHardy pointed possible problems with runtime module
dependencies.
Reported-by: Steven Noonan <steven@uplinklabs.net>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows for the removal of byteswapping in some places and
the removal of HIPQUAD (replaced by %pI4).
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dcc_ip is treated as a host-endian value in the first printk,
but the second printk uses %pI4 which expects a be32. This
will cause a mismatch between the debug statement and the
warning statement.
Treat as a be32 throughout and avoid some byteswapping during
some comparisions, and allow another user of HIPQUAD to bite the
dust.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When GRO/frag_list support was added to GSO, I made an error
which broke the support for segmenting linear GSO packets (GSO
packets are normally non-linear in the payload).
These days most of these packets are constructed by the tun
driver, which prefers to allocate linear memory if possible.
This is fixed in the latest kernel, but for 2.6.29 and earlier
it is still the norm.
Therefore this bug causes failures with GSO when used with tun
in 2.6.29.
Reported-by: James Huang <jamesclhuang@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
smack: Add a new '-CIPSO' option to the network address label configuration
netlabel: Cleanup the Smack/NetLabel code to fix incoming TCP connections
lsm: Remove the socket_post_accept() hook
selinux: Remove the "compat_net" compatibility code
netlabel: Label incoming TCP connections correctly in SELinux
lsm: Relocate the IPv4 security_inet_conn_request() hooks
TOMOYO: Fix a typo.
smack: convert smack to standard linux lists
* 'percpu-cpumask-x86-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (682 commits)
percpu: fix spurious alignment WARN in legacy SMP percpu allocator
percpu: generalize embedding first chunk setup helper
percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()
percpu: make x86 addr <-> pcpu ptr conversion macros generic
linker script: define __per_cpu_load on all SMP capable archs
x86: UV: remove uv_flush_tlb_others() WARN_ON
percpu: finer grained locking to break deadlock and allow atomic free
percpu: move fully free chunk reclamation into a work
percpu: move chunk area map extension out of area allocation
percpu: replace pcpu_realloc() with pcpu_mem_alloc() and pcpu_mem_free()
x86, percpu: setup reserved percpu area for x86_64
percpu, module: implement reserved allocation and use it for module percpu variables
percpu: add an indirection ptr for chunk page map access
x86: make embedding percpu allocator return excessive free space
percpu: use negative for auto for pcpu_setup_first_chunk() arguments
percpu: improve first chunk initial area map handling
percpu: cosmetic renames in pcpu_setup_first_chunk()
percpu: clean up percpu constants
x86: un-__init fill_pud/pmd/pte
x86: remove vestigial fix_ioremap prototypes
...
Manually merge conflicts in arch/ia64/kernel/irq_ia64.c
We just augmented the kernel's RPC service registration code so that
it automatically adjusts to what is supported in user space. Thus we
no longer need the kernel configuration option to enable registering
RPC services with v4 -- it's all done automatically.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move error reporting for RPC registration to rpcb_register's caller.
This way the caller can choose to recover silently from certain
errors, but report errors it does not recognize. Error reporting
for kernel RPC service registration is now handled in one place.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kernel registers RPC services with the local portmapper with an
rpcbind SET upcall to the local portmapper. Traditionally, this used
rpcbind v2 (PMAP), but registering RPC services that support IPv6
requires rpcbind v3 or v4.
Since we now want separate PF_INET and PF_INET6 listeners for each
kernel RPC service, svc_register() will do only one of those
registrations at a time.
For PF_INET, it tries an rpcb v4 SET upcall first; if that fails, it
does a legacy portmap SET. This makes it entirely backwards
compatible with legacy user space, but allows a proper v4 SET to be
used if rpcbind is available.
For PF_INET6, it does an rpcb v4 SET upcall. If that fails, it fails
the registration, and thus the transport creation. This let's the
kernel detect if user space is able to support IPv6 RPC services, and
thus whether it should maintain a PF_INET6 listener for each service
at all.
This provides complete backwards compatibilty with legacy user space
that only supports rpcbind v2. The only down-side is that registering
a new kernel RPC service may take an extra exchange with the local
portmapper on legacy systems, but this is an infrequent operation and
is done over UDP (no lingering sockets in TIMEWAIT), so it shouldn't
be consequential.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Our initial implementation of svc_unregister() assumed that PMAP_UNSET
cleared all rpcbind registrations for a [program, version] tuple.
However, we now have evidence that PMAP_UNSET clears only "inet"
entries, and not "inet6" entries, in the rpcbind database.
For backwards compatibility with the legacy portmapper, the
svc_unregister() function also must work if user space doesn't support
rpcbind version 4 at all.
Thus we'll send an rpcbind v4 UNSET, and if that fails, we'll send a
PMAP_UNSET.
This simplifies the code in svc_unregister() and provides better
backwards compatibility with legacy user space that does not support
rpcbind version 4. We can get rid of the conditional compilation in
here as well.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The user space TI-RPC library uses an empty string for the universal
address when unregistering all target addresses for [program, version].
The kernel's rpcb client should behave the same way.
Here, we are switching between several registration methods based on
the protocol family of the incoming address. Rename the other rpcbind
v4 registration functions to make it clear that they, as well, are
switched on protocol family. In /etc/netconfig, this is either "inet"
or "inet6".
NB: The loopback protocol families are not supported in the kernel.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC 1833 has little to say about the contents of r_owner; it only
specifies that it is a string, and states that it is used to control
who can UNSET an entry.
Our port of rpcbind (from Sun) assumes this string contains a numeric
UID value, not alphabetical or symbolic characters, but checks this
value only for AF_LOCAL RPCB_SET or RPCB_UNSET requests. In all other
cases, rpcbind ignores the contents of the r_owner string.
The reference user space implementation of rpcb_set(3) uses a numeric
UID for all SET/UNSET requests (even via the network) and an empty
string for all other requests. We emulate that behavior here to
maintain bug-for-bug compatibility.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Simplify rpcb_v4_register() and its helpers by moving the
details of sockaddr type casting to rpcb_v4_register()'s helper
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC client returns -EPROTONOSUPPORT if there is a protocol version
mismatch (ie the remote RPC server doesn't support the RPC protocol
version sent by the client).
Helpers for the svc_register() function return -EPROTONOSUPPORT if they
don't recognize the passed-in IPPROTO_ value.
These are two entirely different failure modes.
Have the helpers return -ENOPROTOOPT instead of -EPROTONOSUPPORT. This
will allow callers to determine more precisely what the underlying
problem is, and decide to report or recover appropriately.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kernel uses an IPv6 loopback address when registering its AF_INET6
RPC services so that it can tell whether the local portmapper is
actually IPv6-enabled.
Since the legacy portmapper doesn't listen on IPv6, however, this
causes a long timeout on older systems if the kernel happens to try
creating and registering an AF_INET6 RPC service. Originally I wanted
to use a connected transport (either TCP or connected UDP) so that the
upcall would fail immediately if the portmapper wasn't listening on
IPv6, but we never agreed on what transport to use.
In the end, it's of little consequence to the kernel whether the local
portmapper is listening on IPv6. It's only important whether the
portmapper supports rpcbind v4. And the kernel can't tell that at all
if it is sending requests via IPv6 -- the portmapper will just ignore
them.
So, send both rpcbind v2 and v4 SET/UNSET requests via IPv4 loopback
to maintain better backwards compatibility between new kernels and
legacy user space, and prevent multi-second hangs in some cases when
the kernel attempts to register RPC services.
This patch is part of a series that addresses
http://bugzilla.kernel.org/show_bug.cgi?id=12256
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We are about to convert to using separate RPC listener sockets for
PF_INET and PF_INET6. This echoes the way IPv6 is handled in user
space by TI-RPC, and eliminates the need for ULPs to worry about
mapped IPv4 AF_INET6 addresses when doing address comparisons.
Start by setting the IPV6ONLY flag on PF_INET6 RPC listener sockets.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since an RPC service listener's protocol family is specified now via
svc_create_xprt(), it no longer needs to be passed to svc_create() or
svc_create_pooled(). Remove that argument from the synopsis of those
functions, and remove the sv_family field from the svc_serv struct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sv_family field is going away. Pass a protocol family argument to
svc_create_xprt() instead of extracting the family from the passed-in
svc_serv struct.
Again, as this is a listener socket and not an address, we make this
new argument an "int" protocol family, instead of an "sa_family_t."
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since the sv_family field is going away, modify svc_setup_socket() to
extract the protocol family from the passed-in socket instead of from
the passed-in svc_serv struct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sv_family field is going away. Instead of using sv_family, have
the svc_register() function take a protocol family argument.
Since this argument represents a protocol family, and not an address
family, this argument takes an int, as this is what is passed to
sock_create_kern(). Also make sure svc_register's helpers are
checking for PF_FOO instead of AF_FOO. The value of [AP]F_FOO are
equivalent; this is simply a symbolic change to reflect the semantics
of the value stored in that variable.
sock_create_kern() should return EPFNOSUPPORT if the passed-in
protocol family isn't supported, but it uses EAFNOSUPPORT for this
case. We will stick with that tradition here, as svc_register()
is called by the RPC server in the same path as sock_create_kern().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: add documentating comment and use appropriate data types for
svc_find_xprt()'s arguments.
This also eliminates a mixed sign comparison: @port was an int, while
the return value of svc_xprt_local_port() is an unsigned short.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In 2007, commit e65fe3976f added
additional sanity checking to rpcb_decode_getaddr() to make sure we
were getting a reply that was long enough to be an actual universal
address. If the uaddr string isn't long enough, the XDR decoder
returns EIO.
However, an empty string is a valid RPCB_GETADDR response if the
requested service isn't registered. Moreover, "::.n.m" is also a
valid RPCB_GETADDR response for IPv6 addresses that is shorter
than rpcb_decode_getaddr()'s lower limit of 11. So this sanity
check introduced a regression for rpcbind requests against IPv6
remotes.
So revert the lower bound check added by commit
e65fe3976f, and add an explicit check
for an empty uaddr string, similar to libtirpc's rpcb_getaddr(3).
Pointed-out-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch cleans up a lot of the Smack network access control code. The
largest changes are to fix the labeling of incoming TCP connections in a
manner similar to the recent SELinux changes which use the
security_inet_conn_request() hook to label the request_sock and let the label
move to the child socket via the normal network stack mechanisms. In addition
to the incoming TCP connection fixes this patch also removes the smk_labled
field from the socket_smack struct as the minor optimization advantage was
outweighed by the difficulty in maintaining it's proper state.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <jmorris@namei.org>
The socket_post_accept() hook is not currently used by any in-tree modules
and its existence continues to cause problems by confusing people about
what can be safely accomplished using this hook. If a legitimate need for
this hook arises in the future it can always be reintroduced.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: James Morris <jmorris@namei.org>
The current NetLabel/SELinux behavior for incoming TCP connections works but
only through a series of happy coincidences that rely on the limited nature of
standard CIPSO (only able to convey MLS attributes) and the write equality
imposed by the SELinux MLS constraints. The problem is that network sockets
created as the result of an incoming TCP connection were not on-the-wire
labeled based on the security attributes of the parent socket but rather based
on the wire label of the remote peer. The issue had to do with how IP options
were managed as part of the network stack and where the LSM hooks were in
relation to the code which set the IP options on these newly created child
sockets. While NetLabel/SELinux did correctly set the socket's on-the-wire
label it was promptly cleared by the network stack and reset based on the IP
options of the remote peer.
This patch, in conjunction with a prior patch that adjusted the LSM hook
locations, works to set the correct on-the-wire label format for new incoming
connections through the security_inet_conn_request() hook. Besides the
correct behavior there are many advantages to this change, the most significant
is that all of the NetLabel socket labeling code in SELinux now lives in hooks
which can return error codes to the core stack which allows us to finally get
ride of the selinux_netlbl_inode_permission() logic which greatly simplfies
the NetLabel/SELinux glue code. In the process of developing this patch I
also ran into a small handful of AF_INET6 cleanliness issues that have been
fixed which should make the code safer and easier to extend in the future.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <jmorris@namei.org>
The current placement of the security_inet_conn_request() hooks do not allow
individual LSMs to override the IP options of the connection's request_sock.
This is a problem as both SELinux and Smack have the ability to use labeled
networking protocols which make use of IP options to carry security attributes
and the inability to set the IP options at the start of the TCP handshake is
problematic.
This patch moves the IPv4 security_inet_conn_request() hooks past the code
where the request_sock's IP options are set/reset so that the LSM can safely
manipulate the IP options as needed. This patch intentionally does not change
the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options
are not currently implemented, once they are we will have a better idea of
the correct placement for the IPv6 hooks.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
Conflicts:
arch/sparc/kernel/time_64.c
drivers/gpu/drm/drm_proc.c
Manual merge to resolve build warning due to phys_addr_t type change
on x86:
drivers/gpu/drm/drm_info.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (166 commits)
Revert "ax25: zero length frame filtering in AX25"
Revert "netrom: zero length frame filtering in NetRom"
cfg80211: default CONFIG_WIRELESS_OLD_REGULATORY to n
mac80211/iwlwifi: move virtual A-MDPU queue bookkeeping to iwlwifi
mac80211: fix aggregation to not require queue stop
mac80211: add skb length sanity checking
mac80211: unify and fix TX aggregation start
mac80211: clean up __ieee80211_tx args
mac80211: rework the pending packets code
mac80211: fix A-MPDU queue assignment
mac80211: rewrite fragmentation
iwlwifi: show current driver status in user readable format
b43: Add BCM4307 PCI-ID
cfg80211: fix locking in nl80211_set_wiphy
mac80211: fix RX path
ath5k: properly drop packets from ops->tx
ar9170: single module build
ath9k: fix dma mapping leak of rx buffer upon rmmod
rt2x00: New USB ID for rt73usb
ath5k: warn and correct rate for unknown hw rate indexes
...
This reverts commit f99bcff7a2.
Like netrom, Alan Cox says that zero lengths have real meaning
and are useful in this protocol.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a3ac80a130.
Alan Cox says that zero length writes do have special meaning
and are useful in this protocol.
Signed-off-by: David S. Miller <davem@davemloft.net>
And update description and feature-removal schedule according
to the new plan.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch removes all the virtual A-MPDU-queue bookkeeping from
mac80211. Curiously, iwlwifi already does its own bookkeeping, so
it doesn't require much changes except where it needs to handle
starting and stopping the queues in mac80211.
To handle the queue stop/wake properly, we rewrite the software
queue number for aggregation frames and internally to iwlwifi keep
track of the queues that map into the same AC queue, and only talk
to mac80211 about the AC queue. The implementation requires calling
two new functions, iwl_stop_queue and iwl_wake_queue instead of the
mac80211 counterparts.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Reinette Chattre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Instead of stopping the entire AC queue when enabling aggregation
(which was only done for hardware with aggregation queues) buffer
the packets for each station, and release them to the pending skb
queue once aggregation is turned on successfully.
We get a little more code, but it becomes conceptually simpler and
we can remove the entire virtual queue mechanism from mac80211 in
a follow-up patch.
This changes how mac80211 behaves towards drivers that support
aggregation but have no hardware queues -- those drivers will now
not be handed packets while the aggregation session is being
established, but only after it has been fully established.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We just found a bug in zd1211rw where it would reject
packets in the ->tx() method but leave them modified,
which would cause retransmit attempts with completely
bogus skbs, eventually leading to a panic due to not
having enough headroom in those.
This patch adds a sanity check to mac80211 to catch
such driver mistakes; in this case we warn and drop
the skb.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When TX aggregation becomes operational, we do a number of steps:
1) print a debug message
2) wake the virtual queue
3) notify the driver
Unfortunately, 1) and 3) are only done if the driver is first to
reply to the aggregation request, it is, however, possible that the
remote station replies before the driver! Thus, unify the code for
this and call the new function ieee80211_agg_tx_operational in both
places where TX aggregation can become operational.
Additionally, rename the driver notification from
IEEE80211_AMPDU_TX_RESUME to IEEE80211_AMPDU_TX_OPERATIONAL.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
__ieee80211_tx takes a struct ieee80211_tx_data argument, but only
uses a few of its members, namely 'skb' and 'sta'. Make that explicit,
so that less internal knowledge is required in ieee80211_tx_pending
and the possibility of introducing errors here is removed.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The pending packets code is quite incomprehensible, uses memory barriers
nobody really understands, etc. This patch reworks it entirely, using
the queue spinlock, proper stop bits and the skb queues themselves to
indicate whether packets are pending or not (rather than a separate
variable like before).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Internally, mac80211 requires the skb's queue mapping to be set
to the AC queue, not the virtual A-MPDU queue. This is not done
correctly currently, this patch moves the code down to directly
before the driver is invoked and adds a comment that it will be
moved into the driver later.
Since this requires __ieee80211_tx() to have the sta pointer,
make sure to provide it in ieee80211_tx_pending().
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fragmentation currently uses an allocated array to store the
fragment skbs, and then keeps track of which have been sent
and which are still pending etc. This is rather complicated;
make it simpler by just chaining the fragments into skb->next
and removing from that list when sent. Also simplifies all
code that needs to touch fragments, since it now only needs
to walk the skb->next list.
This is a prerequisite for fixing the stored packet code,
which I need to do for proper aggregation packet storing.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Luis reports that there's a circular locking dependency;
this is because cfg80211_dev_rename() will acquire the
cfg80211_mutex while the device mutex is held, while
this normally is done the other way around. The solution
is to open-code the device-getting in nl80211_set_wiphy
and require holding the mutex around cfg80211_dev_rename
rather than acquiring it within.
Also fix a bug -- rtnl locking is expected by drivers so
we need to provide it.
Reported-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
My previous patch ("mac80211: remove mixed-cell and userspace MLME code")
was too obvious to me, so obvious that a stupid bug crept in. The IBSS
RX function must be invoked for IBSS, of course, not anything != IBSS.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch changes mac80211 to not notify the rate control algorithm's
tx_status() method when reporting status for a packet that didn't go
through the rate control algorithm's get_rate() method.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Add IEEE80211_HW_BEACON_FILTERING flag so that driver inform that it supports
beacon filtering. Drivers need to call the new function
ieee80211_beacon_loss() to notify about beacon loss.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In beacon filtering there needs to be a way to not expire the BSS even
when no beacons are received. Add an interface to cfg80211 to hold
BSS and make sure that it's not expired.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When software scanning we need to disable power save so that all possible
probe responses and beacons are received. For hardware scanning assume that
hardware will take care of that and document that assumption.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Separate beacon and rx path tracking in preparation for the beacon filtering
support. At the same time change ieee80211_associated() to look a bit simpler.
Probe requests are now sent only after IEEE80211_PROBE_IDLE_TIME, which
is now set to 60 seconds.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently the timer is triggering every two seconds
(IEEE80211_MONITORING_INTERVAL). Decrease the timer to only trigger during
data idle periods to avoid waking up CPU unnecessary. The timer will
still trigger during idle periods, that needs to be fixed later.
There's also a functional change that probe requests are sent only when the
data path is idle, earlier they were sent also while there was activity
on the data path.
This is also preparation for the beacon filtering support. Thanks to
Johannes Berg for the idea.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Neither can currently be set from userspace, so there's no
regression potential, and neither will be supported from
userspace since the new userspace APIs allow the SME, which
is in userspace, to control all we need.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When somebody tries to set the interface mode to the existing
mode, don't ask the driver but silently accept the setting.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We had left in code to allow interested developers to add
support for parsing country IEs when OLD_REG was enabled.
This never happened and since we're going to remove OLD_REG
lets just remove these comments and code for it.
This code path was never being entered so this has no
functional change.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It seems a few users are using this module parameter although its not
recommended. People are finding it useful despite there being utilities
for setting this in userspace. I'm not aware of any distribution using
this though.
Until userspace and distributions catch up with a default userspace
automatic replacement (GeoClue integration would be nirvana) we copy
the ieee80211_regdom module parameter from OLD_REG to the new reg
code to help these users migrate.
Users who are using the non-valid ISO / IEC 3166 alpha "EU" in their
ieee80211_regdom module parameter and migrate to non-OLD_REG enabled
system will world roam.
This also schedules removal of this same ieee80211_regdom module
parameter circa March 2010. Hope is by then nirvana is reached and
users will abandoned the module parameter completely.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The incorrect assumption is the last regulatory request
(last_request) is always a country IE when processing
country IEs. Although this is true 99% of the time the
first time this happens this could not be true.
This fixes an oops in the branch check for the last_request
when accessing drv_last_ie. The access was done under the
assumption the struct won't be null.
Note to stable: to port to 29 replace as follows, only 29 has
country IE code:
s|NL80211_REGDOM_SET_BY_COUNTRY_IE|REGDOM_SET_BY_COUNTRY_IE
Cc: stable@kernel.org
Reported-by: Quentin Armitage <Quentin@armitage.org.uk>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Although EU is a bogus alpha2 we need to process the send request
as our code depends on last_request being set.
Cc: stable@kernel.org
Reported-by: Quentin Armitage <Quentin@armitage.org.uk>
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We do not want to require all the drivers using cfg80211 to need to do
this. In addition, make the error values consistent by using
EOPNOTSUPP instead of semi-random assortment of errno values.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We do not want to require all the drivers using cfg80211 to need to do
this or to be prepared to handle these commands when the interface is
down.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Check that the used authentication type and reason code are valid here
so that drivers/mac80211 do not need to care about this. In addition,
remove the unnecessary validation of SSID attribute length which is
taken care of by netlink policy.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The functionality that NL80211_CMD_SET_MGMT_EXTRA_IE provided can now
be achieved with cleaner design by adding IE(s) into
NL80211_CMD_TRIGGER_SCAN, NL80211_CMD_AUTHENTICATE,
NL80211_CMD_ASSOCIATE, NL80211_CMD_DEAUTHENTICATE, and
NL80211_CMD_DISASSOCIATE.
Since this is a very recently added command and there are no known (or
known planned) applications using NL80211_CMD_SET_MGMT_EXTRA_IE and
taken into account how much extra complexity it adds to the IE
processing we have now (and need to add in the future to fix IE order
in couple of frames), it looks like the best option is to just remove
the implementation of this command for now. The enum values themselves
are left to avoid changing the nl80211 command or attribute numbers.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This file was forgotten from the quilt patch that added MLME
primitives, so the kfree on interface removal is missing. Fix this
potential memleak by freeing the temporary Authentication frame IEs
from SME when the interface is being removed.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When mac80211 resumes, it currently doesn't reconfigure the interfaces
entirely and also doesn't reconfigure BSS information -- fix this.
Also, to be able to test this, add a debugfs file that just calls
the suspend/resume code to see what happens when we go through that,
without needing the time-consuming suspend/resume cycle.
(Original version broke the build for CONFIG_PM=n. Define alternative
functions for that situation. -- JWL)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch adds new nl80211 commands to allow user space to request
authentication and association (and also deauthentication and
disassociation). The commands are structured to allow separate
authentication and association steps, i.e., the interface between
kernel and user space is similar to the MLME SAP interface in IEEE
802.11 standard and an user space application takes the role of the
SME.
The patch introduces MLME-AUTHENTICATE.request,
MLME-{,RE}ASSOCIATE.request, MLME-DEAUTHENTICATE.request, and
MLME-DISASSOCIATE.request primitives. The authentication and
association commands request the actual operations in two steps
(assuming the driver supports this; if not, separate authentication
step is skipped; this could end up being a separate "connect"
command).
The initial implementation for mac80211 uses the current
net/mac80211/mlme.c for actual sending and processing of management
frames and the new nl80211 commands will just stop the current state
machine from moving automatically from authentication to association.
Future cleanup may move more of the MLME operations into cfg80211.
The goal of this design is to provide more control of authentication and
association process to user space without having to move the full MLME
implementation. This should be enough to allow IEEE 802.11r FT protocol
and 802.11s SAE authentication to be implemented. Obviously, this will
also bring the extra benefit of not having to use WEXT for association
requests with mac80211. An example implementation of a user space SME
using the new nl80211 commands is available for wpa_supplicant.
This patch is enough to get IEEE 802.11r FT protocol working with
over-the-air mechanism (over-the-DS will need additional MLME
primitives for handling the FT Action frames).
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Add new nl80211 event notifications (and a new multicast group, "mlme")
for informing user space about received and processed Authentication,
(Re)Association Response, Deauthentication, and Disassociation frames in
station and IBSS modes (i.e., MLME SAP interface primitives
MLME-AUTHENTICATE.confirm, MLME-ASSOCIATE.confirm,
MLME-REASSOCIATE.confirm, MLME-DEAUTHENTICATE.indicate, and
MLME-DISASSOCIATE.indication). The event data is encapsulated as the 802.11
management frame since we already have the frame in that format and it
includes all the needed information.
This is the initial step in providing MLME SAP interface for
authentication and association with nl80211. In other words, kernel code
will act as the MLME and a user space application can control it as the
SME.
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We must not clear the previous BSSID when roaming to another AP within
the same ESS for reassociation to be used properly. It is fine to
clear this when the SSID changes, so let's move the code into
ieee80211_sta_set_ssid().
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
ieee80211_tx_h_check_assoc() was dropping everything else than probe
requests during software scan. So the nullfunc frame with the power save
bit was dropped and AP never received it. This meant that AP never
buffered any frames for the station during software scan.
Fix this by allowing to transmit both probe request and nullfunc frames
during software scan. Tested with stlc45xx.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When I added scanning to cfg80211, we got a lock dependency like this:
rtnl --> cfg80211_mtx
nl80211, on the other hand, has the reverse lock dependency:
cfg80211_mtx --> rtnl
which clearly is a bad idea. This patch reworks nl80211 to take these
two locks in the other order to fix the possible, and easily
triggerable, deadlock.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When the driver has been notified with a STA_REMOVE, it tears down
the internal ADDBA state. On resume, trying to initiate aggregation would
fail because mac80211 has not cleared the operational state for that <TID,STA>.
This can be fixed by tearing down the existing sessions on a suspend.
Also, the driver can initiate a new BA session when suspend is in progress.
This is fixed by marking the station as being in suspend state and
denying ADDBA requests for such STAs.
Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
To avoid concurrent manipulations of the sta list (which shouldn't
be possible at this point, but anyway) we need to hold the sta_lock
around iterating the list.
At the same time, we do not need to iterate the list at all if
the driver doesn't want to be notified.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This makes nl80211 export the supported commands (command groups)
per wiphy so userspace has an idea what it can do -- this will be
required reading for userspace when we introduce auth/assoc /or/
connect for older hardware that cannot separate auth and assoc.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Radiotap was updated to include a "bad PLCP" flag and standardise
the "bad FCS" flag in the "flags" rather than "RX flags" field,
this patch updates Linux to that standard.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Even though userland probably cannot submit packets, there might
still be some coming, and that's no good when the driver doesn't
expect them. Stop the queues across suspend/resume.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The last warning can never trigger, and the explicit AP_VLAN
check is pointless if we move the config_interface check down,
in practice config_interface is required anyway.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If a scan is queued in STA mode while the interface is in state direct
probe, authenticate or associate the scan is delayed until the interface
enters disabled or associated state. But in case of direct probe-,
authentication- or association- timeout sta_work will not be scheduled
anymore (without external trigger) and thus the pending scan is not
executed and prevents a new scan from being triggered (-EBUSY).
Fix this by queueing the sta work again after direct probe-, authentication-
and association- timeout.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This inline is useless and actually makes the code _longer_
rather than shorter.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The scan capability added to cfg80211/nl80211 introduced a
dependency on nl80211 by cfg80211. We can thus no longer have
just cfg80211 without nl80211. Specifically, cfg80211_scan_done()
calls nl80211_send_scan_aborted() or nl80211_send_scan_done().
Now we remove the option for user to select nl80211. It will always
be compiled if user selects cfg80211.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Don't call ieee80211_sta_find_ibss() directly, like it's done in STA
mode, so that the commit() call is more harmless respectively has
less site-effects.
Signed-off-by: Alina Friedrichsen <x-alina@gmx.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There is no need to set rqstp->rq_server to serv, while serv is initialized as rqstp->rq_server at previous line. And between these two lines, there is no change to rqstp->rq_server.
Signed-off-by: ideawu <ideawu@163.com>
Reviewed-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Otherwise we can wrap the sizes and end up sending garbage.
Closes#10423
Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
spin_lock() should be spin_unlock() in xfrm_state_walk_done().
caused by:
commit 12a169e7d8
"ipsec: Put dumpers on the dump list"
Reported-by: Marc Milgram <mmilgram@redhat.com>
Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 778d80be52
(ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface)
seems to have introduced a leak of sk_buff's for ipv6 traffic,
at least in some configurations where idev is NULL, or when ipv6
is disabled via sysctl.
The problem is that if the first condition of the if-statement
returns non-NULL, it returns an skb with only one reference,
and when the other conditions apply, execution jumps to the "out"
label, which does not call kfree_skb for it.
To plug this leak, change to use the "drop" label instead.
(this relies on it being ok to call kfree_skb on NULL)
This also allows us to avoid calling rcu_read_unlock here,
and removes the only user of the "out" label.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.
Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'bkl-removal' of git://git.lwn.net/linux-2.6:
Rationalize fasync return values
Move FASYNC bit handling to f_op->fasync()
Use f_lock to protect f_flags
Rename struct file->f_ep_lock
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (29 commits)
crypto: sha512-s390 - Add missing block size
hwrng: timeriomem - Breaks an allyesconfig build on s390:
nlattr: Fix build error with NET off
crypto: testmgr - add zlib test
crypto: zlib - New zlib crypto module, using pcomp
crypto: testmgr - Add support for the pcomp interface
crypto: compress - Add pcomp interface
netlink: Move netlink attribute parsing support to lib
crypto: Fix dead links
hwrng: timeriomem - New driver
crypto: chainiv - Use kcrypto_wq instead of keventd_wq
crypto: cryptd - Per-CPU thread implementation based on kcrypto_wq
crypto: api - Use dedicated workqueue for crypto subsystem
crypto: testmgr - Test skciphers with no IVs
crypto: aead - Avoid infinite loop when nivaead fails selftest
crypto: skcipher - Avoid infinite loop when cipher fails selftest
crypto: api - Fix crypto_alloc_tfm/create_create_tfm return convention
crypto: api - crypto_alg_mod_lookup either tested or untested
crypto: amcc - Add crypt4xx driver
crypto: ansi_cprng - Add maintainer
...
On a box with most of the optional Netfilter switches turned off some
of the NLAs are never send, e. g. secmark, mark or the conntrack
byte/packet counters. As a worst case scenario this may possibly
still lead to ctnetlink skbs being reallocated in netlink_trim()
later, loosing all the nice effects from the previous patches.
I try to solve that (at least partly) by correctly #ifdef'ing the
NLAs in the computation.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch renames the ebt_ulog nf_logger from "ulog" to "ebt_ulog" to
be in sync with other modules naming. As this name was currently only
used for informational purpose, the renaming should be harmless.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ebt_ulog module does not follow the fixed convention about function
return. Loading the module is triggering the following message:
sys_init_module: 'ebt_ulog'->init suspiciously returned 1, it should follow 0/-E convention
sys_init_module: loading module anyway...
Pid: 2334, comm: modprobe Not tainted 2.6.29-rc5edenwall0-00883-g199e57b #146
Call Trace:
[<c0441b81>] ? printk+0xf/0x16
[<c02311af>] sys_init_module+0x107/0x186
[<c0202cfa>] syscall_call+0x7/0xb
The following patch fixes the return treatment in ebt_ulog_init()
function.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the declaration of the logger structure in ebt_log
and ebt_ulog: I forgot to remove the const option from their declaration
in the commit ca735b3aaa ("netfilter:
use a linked list of loggers").
Pointed-out-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an crash when empty bond device is added to a bridge.
If an interface with invalid ethernet address (all zero) is added
to a bridge, then bridge code detects it when setting up the forward
databas entry. But the error unwind is broken, the bridge port object
can get freed twice: once when ref count went to zeo, and once by kfree.
Since object is never really accessible, just free it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usefull for all protocols which do not add additional data, such
as GRE or UDPlite.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Try to allocate a Netlink skb roughly the size of the actual
message, with the help from the l3 and l4 protocol helpers.
This is all to prevent a reallocation in netlink_trim() later.
The overhead of allocating the right-sized skb is rather small, with
ctnetlink_alloc_skb() actually being inlined away on my x86_64 box.
The size of the per-proto space is determined at registration time of
the protocol helper.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.
This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.
Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.
First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.
- It allows fast reuse of just freed elements, permitting better use of
CPU cache.
- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.
This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit e1b4b9f ([NETFILTER]: {ip,ip6,arp}_tables: fix exponential worst-case
search for loops) introduced a regression in the loop detection algorithm,
causing sporadic incorrectly detected loops.
When a chain has already been visited during the check, it is treated as
having a standard target containing a RETURN verdict directly at the
beginning in order to not check it again. The real target of the first
rule is then incorrectly treated as STANDARD target and checked not to
contain invalid verdicts.
Fix by making sure the rule does actually contain a standard target.
Based on patch by Francis Dupont <Francis_Dupont@isc.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This is necessary in order to have an upper bound for Netlink
message calculation, which is not a problem at all, as there
are no helpers with a longer name.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It calculates the max. length of a Netlink policy, which is usefull
for allocating Netlink buffers roughly the size of the actual
message.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
There is added a single callback for the l3 proto helper. The two
callbacks for the l4 protos are necessary because of the general
structure of a ctnetlink event, which is in short:
CTA_TUPLE_ORIG
<l3/l4-proto-attributes>
CTA_TUPLE_REPLY
<l3/l4-proto-attributes>
CTA_ID
...
CTA_PROTOINFO
<l4-proto-attributes>
CTA_TUPLE_MASTER
<l3/l4-proto-attributes>
Therefore the formular is
size := sizeof(generic-nlas) + 3 * sizeof(tuple_nlas) + sizeof(protoinfo_nlas)
Some of the NLAs are optional, e. g. CTA_TUPLE_MASTER, which is only
set if it's an expected connection. But the number of optional NLAs is
small enough to prevent netlink_trim() from reallocating if calculated
properly.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
We use same not trivial helper function in four places. We can factorize it.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
Without any barrier, one CPU could see a loop while doing its lookup.
Its true new table cannot be seen by another cpu, but previous table is still
readable.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
net/netfilter/xt_LED.c:40: error: field netfilter_led_trigger has incomplete type
net/netfilter/xt_LED.c: In function led_timeout_callback:
net/netfilter/xt_LED.c:78: warning: unused variable ledinternal
net/netfilter/xt_LED.c: In function led_tg_check:
net/netfilter/xt_LED.c:102: error: implicit declaration of function led_trigger_register
net/netfilter/xt_LED.c: In function led_tg_destroy:
net/netfilter/xt_LED.c:135: error: implicit declaration of function led_trigger_unregister
Fix by adding a dependency on LED_TRIGGERS.
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Tested-by: Subrata Modak <tosubrata@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The ipv6 version of bind_conflict code calls ipv6_rcv_saddr_equal()
which at times wrongly identified intersections between addresses.
It particularly broke down under a few instances and caused erroneous
bind conflicts.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Binding to a v4-mapped address on an AF_INET6 socket should
produce the same result as binding to an IPv4 address on
AF_INET socket. The two are interchangable as v4-mapped
address is really a portability aid.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 wildcard (0.0.0.0) address does not intersect
in any way with explicit IPv6 addresses. These two should
be permitted, but the IPv4 conflict code checks the ipv6only
bit as part of the test. Since binding to an explicit IPv6
address restricts the socket to only that IPv6 address, the
side-effect is that the socket behaves as v6-only. By
explicitely setting ipv6only in this case, allows the 2 binds
to succeed.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket marked v6-only, can not receive or send traffic to v4-mapped
addresses. Thus allowing binding to v4-mapped address on such a
socket makes no sense.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch combines Greg Bank's dprintk() work with the existing dynamic
printk patchset, we are now calling it 'dynamic debug'.
The new feature of this patchset is a richer /debugfs control file interface,
(an example output from my system is at the bottom), which allows fined grained
control over the the debug output. The output can be controlled by function,
file, module, format string, and line number.
for example, enabled all debug messages in module 'nf_conntrack':
echo -n 'module nf_conntrack +p' > /mnt/debugfs/dynamic_debug/control
to disable them:
echo -n 'module nf_conntrack -p' > /mnt/debugfs/dynamic_debug/control
A further explanation can be found in the documentation patch.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
dpm_list currently relies on the fact that child devices will
be registered after their parents to get a correct suspend
order. Using device_move() however destroys this assumption, as
an already registered device may be moved under a newly registered
one.
This patch adds a new argument to device_move(), allowing callers
to specify how dpm_list should be adapted.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can
be used by unicast and broadcast listeners to avoid receiving
ENOBUFS errors.
Generally speaking, ENOBUFS errors are useful to notify two things
to the listener:
a) You may increase the receiver buffer size via setsockopt().
b) You have lost messages, you may be out of sync.
In some cases, ignoring ENOBUFS errors can be useful. For example:
a) nfnetlink_queue: this subsystem does not have any sort of resync
method and you can decide to ignore ENOBUFS once you have set a
given buffer size.
b) ctnetlink: you can use this together with the socket flag
NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as
you do not need to resync (packets whose event are not delivered
are drop to provide reliable logging and state-synchronization).
Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down"
effect in terms of performance which is due to the netlink congestion
control when the listener cannot back off. The effect is the following:
1) throughput rate goes up and netlink messages are inserted in the
receiver buffer.
2) Then, netlink buffer fills and overruns (set on nlk->state bit 0).
3) While the listener empties the receiver buffer, netlink keeps
dropping messages. Thus, throughput goes dramatically down.
4) Then, once the listener has emptied the buffer (nlk->state
bit 0 is set off), goto step 1.
This effect is easy to trigger with netlink broadcast under heavy
load, and it is more noticeable when using a big receiver buffer.
You can find some results in [1] that show this problem.
[1] http://1984.lsi.us.es/linux/netlink/
This patch also includes the use of sk_drop to account the number of
netlink messages drop due to overrun. This value is shown in
/proc/net/netlink.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arches without efficient unaligned access can still perform a loop
assuming 16bit alignment in ifname_compare()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits)
ucc_geth: Fix oops when using fixed-link support
dm9000: locking bugfix
net: update dnet.c for bus_id removal
dnet: DNET should depend on HAS_IOMEM
dca: add missing copyright/license headers
nl80211: Check that function pointer != NULL before using it
sungem: missing net_device_ops
be2net: fix to restore vlan ids into BE2 during a IF DOWN->UP cycle
be2net: replenish when posting to rx-queue is starved in out of mem conditions
bas_gigaset: correctly allocate USB interrupt transfer buffer
smsc911x: reset last known duplex and carrier on open
sh_eth: Fix mistake of the address of SH7763
sh_eth: Change handling of IRQ
netns: oops in ip[6]_frag_reasm incrementing stats
net: kfree(napi->skb) => kfree_skb
net: fix sctp breakage
ipv6: fix display of local and remote sit endpoints
net: Document /proc/sys/net/core/netdev_budget
tulip: fix crash on iface up with shirq debug
virtio_net: Make virtio_net support carrier detection
...
This patch fixes an unaligned memory access in tcp_sack while reading
sequence numbers from TCP selective acknowledgement options. Prior to
applying this patch, upstream linux-2.6.27.20 was occasionally
generating messages like this on my sparc64 system:
[54678.532071] Kernel unaligned access at TPC[6b17d4] tcp_packet+0xcd4/0xd00
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch adds nfnetlink_set_err() to propagate the error to netlink
broadcast listener in case of memory allocation errors in the
message building.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patchs adds support of modification of the used logger via sysctl.
It can be used to change the logger to module that can not use the bind
operation (ipt_LOG and ipt_ULOG). For this purpose, it creates a
directory /proc/sys/net/netfilter/nf_log which contains a file
per-protocol. The content of the file is the name current logger (NONE if
not set) and a logger can be setup by simply echoing its name to the file.
By echoing "NONE" to a /proc/sys/net/netfilter/nf_log/PROTO file, the
logger corresponding to this PROTO is set to NULL.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Discard incoming packets whose ack field iincludes data not yet sent.
This is consistent with RFC 793 Section 3.9.
Change tcp_ack() to distinguish between too-small and too-large ack
field values. Keep segments with too-large ack fields out of the fast
path, and change slow path to discard them.
Reported-by: Oliver Zheng <mailinglists+netdev@oliverzheng.com>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that most network device drivers in (all but one in x86_64 allmodconfig)
support net_device_ops. Expose it as a configuration parameter. Still
need to address even older 32 bit drivers, and other arch before
compatiablity can be scheduled for removal in some future release.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to reference net_device_ops not old pointer.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the mpc device to using new netdevice_ops.
Compile tested only, needs more than usual review since
device was swaping pointers around etc.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial version of the DSA driver only supported a single switch
chip per network interface, while DSA-capable switch chips can be
interconnected to form a tree of switch chips. This patch adds support
for multiple switch chips on a network interface.
An example topology for a 16-port device with an embedded CPU is as
follows:
+-----+ +--------+ +--------+
| |eth0 10| switch |9 10| switch |
| CPU +----------+ +-------+ |
| | | chip 0 | | chip 1 |
+-----+ +---++---+ +---++---+
|| ||
|| ||
||1000baseT ||1000baseT
||ports 1-8 ||ports 9-16
This requires a couple of interdependent changes in the DSA layer:
- The dsa platform driver data needs to be extended: there is still
only one netdevice per DSA driver instance (eth0 in the example
above), but each of the switch chips in the tree needs its own
mii_bus device pointer, MII management bus address, and port name
array. (include/net/dsa.h) The existing in-tree dsa users need
some small changes to deal with this. (arch/arm)
- The DSA and Ethertype DSA tagging modules need to be extended to
use the DSA device ID field on receive and demultiplex the packet
accordingly, and fill in the DSA device ID field on transmit
according to which switch chip the packet is heading to.
(net/dsa/tag_{dsa,edsa}.c)
- The concept of "CPU port", which is the switch chip port that the
CPU is connected to (port 10 on switch chip 0 in the example), needs
to be extended with the concept of "upstream port", which is the
port on the switch chip that will bring us one hop closer to the CPU
(port 10 for both switch chips in the example above).
- The dsa platform data needs to specify which ports on which switch
chips are links to other switch chips, so that we can enable DSA
tagging mode on them. (For inter-switch links, we always use
non-EtherType DSA tagging, since it has lower overhead. The CPU
link uses dsa or edsa tagging depending on what the 'root' switch
chip supports.) This is done by specifying "dsa" for the given
port in the port array.
- The dsa platform data needs to be extended with information on via
which port to reach any given switch chip from any given switch chip.
This info is specified via the per-switch chip data struct ->rtable[]
array, which gives the nexthop ports for each of the other switches
in the tree.
For the example topology above, the dsa platform data would look
something like this:
static struct dsa_chip_data sw[2] = {
{
.mii_bus = &foo,
.sw_addr = 1,
.port_names[0] = "p1",
.port_names[1] = "p2",
.port_names[2] = "p3",
.port_names[3] = "p4",
.port_names[4] = "p5",
.port_names[5] = "p6",
.port_names[6] = "p7",
.port_names[7] = "p8",
.port_names[9] = "dsa",
.port_names[10] = "cpu",
.rtable = (s8 []){ -1, 9, },
}, {
.mii_bus = &foo,
.sw_addr = 2,
.port_names[0] = "p9",
.port_names[1] = "p10",
.port_names[2] = "p11",
.port_names[3] = "p12",
.port_names[4] = "p13",
.port_names[5] = "p14",
.port_names[6] = "p15",
.port_names[7] = "p16",
.port_names[10] = "dsa",
.rtable = (s8 []){ 10, -1, },
},
},
static struct dsa_platform_data pd = {
.netdev = &foo,
.nr_switches = 2,
.sw = sw,
};
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Gary Thomas <gary@mlbassoc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the Marvell 88E6095/6095F switch chips. These
chips are similar to the 88e6131, so we can add the support to
mv88e6131.c easily.
Thanks to Gary Thomas <gary@mlbassoc.com> and Jesper Dangaard
Brouer <hawk@diku.dk> for testing various patches.
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Gary Thomas <gary@mlbassoc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
..so that we can parse the DSA topology from 'ip link' output:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
4: lan1@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
5: lan2@eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
6: lan3@eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
7: lan4@eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix compiler warning about non-const format string.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols should be able to use constant value for the descriptor.
Minor whitespace cleanup as well
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no gain using prefetch() in dev_hard_start_xmit(), since
we already had to read ops->ndo_select_queue pointer in dev_pick_tx(),
and both pointers are probably located in the same cache line.
This prefetch call slows down fast path because of a stall in address
computation.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove 2 TEST_FRAME hacks that are no longer needed. These allowed
sctp regression tests to compile before, but are no longer needed.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor changes to queue hashing:
1. Use const on accessor functions
2. Export skb_tx_hash for use in drivers (see ixgbe)
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than calling device pointer directly (which is incorrect with
net_device_ops), use the standard dev_change_mtu. Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_sack_swap seems unnecessary so I pushed swap to the caller.
Also removed comment that seemed then pointless, and added include
when not already there. Compile tested.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
A zero length frame filter was recently introduced in ROSE protocole.
Previous commit makes the same at AX25 protocole level.
This patch has the same purpose for NetRom protocole.
The reason is that empty frames have no meaning in NetRom protocole.
Signed-off-by: Bernard Pidoux <f6bvp@amsat.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In previous commit 244f46ae6e
was introduced a zero length frame filter for ROSE protocole.
This patch has the same purpose at AX25 frame level for the same
reason. Empty frames have no meaning in AX25 protocole.
Signed-off-by: Bernard Pidoux <f6bvp@amsat.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
NL80211_CMD_GET_MESH_PARAMS and NL80211_CMD_SET_MESH_PARAMS handlers
did not verify whether a function pointer is NULL (not supported by
the driver) before trying to call the function. The former nl80211
command is available for unprivileged users, too, so this can
potentially allow normal users to kill networking (or worse..) if
mac80211 is built without CONFIG_MAC80211_MESH=y.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
printk formats in prior commit were reversed/incorrect.
Compiled without warning on x86 and x86_64, but detected on ppc.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As long as one task is holding the socket lock, then calls to
xprt_force_disconnect(xprt) will not succeed in shutting down the socket.
In particular, this would mean that a server initiated shutdown will not
succeed until the lock is relinquished.
In order to avoid the deadlock, we should ensure that xs_tcp_send_request()
closes the socket on EPIPE errors too.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This fixes a regression against FreeBSD servers as reported by Tomas
Kasparek. Apparently when using RPC over a TCP socket, the FreeBSD servers
don't ever react to the client closing the socket, and so commit
e06799f958 (SUNRPC: Use shutdown() instead of
close() when disconnecting a TCP socket) causes the setup to hang forever
whenever the client attempts to close and then reconnect.
We break the deadlock by adding a 'linger2' style timeout to the socket,
after which, the client will abort the connection using a TCP 'RST'.
The default timeout is set to 15 seconds. A subsequent patch will put it
under user control by means of a systctl.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
dev can be NULL in ip[6]_frag_reasm for skb's coming from RAW sockets.
Quagga's OSPFD sends fragmented packets on a RAW socket, when netfilter
conntrack reassembles them on the OUTPUT path you hit this code path.
You can test it with something like "hping2 -0 -d 2000 -f AA.BB.CC.DD"
With help from Jarek Poplawski.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct sk_buff pointers should be freed with kfree_skb.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
broken by commit 5e739d1752aca4e8f3e794d431503bfca3162df4; AFAICS should
be -stable fodder as well...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Aced-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix warnings from current gcc about using non-const strings as printf
args in TIPC. Compile tested only (not a TIPC user).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the regressions cause by
commit 1326c3d5a4
(v2.6.28-rc6-461-g23a12b1) broke the display of local and remote
addresses of an SIT tunnel in iproute2.
nt->parms is used by ipip6_tunnel_init() and therefore need to be
initialized first.
Tracked as http://bugzilla.kernel.org/show_bug.cgi?id=12868
Reported-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes an unused parameter (addr_len) from tcp_recv_urg()
method in net/ipv4/tcp.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the behavior of allowing both sysctl and addrconf_dad_failure()
to set the disable_ipv6 parameter without any bad side-effects.
If DAD fails and accept_dad > 1, we will still set disable_ipv6=1,
but then instead of allowing an RA to add an address then
immediately fail DAD, we simply don't allow the address to be
added in the first place. This also lets the user set this flag
and disable all IPv6 addresses on the interface, or on the entire
system.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the NFSv4 server to make use of TCP autotuning behaviour, which
was previously disabled by setting the sk_userlocks variable.
Set the receive buffers to be big enough to receive the whole RPC
request, and set this for the listening socket, not the accept socket.
Remove the code that readjusts the receive/send buffer sizes for the
accepted socket. Previously this code was used to influence the TCP
window management behaviour, which is no longer needed when autotuning
is enabled.
This can improve IO bandwidth on networks with high bandwidth-delay
products, where a large tcp window is required. It also simplifies
performance tuning, since getting adequate tcp buffers previously
required increasing the number of nfsd threads.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Cc: Jim Rees <rees@umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.
This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:
http://article.gmane.org/gmane.linux.nfs/10375
It has also been updated thus:
* moved EXPORT_SYMBOL() to near the function it exports
* made the new struct struct seq_operations const
* used SEQ_START_TOKEN instead of ((void *)1)
* merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
printk in svc_pool_stats_*()" by Harshula Jayasuriya.
* merged fix from SGI PV 964001 "Crash reading pool_stats before
nfsds are started".
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads. When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up. As long as there are idle
threads, one will be woken up.
If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them. Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run. This situation
causes two significant problems:
1. The CPU scheduler takes over 10% of each CPU, which is robbing
the nfsd threads of valuable CPU time.
2. At a high enough load, the nfsd threads starve userspace threads
of CPU time, to the point where daemons like portmap and rpc.mountd
do not schedule for tens of seconds at a time. Clients attempting
to mount an NFS filesystem timeout at the very first step (opening
a TCP connection to portmap) because portmap cannot wake up from
select() and call accept() in time.
Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.
The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.
Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server. That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server. The server was running 128 nfsds.
Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.
This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:
http://article.gmane.org/gmane.linux.nfs/10374
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Convert the remaining refcount users.
As pointed out by Patrick McHardy, the protocols can be accessed safely using RCU.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
On the legacy netif_rx path, I incorrectly tried to optimise
the napi_complete call by using __napi_complete before we reenable
IRQs. This simply doesn't work since we need to flush the held
GRO packets first.
This patch fixes it by doing the obvious thing of reenabling
IRQs first and then calling napi_complete.
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski pointed out that my previous fix is broken for
VLAN+netpoll as if netpoll is enabled we'd end up in the normal
receive path instead of the VLAN receive path.
This patch fixes it by calling the VLAN receive hook.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to send to userspace "regulatory" events.
For now we just send an event when we change regulatory domains.
We also notify userspace when devices are using their own custom
world roaming regulatory domains.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We do this so we can later inform userspace who set the
regulatory domain and provide details of the request.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is not used as we can always just assume the first
regulatory domain set will _always_ be a static regulatory
domain. REGDOM_SET_BY_CORE will be the first request from
cfg80211 for a regdomain and that then populates the first
regulatory request.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Even after commit "mac80211: deauth when interface is marked down"
(e327b847 on Linus tree), userspace still isn't notified when interface
goes down. There isn't a problem with this commit, but because of other
code changes it doesn't work on kernels >= 2.6.28 (works if same/similar
change applied on 2.6.27 for example).
The issue is as follows: after commit "mac80211: restructure disassoc/deauth
flows" in 2.6.28, the call to ieee80211_sta_deauthenticate added by
commit e327b847 will not work: because we do sta_info_flush(local, sdata)
inside ieee80211_stop (iface.c), all stations in interface are cleared, so
when calling ieee80211_sta_deauthenticate->ieee80211_set_disassoc (mlme.c),
inside ieee80211_set_disassoc we have this in the beginning:
sta = sta_info_get(local, ifsta->bssid);
if (!sta) {
The !sta check triggers, thus the function returns early and
ieee80211_sta_send_apinfo(sdata, ifsta) later isn't called, so
wpa_supplicant/userspace isn't notified with SIOCGIWAP.
This commit moves deauthentication to before flushing STA info
(sta_info_flush), thus the above can't happen and userspace is really
notified when interface goes down.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If cfg80211 requests a scan it awaits either a return code != 0 from
the scan function or the cfg80211_scan_done to be called. In case of
a STA mac80211's scan function ever returns 0 and queues the scan request.
If ieee80211_sta_work is executed and ieee80211_start_scan fails for
some reason cfg80211_scan_done will never be called but cfg80211 still
thinks the scan was triggered successfully and will refuse any future
scan requests due to drv->scan_req not being cleaned up.
If a scan is triggered from within the MLME a similar problem appears. If
ieee80211_start_scan returns an error, local->scan_req will not be reset
and mac80211 will refuse any future scan requests.
Hence, in both cases call ieee80211_scan_failed (which notifies cfg80211
and resets local->scan_req) if ieee80211_start_scan returns an error.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is the lowest value amongst countries which do enable 5 GHz operation.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Incorrect local->wmm_acm bits were set for AC_BK and AC_BE. Fix this
and add some comments to make it easier to understand the AC-to-UP(pair)
mapping. Set the wmm_acm bits (and show WMM debug) even if the driver
does not implement conf_tx() handler.
In addition, fix the ACM-based AC downgrade code to not use the
highest priority in error cases. We need to break the loop to get the
correct AC_BK value (3) instead of returning 0 (which would indicate
AC_VO). The comment here was not really very useful either, so let's
provide somewhat more helpful description of the situation.
Since it is very unlikely that the ACM flag would be set for AC_BK and
AC_BE, these bugs are not likely to be seen in real life networks.
Anyway, better do these things correctly should someone really use
silly AP configuration (and to pass some functionality tests, too).
Remove the TODO comment about handling ACM. Downgrading AC is
perfectly valid mechanism for ACM. Eventually, we may add support for
WMM-AC and send a request for a TS, but anyway, that functionality
won't be here at the location of this TODO comment.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It was possible to hit a kernel panic on NULL pointer dereference in
dev_queue_xmit() when sending power save buffered frames to a STA that
woke up from sleep. This happened when the buffered frame was requeued
for transmission in ap_sta_ps_end(). In order to avoid the panic, copy
the skb->dev and skb->iif values from the first fragment to all other
fragments.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When they were part of the now defunct ieee80211 component, these
messages were only visible when special debugging settings were enabled.
Let's mirror that with a new lib80211 debugging Kconfig option.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
As my netpoll fix for net doesn't really work for net-next, we
need this update to move the checks into the right place. As it
stands we may pass freed skbs to netpoll_receive_skb.
This patch also introduces a netpoll_rx_on function to avoid GRO
completely if we're invoked through netpoll. This might seem
paranoid but as netpoll may have an external receive hook it's
better to be safe than sorry. I don't think we need this for
2.6.29 though since there's nothing immediately broken by it.
This patch also moves the GRO_* return values to netdevice.h since
VLAN needs them too (I tried to avoid this originally but alas
this seems to be the easiest way out). This fixes a bug in VLAN
where it continued to use the old return value 2 instead of the
correct GRO_DROP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the iptables cluster match. This match can be used
to deploy gateway and back-end load-sharing clusters. The cluster
can be composed of 32 nodes maximum (although I have only tested
this with two nodes, so I cannot tell what is the real scalability
limit of this solution in terms of cluster nodes).
Assuming that all the nodes see all packets (see below for an
example on how to do that if your switch does not allow this), the
cluster match decides if this node has to handle a packet given:
(jhash(source IP) % total_nodes) & node_mask
For related connections, the master conntrack is used. The following
is an example of its use to deploy a gateway cluster composed of two
nodes (where this is the node 1):
iptables -I PREROUTING -t mangle -i eth1 -m cluster \
--cluster-total-nodes 2 --cluster-local-node 1 \
--cluster-proc-name eth1 -j MARK --set-mark 0xffff
iptables -A PREROUTING -t mangle -i eth1 \
-m mark ! --mark 0xffff -j DROP
iptables -A PREROUTING -t mangle -i eth2 -m cluster \
--cluster-total-nodes 2 --cluster-local-node 1 \
--cluster-proc-name eth2 -j MARK --set-mark 0xffff
iptables -A PREROUTING -t mangle -i eth2 \
-m mark ! --mark 0xffff -j DROP
And the following commands to make all nodes see the same packets:
ip maddr add 01:00:5e:00:01:01 dev eth1
ip maddr add 01:00:5e:00:01:02 dev eth2
arptables -I OUTPUT -o eth1 --h-length 6 \
-j mangle --mangle-mac-s 01:00:5e:00:01:01
arptables -I INPUT -i eth1 --h-length 6 \
--destination-mac 01:00:5e:00:01:01 \
-j mangle --mangle-mac-d 00:zz:yy:xx:5a:27
arptables -I OUTPUT -o eth2 --h-length 6 \
-j mangle --mangle-mac-s 01:00:5e:00:01:02
arptables -I INPUT -i eth2 --h-length 6 \
--destination-mac 01:00:5e:00:01:02 \
-j mangle --mangle-mac-d 00:zz:yy:xx:5a:27
In the case of TCP connections, pickup facility has to be disabled
to avoid marking TCP ACK packets coming in the reply direction as
valid.
echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose
BTW, some final notes:
* This match mangles the skbuff pkt_type in case that it detects
PACKET_MULTICAST for a non-multicast address. This may be done in
a PKTTYPE target for this sole purpose.
* This match supersedes the CLUSTERIP target.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Module specific data moved into per-net site and being allocated/freed
during net namespace creation/deletion.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
NEXTHDR_NONE doesn't has an IPv6 option header, so the first check
for the length will always fail and results in a confusing message
"too short" if debugging enabled. With this patch, we check for
NEXTHDR_NONE before length sanity checkings are done.
Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
We currently use the negative value in the conntrack code to encode
the packet verdict in the error. As NF_DROP is equal to 0, inverting
NF_DROP makes no sense and, as a result, no packets are ever dropped.
Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes a possible crash due to the missing initialization
of the expectation class when nf_ct_expect_related() is called.
Reported-by: BORBELY Zoltan <bozo@andrews.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit 784544739a (netfilter: iptables:
lock free counters) broke a number of modules whose rule data referenced
itself. A reallocation would not reestablish the correct references, so
it is best to use a separate struct that does not fall under RCU.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Removing the BKL from FASYNC handling ran into the challenge of keeping the
setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
the underlying fasync() function. Andi Kleen suggested moving the handling
of that bit into fasync(); this patch does exactly that. As a result, we
have a couple of internal API changes: fasync() must now manage the FASYNC
bit, and it will be called without the BKL held.
As it happens, every fasync() implementation in the kernel with one
exception calls fasync_helper(). So, if we make fasync_helper() set the
FASYNC bit, we can avoid making any changes to the other fasync()
functions - as long as those functions, themselves, have proper locking.
Most fasync() implementations do nothing but call fasync_helper() - which
has its own lock - so they are easily verified as correct. The BKL had
already been pushed down into the rest.
The networking code has its own version of fasync_helper(), so that code
has been augmented with explicit FASYNC bit handling.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
The ip_queue module is missing the net-pf-16-proto-3 alias that would
causae it to be auto-loaded when a socket of that type is opened. This
patch adds the alias.
Signed-off-by: Scott James Remnant <scott@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The ip6_queue module is missing the net-pf-16-proto-13 alias that would
cause it to be auto-loaded when a socket of that type is opened. This
patch adds the alias.
Signed-off-by: Scott James Remnant <scott@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch moves the event reporting outside the lock section. With
this patch, the creation and update of entries is homogeneous from
the event reporting perspective. Moreover, as the event reporting is
done outside the lock section, the netlink broadcast delivery can
benefit of the yield() call under congestion.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch moves the preliminary checkings that must be fulfilled
to update a conntrack, which are the following:
* NAT manglings cannot be updated
* Changing the master conntrack is not allowed.
This patch is a cleanup.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch moves the assignation of the master conntrack to
ctnetlink_create_conntrack(), which is where it really belongs.
This patch is a cleanup.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch increases the statistics of packets drop if the sequence
adjustment fails in ipv4_confirm().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch modifies the proc output to add display of registered
loggers. The content of /proc/net/netfilter/nf_log is modified. Instead
of displaying a protocol per line with format:
proto:logger
it now displays:
proto:logger (comma_separated_list_of_loggers)
NONE is used as keyword if no logger is used.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch modifies nf_log to use a linked list of loggers for each
protocol. This list of loggers is read and write protected with a
mutex.
This patch separates registration and binding. To be used as
logging module, a module has to register calling nf_log_register()
and to bind to a protocol it has to call nf_log_bind_pf().
This patch also converts the logging modules to the new API. For nfnetlink_log,
it simply switchs call to register functions to call to bind function and
adds a call to nf_log_register() during init. For other modules, it just
remove a const flag from the logger structure and replace it with a
__read_mostly.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It's not too likely to happen, would basically require crafted
packets (must hit the max guard in tcp_bound_to_half_wnd()).
It seems that nothing that bad would happen as there's tcp_mems
and congestion window that prevent runaway at some point from
hurting all too much (I'm not that sure what all those zero
sized segments we would generate do though in write queue).
Preventing it regardless is certainly the best way to go.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The results is very unlikely change every so often so we
hardly need to divide again after doing that once for a
connection. Yet, if divide still becomes necessary we
detect that and do the right thing and again settle for
non-divide state. Takes the u16 space which was previously
taken by the plain xmit_size_goal.
This should take care part of the tso vs non-tso difference
we found earlier.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's very little need for most of the callsites to get
tp->xmit_goal_size updated. That will cost us divide as is,
so slice the function in two. Also, the only users of the
tp->xmit_goal_size are directly behind tcp_current_mss(),
so there's no need to store that variable into tcp_sock
at all! The drop of xmit_goal_size currently leaves 16-bit
hole and some reorganization would again be necessary to
change that (but I'm aiming to fill that hole with u16
xmit_goal_size_segs to cache the results of the remaining
divide to get that tso on regression).
Bring xmit_goal_size parts into tcp.c
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that no variables clash such that we couldn't do
the check just once later on. Therefore move it.
Also kill dead obvious comment, dead argument and add
unlikely since this mtu probe does not happen too often.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wow, it was quite tricky to merge that stream of negations
but I think I finally got it right:
check & replace_ts_recent:
(s32)(rcv_tsval - ts_recent) >= 0 => 0
(s32)(ts_recent - rcv_tsval) <= 0 => 0
discard:
(s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1
(s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW => 0
I toggled the return values of tcp_paws_check around since
the old encoding added yet-another negation making tracking
of truth-values really complicated.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've already forgotten what for this was necessary, anyway
it's no longer used (if it ever was).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the pure assignment case, the earlier zeroing is
still in effect.
David S. Miller raised concerns if the ifs are there to avoid
dirtying cachelines. I came to these conclusions:
> We'll be dirty it anyway (now that I check), the first "real" statement
> in tcp_rcv_established is:
>
> tp->rx_opt.saw_tstamp = 0;
>
> ...that'll land on the same dword. :-/
>
> I suppose the blocks are there just because they had more complexity
> inside when they had to calculate the eff_sacks too (maybe it would
> have been better to just remove them in that drop-patch so you would
> have had less head-ache :-)).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking for a possible reason of bugzilla report on HTB oops:
http://bugzilla.kernel.org/show_bug.cgi?id=12858
I found the code in htb_delete calling htb_destroy_class on zero
refcount is very misleading: it can suggest this is a common path, and
destroy is called under sch_tree_lock. Actually, this can never happen
like this because before deletion cops->get() is done, and after
delete a class is still used by tclass_notify. The class destroy is
always called from cops->put(), so without sch_tree_lock.
This doesn't mean much now (since 2.6.27) because all vulnerable calls
were moved from htb_destroy_class to htb_delete, but there was a bug
in older kernels. The same change is done for other classful scheds,
which, it seems, didn't have similar locking problems here.
Reported-by: m0sia <m0sia@m0sia.ru>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->len is an unsigned int, so the test in x25_rx_call_request() always
evaluates to true.
len in x25_sendmsg() is unsigned as well. so -ERRORS returned by x25_output()
are not noticed.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Windows (XP at least) hosts on boot, with configured static ip, performing
address conflict detection, which is defined in RFC3927.
Here is quote of important information:
"
An ARP announcement is identical to the ARP Probe described above,
except that now the sender and target IP addresses are both set
to the host's newly selected IPv4 address.
"
But it same time this goes wrong with RFC5227.
"
The 'sender IP address' field MUST be set to all zeroes; this is to avoid
polluting ARP caches in other hosts on the same link in the case
where the address turns out to be already in use by another host.
"
When ARP proxy configured, it must not answer to both cases, because
it is address conflict verification in any case. For Windows it is just
causing to detect false "ip conflict". Already there is code for RFC5227, so
just trivially we just check also if source ip == target ip.
Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
The change to make xfrm_state objects hash on source address
broke the case where such source addresses are wildcarded.
Fix this by doing a two phase lookup, first with fully specified
source address, next using saddr wildcarded.
Reported-by: Nicolas Dichtel <nicolas.dichtel@dev.6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add FC CRC offload check for ETH_P_FCOE.
Signed-off-by: Yi Zou <yi.zou@intel.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
RFC5061 states:
Each adaptation layer that is defined that wishes
to use this parameter MUST specify an adaptation code point in an
appropriate RFC defining its use and meaning.
If the user has not set one - assume they don't want to sent the param
with a zero Adaptation Code Point.
Rationale - Currently the IANA defines zero as reserved - and
1 as the only valid value - so we consider zero to be unset - to save
adding a boolean to the socket structure.
Including this parameter unconditionally causes endpoints that do not
understand it to report errors unnecessarily.
Signed-off-by: Malcolm Lashley <mlashley@gmail.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC3758 Section 3.3.1. Sending Forward-TSN-Supported param in INIT
Note that if the endpoint chooses NOT to include the parameter, then
at no time during the life of the association can it send or process
a FORWARD TSN.
If peer does not support PR-SCTP capable, don't send FORWARD-TSN chunk
to peer.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix to indicate ASCONF support in INIT-ACK only if peer has
such capable.
This patch also fix to calc the chunk size if peer has no FWD-TSN
capable.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_inet_listen() call is split between UDP and TCP style. Looking
at the code, the two functions are almost the same and can be
merged into a single helper. This also fixes a bug that was
fixed in the UDP function, but missed in the TCP function.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: cleanup
node_to_cpumask (and the blecherous node_to_cpumask_ptr which
contained a declaration) are replaced now everyone implements
cpumask_of_node.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If xs_nospace() finds that the socket has disconnected, it attempts to
return ENOTCONN, however that value is then squashed by the callers.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Enforce the comment in xs_tcp_connect_worker4/xs_tcp_connect_worker6 that
we should delay, then retry on certain connection errors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While we should definitely return socket errors to the task that is
currently trying to send data, there is no need to propagate the same error
to all the other tasks on xprt->pending. Doing so actually slows down
recovery, since it causes more than one tasks to attempt socket recovery.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we get an ECONNREFUSED error, we currently go to sleep on the
'xprt->sending' wait queue. The problem is that no timeout is set there,
and there is nothing else that will wake the task up later.
We should deal with ECONNREFUSED in call_status, given that is where we
also deal with -EHOSTDOWN, and friends.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...so that we can distinguish between when we need to shutdown and when we
don't. Also remove the call to xs_tcp_shutdown() from xs_tcp_connect(),
since xprt_connect() makes the same test.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the socket is unconnected, and xprt_transmit() returns ENOTCONN, we
currently give up the lock on the transport channel. Doing so means that
the lock automatically gets assigned to the next task in the xprt->sending
queue, and so that task needs to be woken up to do the actual connect.
The following patch aims to avoid that unnecessary task switch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Provide an api to attempt to load any necessary kernel RPC
client transport module automatically. By convention, the
desired module name is "xprt"+"transport name". For example,
when NFS mounting with "-o proto=rdma", attempt to load the
"xprtrdma" module.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Certain client rpc's which contain both lengthy page-contained
metadata and a non-empty xdr_tail buffer require careful handling
to avoid overlapped memory copying. Rearranging of existing rpcrdma
marshaling code avoids it; this fixes an NFSv4 symlink creation error
detected with connectathon basic/test8 to multiple servers.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Certain client-provided RPCRDMA chunk alignments result in an
additional scatter/gather entry, which triggered nfs/rdma server
assertions incorrectly. OpenSolaris nfs/rdma client connectathon
testing was blocked by these in the special/locking section.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To clear out old state, the UDP connect workers unconditionally invoke
xs_close() before proceeding with a new connect. Nowadays this causes
a spurious wake-up of the task waiting for the connect to complete.
This is a little racey, but usually harmless. The waiting task
immediately retries the connect via a call_bind/call_connect sequence,
which usually finds the transport already in the connected state
because the connect worker has finished in the background.
To avoid a spurious wake-up, factor the xs_close() logic that resets
the underlying socket into a helper, and have the UDP connect workers
call that helper instead of xs_close().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the transport isn't bound, then we should just return ENOTCONN, letting
call_connect_status() and/or call_status() deal with retrying. Currently,
we appear to abort all pending tasks with an EIO error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We can Oops in both xs_udp_send_request() and xs_tcp_send_request() if the
call to xs_sendpages() returns an error due to the socket not yet being
set up.
Deal with that situation by returning a new error: ENOTSOCK, so that we
know to avoid dereferencing transport->sock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some systems send SYN packets with apparently wrong RFC1323 timestamp
option values [timestamp tsval=0 tsecr=0].
It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )
Linux TCP stack ignores this option and sends back a SYN+ACK packet
without timestamp option, thus many TCP flows cannot use timestamps
and lose some benefit of RFC1323.
Other operating systems seem to not care about initial tsval value, and let
tcp flows to negotiate timestamp option.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not try to "uninitialize" ipv6 if its initialization had been skipped
because module parameter disable=1 had been specified.
Reported-by: Thomas Backlund <tmb@mandriva.org>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should probably not be testing any flags after we've cleared the
RPC_TASK_RUNNING flag, since rpc_make_runnable() is then free to assign the
rpc_task to another workqueue, which may then destroy it.
We can fix any races with rpc_make_runnable() by ensuring that we only
clear the RPC_TASK_RUNNING flag while holding the rpc_wait_queue->lock that
the task is supposed to be sleeping on (and then checking whether or not
the task really is sleeping).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since dev_set_name takes a printf style string, new gcc complains
if arg is not const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols that use packet_type can be __read_mostly section for better
locality. Elminate any unnecessary initializations of NULL.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
freq_diff is unsigned, so test before subtraction
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In IBSS mode, the beacon timestamp has to be filled with the
BSS's timestamp when joining, and set to zero when creating
a new BSS.
Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>