We should probably not be testing any flags after we've cleared the
RPC_TASK_RUNNING flag, since rpc_make_runnable() is then free to assign the
rpc_task to another workqueue, which may then destroy it.
We can fix any races with rpc_make_runnable() by ensuring that we only
clear the RPC_TASK_RUNNING flag while holding the rpc_wait_queue->lock that
the task is supposed to be sleeping on (and then checking whether or not
the task really is sleeping).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A race between svc_revisit and svc_delete_xprt can result in
deferred requests holding references on a transport that can never be
recovered because dead transports are not enqueued for subsequent
processing.
Check for XPT_DEAD in revisit to clean up completing deferrals on a dead
transport and sweep a transport's deferred queue to do the same for queued
but unprocessed deferrals.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The rqstp structure has a pointer to a svc_deferred_req record
that is allocated when requests are deferred. This record is common
to all transports and can be freed in common code.
Move the kfree of the rq_deferred to the common svc_xprt_release
function.
This also fixes a memory leak in the RDMA transport which does not
kfree the dr structure in it's version of the xpo_release_rqst callback.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We want to ensure that connected sockets close down the connection when we
set XPT_CLOSE, so that we don't keep it hanging while cleaning up all the
stuff that is keeping a reference to the socket.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
svc_check_conn_limits() attempts to prevent denial of service attacks
by having the service close old connections once it reaches a
threshold. This threshold is based on the number of threads in the
service:
(serv->sv_nrthreads + 3) * 20
Once we reach this, we drop the oldest connections and a printk pops
to warn the admin that they should increase the number of threads.
Increasing the number of threads isn't an option however for services
like lockd. We don't want to eliminate this check entirely for such
services but we need some way to increase this limit.
This patch adds a sv_maxconn field to the svc_serv struct. When it's
set to 0, we use the current method to calculate the max number of
connections. RPC services can then set this on an as-needed basis.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.
i_mode is not worth the effort; it has no common default value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
net: Allow dependancies of FDDI & Tokenring to be modular.
igb: Fix build warning when DCA is disabled.
net: Fix warning fallout from recent NAPI interface changes.
gro: Fix potential use after free
sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
sfc: When disabling the NIC, close the device rather than unregistering it
sfc: SFT9001: Add cable diagnostics
sfc: Add support for multiple PHY self-tests
sfc: Merge top-level functions for self-tests
sfc: Clean up PHY mode management in loopback self-test
sfc: Fix unreliable link detection in some loopback modes
sfc: Generate unique names for per-NIC workqueues
802.3ad: use standard ethhdr instead of ad_header
802.3ad: generalize out mac address initializer
802.3ad: initialize ports LACPDU from const initializer
802.3ad: remove typedef around ad_system
802.3ad: turn ports is_individual into a bool
802.3ad: turn ports is_enabled into a bool
802.3ad: make ntt bool
ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
...
Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).
This patch extends the new upcall with a "service" field that currently
can have 2 values: "*" or "nfs". These values specify matching rules for
principals in the keytab file. The "*" means that gssd is allowed to use
"root", "nfs", or "host" keytab entries while the other option requires
"nfs".
Restricting gssd to use the "nfs" principal is needed for when the
server performs a callback to the client. The server in this case has
to authenticate itself as an "nfs" principal.
We also need "service" field to distiguish between two client-side cases
both currently using a uid of 0: the case of regular file access by the
root user, and the case of state-management calls (such as setclientid)
which should use a keytab for authentication. (And the upcall should
fail if an appropriate principal can't be found.)
Signed-off: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch extends the new upcall by adding a "target" field
communicating who we want to authenticate to (equivalently, the service
principal that we want to acquire a ticket for).
Signed-off: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds server-side support for callbacks other than AUTH_SYS.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds client-side support to allow for callbacks other than
AUTH_SYS.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rpc client needs to know the principal that the setclientid was done
as, so it can tell gssd who to authenticate to.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Two principals are involved in krb5 authentication: the target, who we
authenticate *to* (normally the name of the server, like
nfs/server.citi.umich.edu@CITI.UMICH.EDU), and the source, we we
authenticate *as* (normally a user, like bfields@UMICH.EDU)
In the case of NFSv4 callbacks, the target of the callback should be the
source of the client's setclientid call, and the source should be the
nfs server's own principal.
Therefore we allow svcgssd to pass down the name of the principal that
just authenticated, so that on setclientid we can store that principal
name with the new client, to be used later on callbacks.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement the new upcall. We decide which version of the upcall gssd
will use (new or old), by creating both pipes (the new one named "gssd",
the old one named after the mechanism (e.g., "krb5")), and then waiting
to see which version gssd actually opens.
We don't permit pipes of the two different types to be opened at once.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Keep a pointer to the inode that the message is queued on in the struct
gss_upcall_msg. This will be convenient, especially after we have a
choice of two pipes that an upcall could be queued on.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a global variable pipe_version which will eventually be used
to keep track of which version of the upcall gssd is using.
For now, though, it only keeps track of whether any pipe is open or not;
it is negative if not, zero if one is opened. We use this to wait for
the first gssd to open a pipe.
(Minor digression: note this waits only for the very first open of any
pipe, not for the first open of a pipe for a given auth; thus we still
need the RPC_PIPE_WAIT_FOR_OPEN behavior to wait for gssd to open new
pipes that pop up on subsequent mounts.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Keep a count of the number of pipes open plus the number of messages on
a pipe. This count isn't used yet.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I can't see any reason we need to call this until either the kernel or
the last gssd closes the pipe.
Also, this allows to guarantee that open_pipe and release_pipe are
called strictly in pairs; open_pipe on gssd's first open, release_pipe
on gssd's last close (or on the close of the kernel side of the pipe, if
that comes first).
That will make it very easy for the gss code to keep track of which
pipes gssd is using.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to transition to a new gssd upcall which is text-based and more
easily extensible.
To simplify upgrades, as well as testing and debugging, it will help if
we can upgrade gssd (to a version which understands the new upcall)
without having to choose at boot (or module-load) time whether we want
the new or the old upcall.
We will do this by providing two different pipes: one named, as
currently, after the mechanism (normally "krb5"), and supporting the
old upcall. One named "gssd" and supporting the new upcall version.
We allow gssd to indicate which version it supports by its choice of
which pipe to open.
As we have no interest in supporting *simultaneous* use of both
versions, we'll forbid opening both pipes at the same time.
So, add a new pipe_open callback to the rpc_pipefs api, which the gss
code can use to track which pipes have been open, and to refuse opens of
incompatible pipes.
We only need this to be called on the first open of a given pipe.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I want to add a little more code here, so it'll be convenient to have
this flatter.
Also, I'll want to add another error condition, so it'll be more
convenient to return -ENOMEM than NULL in the error case. The only
caller is already converting NULL to -ENOMEM anyway.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'll want to call this from elsewhere soon. And this is a bit nicer
anyway.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're just about to kfree() gss_auth, so there's no point to setting any
of its fields.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There's a bit of a chicken and egg problem when it comes to destroying
auth_gss credentials. When we destroy the last instance of a GSSAPI RPC
credential, we should send a NULL RPC call with a GSS procedure of
RPCSEC_GSS_DESTROY to hint to the server that it can destroy those
creds.
This isn't happening because we're setting clearing the uptodate bit on
the credentials and then setting the operations to the gss_nullops. When
we go to do the RPC call, we try to refresh the creds. That fails with
-EACCES and the call fails.
Fix this by not clearing the UPTODATE bit for the credentials and adding
a new crdestroy op for gss_nullops that just tears down the cred without
trying to destroy the context.
The only difference between this patch and the first one is the removal
of some minor formatting deltas.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi.
I've been looking at a bugzilla which describes a problem where
a customer was advised to use either the "noac" or "actimeo=0"
mount options to solve a consistency problem that they were
seeing in the file attributes. It turned out that this solution
did not work reliably for them because sometimes, the local
attribute cache was believed to be valid and not timed out.
(With an attribute cache timeout of 0, the cache should always
appear to be timed out.)
In looking at this situation, it appears to me that the problem
is that the attribute cache timeout code has an off-by-one
error in it. It is assuming that the cache is valid in the
region, [read_cache_jiffies, read_cache_jiffies + attrtimeo]. The
cache should be considered valid only in the region,
[read_cache_jiffies, read_cache_jiffies + attrtimeo). With this
change, the options, "noac" and "actimeo=0", work as originally
expected.
This problem was previously addressed by special casing the
attrtimeo == 0 case. However, since the problem is only an off-
by-one error, the cleaner solution is address the off-by-one
error and thus, not require the special case.
Thanx...
ps
Signed-off-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We've never considered the sunrpc code as part of any ABI to be used by
out-of-tree modules.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somehow, this escaped the previous purge. There should be no need to keep
any extra locks in the XDR callbacks.
The NFS client XDR code only writes into private objects, whereas all reads
of shared objects are confined to fields that do not change, such as
filehandles...
Ditto for lockd, the NFSv2/v3 client mount code, and rpcbind.
The nfsd XDR code may require the BKL, but since it does a synchronous RPC
call from a thread that already holds the lock, that issue is moot.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Conflicts:
fs/nfsd/nfs4recover.c
Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().
Signed-off-by: James Morris <jmorris@namei.org>
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux:
NLM: client-side nlm_lookup_host() should avoid matching on srcaddr
nfsd: use of unitialized list head on error exit in nfs4recover.c
Add a reference to sunrpc in svc_addsock
nfsd: clean up grace period on early exit
fix this warning:
net/sunrpc/xprtrdma/verbs.c: In function ‘rpcrdma_conn_upcall’:
net/sunrpc/xprtrdma/verbs.c:279: warning: unused variable ‘addr’
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
this warning:
net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’:
net/sunrpc/xprtrdma/svc_rdma_transport.c:830: warning: ‘dma_mr_acc’ may be used uninitialized in this function
triggers because GCC does not recognize the (correct) flow connection
between need_dma_mr and dma_mr_acc.
Annotate it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The svc_addsock function adds transport instances without taking a
reference on the sunrpc.ko module, however, the generic transport
destruction code drops a reference when a transport instance
is destroyed.
Add a try_module_get call to the svc_addsock function for transport
instances added by this function.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Fix a regression reported by Max Kellermann whereby kernel profiling
showed that his clients were spending 45% of their time in
rpcauth_lookup_credcache.
It turns out that although his processes had identical uid/gid/groups,
generic_match() was failing to detect this, because the task->group_info
pointers were not shared. This again lead to the creation of a huge number
of identical credentials at the RPC layer.
The regression is fixed by comparing the contents of task->group_info
if the actual pointers are not identical.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wrap current->cred and a few other accessors to hide their actual
implementation.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.
Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.
With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc warns when using the # modifier with the %p format specifier,
so we can't use this to omit the colons when needed, introduces
%pi6 instead.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Open code NIP6_FMT in the one call inside sscanf and one user
of NIP6() that could use %p6 in the netfilter code.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The define in kernel.h can be done away with at a later time.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to be careful when we try to unhash the credential in
put_rpccred(), because we're not holding the credcache lock, so the call to
rpcauth_unhash_cred() may fail if someone else has looked the cred up, and
obtained a reference to it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to make sure that we don't remove creds from the cred_unused list
if they are still under the moratorium, or else they will never get
garbage collected.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server sends us an RST error while we're in the TCP_ESTABLISHED
state, then that will not result in a state change, and so the RPC client
ends up hanging forever (see
http://bugzilla.kernel.org/show_bug.cgi?id=11154)
We can intercept the reset by setting up an sk->sk_error_report callback,
which will then allow us to initiate a proper shutdown and retry...
We also make sure that if the send request receives an ECONNRESET, then we
shutdown too...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
ipv4: Add a missing rcu_assign_pointer() in routing cache.
[netdrvr] ibmtr: PCMCIA IBMTR is ok on 64bit
xen-netfront: Avoid unaligned accesses to IP header
lmc: copy_*_user under spinlock
[netdrvr] myri10ge, ixgbe: remove broken select INTEL_IOATDMA
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
svcrdma: Fix IRD/ORD polarity
svcrdma: Update svc_rdma_send_error to use DMA LKEY
svcrdma: Modify the RPC reply path to use FRMR when available
svcrdma: Modify the RPC recv path to use FRMR when available
svcrdma: Add support to svc_rdma_send to handle chained WR
svcrdma: Modify post recv path to use local dma key
svcrdma: Add a service to register a Fast Reg MR with the device
svcrdma: Query device for Fast Reg support during connection setup
svcrdma: Add FRMR get/put services
NLM: Remove unused argument from svc_addsock() function
NLM: Remove "proto" argument from lockd_up()
NLM: Always start both UDP and TCP listeners
lockd: Remove unused fields in the nlm_reboot structure
lockd: Add helper to sanity check incoming NOTIFY requests
lockd: change nlmclnt_grant() to take a "struct sockaddr *"
lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
lockd: Support non-AF_INET addresses in nlm_lookup_host()
NLM: Convert nlm_lookup_host() to use a single argument
svcrdma: Add Fast Reg MR Data Types
...
Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RPC/RDMA connection logic could return early from reconnection
attempts, leading to additional spurious retries.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA code had a constant 5-second reconnect backoff, and
always performed it, even when re-establishing a connection to a
server after the RPC layer closed it due to being idle. Make it
an geometric backoff (up to 30 seconds), and don't delay idle
reconnect.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The send marshaling code split a particular dprintk across two
lines, which makes it hard to extract from logfiles.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add defensive timeouts to wait_for_completion() calls in RDMA
address resolution, and make them interruptible. Fix the timeout
units to milliseconds (formerly jiffies) and move to private header.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA code can leak RDMA connection manager endpoints in
certain error cases on connect. Don't signal unwanted events,
and be certain to destroy any allocated qp.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The xprt_connect call path does not expect such errors as ECONNREFUSED
to be returned from failed transport connection attempts, otherwise it
translates them to EIO and signals fatal errors. For example, mount.nfs
prints simply "internal error". Translate all such errors to ENOTCONN
from RPC/RDMA to match sockets behavior.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA protocol allows clients and servers to avoid RDMA
operations for data which is purely the result of XDR padding.
On the client, automatically insert the necessary padding for
such server replies, and optionally don't marshal such chunks.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RDMA disconnects yield an upcall from the RDMA connection manager,
which can race with rpc transport close, e.g. on ^C of a mount.
Ensure any rdma cm_id and qp are fully destroyed before continuing.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An RPC/RDMA client cannot retransmit on an unbroken connection,
doing so violates its flow control with the server.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This logic sets the connection parameter that configures the local device
and informs the remote peer how many concurrent incoming RDMA_READ
requests are supported. The original logic didn't really do what was
intended for two reasons:
- The max number supported by the device is typically smaller than
any one factor in the calculation used, and
- The field in the connection parameter structure where the value is
stored is a u8 and always overflows for the default settings.
So what really happens is the value requested for responder resources
is the left over 8 bits from the "desired value". If the desired value
happened to be a multiple of 256, the result was zero and it wouldn't
connect at all.
Given the above and the fact that max_requests is almost always larger
than the max responder resources supported by the adapter, this patch
simplifies this logic and simply requests the max supported by the device,
subject to a reasonable limit.
This bug was found by Jim Schutt at Sandia.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Configure, detect and use "fastreg" support from IB/iWARP verbs
layer to perform RPC/RDMA memory registration.
Make FRMR the default memreg mode (will fall back if not supported
by the selected RDMA adapter).
This allows full and optimal operation over the cxgb3 adapter, and others.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Acked-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
At transport creation, check for, and use, any local dma lkey.
Then, check that the selected memory registration mode is in fact
supported by the RDMA adapter selected for the mount. Fall back
to best alternative if not.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Acked-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor the memory registration and deregistration routines.
This saves stack space, makes the code more readable and prepares
to add the new FRMR registration methods.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Despite the fact that cloned rpc clients won't have the cl_autobind flag
set, they may still find themselves calling rpcb_getport_async(). For this
to happen, it suffices for a _parent_ rpc_clnt to use autobinding, in which
case any clone may find itself triggering the !xprt_bound() case in
call_bind().
The correct fix for this is to walk back up the tree of cloned rpc clients,
in order to find the parent that 'owns' the transport, either because it
has clnt->cl_autobind set, or because it originally created the
transport...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Basically, try_module_get here are pretty useless. Any other module using
this API will pin sunrpc in memory due using exported symbols.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The inititator/responder resources in the event have been swapped. They
no represent what the local peer would set their values to in order to
match the peer. Note that iWARP does not exchange these on the wire and
the provider is simply putting in the local device max.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Update the svc_rdma_send_error code to use the DMA LKEY which is valid
regardless of the memory registration strategy in use.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Use FRMR to map local RPC reply data. This allows RDMA_WRITE to send reply
data using a single WR. The FRMR is invalidated by linking the LOCAL_INV WR
to the RDMA_SEND message used to complete the reply.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
RPCRDMA requests that specify a read-list are fetched with RDMA_READ. Using
an FRMR to map the data sink improves NFSRDMA security on transports that
place the RDMA_READ data sink LKEY on the wire because the valid lifetime
of the MR is only the duration of the RDMA_READ. The LKEY is invalidated
when the last RDMA_READ WR completes.
Mapping the data sink also allows for very large amounts to data to be
fetched with a single WR, so if the client is also using FRMR, the entire
RPC read-list can be fetched with a single WR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
WR can be submitted as linked lists of WR. Update the svc_rdma_send
routine to handle WR chains. This will be used to submit a WR that
uses an FRMR with another WR that invalidates the FRMR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Update the svc_rdma_post_recv routine to use the adapter's global LKEY
instead of sc_phys_mr which is only valid when using a DMA MR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Fast Reg MR introduces a new WR type. Add a service to register the
region with the adapter and update the completion handling to support
completions with a NULL WR context.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Query the device capabilities in the svc_rdma_accept function to determine
what advanced memory management capabilities are supported by the device.
Based on the query, select the most secure model available given the
requirements of the transport and capabilities of the adapter.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Add services for the allocating, freeing, and unmapping Fast Reg MR. These
services will be used by the transport connection setup, send and receive
routines.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Clean up: The svc_addsock() function no longer uses its "proto"
argument, so remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
since commit ff7d9756b5
"nfsd: use static memory for callback program and stats"
do_probe_callback uses a static callback program
(NFS4_CALLBACK) rather than the one set in clp->cl_callback.cb_prog
as passed in by the client in setclientid (4.0)
or create_session (4.1).
This patches introduces rpc_create_args.prognumber that allows
overriding program->number when creating rpc_clnt.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The RPCB XDR functions are used for multiple procedures. For instance,
rpcb_encode_getaddr() is used for RPCB_GETADDR, RPCB_SET, and
RPCB_UNSET. Make the XDR debug messages more generic so they are less
confusing.
And, unlike in other RPC consumers in the kernel, a single debug flag
enables all levels of debug messages in the RPC bind client, including
XDR debug messages. Since the XDR decoders already report success or
failure in this case, remove redundant debug messages in the mid-level
rpcb_register_call() function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
With the new rpcbind code, a PMAP_UNSET will not have any effect on
services registered via rpcbind v3 or v4.
Implement a version of svc_unregister() that uses an RPCB_UNSET with
an empty netid string to make sure we have cleared *all* entries for
a kernel RPC service when shutting down, or before starting a fresh
instance of the service.
Use the new version only when CONFIG_SUNRPC_REGISTER_V4 is enabled;
otherwise, the legacy PMAP version is used to ensure complete
backwards-compatibility with the Linux portmapper daemon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: When doing an RPCB_SET, make the kernel's rpcb client use the
shorthand "::" for the universal form of the IPv6 ANY address.
Without this patch, rpcbind will advertise:
0000:0000:0000:0000:0000:0000:0000:0000.x.y
This is cosmetic only. It cleans up the display of information from
/sbin/rpcinfo.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
TI-RPC is a user-space library of RPC functions that replaces ONC RPC
and allows RPC to operate in the new world of IPv6.
TI-RPC combines the concept of a transport protocol (UDP and TCP)
and a protocol family (PF_INET and PF_INET6) into a single identifier
called a "netid." For example, "udp" means UDP over IPv4, and "udp6"
means UDP over IPv6.
For rpcbind, then, the RPC service tuple that is registered and
advertised is:
[RPC program, RPC version, service address and port, netid]
instead of
[RPC program, RPC version, port, protocol]
Service address is typically ANYADDR, but can be a specific address
of one of the interfaces on a multi-homed host. The third item in
the new tuple is expressed as a universal address.
The current Linux rpcbind implementation registers a netid for both
protocol families when RPCB_SET is done for just the PF_INET6 version
of the netid (ie udp6 or tcp6). So registering "udp6" causes a
registration for "udp" to appear automatically as well.
We've recently determined that this is incorrect behavior. In the
TI-RPC world, "udp6" is not meant to imply that the registered RPC
service handles requests from AF_INET as well, even if the listener
socket does address mapping. "udp" and "udp6" are entirely separate
capabilities, and must be registered separately.
The Linux kernel, unlike TI-RPC, leverages address mapping to allow a
single listener socket to handle requests for both AF_INET and AF_INET6.
This is still OK, but the kernel currently assumes registering "udp6"
will cover "udp" as well. It registers only "udp6" for it's AF_INET6
services, even though they handle both AF_INET and AF_INET6 on the same
port.
So svc_register() actually needs to register both "udp" and "udp6"
explicitly (and likewise for TCP). Until rpcbind is fixed, the
kernel can ignore the return code for the second RPCB_SET call.
Please merge this with commit 15231312:
SUNRPC: Support IPv6 when registering kernel RPC services
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Olaf Kirch <okir@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
In order to advertise NFS-related services on IPv6 interfaces via
rpcbind, the kernel RPC server implementation must use
rpcb_v4_register() instead of rpcb_register().
A new kernel build option allows distributions to use the legacy
v2 call until they integrate an appropriate user-space rpcbind
daemon that can support IPv6 RPC services.
I tried adding some automatic logic to fall back if registering
with a v4 protocol request failed, but there are too many corner
cases. So I just made it a compile-time switch that distributions
can throw when they've replaced portmapper with rpcbind.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Create a separate server-level interface for unregistering RPC services.
The mechanics of, and the API for, registering and unregistering RPC
services will diverge further as support for IPv6 is added.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Bruce suggested there's no need to expose the difference between an error
sending the PMAP_SET request and an error reply from the portmapper to
rpcb_register's callers. The user space equivalent of rpcb_register() is
pmap_set(3), which returns a bool_t : either the PMAP set worked, or it
didn't. Simple.
So let's remove the "*okay" argument from rpcb_register() and
rpcb_v4_register(), and simply return an error if any part of the call
didn't work.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
My plan is to use an AF_INET listener on systems that support only IPv4,
and an AF_INET6 listener on systems that can support IPv6. Incoming
IPv4 packets will be posted to an AF_INET6 listener with a mapped IPv4
address.
Max Matveev <makc@sgi.com> says:
Creating a single listener can be dangerous - if net.ipv6.bindv6only
is enabled then it's possible to create another listener in v4
namespace on the same port and steal the traffic from the "unifed"
listener. You need to disable V6ONLY explicitly via a sockopt to stop
that.
Set appropriate socket option on RPC server listener sockets to prevent
this.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Teach svc_create_xprt() to use the correct ANY address for AF_INET6 based
RPC services.
No caller uses AF_INET6 yet.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce and initialize an address family field in the svc_serv structure.
This field will determine what family to use for the service's listener
sockets and what families are advertised via the local rpcbind daemon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Vegard Nossum reported
----------------------
> I noticed that something weird is going on with /proc/sys/sunrpc/transports.
> This file is generated in net/sunrpc/sysctl.c, function proc_do_xprt(). When
> I "cat" this file, I get the expected output:
> $ cat /proc/sys/sunrpc/transports
> tcp 1048576
> udp 32768
> But I think that it does not check the length of the buffer supplied by
> userspace to read(). With my original program, I found that the stack was
> being overwritten by the characters above, even when the length given to
> read() was just 1.
David Wagner added (among other things) that copy_to_user could be
probably used here.
Ingo Oeser suggested to use simple_read_from_buffer() here.
The conclusion is that proc_do_xprt doesn't check for userside buffer
size indeed so fix this by using Ingo's suggestion.
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
CC: Ingo Oeser <ioe-lkml@rameria.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Greg Banks <gnb@sgi.com>
Cc: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
RDMA_READ completions are kept on a separate queue from the general
I/O request queue. Since a separate lock is used to protect the RDMA_READ
completion queue, a race exists between the dto_tasklet and the
svc_rdma_recvfrom thread where the dto_tasklet sets the XPT_DATA
bit and adds I/O to the read-completion queue. Concurrently, the
recvfrom thread checks the generic queue, finds it empty and resets
the XPT_DATA bit. A subsequent svc_xprt_enqueue will fail to enqueue
the transport for I/O and cause the transport to "stall".
The fix is to protect both lists with the same lock and set the XPT_DATA
bit with this lock held.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Replace previous instances of the cpumask_of_cpu_ptr* macros
with a the new (lvalue capable) generic cpumask_of_cpu().
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
nfsd: nfs4xdr.c do-while is not a compound statement
nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
lockd: Pass "struct sockaddr *" to new failover-by-IP function
lockd: get host reference in nlmsvc_create_block() instead of callers
lockd: minor svclock.c style fixes
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
lockd: nlm_release_host() checks for NULL, caller needn't
file lock: reorder struct file_lock to save space on 64 bit builds
nfsd: take file and mnt write in nfs4_upgrade_open
nfsd: document open share bit tracking
nfsd: tabulate nfs4 xdr encoding functions
nfsd: dprint operation names
svcrdma: Change WR context get/put to use the kmem cache
svcrdma: Create a kmem cache for the WR contexts
svcrdma: Add flush_scheduled_work to module exit function
svcrdma: Limit ORD based on client's advertised IRD
svcrdma: Remove unused wait q from svcrdma_xprt structure
svcrdma: Remove unneeded spin locks from __svc_rdma_free
svcrdma: Add dma map count and WARN_ON
...
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
with new cpumask_of_cpu_ptr macros. These are patterned after the
node_to_cpumask_ptr macros.
In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map
is provided when there is a large NR_CPUS count, reducing
greatly the amount of code generated and stack space used for
cpumask_of_cpu(). The pointer to the cpumask_t value is needed for
calling set_cpus_allowed_ptr() to reduce the amount of stack space
needed to pass the cpumask_t value.
If there isn't a cpumask_of_cpu_map[], then a temporary variable is
declared and filled in with value from cpumask_of_cpu(cpu) as well as
a pointer variable pointing to this temporary variable. Afterwards,
the pointer is used to reference the cpumask value. The compiler
will optimize out the extra dereference through the pointer as well
as the stack space used for the pointer, resulting in identical code.
A good example of the orthogonal usages is in net/sunrpc/svc.c:
case SVC_POOL_PERCPU:
{
unsigned int cpu = m->pool_to[pidx];
cpumask_of_cpu_ptr(cpumask, cpu);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, cpumask);
return 1;
}
case SVC_POOL_PERNODE:
{
unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, nodecpumask);
return 1;
}
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Push it into those callback functions that actually need it.
Note that all the NFS operations use their own locking, so don't need the
BKL. Ditto for the rpcbind client.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.
Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon. The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP. All of this functionality is exposed via
the new rpcb_v4_register() kernel API.
A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.
Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
rpcbind version 4 registration will reuse part of rpcb_register, so just
split it out into a separate function now.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Callers that required a privileged source port now use
rpcb_create_local(), so we can remove the @privileged argument from
rpcb_create().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add rpcb_create_local() for use by rpcb_register() and upcoming IPv6
registration functions.
Ensure any errors encountered by rpcb_create_local() are properly
reported.
We can also use a statically allocated constant loopback socket address
instead of one allocated on the stack and initialized every time the
function is called.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rpcbind versions 3 and 4 SET and UNSET procedures use the same
arguments as the GETADDR procedure.
While definitely a bug, this hasn't been a problem so far since the
kernel hasn't used version 3 or 4 SET and UNSET. But this will change
in just a moment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If another task is busy in rpcb_getport_async number, it is more efficient
to have it wake us up when it has finished instead of arbitrarily sleeping
for 5 seconds.
Also ensure that rpcb_wake_rpcbind_waiters() is called regardless of
whether or not rpcb_getport_done() gets called.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some server vendors support the higher versions of rpcbind only for
AF_INET6. The kernel doesn't need to use v3 or v4 for AF_INET anyway,
so change the kernel's rpcbind client to query AF_INET servers over
rpcbind v2 only.
This has a few interesting benefits:
1. If the rpcbind request is going over TCP, and the server doesn't
support rpcbind versions 3 or 4, the client reduces by two the number
of ephemeral ports left in TIME_WAIT for each rpcbind request. This
will help during NFS mount storms.
2. The rpcbind interaction with servers that don't support rpcbind
versions 3 or 4 will use less network traffic. Also helpful
during mount storms.
3. We can eliminate the kernel build option that controls whether the
kernel's rpcbind client uses rpcbind version 3 and 4 for AF_INET
servers. Less complicated kernel configuration...
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some rpcbind servers that do support rpcbind version 4 do not support
the GETVERSADDR procedure. Use GETADDR for querying rpcbind servers
via rpcbind version 4 instead of GETVERSADDR.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Change the version 2 procedure name to GETPORT. It's the same
procedure number as GETADDR, but version 2 implementations usually refer
to it as GETPORT.
This also now matches the procedure name used in the version 2 procedure
entry in the rpcb_next_version[] array, making it slightly less confusing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC client uses the rq_xtime field in each RPC request to determine the
round-trip time of the request. Currently, the rq_xtime field is
initialized by each transport just before it starts enqueing a request to
be sent. However, transports do not handle initializing this value
consistently; sometimes they don't initialize it at all.
To make the measurement of request round-trip time consistent for all
RPC client transport capabilities, pull rq_xtime initialization into the
RPC client's generic transport logic. Now all transports will get a
standardized RTT measure automatically, from:
xprt_transmit()
to
xprt_complete_rqst()
This makes round-trip time calculation more accurate for the TCP transport.
The socket ->sendmsg() method can return "-EAGAIN" if the socket's output
buffer is full, so the TCP transport's ->send_request() method may call
the ->sendmsg() method repeatedly until it gets all of the request's bytes
queued in the socket's buffer.
Currently, the TCP transport sets the rq_xtime field every time through
that loop so the final value is the timestamp just before the *last* call
to the underlying socket's ->sendmsg() method. After this patch, the
rq_xtime field contains a timestamp that reflects the time just before the
*first* call to ->sendmsg().
This is consequential under heavy workloads because large requests often
take multiple ->sendmsg() calls to get all the bytes of a request queued.
The TCP transport causes the request to sleep until the remote end of the
socket has received enough bytes to clear space in the socket's local
output buffer. This delay can be quite significant.
The method introduced by this patch is a more accurate measure of RTT
for stream transports, since the server can cause enough back pressure
to delay (ie increase the latency of) requests from the client.
Additionally, this patch corrects the behavior of the RDMA transport, which
entirely neglected to initialize the rq_xtime field. RPC performance
metrics for RDMA transports now display correct RPC request round trip
times.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Tom Talpey <thomas.talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Try to make the comment here a little more clear and concise.
Also, this macro definition seems unnecessary.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There used to be a print_hexl() function that used isprint(), now gone.
I don't know why NFS_NGROUPS and CA_RUN_AS_MACHINE were here.
I also don't know why another #define that's actually used was marked
"unused".
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, a minor comment grammar fix in the same file.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The cl_chatty flag alows us to control whether a given rpc client leaves
"server X not responding, timed out"
messages in the syslog. Such messages make sense for ordinary nfs
clients (where an unresponsive server means applications on the
mountpoint are probably hanging), but not for the callback client (which
can fail more commonly, with the only result just of disabling some
optimizations).
Previously cl_chatty was removed, do to lack of users; reinstate it, and
use it for the nfsd's callback client.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent changes to the RPC client's transport connect logic make connect
status values ECONNREFUSED and ECONNRESET impossible.
Clean up xprt_connect_status() to account for these changes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: move the logic that displays each task to its own function.
This removes indentation and makes future changes easier.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: don't display the rpc_show_tasks column header unless there is at
least one task to display. As far as I can tell, it is safe to let the
list_for_each_entry macro decide that each list is empty.
scripts/checkpatch.pl also wants a KERN_FOO at the start of any newly added
printk() calls, so this and subsequent patches will also add KERN_INFO.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC client uses a finite state machine to move RPC tasks through each
step of an RPC request. Each state is contained in a function in
net/sunrpc/clnt.c, and named call_foo.
Some of the functions named call_foo have changed over the past few years and
are no longer states in the FSM. These include: call_encode, call_header,
and call_verify. As a clean up, rename the functions that have changed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Improve debugging messages in call_start() and call_verify() by having
them show the RPC procedure name instead of the procedure number.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since the credentials may be allocated during the call to rpc_new_task(),
which again may be called by a memory allocator...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The special 'ENOMEM' case that was previously flagged as non-fatal is
bogus: auth_gss always returns EAGAIN for non-fatal errors, and may in fact
return ENOMEM in the special case where xdr_buf_read_netobj runs out of
preallocated buffer space (invariably a _fatal_ error, since there is no
provision for preallocating larger buffers).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All errors from call_encode(), with exception of EAGAIN are fatal, so we
should immediately return instead of proceeding to xprt_transmit().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that rpcb_next_version has been split into an IPv4 version and an IPv6
version, we Oops when rpcb_call_async attempts to look up the IPv6-specific
RPC procedure in rpcb_next_version.
Fix the Oops simply by having rpcb_getport_async pass the correct RPC
procedure as an argument.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is wrong to be freeing up the rpcbind arguments if the call to
rpcb_call_async() fails, since they should already have been freed up by
rpcb_map_release().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To return garbage_args, the accept_stat must be 0, and we must have a
verifier. So we shouldn't be resetting the write pointer as we reject
the call.
Also, we must add the two placeholder words here regardless of success
of the unwrap, to ensure the output buffer is left in a consistent state
for svcauth_gss_release().
This fixes a BUG() in svcauth_gss.c:svcauth_gss_release().
Thanks to Aime Le Rouzic for bug report, debugging help, and testing.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Aime Le Rouzic <aime.le-rouzic@bull.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the WR context pool to be shared across mount points. This
reduces the RDMA transport memory footprint significantly since
idle mounts don't consume WR context memory.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Create a kmem cache to hold WR contexts. Next we will convert
the WR context get and put services to use this kmem cache.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>