With mthca, RC QPs can starve each other and even UD QPs on the same
hardware schedule queue. As a result, userspace MPI can starve
e.g. IPoIB traffic, with netdev watchdog warnings getting printed out,
and TCP connections getting stuck or failing.
Reduce the chance of this happening by using three separate hardware
schedule queues: one for userspace RC QPs, one for kernel RC QPs, and
one for all other QPs.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_free_qp() already has local variables to hold the QP's send_cq
and recv_cq, so we can slightly clean up the calls to mthca_cq_clean()
by using those local variables instead of expressions like
to_mcq(qp->ibqp.send_cq).
Also, by cleaning the recv_cq first, we can avoid worrying about
whether the QP is attached to an SRQ for the second call, because we
would only clean send_cq if send_cq is not equal to recv_cq, and that
means send_cq cannot have any receive completions from the QP being
destroyed.
All this work even improves the generated code a bit:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-5 (-5)
function old new delta
mthca_free_qp 510 505 -5
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The garbled logic in mthca_alloc_memfree() causes it to return 0, even
if it fails to allocate all doorbell records. Fix it to return -ENOMEM
when it fails.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
RESET->RESET is an allowed QP state transition, so mthca should handle
it correctly, by just returning success without involving the firmware.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When clearing the ib_ah_attr parameter in to_ib_ah_attr(), use sizeof
*ib_ah_attr instead of sizeof *path.
Pointed out by Jack Morgenstein <jackm@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a QP being queried is in the RESET state, don't execute the
QUERY_QP firmware command (because it will fail).
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This facility provides three entry points:
ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64
These facilities can either be used inside functions on dynamic data:
int do_something(long q)
{
...;
y = ilog2(x)
...;
}
Or can be used to statically initialise global variables with constant values:
unsigned n = ilog2(27);
When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.
When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.
[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit b3b30f5e ("IB/mthca: Recover from catastrophic errors")
introduced some section mismatch breakage, because the error recovery
code tears down and reinitializes the device, which calls into lots of
code originally marked __devinit and __devexit from regular .text.
Fix this by getting rid of these now-incorrect section markers.
Reported by Randy Dunlap <randy.dunlap@oracle.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We discovered a problem when running IPoIB applications on multiple
CPUs on an Altix system. Many messages such as:
ib_mthca 0002:01:00.0: SQ 000014 full (19941644 head, 19941707 tail, 64 max, 0 nreq)
appear in syslog, and the driver wedges up.
Apparently this is because writes to the doorbells from different CPUs
reach the device out of order. The following patch adds mmiowb() calls
after doorbell rings to ensure the doorbell writes are ordered.
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a QP has separate send and receive CQs, then the send CQ will never
have receive completions from that QP in it. So when cleaning the
send CQ, there's no need to pass in an SRQ pointer, even if the QP is
attached to an SRQ.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Incorrect number of bits was taken for static_rate field.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
port_num was not being returned for unconnected QPs.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass a struct ib_udata to the low-level driver's ->modify_srq() and
->modify_qp() methods, so that it can get to the device-specific data
passed in by the userspace driver.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When destroying a QP, mthca locks both the QP's send CQ and receive
CQ. However, the following scenario is perfectly valid:
QP_a: send_cq == CQ_x, recv_cq == CQ_y
QP_b: send_cq == CQ_y, recv_cq == CQ_x
The old mthca code simply locked send_cq and then recv_cq, which in
this case could lead to an AB-BA deadlock if QP_a and QP_b were
destroyed simultaneously.
We can fix this by changing the locking code to lock the CQ with the
lower CQ number first, which will create a consistent lock ordering.
Also, the second CQ is locked with spin_lock_nested() to tell lockdep
that we know what we're doing with the lock nesting.
This bug was found by lockdep.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The fence bit needs to be set in the doorbell too, not just the WQE.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After recent changes, mthca_wq_init does not actually initialize the WQ as it
used to - it simply resets all index fields to their initial values. So,
let's rename it to mthca_wq_reset.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Cc: Roland Dreier <rolandd@cisco.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mthca: initialize send and receive queue locks separately
lockdep identifies a lock by the call site of its initialization. By
initializing the send and receive queue locks in mthca_wq_init() we confuse
lockdep. It warns that that the ordered acquiry of both locks in
mthca_modify_qp() is recursive acquiry of one lock:
=============================================
[ INFO: possible recursive locking detected ]
---------------------------------------------
modprobe/1192 is trying to acquire lock:
(&wq->lock){....}, at: [<f892b4db>] mthca_modify_qp+0x60/0xa7b [ib_mthca]
but task is already holding lock:
(&wq->lock){....}, at: [<f892b4ce>] mthca_modify_qp+0x53/0xa7b [ib_mthca]
Initializing the locks separately in mthca_alloc_qp_common() stops the
warning and will let lockdep enforce proper ordering on paths that acquire
both locks.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Documentation/infiniband/core_locking.txt says:
All of the methods in struct ib_device exported by a low-level
driver must be fully reentrant. The low-level driver is required to
perform all synchronization necessary to maintain consistency, even
if multiple function calls using the same object are run
simultaneously.
However, mthca's modify_qp, modify_srq and resize_cq methods are
currently not reentrant. Add a mutex to the QP, SRQ and CQ structures
so that these calls can be properly serialized.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some error paths after the mthca_alloc_mailbox() call in mthca_modify_qp()
just do a "return -EINVAL" without freeing the mailbox. Convert these
returns to "goto out" to avoid leaking the mailbox storage.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If we post a list of length 256 exactly, nreq in doorbell gets set to
256 which is wrong: it should be encoded by 0. This is because we
only zero it out on the next WR, which may not be there. The solution
is to ring the doorbell after posting a WQE, not before posting the
next one.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix races in in destroying various objects. If a destroy routine
waits for an object to become free by doing
wait_event(&obj->wait, !atomic_read(&obj->refcount));
/* now clean up and destroy the object */
and another place drops a reference to the object by doing
if (atomic_dec_and_test(&obj->refcount))
wake_up(&obj->wait);
then this is susceptible to a race where the wait_event() and final
freeing of the object occur between the atomic_dec_and_test() and the
wake_up(). And this is a use-after-free, since wake_up() will be
called on part of the already-freed object.
Fix this in mthca by replacing the atomic_t refcounts with plain old
integers protected by a spinlock. This makes it possible to do the
decrement of the reference count and the wake_up() so that it appears
as a single atomic operation to the code waiting on the wait queue.
While touching this code, also simplify mthca_cq_clean(): the CQ being
cleaned cannot go away, because it still has a QP attached to it. So
there's no reason to be paranoid and look up the CQ by number; it's
perfectly safe to use the pointer that the callers already have.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Quite a few cleanup functions in mthca were marked as __devexit.
However, they could also be called from error paths during
initialization, so they cannot be marked that way. Just delete all of
the incorrect annotations.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the call to mthca_MODIFY_QP() failed, then mthca_modify_qp() would
still do some things it shouldn't, such as store away attributes for
special QPs. Fix this, and simplify the code, by simply jumping to
the exit path if mthca_MODIFY_QP() fails.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_alloc_sqp() by mthca_set_qp_size() need to set qp->transport
before calling mthca_set_qp_size(), since the value is used there.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a check that the modify QP parameters sgid_index and path_mtu are
valid, since they might come from userspace.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Check that the alternate P_Key index is in range when setting the
alternate path for a QP. Also make a cosmetic touch up to the debug
message printed when the main P_Key index is out of range.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IB_SEND_FENCE flag in post_send methods.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use ib_modify_qp_is_ok() in mthca, and delete the big table of
attributes for queue pair state transitions.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add low-level driver support to ib_mthca so that consumers can request
a "send queue drained" event be generated when a transiton to the SQD
state completes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The function mthca_free_err_wqe() can never fail, so get rid of its
return value. That means handle_error_cqe() doesn't have to check
what mthca_free_err_wqe() returns, which means it can't fail either
and doesn't have to return anything either. All this results in
simpler source code and a slight object code improvement:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-10 (-10)
function old new delta
mthca_free_err_wqe 83 81 -2
mthca_poll_cq 1758 1750 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
build_mlx_header() was using sqp->ud_header.grh_present before it was
initialized by mthca_read_ah(). Furthermore, header->grh_present is
set by ib_ud_header_init, so there's no need to set it again in
mthca_read_ah().
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add code to modify QP operation to handle setting alternate paths for
connected QPs.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
PKEY_INDEX is not a legal parameter in the RTR->RTS transition.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fixes to SQEr->RTS transition in modify_qp:
1. The flag IB_QP_ACCESS_FLAGS is optional for UC qps
2. The SQEr state is not supported for RC qps
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a case where copying max_inline_data from a successful create_qp
capabilities output to create_qp input could cause EINVAL error:
mthca_set_qp_size must check max_inline_data directly against
max_desc_sz; checking qp->sq.max_gs is wrong since max_inline_data
depends on the qp type and does not involve max_sg.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Modify_qp should check that the physical port number provided
is a legal value.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
sae and sre bits should only be set when setting sra_max. Further, in
the old code, if the caller specifies max_rd_atomic = 0, the sre and
sae bits are still set, with the result that the QP ends up with
max_rd_atomic = 1 in effect.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch corrects some corner cases in managing the RAE/RRE bits in
the mthca qp context. These bits need to be zero if the user requests
max_dest_rd_atomic of zero. The bits need to be restored to the value
implied by the qp access flags attribute in a previous (or the
current) modify-qp command if the dest_rd_atomic variable is changed
to non-zero.
In the current implementation, the following scenario will not work:
RESET-to-INIT set QP access flags to all disabled (zeroes)
INIT-to-RTR set max_dest_rd_atomic=10, AND
set qp_access_flags = IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_ATOMIC
The current code will incorrectly take the access-flags value set in
the RESET-to-INIT transition.
We can simplify, and correct, this IB_QP_ACCESS_FLAGS handling: it is
always safe to set qp access flags in the firmware command if either
of IB_QP_MAX_DEST_RD_ATOMIC or IB_QP_ACCESS_FLAGS is set, so let's
just set it to the correct value, always.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only change the driver's copy of the QP attributes in modify QP after
checking the modify QP command completed successfully.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix thinko in rd_atomic calculation: ffs(x) - 1 does not find the next
power of 2 -- it should be fls(x - 1).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add limit checking on rd_atomic and dest_rd_atomic attributes:
especially for max_dest_rd_atomic, a value that is larger than HCA
capability can cause RDB overflow and corruption of another QP.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
On mem-free HCAs, when posting a long list of send requests, a
doorbell must be rung every 255 requests. Add code to handle this.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
last pointer is not updated when QP is modified to reset state. This
causes data corruption if WQEs are already posted on the queue.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Calculation of QP capabilities still isn't exactly right in mthca:
max_send_sge/max_recv_sge fields returned in create_qp can exceed the
handware supported limits.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Responder resources are only required to handle RDMA reads and atomic
operations, not RDMA writes. So the driver should allow RDMA writes
even if responder resources are set to 0. This is especially
important for the UC transport -- with the old code, it was impossible
to enable RDMA writes for UC QPs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In Tavor mode, when posting a long list of receive work requests, a
doorbell must be rung every 256 requests. Add code to do this when
required.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>