linux

Author	SHA1	Message	Date
Ingo Molnar	6c9fcaf2ee	Merge branch 'core/rcu' into core/rcu-for-linus	2008-07-15 21:10:12 +02:00
David S. Miller	b9e4085768	netdev: Do not use TX lock to protect address lists. Now that we have a specific lock to protect the network device unicast and multicast lists, remove extraneous grabs of the TX lock in cases where the code only needs address list protection. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-15 00:15:08 -07:00
David S. Miller	e308a5d806	netdev: Add netdev->addr_list_lock protection. Add netif_addr_{lock,unlock}{,_bh}() helpers. Use them to protect operations that operate on or read the network device unicast and multicast address lists. Also use them in cases where the code simply wants to block calls into the driver's ->set_rx_mode() and ->set_multicast_list() methods. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-15 00:13:44 -07:00
Eli Cohen	f507d28bff	IB/mlx4: Use kzalloc() for new QPs so flags are initialized to 0 Current code uses kmalloc() and then just does a bitwise OR operation on qp->flags in create_qp_common(), which means that qp->flags may potentially have some unintended bits set. This patch uses kzalloc() and avoids further explicit clearing of structure members, which also shrinks the code: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-65 (-65) function old new delta create_qp_common 2024 1959 -65 Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:53 -07:00
Or Gerlitz	de910bd921	RDMA/cma: Simplify locking needed for serialization of callbacks The RDMA CM has some logic in place to make sure that callbacks on a given CM ID are delivered to the consumer in a serialized manner. Specifically it has code to protect against a device removal racing with a running callback function. This patch simplifies this logic by using a mutex per ID instead of a wait queue and atomic variable. This means that cma_disable_remove() now is more properly named to cma_disable_callback(), and cma_enable_remove() can now be removed because it just would become a trivial wrapper around mutex_unlock(). Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:53 -07:00
Or Gerlitz	64c5e613b9	RDMA/addr: Keep pointer to netdevice in struct rdma_dev_addr Keep a pointer to the local (src) netdevice in struct rdma_dev_addr, and copy it in as part of rdma_copy_addr(). Use rdma_translate_ip() in cma_new_conn_id() to reduce some code duplication and also make sure the src_dev member gets set. In a high-availability configuration the netdevice pointer can be used by the RDMA CM to align RDMA sessions to use the same links as the IP stack does under fail-over and route change cases. Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:53 -07:00
Steve Wise	4ab928f692	RDMA/cxgb3: Fixes for zero STag Handling the zero STag in receive work request requires some extra logic in the driver: - Only set the QP_PRIV bit for kernel mode QPs. - Add a zero STag build function for recv wrs. The uP needs a PBL allocated and passed down in the recv WR so it can construct a HW PBL for the zero STag S/G entries. Note: we need to place a few restrictions on zero STag usage because of this: 1) all SGEs in a recv WR must either be zero STag or not. No mixing. 2) an individual SGE length cannot exceed 128MB for a zero-stag SGE. This should be OK since it's not really practical to allocate such a large chunk of pinned contiguous DMA mapped memory. - Add an optimized non-zero-STag recv wr format for kernel users. This is needed to optimize both zero and non-zero STag cracking in the recv path for kernel users. - Remove the iwch_ prefix from the static build functions. - Bump required FW version. Signed-off-by: Steve Wise <swise@opengridcomputing.com>	2008-07-14 23:48:53 -07:00
Steve Wise	96f15c0353	RDMA/core: Add local DMA L_Key support - Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0 STag support and IB HCAs to indicate reserved L_Key support. - Add a u32 local_dma_lkey member to struct ib_device. Drivers fill this in with the appropriate local DMA L_Key (if they support it). - Fix up the drivers using this flag. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:53 -07:00
Roland Dreier	aed012279d	IB/mthca: Fix check of max_send_sge for special QPs The MLX transport requires two extra gather entries for sends (one for the header and one for the checksum at the end, as the comment says). However the code checked that max_recv_sge was not too big, instead of checking max_send_sge as it should have. Fix the code to check the correct condition. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Roland Dreier	c036925ac0	IB/mthca: Use round_jiffies() for catastrophic error polling timer Exactly when the catastrophic error polling timer function runs is not important, so use round_jiffies() to save unnecessary wakeups. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Roland Dreier	4522e08ced	IB/mthca: Remove "stop" flag for catastrophic error polling timer Since we use del_timer_sync() anyway, there's no need for an additional flag to tell the timer not to rearm. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Eli Cohen	bc3a290b51	IPoIB: Double default RX/TX ring sizes Increase IPoIB ring sizes to twice their original sizes (RX: 128->256, TX: 64->128) to act as a shock absorber for high traffic peaks. With the current settings, we have seen cases that there are many calls to netif_stop_queue(), which causes degradation in throughput. Also, larger receive buffer sizes help IPoIB in CM mode to avoid experiencing RNR NAK conditions due to insufficient receive buffers at the SRQ. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Eli Cohen	e112373fd6	IPoIB/cm: Reduce connected mode TX object size Since IPoIB connected mode does not NETIF_F_SG, we only have one DMA mapping per send, so we don't need a mapping[] array. Define a new struct with a single u64 mapping member and use it for the CM tx_ring. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Ralph Campbell	df8666198d	IB/ipath: Use IEEE OUI for vendor_id reported by ibv_query_device() The IB spe. for SubnGet(NodeInfo) and query HCA says that the vendor ID field should be the IEEE OUI assigned to the vendor. The ipath driver was returning the PCI vendor ID instead. This will affect applications which call ibv_query_device(). The old value was 0x001fc1 or 0x001077, the new value is 0x001175. The vendor ID doesn't appear to be exported via /sys so that should reduce possible compatibility issues. I'm only aware of Open MPI as a major application which depends on this change, and they have made necessary adjustments. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:52 -07:00
Eli Cohen	bd3606715e	IPoIB: Use dev_set_mtu() to change mtu When the driver sets the MTU of the net device outside of its change_mtu method, it should make use of dev_set_mtu() instead of directly setting the mtu field of struct netdevice. Otherwise functions registered to be called upon MTU change will not get called (this is done through call_netdevice_notifiers() in dev_set_mtu()). Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:51 -07:00
Eli Cohen	c8c2afe360	IPoIB: Use rtnl lock/unlock when changing device flags Use of this lock is required to synchronize changes to the netdvice's data structs. Also move the call to ipoib_flush_paths() after the modification of the netdevice flags in set_mode(). Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:51 -07:00
Roland Dreier	9eae554c17	IPoIB: Get rid of ipoib_mcast_detach() wrapper ipoib_mcast_detach() does nothing except call ib_detach_mcast(), so just use the core API in the one place that does a multicast group detach. add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-105 (-105) function old new delta ipoib_mcast_leave 357 319 -38 ipoib_mcast_detach 67 - -67 Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:50 -07:00
Eli Cohen	d0de13622d	IPoIB: Only set Q_Key once: after joining broadcast group The current code will set the Q_Key for any join of a non-sendonly multicast group. The operation involves a modify QP operation, which is fairly heavyweight, and is only really required after the join of the broadcast group. Fix this by adding a parameter to ipoib_mcast_attach() to control when the Q_Key is set. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:50 -07:00
Eli Cohen	5892eff91a	IPoIB: Remove priv->mcast_mutex No need for a mutex around calls to ib_attach_mcast/ib_detach_mcast since these operations are synchronized at the HW driver layer. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:50 -07:00
Eli Cohen	c03d4731b5	IPoIB: Remove unused IPOIB_MCAST_STARTED code The IPOIB_MCAST_STARTED flag is not used at all since commit `b3e2749b` ("IPoIB: Don't drop multicast sends when they can be queued"), so remove it. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:50 -07:00
Steve Wise	70fe1796a5	RDMA/cxgb3: Set rkey field for new memory windows in iwch_alloc_mw() Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:49 -07:00
Roland Dreier	8294f29767	RDMA/nes: Get rid of ring_doorbell parameter of nes_post_cqp_request() Every caller of nes_post_cqp_request() passed it NES_CQP_REQUEST_RING_DOORBELL, so just remove that parameter and always ring the doorbell. Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Faisal Latif <flatif@neteffect.com>	2008-07-14 23:48:49 -07:00
Jon Mason	52c8084b74	RDMA/cxgb3: Propagate HW page size capabilities cxgb3 does not currently report the page size capabilities, and incorrectly reports them internally. This version changes the bit-shifting to a static value (per Steve's request). Signed-off-by: Jon Mason <jon@opengridcomputing.com> Acked-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:49 -07:00
Roland Dreier	1ff66e8c1f	RDMA/nes: Encapsulate logic nes_put_cqp_request() The iw_nes driver repeats the logic if (atomic_dec_and_test(&cqp_request->refcount)) { if (cqp_request->dynamic) { kfree(cqp_request); } else { spin_lock_irqsave(&nesdev->cqp.lock, flags); list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); spin_unlock_irqrestore(&nesdev->cqp.lock, flags); } } over and over. Wrap this up in functions nes_free_cqp_request() and nes_put_cqp_request() to simplify such code. In addition to making the source smaller and more readable, this shrinks the compiled code quite a bit: add/remove: 2/0 grow/shrink: 0/13 up/down: 164/-1692 (-1528) function old new delta nes_free_cqp_request - 147 +147 nes_put_cqp_request - 17 +17 nes_modify_qp 2316 2293 -23 nes_hw_modify_qp 737 657 -80 nes_dereg_mr 945 860 -85 flush_wqes 501 416 -85 nes_manage_apbvt 648 560 -88 nes_reg_mr 1117 1026 -91 nes_cqp_ce_handler 927 769 -158 nes_alloc_mw 1052 884 -168 nes_create_qp 5314 5141 -173 nes_alloc_fmr 2212 2035 -177 nes_destroy_cq 1097 918 -179 nes_create_cq 2787 2598 -189 nes_dealloc_mw 762 566 -196 Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Faisal Latif <flatif@neteffect.com>	2008-07-14 23:48:49 -07:00
Moni Shoua	ee1e2c82c2	IPoIB: Refresh paths instead of flushing them on SM change events The patch tries to solve the problem of device going down and paths being flushed on an SM change event. The method is to mark the paths as candidates for refresh (by setting the new valid flag to 0), and wait for an ARP probe a new path record query. The solution requires a different and less intrusive handling of SM change event. For that, the second argument of the flush function changes its meaning from a boolean flag to a level. In most cases, SM failover doesn't cause LID change so traffic won't stop. In the rare cases of LID change, the remote host (the one that hadn't changed its LID) will lose connectivity until paths are refreshed. This is no worse than the current state. In fact, preventing the device from going down saves packets that otherwise would be lost. Signed-off-by: Moni Levy <monil@voltaire.com> Signed-off-by: Moni Shoua <monis@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:49 -07:00
Joachim Fenkes	038919f296	IB/ehca: Make device table externally visible This gives ehca an autogenerated modalias and therefore enables automatic loading. Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:49 -07:00
Vladimir Sokolovsky	af40da894e	IPoIB: add LRO support Add "ipoib_use_lro" module parameter to enable LRO and an "ipoib_lro_max_aggr" module parameter to set the max number of packets to be aggregated. Make LRO controllable and LRO statistics accessible through ethtool. Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il> Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:48 -07:00
Ron Livne	1240673405	IPoIB: Use multicast loopback blocking if available Set IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK for IPoIB's UD QPs if supported by the underlying device. This creates an improvement of up to 39% in bandwidth when sending multicast packets with IPoIB, and an improvment of 12% in cpu usage. Signed-off-by: Ron Livne <ronli@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:48 -07:00
Ron Livne	521e575b9a	IB/mlx4: Add support for blocking multicast loopback packets Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK flag by using the per-multicast group loopback blocking feature of mlx4 hardware. Signed-off-by: Ron Livne <ronli@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:48 -07:00
Steve Wise	14cc180f7b	RDMA/cxgb3: Add support for protocol statistics - Add a new rdma ctl command called RDMA_GET_MIB to the cxgb3 low level driver to obtain the protocol mib from the rnic hardware. - Add new iw_cxgb3 provider method to get the MIB from the low level driver. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:48 -07:00
Steve Wise	7f624d023b	RDMA/core: Add iWARP protocol statistics attributes in sysfs This patch adds a sysfs attribute group called "proto_stats" under /sys/class/infiniband/$device/ and populates this group with protocol statistics if they exist for a given device. Currently, only iWARP stats are defined, but the code is designed to allow InfiniBand protocol stats if they become available. These stats are per-device and more importantly -not- per port. Details: - Add union rdma_protocol_stats in ib_verbs.h. This union allows defining transport-specific stats. Currently only iwarp stats are defined. - Add struct iw_protocol_stats to define the current set of iwarp protocol stats. - Add new ib_device method called get_proto_stats() to return protocol statistics. - Add logic in core/sysfs.c to create iwarp protocol stats attributes if the device is an RNIC and has a get_proto_stats() method. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:48 -07:00
Roland Dreier	a7d834c4bc	IPoIB/cm: Fix racy use of receive WR/SGL in ipoib_cm_post_receive_nonsrq() For devices that don't support SRQs, ipoib_cm_post_receive_nonsrq() is called from both ipoib_cm_handle_rx_wc() and ipoib_cm_nonsrq_init_rx(), and these two callers are not synchronized against each other. However, ipoib_cm_post_receive_nonsrq() always reuses the same receive work request and scatter list structures, so multiple callers can end up stepping on each other, which leads to posting garbled work requests. Fix this by having the caller pass in the ib_recv_wr and ib_sge structures to use, and allocating new local structures in ipoib_cm_nonsrq_init_rx(). Based on a patch by Pradeep Satyanarayana <pradeep@us.ibm.com> and David Wilder <dwilder@us.ibm.com>, with debugging help from Hoang-Nam Nguyen <hnguyen@de.ibm.com>. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:47 -07:00
Roland Dreier	468f2239bc	RDMA/cma: Add missing newlines to printk()s Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Sean Hefty <sean.hefty@intel.com>	2008-07-14 23:48:47 -07:00
Roland Dreier	eec8845d29	RDMA/cxgb3: Remove write-only iwch_rnic_attributes fields The members struct iwch_rnic_attributes.vendor_id and .vendor_part_id are write-only, so we might as well get rid of them. Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Steve Wise <swise@opengridcomputing.com>	2008-07-14 23:48:47 -07:00
Steve Wise	97d1cc8055	RDMA/cxgb3: Fix up some ib_device_attr fields - set fw_ver - set hw_ver - set max_qp_wr to something reasonable - set max_cqe to something reasonable Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:47 -07:00
Stefan Roscher	6f7bc01a73	IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts During corner case testing, we noticed that some versions of ehca do not properly transition to interrupt done in special load situations. This can be resolved by periodically triggering EOI through H_EOI, if EQEs are pending. Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:47 -07:00
Joachim Fenkes	3e255eac56	IB/ehca: Reject receive work requests if QP is in RESET state Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:47 -07:00
Roland Dreier	7c27f35820	IB/mlx4: Remove extra code for RESET->ERR QP state transition Commit `65adfa91` ("IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions") added some extra code to handle a QP state transition from RESET to ERROR. However, the latest 1.2.1 version of the IB spec has clarified that this transition is actually not allowed, so we can remove this extra code again. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:46 -07:00
Roland Dreier	d3809ad097	IB/mthca: Remove extra code for RESET->ERR QP state transition Commit `b18aad71` ("IB/mthca: Fix RESET to ERROR transition") added some extra code to handle a QP state transition from RESET to ERROR. However, the latest 1.2.1 version of the IB spec has clarified that this transition is actually not allowed, so we can remove this extra code again. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:46 -07:00
Ralph Campbell	e5a5e7d59a	IB/core: Reset to error QP state transition is not allowed I was reviewing the QP state transition diagram in the IB 1.2.1 spec and the code for qp_state_table[], and noticed that the code allows a QP to be modified from IB_QPS_RESET to IB_QPS_ERR whereas the notes for figure 124 (pg 457) specifically says that this transition isn't allowed. This is a clarification from earlier versions of the IB spec, which were ambiguous in this area and suggested that the RESET to ERR transition was allowed. Fix up the qp_state_table[] to make RESET->ERR not allowed. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:46 -07:00
Eli Cohen	6578cf3398	IB/mlx4: Pass congestion management class MADs to the HCA ConnectX HCAs support the IB_MGMT_CLASS_CONG_MGMT management class, so process MADs of this class through the MAD_IFC firmware command. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:45 -07:00
Eli Cohen	d1f2cd895f	IB/mlx4: Configure QPs' max message size based on real device capability ConnectX returns the max message size it supports through the QUERY_DEV_CAP firmware command. When modifying a QP to RTR, the max message size for the QP must be specified. This value must not exceed the value declared through QUERY_DEV_CAP. The current code ignores the max allowed size and unconditionally sets the value to 2^31. This patch sets all QPs to the max value allowed as returned from firmware. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:45 -07:00
Steve Wise	e7e5582999	RDMA/cxgb3: MEM_MGT_EXTENSIONS support - set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit if fw supports it. - set max_fast_reg_page_list_len device attribute. - add iwch_alloc_fast_reg_mr function. - add iwch_alloc_fastreg_pbl - add iwch_free_fastreg_pbl - adjust the WQ depth for kernel mode work queues to account for fastreg possibly taking 2 WR slots. - add fastreg_mr work request support. - add local_inv work request support. - add send_with_inv and send_with_se_inv work request support. - removed useless duplicate enums/defines for TPT/MW/MR stuff. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:45 -07:00
Steve Wise	00f7ec36c9	RDMA/core: Add memory management extensions support This patch adds support for the IB "base memory management extension" (BMME) and the equivalent iWARP operations (which the iWARP verbs mandates all devices must implement). The new operations are: - Allocate an ib_mr for use in fast register work requests. - Allocate/free a physical buffer lists for use in fast register work requests. This allows device drivers to allocate this memory as needed for use in posting send requests (eg via dma_alloc_coherent). - New send queue work requests: * send with remote invalidate * fast register memory region * local invalidate memory region * RDMA read with invalidate local memory region (iWARP only) Consumer interface details: - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added to indicate device support for these features. - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV, IB_WR_RDMA_READ_WITH_INV are added. - A new consumer API function, ib_alloc_mr() is added to allocate fast register memory regions. - New consumer API functions, ib_alloc_fast_reg_page_list() and ib_free_fast_reg_page_list() are added to allocate and free device-specific memory for fast registration page lists. - A new consumer API function, ib_update_fast_reg_key(), is added to allow the key portion of the R_Key and L_Key of a fast registration MR to be updated. Consumers call this if desired before posting a IB_WR_FAST_REG_MR work request. Consumers can use this as follows: - MR is allocated with ib_alloc_mr(). - Page list memory is allocated with ib_alloc_fast_reg_page_list(). - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV), ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with invalidate operation. - MR is deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast register MR once, and then can repeatedly bind the MR to different physical block lists (PBLs) via posting work requests to a send queue (SQ). For each outstanding MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be allocated (the fast_reg_page_list is owned by the low-level driver from the consumer posting a work request until the request completes). Thus pipelining can be achieved while still allowing device-specific page_list processing. The 32-bit fast register memory key/STag is composed of a 24-bit index and an 8-bit key. The application can change the key each time it fast registers thus allowing more control over the peer's use of the key/STag (ie it can effectively be changed each time the rkey is rebound to a page list). Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:45 -07:00
Eli Cohen	f89271da32	IPoIB: Copy small received SKBs in connected mode The connected mode implementation in the IPoIB driver has a large overhead in the way SKBs are handled in the receive flow. It usually allocates an SKB with as big as was used in the currently received SKB and moves unused fragments from the old SKB to the new one. This involves a loop on all the remaining fragments and incurs overhead on the CPU. This patch, for small SKBs, allocates an SKB just large enough to contain the received data and copies to it the data from the received SKB. The newly allocated SKB is passed to the stack and the old SKB is reposted. When running netperf, UDP small messages, without this pach I get: UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 14.4.3.178 (14.4.3.178) port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 114688 128 10.00 5142034 0 526.31 114688 10.00 1130489 115.71 With this patch I get both send and receive at ~315 mbps. The reason that send performance actually slows down is as follows: When using this patch, the overhead of the CPU for handling RX packets is dramatically reduced. As a result, we do not experience RNR NAK messages from the receiver which cause the connection to be closed and reopened again; when the patch is not used, the receiver cannot handle the packets fast enough so there is less time to post new buffers and hence the mentioned RNR NACKs. So what happens is that the application thinks it posted a certain number of packets for transmission but these packets are flushed and do not really get transmitted. Since the connection gets opened and closed many times, each time netperf gets the CPU time that otherwise would have been given to IPoIB to actually transmit the packets. This can be verified when looking at the port counters -- the output of ifconfig and the oputput of netperf (this is for the case without the patch): tx packets ========== port counter: 1,543,996 ifconfig: 1,581,426 netperf: 5,142,034 rx packets ========== netperf 1,1304,089 Signed-off-by: Eli Cohen <eli@mellanox.co.il>	2008-07-14 23:48:44 -07:00
Roland Dreier	f3781d2e89	RDMA: Remove subversion $Id tags They don't get updated by git and so they're worse than useless. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:44 -07:00
Robert P. J. Day	fd91b1bf1b	IB/ipath: Simplify code using ARRAY_SIZE() macro Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:44 -07:00
Eli Cohen	9670e55391	IB/mlx4: Optimize QP stamping The idea is that for QPs with fixed size work requests (eg selective signaling QPs), before stamping the WQE, we read the value of the DS field, which gives the effective size of the descriptor as used in the previous post. Then we stamp only that area, since the rest of the descriptor is already stamped. When initializing the send queue buffer, make sure the DS field is initialized to the max descriptor size so that the subsequent stamping will be done on the entire descriptor area. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:44 -07:00
Moni Shoua	164ba0893c	IB/sa: Fail requests made while creating new SM AH This patch solves a race that occurs after an event occurs that causes the SA query module to flush its SM address handle (AH). When SM AH becomes invalid and needs an update it is handled by the global workqueue. On the other hand this event is also handled in the IPoIB driver by queuing work in the ipoib_workqueue that does multicast joins. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. This causes a problem because IPoIB may end up sending an request to the old SM, which will take a long time to time out (since the old SM is gone); this leads to a much longer than necessary interruption in multicast traffer. The patch sets the SA query module's SM AH to NULL when the event occurs, and until update_sm_ah() is done, any request that needs sm_ah fails with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch, MADs are sent to the wrong SM so the request gets lost. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse. Signed-off-by: Moni Levy <monil@voltaire.com> Signed-off-by: Moni Shoua <monis@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:43 -07:00
Sean Hefty	a947491709	RDMA: Fix license text The license text for several files references a third software license that was inadvertently copied in. Update the license to what was intended. This update was based on a request from HP. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:43 -07:00
Christophe Jaillet	929555a2ba	RDMA/nes: Remove unnecessary memset() Remove an explicit memset(..., 0, ...) of a 'listener' structure allocated with kzalloc(). Signed-off-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr> Acked-by: Faisal Latif <faisal@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:43 -07:00
Roland Dreier	969a60f9db	IB/srp: Remove use of cached P_Key/GID queries The SRP initiator is currently using ib_find_cached_pkey() and ib_get_cached_gid() in situations where the uncached ib_find_pkey() and ib_query_gid() functions serve just as well: sleeping is allowed and performance is not an issue. Since we want to eliminate the cached operations in the long term, convert SRP to use the uncached variants. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-14 23:48:43 -07:00
Jonathan Corbet	2fceef397f	Merge commit 'v2.6.26' into bkl-removal	2008-07-14 15:29:34 -06:00
Mike Christie	8e9a20cee4	[SCSI] libiscsi, iscsi_tcp, ib_iser: fix setting of can_queue with old tools. This patch fixes two bugs that are related. 1. Old tools did not set can_queue/cmds_max. This patch modifies libiscsi so that when we add the host we catch this and set it to the default. 2. iscsi_tcp thought that the scsi command that was passed to the eh functions needed a iscsi_cmd_task allocated for it. It only needed a mgmt task, and now it does not matter since it all comes from the same pool and libiscsi handles this for the drivers. ib_iser had copied iscsi_tcp's code and set can_queue to its max - 1 to handle this. So this patch removes the max -1, and just sets it to the max. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:29 -05:00
Mike Christie	913e5bf435	[SCSI] libiscsi, iser, tcp: remove recv_lock The recv lock was defined so the iscsi layer could block the recv path from processing IO during recovery. It turns out iser just set a lock to that pointer which was pointless. We now disconnect the transport connection before doing recovery so we do not need the recv lock. For iscsi_tcp we still stop the recv path incase older tools are being used. This patch also has iscsi_itt_to_ctask user grab the session lock and has the caller access the task with the lock or get a ref to it in case the target is broken and sends a tmf success response then sends data or a response for the command that was supposed to be affected bty the tmf. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:22 -05:00
Mike Christie	88dfd340b9	[SCSI] iscsi class: Add session initiatorname and ifacename sysfs attrs. This adds two new attrs used for creating initiator ports and binding sessions to hardware. The session level initiatorname: Since bnx2i does a scsi_host per host device, we need to add the iface initiator port settings on the session, so we can create multiple initiator ports (each with different inames) per device/scsi_host. The current iname reflects that qla4xxx can have one iname per hba, and we are allocating a host per session for software. The iname on the host will remain so we can export and set the hba level qla4xxx setting. The ifacename attr: To bind a session to a some peice of hardware in userspace we maintain some mappings, but during boot or iscsid restart (iscsid contains the user space part of the driver) we need to be able to figure out which of those host mappings abstractions maps to certain sessions. This patch adds a ifacename attr, which userspace can set to id the host side of the endpoint across pivot_roots and iscsid restarts. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:21 -05:00
Mike Christie	412eeafa0a	[SCSI] iser: Modify iser to take a iscsi_endpoint struct in ep callouts and session setup This hooks iser into the iscsi endpoint code. Previously it handled the lookup and allocation. This has been made generic so bnx2i and iser can share it. It also allows us to pass iser the leading conn's ep, so we know the ib_deivce being used and can set it as the scsi_host's parent. And that allows scsi-ml to set the dma_mask based on those values. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:21 -05:00
Mike Christie	7970634b81	[SCSI] iscsi class: user device_for_each_child instead of duplicating session list Currently we duplicate the list of sessions, because we were using the test for if a session was on the host list to indicate if the session was bound or unbound. We can instead use the target_id and fix up the class so that drivers like bnx2i do not have to manage the target id space. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:20 -05:00
Mike Christie	2261ec3d68	[SCSI] iser: handle iscsi_cmd_task rename This handles the iscsi_cmd_task rename and renames the iser cmd task to iser task. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:20 -05:00
Mike Christie	2747fdb257	[SCSI] iser: convert ib_iser to support merged tasks Convert ib_iser to support merged tasks. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:19 -05:00
Mike Christie	0af967f5d4	[SCSI] libiscsi, iscsi_tcp, iser: add session cmds array accessor Currently to get a ctask from the session cmd array, you have to know to use the itt modifier. To make this easier on LLDs and so in the future we can easilly kill the session array and use the host shared map instead, this patch adds a nice wrapper to strip the itt into a session->cmds index and return a ctask. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:18 -05:00
Mike Christie	b40977d95f	[SCSI] iser: fix handling of scsi cmnds during recovery. After the stop_conn callback has returned the LLD should not touch the scsi cmds. iscsi_tcp and libiscsi use the conn->recv_lock and suspend_rx field to halt recv path processing, but iser does not have any protection. This patch modifies iser so that userspace can just call the ep_disconnect callback, which will halt all recv IO, before calling the stop_conn callback so we do not have to worry about the conn->recv_lock and suspend rx field. iser just needs to stop the send side from accessing the ib conn. Fixup to handle when the ep poll fails and ep disconnect is called from Erez. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:17 -05:00
Mike Christie	5d91e209fb	[SCSI] iscsi: remove session/conn_data_size from iscsi_transport This removes the session and conn data_size fields from the iscsi_transport. Just pass in the value like with host allocation. This patch also makes it so the LLD iscsi_conn data is allocated with the iscsi_cls_conn. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:16 -05:00
Mike Christie	a4804cd6eb	[SCSI] iscsi: add iscsi host helpers This finishes the host/session unbinding, by adding some helpers to add and remove hosts and the session they manage. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:16 -05:00
Mike Christie	756135215e	[SCSI] iscsi: remove session and host binding in libiscsi bnx2i allocates a host per netdevice but will use libiscsi, so this unbinds the session from the host in that code. This will also be useful for the iser parent device dma settings fixes. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:16 -05:00
Mike Christie	d3826721b1	[SCSI] iscsi class, iscsi drivers: remove unused iscsi_transport attrs max_cmd_len and max_conn are not really used. max_cmd_len is always 16 and can be set by the LLD. max_conn is always one since we do not support MCS. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:15 -05:00
Mike Christie	40753caa36	[SCSI] iscsi class, iscsi_tcp/iser: add host arg to session creation iscsi offload (bnx2i and qla4xx) allocate a scsi host per hba, so the session creation path needs a shost/host_no argument. Software iscsi/iser will follow the same behabior as before where it allcoates a host per session, but in the future iser will probably look more like bnx2i where the host's parent is the hardware (rnic for iser and for bnx2i it is the nic), because it does not use a socket layer like how iscsi_tcp does. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-12 08:22:15 -05:00
Roland Dreier	feae1ef116	IB/umad: BKL is not needed for ib_umad_open() Remove explicit lock_kernel() calls and document why the code is safe. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2008-07-11 16:40:58 -06:00
Ingo Molnar	0c81b2a144	Merge branch 'linus' into core/rcu Conflicts: include/linux/rculist.h kernel/rcupreempt.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-11 10:46:50 +02:00
Steve Wise	5e19cf663b	RDMA/cxgb3: Fix regression caused by class_device -> device conversion The change to iwch_provider.c in commit `f4e91eb4` ("IB: convert struct class_device to struct device") undid the fix done in commit `7f049f2f` ("RDMA/cxgb3: Hold rtnl_lock() around ethtool get_drvinfo call"). It removed the calls to rtnl_lock() that serialized the iw_cxgb3 ethtool ops calls into the cxgb3 driver. This locking is needed to avoid messing up the internal state of the cxgb3 driver. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-07-08 14:40:05 -07:00
Ingo Molnar	68083e05d7	Merge commit 'v2.6.26-rc9' into cpus4096	2008-07-06 14:23:39 +02:00
Roland Dreier	5b2d281acb	IB/uverbs: BKL is not needed for ib_uverbs_open() Remove explicit lock_kernel() calls and document why the code is safe. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2008-07-04 10:32:28 -06:00
Ingo Molnar	9a13150109	Merge commit 'v2.6.26-rc8' into core/rcu	2008-06-26 09:24:23 +02:00
Eli Cohen	87afd448b1	IB/mthca: Clear ICM pages before handing to FW Current memfree FW has a bug which in some cases, assumes that ICM pages passed to it are cleared. This patch uses __GFP_ZERO to allocate all ICM pages passed to the FW. Once firmware with a fix is released, we can make the workaround conditional on firmware version. This fixes the bug reported by Arthur Kepner <akepner@sgi.com> here: http://lists.openfabrics.org/pipermail/general/2008-May/050026.html Cc: <stable@kernel.org> Signed-off-by: Eli Cohen <eli@mellanox.co.il> [ Rewritten to be a one-liner using __GFP_ZERO instead of vmap()ing ICM memory and memset()ing it to 0. - Roland ] Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-23 09:29:58 -07:00
Ingo Molnar	1e74f9cbbb	Merge branch 'linus' into core/rcu	2008-06-23 11:29:11 +02:00
Arnd Bergmann	6b0ee363b2	infiniband-ucma: BKL pushdown Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2008-06-20 14:05:57 -06:00
Jonathan Corbet	f2b9857eee	Add a bunch of cycle_kernel_lock() calls All of the open() functions which don't need the BKL on their face may still depend on its acquisition to serialize opens against driver initialization. So make those functions acquire then release the BKL to be on the safe side. Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2008-06-20 14:05:53 -06:00
Jonathan Corbet	057e7c7ff9	infiniband: more BKL pushdown Be extra-cautious and protect the remaining open() functions. Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2008-06-20 14:05:51 -06:00
Jonathan Corbet	d21c95c569	Add "no BKL needed" comments to several drivers This documents the fact that somebody looked at the relevant open() functions and concluded that, due to their trivial nature, no locking was needed. Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2008-06-20 14:05:50 -06:00
Jack Morgenstein	fb77bcef9f	IB/uverbs: Fix check of is_closed flag check in ib_uverbs_async_handler() Commit `1ae5c187` ("IB/uverbs: Don't store struct file * for event files") changed the way that closed files are handled in the uverbs code. However, after the conversion, is_closed flag is checked incorrectly in ib_uverbs_async_handler(). As a result, no async events are ever passed to applications. Found by: Ronni Zimmerman <ronniz@mellanox.co.il> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-18 15:36:38 -07:00
Ingo Molnar	766d02786e	Merge branch 'linus' into core/rcu	2008-06-16 11:23:36 +02:00
Roland Dreier	24797a3442	RDMA/nes: Fix off-by-one in nes_reg_user_mr() error path nes_reg_user_mr() should fail if page_count becomes >= 1024 * 512 rather than just testing for strict >, because page_count is essentially used as an index into an array with 1024 * 512 entries, so allowing the loop to continue with page_count == 1024 * 512 means that memory after the end of the array is corrupted. This leads to a crash triggerable by a userspace application that requests registration of a too-big region. Also get rid of the call to pci_free_consistent() here to avoid corrupting state with a double free, since the same memory will be freed in the code jumped to at reg_user_mr_err. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-10 12:29:49 -07:00
Roland Dreier	4c0283fc56	IB/core: Remove IB_DEVICE_SEND_W_INV capability flag In 2.6.26, we added some support for send with invalidate work requests, including a device capability flag to indicate whether a device supports such requests. However, the support was incomplete: the completion structure was not extended with a field for the key contained in incoming send with invalidate requests. Full support for memory management extensions (send with invalidate, local invalidate, fast register through a send queue, etc) is planned for 2.6.27. Since send with invalidate is not very useful by itself, just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27, which makes things simpler for applications, since they will not have quite as confusing an array of fine-grained bits to check. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-09 09:58:42 -07:00
Roland Dreier	8079ffa0e1	IB/umem: Avoid sign problems when demoting npages to integer On a 64-bit architecture, if ib_umem_get() is called with a size value that is so big that npages is negative when cast to int, then the length of the page list passed to get_user_pages(), namely min_t(int, npages, PAGE_SIZE / sizeof (struct page *)) will be negative, and get_user_pages() will immediately return 0 (at least since `900cf086`, "Be more robust about bad arguments in get_user_pages()"). This leads to an infinite loop in ib_umem_get(), since the code boils down to: while (npages) { ret = get_user_pages(...); npages -= ret; } Fix this by taking the minimum as unsigned longs, so that the value of npages is never truncated. The impact of this bug isn't too severe, since the value of npages is checked against RLIMIT_MEMLOCK, so a process would need to have an astronomical limit or have CAP_IPC_LOCK to be able to trigger this, and such a process could already cause lots of mischief. But it does let buggy userspace code cause a kernel lock-up; for example I hit this with code that passes a negative value into a memory registartion function where it is promoted to a huge u64 value. Cc: <stable@kernel.org> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-06 21:38:37 -07:00
Ralph Campbell	27676a3e16	IB/ipath: Fix SM trap forwarding SM/SMA traps received by the ipath driver should be forwarded to the SM if it is running on the host. The ib_ipath driver was incorrectly replying with "bad method." Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-06 11:23:29 -07:00
Joachim Fenkes	088af1543c	IB/ehca: Reject send WRs only for RESET, INIT and RTR state Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-06-06 11:21:33 -07:00
Ralph Campbell	03031f71c7	IB/ipath: Fix device capability flags The driver supports a few features (RNR NAK, port active event, SRQ resize) that were not reported in the device capability flags. This patch fixes that. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-26 15:22:17 -07:00
Roland Dreier	e8ffef73c8	IB/ipath: Avoid test_bit() on u64 SDMA status value Gabriel C <nix.or.die@googlemail.com> pointed out that when the x86 bitops are updated to operate on unsigned long, the code in sdma_abort_task() will produce warnings: drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type and so on, because it uses test_bit() to operation on a u64 value (returned by ipath_read_kref64() for a hardware register). Fix up these warnings by converting the test_bit() operations to &ing with appropriate symbolic defines of the bits within the hardware register. This has the benign side-effect of making the code more self-documenting as well. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-26 15:20:34 -07:00
Linus Torvalds	c2448278e3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: IB/mad: Fix kernel crash when .process_mad() returns SUCCESS\|CONSUMED IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish() MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries IB/mlx4: Fix creation of kernel QP with max number of send s/g entries IB/mthca: Fix max_sge value returned by query_device RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send() IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send() IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate IB/ipath: Fix printk format for ipath_sdma_status	2008-05-23 11:11:44 -07:00
Dave Olson	5a4f2b6752	IB/mad: Fix kernel crash when .process_mad() returns SUCCESS\|CONSUMED If a low-level driver returns IB_MAD_RESULT_SUCCESS \| IB_MAD_RESULT_CONSUMED, handle_outgoing_dr_smp() doesn't clean up properly. The fix is to kfree the local data and break, rather than falling through. This was observed with the ipath driver, but could happen with any driver. This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1027>. Signed-off-by: Dave Olson <dave.olson@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-23 10:52:59 -07:00
Mike Travis	5d7bfd0c4d	infiniband: use performance variant for_each_cpu_mask_nr Change references from for_each_cpu_mask to for_each_cpu_mask_nr where appropriate Reviewed-by: Paul Jackson <pj@sgi.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 18:39:06 +02:00
Jack Morgenstein	e1d50dce5a	IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish() We saw a kernel oops in our regression testing when a multicast "join finish" occurred just after the interface was -- this is <https://bugs.openfabrics.org/show_bug.cgi?id=1040>. The test randomly causes the HCA physical port to go down then up. The cause of this is that ipoib_mcast_join_finish() processing happen just after ipoib_mcast_dev_flush() was invoked (in which case the broadcast pointer is NULL). This patch tests for and handles the case where priv->broadcast is NULL. Cc: <stable@kernel.org> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-20 15:41:09 -07:00
Roland Dreier	cd155c1c7c	IB/mlx4: Fix creation of kernel QP with max number of send s/g entries When creating a kernel QP where the consumer asked for a send queue with lots of scatter/gater entries, set_kernel_sq_size() incorrectly returned an error if the send queue stride is larger than the hardware's maximum send work request descriptor size. This is not a problem; the only issue is to make sure that the actual descriptors used do not overflow the maximum descriptor size, so check this instead. Clamp the returned max_send_sge value to be no bigger than what query_device returns for the max_sge to avoid confusing hapless users, even if the hardware is capable of handling a few more s/g entries. This bug caused NFS/RDMA mounts to fail when the server adapter used the mlx4 driver. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-20 14:00:02 -07:00
Greg Kroah-Hartman	6c06aec248	IB: fix race in device_create There is a race from when a device is created with device_create() and then the drvdata is set with a call to dev_set_drvdata() in which a sysfs file could be open, yet the drvdata will be NULL, causing all sorts of bad things to happen. This patch fixes the problem by using the new function, device_create_drvdata(). Cc: Kay Sievers <kay.sievers@vrfy.org> Reviewed-by: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-05-20 13:31:55 -07:00
Franck Bui-Huu	82524746c2	rcu: split list.h and move rcu-protected lists into rculist.h Move rcu-protected lists from list.h into a new header file rculist.h. This is done because list are a very used primitive structure all over the kernel and it's currently impossible to include other header files in this list.h without creating some circular dependencies. For example, list.h implements rcu-protected list and uses rcu_dereference() without including rcupdate.h. It actually compiles because users of rcu_dereference() are macros. Others RCU functions could be used too but aren't probably because of this. Therefore this patch creates rculist.h which includes rcupdates without to many changes/troubles. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:01:37 +02:00
Roland Dreier	12103dca52	IB/mthca: Fix max_sge value returned by query_device The mthca driver returns the maximum number of scatter/gather entries returned by the firmware as the max_sge value when device properties are queried. However, the firmware also reports a limit on the maximum descriptor size allowed, and because mthca takes into account the worst case send request overhead when checking whether to allow a QP to be created, the largest number of scatter/gather entries that can be used with mthca may be limited by the maximum descriptor size rather than just by the actual s/g entry limit. This means that applications cannot actually create QPs with max_send_sge equal to the limit returned by ib_query_device(). Fix this by checking if the maximum descriptor size imposes a lower limit and if so returning that lower limit. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-16 14:58:44 -07:00
Roland Dreier	21609ae3ef	RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send() drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send': drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function This is what akpm describes as "the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem." Signed-off-by: Roland Dreier <rolandd@cisco.com> Acked-by: Steve Wise <swise@opengridcomputing.com>	2008-05-16 14:58:40 -07:00
Andrew Morton	a3d8e1591d	IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send() drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send': drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-16 14:28:30 -07:00
Ralph Campbell	df3f0da8db	IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate When I fixed the RC receive completion opcode in `2bfc8e9e` ("IB/ipath: Return the correct opcode for RDMA WRITE with immediate"), I forgot to fix UC, which had the same problem for RDMA write with immediate returning the wrong opcode. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-15 16:37:25 -07:00
Roland Dreier	cd80ec6f81	IB/ipath: Fix printk format for ipath_sdma_status Commit `f018c7e1` ("IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long") changed ipath_sdma_status to be unsigned long, but left a few debug messages that printed it out with a %016llx format, which generates the warnings drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' Fix this by changing the format used to print out the value to %08lx (8 hex digits are now sufficient, because the highest bit used is 31). Warnings reported by Randy Dunlap <randy.dunlap@oracle.com>. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-15 15:28:55 -07:00
Steve Wise	a58e58fafd	RDMA/cxgb3: Wrap the software send queue pointer as needed on flush cxio_flush_sq() was failing to wrap around the software send queue causing garbage completion entries on a flush operation. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:52:55 -07:00
Roland Dreier	f018c7e177	IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long Andrew Morton <akpm@linux-foundation.org> pointed out that bitops should take an unsigned long * arg. However, the ipath driver was doing bitops on struct ipath_devdata.ipath_sdma_status, which is u64. Change this member to unsigned long to avoid tons of warnings when x86 fixes the bitops to take unsigned long * instead of void *. Also, change the IPATH_SDMA_RUNNING and IPATH_SDMA_SHUTDOWN bit numbers to 30 and 31 (instead of 62 and 63) so that we're not setting another booby trap for someone who tries to make ipath work on a 32-bit architecture. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:51:23 -07:00
Pavel Emelyanov	40d97692fb	IB/ipath: Make ipath_portdata work with struct pid * not pid_t The official reason is "with the presence of pid namespaces in the kernel using pid_t-s inside one is no longer safe." But the reason I fix this right now is the following: About a month ago (when 2.6.25 was not yet released) there still was a one last caller of a to-be-deprecated-soon function find_pid() - the kill_proc() function, which in turn was only used by nfs callback code. During the last merge window, this last caller was finally eliminated by some NFS patch(es) and I was about to finally kill this kill_proc() and find_pid(), but found, that I was late and the kill_proc is now called from the ipath driver since commit `58411d1c` ("IB/ipath: Head of Line blocking vs forward progress of user apps"). So here's a patch that fixes this code to use struct pid * and (!) the kill_pid routine. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:45:32 -07:00
Ralph Campbell	74116f580b	IB/ipath: Fix RDMA read response sequence checking If an out of sequence RDMA read response middle or last packet is received, we should only resend the RDMA read request on the first out of sequence packet and drop subsequent out of sequence packets otherwise, we get "too many retries". Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:42:20 -07:00
Ralph Campbell	e509be898d	IB/ipath: Fix many locking issues when switching to error state The send DMA hardware queue voided a number of prior assumptions about when a send is complete which led to completions being generated out of order. There were also a number of locking issues when switching the QP to the error or reset states, and we implement the IB_QPS_SQD state. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:41:29 -07:00
Ralph Campbell	53dc1ca194	IB/ipath: Fix RC and UC error handling When errors are detected in RC, the QP should transition to the IB_QPS_ERR state, not the IB_QPS_SQE state. Also, when the error is on the responder side, the receive work completion error was incorrect (remote vs. local). Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:40:25 -07:00
Roland Dreier	dd37818dbd	RDMA/nes: Fix up nes_lro_max_aggr module parameter Fix some bugs with the max_aggr module parameter added with LRO support: - The module parameter value ignored and not actually used to set lro_mgr.max_aggr. - MODULE_PARM_DESC had a typo "_mro_" instead of "_lro_" so it didn't end up describing the actual module parameter. - The nes_lro_max_aggr variable was declared as unsigned, but the module_param line said "int" instead of "uint" for the type. - The default value for the parameter was stuck in the permissions field of module_param, which led to nonsensical permissions for the file under /sys/module/iw_nes/param. - The parameter was used in only one file but defined in another, which led to the variable being global for no good reason. Move everything related to the parameter to the file nes_hw.c where it is actually used. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-13 11:27:25 -07:00
Stefan Roscher	12137c593d	IB/ehca: Wait for async events to finish before destroying QP This is necessary because, in a multicore environment, a race between uverbs async handler and destroy QP could occur. Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 11:35:06 -07:00
John Gregor	ab69b3cf12	IB/ipath: Fix SDMA error recovery in absence of link status change What's fixed: in ipath_cancel_sends() We need to unconditionally set ABORTING. So, swap the tests so the set_bit() isn't shadowed by the &&. If we've disarmed the piobufs, then we need to unconditionally set DISARMED. So, move it out from the overly protective if at the bottom. in sdma_abort_task() Abort_task was written knowing that the SDMA engine would always be reset (and restarted) on error. A recent change broke that fundamental assumption by taking the restart portion and making it conditional on a link status change. But, SDMA can go boom without a link status change in some conditions. Signed-off-by: John Gregor <john.gregor@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 11:01:10 -07:00
Dave Olson	e2ab41cae4	IB/ipath: Need to always request and handle PIO avail interrupts Now that we always use PIO for vl15 on 7220, we could get stuck forever if we happened to run out of PIO buffers from the verbs code, because the setup code wouldn't run; the interrupt was also ignored if SDMA was supported. We also have to reduce the pio update threshold if we have fewer kernel buffers than the existing threshold. Clean up the initialization a bit to get ordering safer and more sensible, and use the existing ipath_chg_kernavail call to do init, rather than doing it separately. Drop unnecessary clearing of pio buffer on pio parity error. Drop incorrect updating of pioavailshadow when exitting freeze mode (software state may not match chip state if buffer has been allocated and not yet written). If we couldn't get a kernel buffer for a while, make sure we are in sync with hardware, mainly to handle the exitting freeze case. Signed-off-by: Dave Olson <dave.olson@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 11:00:15 -07:00
Michael Albaugh	2889d1ef12	IB/ipath: Fix count of packets received by kernel The loop in ipath_kreceive() that processes packets increments the loop-index 'i' once too often, because the exit condition does not depend on it, and is checked after the increment. By adding a check for !last to the iterator in the for loop, we correct that in a way that is not so likely to be re-broken by changes in the loop body. Signed-off-by: Michael Albaugh <micheal.albaugh@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 10:59:23 -07:00
Ralph Campbell	2bfc8e9edf	IB/ipath: Return the correct opcode for RDMA WRITE with immediate This patch fixes a bug in the RC responder which generates a completion entry with the wrong opcode when an RDMA WRITE with immediate is received. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 10:58:50 -07:00
Dave Olson	b4d390d8d2	IB/ipath: Fix bug that can leave sends disabled after freeze recovery The semantics of cancel_sends changed, but the code using it was missed. Don't leave sends and pioavail updates disabled, and add a comment as to why the force update is needed. Signed-off-by: Dave Olson <dave.olson@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 10:57:48 -07:00
Ralph Campbell	6e87d15007	IB/ipath: Only increment SSN if WQE is put on send queue If a send work request has immediate errors and is not put on the send queue, we shouldn't update any of the QP state. The increment of the SSN wasn't obeying this. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 10:57:14 -07:00
Michael Albaugh	5f51efc195	IB/ipath: Only warn about prototype chip during init We warn about prototype chips, but the function that checks for support is also called as a result of a get_portinfo request, which can clutter the logs. Restrict warning to only appear during initialization. Signed-off-by: Michael Albaugh <michael.albaugh@qlogic.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-07 10:56:47 -07:00
Roland Dreier	273748cc90	RDMA/cxgb3: Fix severe limit on userspace memory registration size Currently, iw_cxgb3 is severely limited on the amount of userspace memory that can be registered in in a single memory region, which causes big problems for applications that expect to be able to register 100s of MB. The problem is that the driver uses a single kmalloc()ed buffer to hold the physical buffer list (PBL) for the entire memory region during registration, which means that 8 bytes of contiguous memory are required for each page of memory being registered. For example, a 64 MB registration will require 128 KB of contiguous memory with 4 KB pages, and it unlikely that such an allocation will succeed on a busy system. This is purely a driver problem: the temporary page list buffer is not needed by the hardware, so we can fix this by writing the PBL to the hardware in page-sized chunks rather than all at once. We do this by splitting the memory registration operation up into several steps: - Allocate PBL space in adapter memory for the full registration - Copy PBL to adapter memory in chunks - Allocate STag and enable memory region This also allows several other cleanups to the __cxio_tpt_op() interface and related parts of the driver. This change leaves the reregister memory region and memory window operations broken, but they already didn't work due to other longstanding bugs, so fixing them will be left to a later patch. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-06 15:56:22 -07:00
Roland Dreier	0e9913362a	RDMA/cxgb3: Don't add PBL memory to gen_pool in chunks Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB chunks. This limits the largest single allocation that can be done to the same size, which means that with 4 KB pages, each of which takes 8 bytes of PBL memory, the largest memory region that can be allocated is 1 GB (256K PBL entries * 4 KB/entry). Remove this limit by adding all the PBL memory in a single gen_pool chunk, if possible. Add code that falls back to smaller chunks if gen_pool_add() fails, which can happen if there is not sufficient contiguous lowmem for the internal gen_pool bitmap. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-06 15:03:38 -07:00
Stefan Roscher	cf04690885	IB/ehca: Fix function return types Also remove duplicate assignment of local_ca_ack_delay and change min_t check for local_ca_ack_delay to u8 instead of int. Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-05 15:51:49 -07:00
Steve Wise	77a8d5741f	RDMA/cxgb3: Bump up the MPA connection setup timeout. Testing on large clusters shows its way too short at 10 secs. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-02 10:57:09 -07:00
Steve Wise	c4d49776e8	RDMA/cxgb3: Silently ignore close reply after abort. Remove bad BUG_ON() that can trigger in correct operation from close_con_rpl(). It is possible to get a close_rpl message on a dead connection. The sequence is: - host refs ep for close exchange - host posts close_req - hw posts PEER_ABORT from incoming RST - host marks ep DEAD - host posts ABORT_RPL and releases ep resources - hw posts CLOSE_RPL - host derefs ep and ep freed. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-02 10:57:09 -07:00
Steve Wise	c8286944b8	RDMA/cxgb3: QP flush fixes - Flush the QP only after the HW disables the connection. Currently we flush the QP when transitioning to CLOSING. This exposes a race condition where the HW can complete a RECV WR, for instance, -and- the SW can flush that same WR. - Only call CQ event handlers on flush IFF we actually flushed something. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-05-02 10:56:57 -07:00
Eli Cohen	57ce41d1d1	IB/ipoib: Fix transmit queue stalling forever Commit `f56bcd80` ("IPoIB: Use separate CQ for UD send completions") introduced a bug where the transmit queue could get stopped and never woken up. The problem is that send completions are only polled at the end of the xmit function, so if the send queue fills up and the xmit path stops the queue, then there is no way for send completions to ever get polled, and so the transmit queue stays stopped forever. Fix this by arming the send CQ just before posting the last send request that fills the send queue. Then, when the completion event handler is called, drain the send CQ. Since it is possible that not enough send completions are in the CQ, verify that the the net queue has been woken up after draining the send CQ, and if not arm a timer and drain again at the timer function. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-30 20:02:45 -07:00
Roland Dreier	3ae15e1623	IB/mlx4: Fix off-by-one errors in calls to mlx4_ib_free_cq_buf() When I merged `bbf8eed1` ("IB/mlx4: Add support for resizing CQs") I changed things around so that mlx4_ib_alloc_cq_buf() and mlx4_ib_free_cq_buf() were used everywhere they could be. However, I screwed up the number of entries passed into mlx4_ib_alloc_cq_buf() in a couple places -- the function bumps the number of entries internally, so the caller shouldn't add 1 as well. Passing a too-big value for the number of entries to mlx4_ib_free_cq_buf() can cause the cleanup to go off the end of an array and corrupt allocator state in interesting ways. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-30 19:52:55 -07:00
Glenn Streiff	7495ab6837	RDMA/nes: Formatting cleanup Various cleanups: - Change // to /* .. */ - Place whitespace around binary operators. - Trim down a few long lines. - Some minor alignment formatting for better readability. - Remove some silly tabs. Signed-off-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:54 -07:00
Eric Schneider	0e1de5d62e	RDMA/nes: Add support for SFP+ PHY This patch enables the iw_nes module for NetEffect RNICs to support additional PHYs including SFP+ (referred to as ARGUS in the code). Signed-off-by: Eric Schneider <eric.schneider@neteffect.com> Signed-off-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:54 -07:00
Faisal Latif	37dab4112d	RDMA/nes: Use LRO Signed-off-by: Faisal Latif <flatif@neteffect.com. Signed-off-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:54 -07:00
Eli Cohen	b4132efa1a	IPoIB: Copy child MTU from parent When creating a child interface, copy the MTU information from the parent. Otherwise when the child's multicast join completes, the MTU will not be updated since the code does dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); and priv->admin_mtu will be set to 0. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:53 -07:00
Roland Dreier	baaad380c0	IB/mthca: Avoid changing userspace ABI to handle DMA write barrier attribute Commit `cb9fbc5c` ("IB: expand ib_umem_get() prototype") changed the mthca userspace ABI to provide a way for userspace to indicate which memory regions need the DMA write barrier attribute. However, it is possible to handle this without breaking existing userspace, by having the mthca kernel driver recognize whether it is talking to old or new userspace, depending on the size of the register MR structure passed in. The only potential drawback of this is that is allows old userspace (which has a bug with DMA ordering on large SGI Altix systems) to continue to run on new kernels, but the advantage of allowing old userspace to continue to work on unaffected systems seems to outweigh this, and we can print a warning to push people to upgrade their userspace. Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:53 -07:00
Olaf Kirch	0bfe151cc4	IB/mthca: Avoid recycling old FMR R_Keys too soon When a FMR is unmapped, mthca resets the map count to 0, and clears the upper part of the R_Key which is used as the sequence counter. This poses a problem for RDS, which uses ib_fmr_unmap as a fence operation. RDS assumes that after issuing an unmap, the old R_Keys will be invalid for a "reasonable" period of time. For instance, Oracle processes uses shared memory buffers allocated from a pool of buffers. When a process dies, we want to reclaim these buffers -- but we must make sure there are no pending RDMA operations to/from those buffers. The only way to achieve that is by using unmap and sync the TPT. However, when the sequence count is reset on unmap, there is a high likelihood that a new mapping will be given the same R_Key that was issued a few milliseconds ago. To prevent this, don't reset the sequence count when unmapping a FMR. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:53 -07:00
Stefan Roscher	d227fa7288	IB/ehca: Allocate event queue size depending on max number of CQs and QPs If a lot of QPs fall into Error state at once and the EQ of the respective HCA is too small, it might overrun, causing the eHCA driver to stop processing completion events and calling the application's completion handlers, effectively causing traffic to stop. Fix this by limiting available QPs and CQs to a customizable max count, and determining EQ size based on these counts and a worst-case assumption. Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:53 -07:00
Eli Cohen	f56bcd8013	IPoIB: Use separate CQ for UD send completions Use a dedicated CQ for UD send completions. Also, do not arm the UD send CQ, which reduces the number of interrupts generated. This patch farther reduces overhead by not calling poll CQ for every posted send WR -- it does polls only when there 16 or more outstanding work requests. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:53 -07:00
Eli Dorfman	87528227df	IB/iser: Count FMR alignment violations per session Count FMR alignment violations per session as part of the iscsi statistics. Signed-off-by: Eli Dorfman <elid@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:52 -07:00
Eli Dorfman	6f735e36ba	IB/iser: Move high-volume debug output to higher debug level Add another level for debug. Signed-off-by: Eli Dorfman <elid@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:52 -07:00
Hoang-Nam Nguyen	7df109d917	IB/ehca: handle negative return value from ibmebus_request_irq() properly ehca_create_eq() was assigning a signed return value to an unsiged local variable and then checking if the variable was < 0, which meant that errors were always ignored. Fix this by using one variable for signed integer return values and another for u64 hcall return values. Bug originally found by Roel Kluin <12o3l@tiscali.nl>. Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:52 -07:00
Steve Wise	f8b0dfd152	RDMA/cxgb3: Support peer-2-peer connection setup Open MPI, Intel MPI and other applications don't respect the iWARP requirement that the client (active) side of the connection send the first RDMA message. This class of application connection setup is called peer-to-peer. Typically once the connection is setup, _both_ sides want to send data. This patch enables supporting peer-to-peer over the chelsio RNIC by enforcing this iWARP requirement in the driver itself as part of RDMA connection setup. Connection setup is extended, when the peer2peer module option is 1, such that the MPA initiator will send a 0B Read (the RTR) just after connection setup. The MPA responder will suspend SQ processing until the RTR message is received and reply-to. In the longer term, this will be handled in a standardized way by enhancing the MPA negotiation so peers can indicate whether they want/need the RTR and what type of RTR (0B read, 0B write, or 0B send) should be sent. This will be done by standardizing a few bits of the private data in order to negotiate all this. However this patch enables peer-to-peer applications now and allows most of the required firmware and driver changes to be done and tested now. Design: - Add a module option, peer2peer, to enable this mode. - New firmware support for peer-to-peer mode: - a new bit in the rdma_init WR to tell it to do peer-2-peer and what form of RTR message to send or expect. - process _all_ preposted recvs before moving the connection into rdma mode. - passive side: defer completing the rdma_init WR until all pre-posted recvs are processed. Suspend SQ processing until the RTR is received. - active side: expect and process the 0B read WR on offload TX queue. Defer completing the rdma_init WR until all pre-posted recvs are processed. Suspend SQ processing until the 0B read WR is processed from the offload TX queue. - If peer2peer is set, driver posts 0B read request on offload TX queue just after posting the rdma_init WR to the offload TX queue. - Add CQ poll logic to ignore unsolicitied read responses. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:52 -07:00
Steve Wise	ccaf10d0ad	RDMA/cxgb3: Set the max_mr_size device attribute correctly cxgb3 only supports 4GB memory regions. The lustre RDMA code uses this attribute and currently has to code around our bad setting. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:52 -07:00
Steve Wise	989a178069	RDMA/cxgb3: Correctly serialize peer abort path Open MPI and other stress testing exposed a few bad bugs in handling aborts in the middle of a normal close. Fix these by: - serializing abort reply and peer abort processing with disconnect processing - warning (and ignoring) if ep timer is stopped when it wasn't running - cleaning up disconnect path to correctly deal with aborting and dead endpoints - in iwch_modify_qp(), taking a ref on the ep before releasing the qp lock if iwch_ep_disconnect() will be called. The ref is dropped after calling disconnect. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:51 -07:00
Yevgeny Petrilin	e463c7b197	mlx4_core: Add a way to set the "collapsed" CQ flag Extend the mlx4_cq_resize() API with a way to set the "collapsed" flag for the CQ being created. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-29 13:46:50 -07:00
Arthur Kepner	cb9fbc5c37	IB: expand ib_umem_get() prototype Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1 when mapping user-allocated CQs with ib_umem_get(). Signed-off-by: Arthur Kepner <akepner@sgi.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Jes Sorensen <jes@sgi.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Roland Dreier	31d1e340f0	RDMA/nes: Remove volatile qualifier from struct nes_hw_cq.cq_vbase Remove the volatile qualifier from the cq_vbase member of struct nes_hw_cq, and add an rmb() in the one place where it looks like access order might make a difference. As usual, removing a volatile qualifier in a declaration is actually a bug fix, since a volatile qualifier is not sufficient to make sure that aggressively out-of-order CPUs don't reorder things and cause incorrect results. For example, a CPU might speculatively execute reads of other cqe fields before the NIC hardware has written those fields and before it has set the NES_CQE_VALID bit (even though those reads come after the test of the NES_CQE_VALID bit in program order), but then when the CPU actually executes the conditional test of the NES_CQE_VALID, the bit has been set, and the CPU will proceed with the results of the earlier speculative execution and end up using bogus data. This also gets rid of the warning: drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_destroy_cq': drivers/infiniband/hw/nes/nes_verbs.c:1978: warning: passing argument 3 of 'pci_free_consistent' discards qualifiers from pointer target type Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Yevgeny Petrilin	6296883ca4	mlx4_core: Move kernel doorbell management into core In addition to mlx4_ib, there will be ethernet and FC consumers of mlx4_core, so move the code for managing kernel doorbells into the core module to avoid having to duplicate this multiple times. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Joachim Fenkes	14fb05b349	IB/ehca: Bump version number to 0026 Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Joachim Fenkes	0455e36d81	IB/ehca: Make some module parameters bool, update descriptions Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Joachim Fenkes	a7607c9b11	IB/ehca: Remove mr_largepage parameter Always enable large page support; didn't seem to cause problems for anyone. Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Joachim Fenkes	4da27d6d5b	IB/ehca: Move high-volume debug output to higher debug levels Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Joachim Fenkes	863fb09fbf	IB/ehca: Prevent posting of SQ WQEs if QP not in RTS ...as required by IB Spec, C10-29. Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Shirley Ma	bc7b3a36ba	IPoIB: Handle 4K IB MTU for UD (datagram) mode This patch enables IPoIB to use 4K UD messages (when the underlying device and fabrics support a 4K MTU) by using two scatter buffers when PAGE_SIZE is less than or equal to thhe HCA IB MTU size. The first buffer is for IPoIB header + GRH header, and the second buffer is the IPoIB payload, which is 4K-4. Signed-off-by: Shirley Ma <xma@us.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Chien Tung	bc5698f3ec	RDMA/nes: Fix adapter reset after PXE boot After PXE boot, the iw_nes driver does a full reset to ensure the card is in a clean state. However, it doesn't wait for firmware to complete its work before issuing a port reset to enable the ports, which leads to problems bringing up the ports. The solution is to wait for firmware to complete its work before proceeding with port reset. This bug was flagged by Roland Dreier <rolandd@cisco.com>. Cc: <stable@kernel.org> Signed-off-by: Chien Tung <ctung@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:45 -07:00
Roland Dreier	e447703123	RDMA/nes: Print IPv4 addresses in a readable format Use NIPQUAD_FMT instead of printing raw 32-bit hex quantities in debugging output. Acked-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:55:43 -07:00
Roland Dreier	2bd01c5d2e	RDMA/nes: Use print_mac() to format ethernet addresses for printing Removing open-coded MAC formats shrinks the source and the generated code too, eg on x86-64: add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-103 (-103) function old new delta make_cm_node 932 912 -20 nes_netdev_set_mac_address 427 406 -21 nes_netdev_set_multicast_list 1148 1124 -24 nes_probe 2349 2311 -38 Acked-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>	2008-04-23 11:52:18 -07:00

1 2 3 4 5 ...

1884 Commits