net: add devmem TCP documentation
Add documentation outlining the usage and details of devmem TCP. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20240910171458.219195-12-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
parent
678f6e28b5
commit
09d1db26b5
269
Documentation/networking/devmem.rst
Normal file
269
Documentation/networking/devmem.rst
Normal file
@ -0,0 +1,269 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
Device Memory TCP
|
||||
=================
|
||||
|
||||
|
||||
Intro
|
||||
=====
|
||||
|
||||
Device memory TCP (devmem TCP) enables receiving data directly into device
|
||||
memory (dmabuf). The feature is currently implemented for TCP sockets.
|
||||
|
||||
|
||||
Opportunity
|
||||
-----------
|
||||
|
||||
A large number of data transfers have device memory as the source and/or
|
||||
destination. Accelerators drastically increased the prevalence of such
|
||||
transfers. Some examples include:
|
||||
|
||||
- Distributed training, where ML accelerators, such as GPUs on different hosts,
|
||||
exchange data.
|
||||
|
||||
- Distributed raw block storage applications transfer large amounts of data with
|
||||
remote SSDs. Much of this data does not require host processing.
|
||||
|
||||
Typically the Device-to-Device data transfers in the network are implemented as
|
||||
the following low-level operations: Device-to-Host copy, Host-to-Host network
|
||||
transfer, and Host-to-Device copy.
|
||||
|
||||
The flow involving host copies is suboptimal, especially for bulk data transfers,
|
||||
and can put significant strains on system resources such as host memory
|
||||
bandwidth and PCIe bandwidth.
|
||||
|
||||
Devmem TCP optimizes this use case by implementing socket APIs that enable
|
||||
the user to receive incoming network packets directly into device memory.
|
||||
|
||||
Packet payloads go directly from the NIC to device memory.
|
||||
|
||||
Packet headers go to host memory and are processed by the TCP/IP stack
|
||||
normally. The NIC must support header split to achieve this.
|
||||
|
||||
Advantages:
|
||||
|
||||
- Alleviate host memory bandwidth pressure, compared to existing
|
||||
network-transfer + device-copy semantics.
|
||||
|
||||
- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
|
||||
level of the PCIe tree, compared to the traditional path which sends data
|
||||
through the root complex.
|
||||
|
||||
|
||||
More Info
|
||||
---------
|
||||
|
||||
slides, video
|
||||
https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
|
||||
|
||||
patchset
|
||||
[PATCH net-next v24 00/13] Device Memory TCP
|
||||
https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
|
||||
|
||||
|
||||
Interface
|
||||
=========
|
||||
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
|
||||
the RX path of this API.
|
||||
|
||||
|
||||
NIC Setup
|
||||
---------
|
||||
|
||||
Header split, flow steering, & RSS are required features for devmem TCP.
|
||||
|
||||
Header split is used to split incoming packets into a header buffer in host
|
||||
memory, and a payload buffer in device memory.
|
||||
|
||||
Flow steering & RSS are used to ensure that only flows targeting devmem land on
|
||||
an RX queue bound to devmem.
|
||||
|
||||
Enable header split & flow steering::
|
||||
|
||||
# enable header split
|
||||
ethtool -G eth1 tcp-data-split on
|
||||
|
||||
|
||||
# enable flow steering
|
||||
ethtool -K eth1 ntuple on
|
||||
|
||||
Configure RSS to steer all traffic away from the target RX queue (queue 15 in
|
||||
this example)::
|
||||
|
||||
ethtool --set-rxfh-indir eth1 equal 15
|
||||
|
||||
|
||||
The user must bind a dmabuf to any number of RX queues on a given NIC using
|
||||
the netlink API::
|
||||
|
||||
/* Bind dmabuf to NIC RX queue 15 */
|
||||
struct netdev_queue *queues;
|
||||
queues = malloc(sizeof(*queues) * 1);
|
||||
|
||||
queues[0]._present.type = 1;
|
||||
queues[0]._present.idx = 1;
|
||||
queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
|
||||
queues[0].idx = 15;
|
||||
|
||||
*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
|
||||
|
||||
req = netdev_bind_rx_req_alloc();
|
||||
netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
|
||||
netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
|
||||
__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
|
||||
|
||||
rsp = netdev_bind_rx(*ys, req);
|
||||
|
||||
dmabuf_id = rsp->dmabuf_id;
|
||||
|
||||
|
||||
The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
|
||||
that has been bound.
|
||||
|
||||
The user can unbind the dmabuf from the netdevice by closing the netlink socket
|
||||
that established the binding. We do this so that the binding is automatically
|
||||
unbound even if the userspace process crashes.
|
||||
|
||||
Note that any reasonably well-behaved dmabuf from any exporter should work with
|
||||
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
|
||||
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
|
||||
|
||||
|
||||
Socket Setup
|
||||
------------
|
||||
|
||||
The socket must be flow steered to the dmabuf bound RX queue::
|
||||
|
||||
ethtool -N eth1 flow-type tcp4 ... queue 15
|
||||
|
||||
|
||||
Receiving data
|
||||
--------------
|
||||
|
||||
The user application must signal to the kernel that it is capable of receiving
|
||||
devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
|
||||
|
||||
ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
|
||||
|
||||
Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
|
||||
on devmem data.
|
||||
|
||||
Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
|
||||
Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
|
||||
|
||||
for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
|
||||
if (cm->cmsg_level != SOL_SOCKET ||
|
||||
(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
|
||||
cm->cmsg_type != SCM_DEVMEM_LINEAR))
|
||||
continue;
|
||||
|
||||
dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
|
||||
|
||||
if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
|
||||
/* Frag landed in dmabuf.
|
||||
*
|
||||
* dmabuf_cmsg->dmabuf_id is the dmabuf the
|
||||
* frag landed on.
|
||||
*
|
||||
* dmabuf_cmsg->frag_offset is the offset into
|
||||
* the dmabuf where the frag starts.
|
||||
*
|
||||
* dmabuf_cmsg->frag_size is the size of the
|
||||
* frag.
|
||||
*
|
||||
* dmabuf_cmsg->frag_token is a token used to
|
||||
* refer to this frag for later freeing.
|
||||
*/
|
||||
|
||||
struct dmabuf_token token;
|
||||
token.token_start = dmabuf_cmsg->frag_token;
|
||||
token.token_count = 1;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
|
||||
/* Frag landed in linear buffer.
|
||||
*
|
||||
* dmabuf_cmsg->frag_size is the size of the
|
||||
* frag.
|
||||
*/
|
||||
continue;
|
||||
|
||||
}
|
||||
|
||||
Applications may receive 2 cmsgs:
|
||||
|
||||
- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
|
||||
by dmabuf_id.
|
||||
|
||||
- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
|
||||
This typically happens when the NIC is unable to split the packet at the
|
||||
header boundary, such that part (or all) of the payload landed in host
|
||||
memory.
|
||||
|
||||
Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
|
||||
regular TCP data that landed on an RX queue not bound to a dmabuf.
|
||||
|
||||
|
||||
Freeing frags
|
||||
-------------
|
||||
|
||||
Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
|
||||
processes the frag. The user must return the frag to the kernel via
|
||||
SO_DEVMEM_DONTNEED::
|
||||
|
||||
ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
|
||||
sizeof(token));
|
||||
|
||||
The user must ensure the tokens are returned to the kernel in a timely manner.
|
||||
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
|
||||
and will lead to packet drops.
|
||||
|
||||
|
||||
Implementation & Caveats
|
||||
========================
|
||||
|
||||
Unreadable skbs
|
||||
---------------
|
||||
|
||||
Devmem payloads are inaccessible to the kernel processing the packets. This
|
||||
results in a few quirks for payloads of devmem skbs:
|
||||
|
||||
- Loopback is not functional. Loopback relies on copying the payload, which is
|
||||
not possible with devmem skbs.
|
||||
|
||||
- Software checksum calculation fails.
|
||||
|
||||
- TCP Dump and bpf can't access devmem packet payloads.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
More realistic example code can be found in the kernel source under
|
||||
``tools/testing/selftests/net/ncdevmem.c``
|
||||
|
||||
ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
|
||||
receives data directly into a udmabuf.
|
||||
|
||||
To run ncdevmem, you need to run it on a server on the machine under test, and
|
||||
you need to run netcat on a peer to provide the TX data.
|
||||
|
||||
ncdevmem has a validation mode as well that expects a repeating pattern of
|
||||
incoming data and validates it as such. For example, you can launch
|
||||
ncdevmem on the server by::
|
||||
|
||||
ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
|
||||
-p 5201 -v 7
|
||||
|
||||
On client side, use regular netcat to send TX data to ncdevmem process
|
||||
on the server::
|
||||
|
||||
yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
|
||||
tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
|
@ -49,6 +49,7 @@ Contents:
|
||||
cdc_mbim
|
||||
dccp
|
||||
dctcp
|
||||
devmem
|
||||
dns_resolver
|
||||
driver
|
||||
eql
|
||||
|
Loading…
Reference in New Issue
Block a user