102d1404c3
Document new behavior when the number of frags passed is too big. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20241107210331.3044434-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
279 lines
8.0 KiB
ReStructuredText
279 lines
8.0 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=================
|
|
Device Memory TCP
|
|
=================
|
|
|
|
|
|
Intro
|
|
=====
|
|
|
|
Device memory TCP (devmem TCP) enables receiving data directly into device
|
|
memory (dmabuf). The feature is currently implemented for TCP sockets.
|
|
|
|
|
|
Opportunity
|
|
-----------
|
|
|
|
A large number of data transfers have device memory as the source and/or
|
|
destination. Accelerators drastically increased the prevalence of such
|
|
transfers. Some examples include:
|
|
|
|
- Distributed training, where ML accelerators, such as GPUs on different hosts,
|
|
exchange data.
|
|
|
|
- Distributed raw block storage applications transfer large amounts of data with
|
|
remote SSDs. Much of this data does not require host processing.
|
|
|
|
Typically the Device-to-Device data transfers in the network are implemented as
|
|
the following low-level operations: Device-to-Host copy, Host-to-Host network
|
|
transfer, and Host-to-Device copy.
|
|
|
|
The flow involving host copies is suboptimal, especially for bulk data transfers,
|
|
and can put significant strains on system resources such as host memory
|
|
bandwidth and PCIe bandwidth.
|
|
|
|
Devmem TCP optimizes this use case by implementing socket APIs that enable
|
|
the user to receive incoming network packets directly into device memory.
|
|
|
|
Packet payloads go directly from the NIC to device memory.
|
|
|
|
Packet headers go to host memory and are processed by the TCP/IP stack
|
|
normally. The NIC must support header split to achieve this.
|
|
|
|
Advantages:
|
|
|
|
- Alleviate host memory bandwidth pressure, compared to existing
|
|
network-transfer + device-copy semantics.
|
|
|
|
- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
|
|
level of the PCIe tree, compared to the traditional path which sends data
|
|
through the root complex.
|
|
|
|
|
|
More Info
|
|
---------
|
|
|
|
slides, video
|
|
https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
|
|
|
|
patchset
|
|
[PATCH net-next v24 00/13] Device Memory TCP
|
|
https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
|
|
|
|
|
|
Interface
|
|
=========
|
|
|
|
|
|
Example
|
|
-------
|
|
|
|
tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
|
|
the RX path of this API.
|
|
|
|
|
|
NIC Setup
|
|
---------
|
|
|
|
Header split, flow steering, & RSS are required features for devmem TCP.
|
|
|
|
Header split is used to split incoming packets into a header buffer in host
|
|
memory, and a payload buffer in device memory.
|
|
|
|
Flow steering & RSS are used to ensure that only flows targeting devmem land on
|
|
an RX queue bound to devmem.
|
|
|
|
Enable header split & flow steering::
|
|
|
|
# enable header split
|
|
ethtool -G eth1 tcp-data-split on
|
|
|
|
|
|
# enable flow steering
|
|
ethtool -K eth1 ntuple on
|
|
|
|
Configure RSS to steer all traffic away from the target RX queue (queue 15 in
|
|
this example)::
|
|
|
|
ethtool --set-rxfh-indir eth1 equal 15
|
|
|
|
|
|
The user must bind a dmabuf to any number of RX queues on a given NIC using
|
|
the netlink API::
|
|
|
|
/* Bind dmabuf to NIC RX queue 15 */
|
|
struct netdev_queue *queues;
|
|
queues = malloc(sizeof(*queues) * 1);
|
|
|
|
queues[0]._present.type = 1;
|
|
queues[0]._present.idx = 1;
|
|
queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
|
|
queues[0].idx = 15;
|
|
|
|
*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
|
|
|
|
req = netdev_bind_rx_req_alloc();
|
|
netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
|
|
netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
|
|
__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
|
|
|
|
rsp = netdev_bind_rx(*ys, req);
|
|
|
|
dmabuf_id = rsp->dmabuf_id;
|
|
|
|
|
|
The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
|
|
that has been bound.
|
|
|
|
The user can unbind the dmabuf from the netdevice by closing the netlink socket
|
|
that established the binding. We do this so that the binding is automatically
|
|
unbound even if the userspace process crashes.
|
|
|
|
Note that any reasonably well-behaved dmabuf from any exporter should work with
|
|
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
|
|
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
|
|
|
|
|
|
Socket Setup
|
|
------------
|
|
|
|
The socket must be flow steered to the dmabuf bound RX queue::
|
|
|
|
ethtool -N eth1 flow-type tcp4 ... queue 15
|
|
|
|
|
|
Receiving data
|
|
--------------
|
|
|
|
The user application must signal to the kernel that it is capable of receiving
|
|
devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
|
|
|
|
ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
|
|
|
|
Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
|
|
on devmem data.
|
|
|
|
Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
|
|
Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
|
|
|
|
for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
|
|
if (cm->cmsg_level != SOL_SOCKET ||
|
|
(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
|
|
cm->cmsg_type != SCM_DEVMEM_LINEAR))
|
|
continue;
|
|
|
|
dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
|
|
|
|
if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
|
|
/* Frag landed in dmabuf.
|
|
*
|
|
* dmabuf_cmsg->dmabuf_id is the dmabuf the
|
|
* frag landed on.
|
|
*
|
|
* dmabuf_cmsg->frag_offset is the offset into
|
|
* the dmabuf where the frag starts.
|
|
*
|
|
* dmabuf_cmsg->frag_size is the size of the
|
|
* frag.
|
|
*
|
|
* dmabuf_cmsg->frag_token is a token used to
|
|
* refer to this frag for later freeing.
|
|
*/
|
|
|
|
struct dmabuf_token token;
|
|
token.token_start = dmabuf_cmsg->frag_token;
|
|
token.token_count = 1;
|
|
continue;
|
|
}
|
|
|
|
if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
|
|
/* Frag landed in linear buffer.
|
|
*
|
|
* dmabuf_cmsg->frag_size is the size of the
|
|
* frag.
|
|
*/
|
|
continue;
|
|
|
|
}
|
|
|
|
Applications may receive 2 cmsgs:
|
|
|
|
- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
|
|
by dmabuf_id.
|
|
|
|
- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
|
|
This typically happens when the NIC is unable to split the packet at the
|
|
header boundary, such that part (or all) of the payload landed in host
|
|
memory.
|
|
|
|
Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
|
|
regular TCP data that landed on an RX queue not bound to a dmabuf.
|
|
|
|
|
|
Freeing frags
|
|
-------------
|
|
|
|
Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
|
|
processes the frag. The user must return the frag to the kernel via
|
|
SO_DEVMEM_DONTNEED::
|
|
|
|
ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
|
|
sizeof(token));
|
|
|
|
The user must ensure the tokens are returned to the kernel in a timely manner.
|
|
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
|
|
and will lead to packet drops.
|
|
|
|
The user must pass no more than 128 tokens, with no more than 1024 total frags
|
|
among the token->token_count across all the tokens. If the user provides more
|
|
than 1024 frags, the kernel will free up to 1024 frags and return early.
|
|
|
|
The kernel returns the number of actual frags freed. The number of frags freed
|
|
can be less than the tokens provided by the user in case of:
|
|
|
|
(a) an internal kernel leak bug.
|
|
(b) the user passed more than 1024 frags.
|
|
|
|
Implementation & Caveats
|
|
========================
|
|
|
|
Unreadable skbs
|
|
---------------
|
|
|
|
Devmem payloads are inaccessible to the kernel processing the packets. This
|
|
results in a few quirks for payloads of devmem skbs:
|
|
|
|
- Loopback is not functional. Loopback relies on copying the payload, which is
|
|
not possible with devmem skbs.
|
|
|
|
- Software checksum calculation fails.
|
|
|
|
- TCP Dump and bpf can't access devmem packet payloads.
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
More realistic example code can be found in the kernel source under
|
|
``tools/testing/selftests/net/ncdevmem.c``
|
|
|
|
ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
|
|
receives data directly into a udmabuf.
|
|
|
|
To run ncdevmem, you need to run it on a server on the machine under test, and
|
|
you need to run netcat on a peer to provide the TX data.
|
|
|
|
ncdevmem has a validation mode as well that expects a repeating pattern of
|
|
incoming data and validates it as such. For example, you can launch
|
|
ncdevmem on the server by::
|
|
|
|
ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
|
|
-p 5201 -v 7
|
|
|
|
On client side, use regular netcat to send TX data to ncdevmem process
|
|
on the server::
|
|
|
|
yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
|
|
tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
|