Performance profiling and design choices of an RDMA implementation using FPGA devices

RDMA communication is an efficient choice for many applications, such as data acquisition systems, data center networking and any other networking application, where high bandwidth and low latency are necessary. RDMA can be implemented using a large array of options, which need to be tailored to the needed use case, in order to get optimal results. Aspects such as the effects of using multiple simultaneous connections, using various transport functions such as RDMA Write and RDMA Send and communication models such as sending individual bursts or continuous streams of data will be investigated for implementing RDMA on FPGA devices.


Background
RDMA stands for Remote Direct Memory Access.It is a method of transferring data between devices with as little CPU involvement as possible.It offers several transport services, such as Reliable Connection, Reliable Datagram, Unreliable Connection, Unreliable datagram etc.There are also several transport functions, such as Send/Receive, Write and Read.When using RDMA Write, only the sender does the work, while the receiver is passive, thus consuming no CPU resources.In this work the Reliable Connection transport function, which is similar to TCP/IP, and the RDMA Write transport service are used.
In previous work [1,2], all the tests have been run only on individual bursts of data having size equal to message size * message count.However, in a production setting, this is not enough.In order to have meaningful test results, several other things need to be taken into account.For example: what happens when multiple simultaneous clients (PCs running the receiver software) and connections (instances of the receiver software) are used?Then: what happens when data is sent in a continuous flow, not a single burst, taking into account how the consumption of data by clients influences the transfer of data?

Design choices
The clients are all implemented in software since, in the current setup, the receiver is always a PC.For the multiple simultaneous clients and connections tests, senders have been implemented, both in software and in hardware.The hardware senders, which are currently working only for sending individual bursts, run on Xilinx Alveo boards.For continuous flows of data, the current implementation of the sender is working only in software.
-1 -Continuous flow transfers can not be safely implemented when using RDMA Write, unless clients account for the consumption of received data.First, as seen from the results in [2], the FPGA RDMA implementation can reach bandwidths close to the theoretical maximum of a 100 Gb/s link.In GB/s, that is 12.5 GB/s, which would mean that 125 GB of memory could be filled in 10 seconds.So there will never be enough memory available to store all the data received during a long period of time.One solution to this is using a circular buffer to store incoming data, in order to have a hard limit for the memory being used by the receiver, while being able to receive any given amounts of data.This presents a second challenge: the rate at which data is being received may be higher than the rate at which data is being consumed by the client.In order not to overwrite data that has not been consumed yet, a backpressure mechanism would be needed on top of the circular buffer.
The backpressure mechanism stops the transfer when the circular buffer usage goes above a configured threshold and restarts it when usage goes back down.In order to avoid flip-flopping around the threshold, a system with a pair of upper and lower thresholds has been implemented.
Because RDMA Write is used, the client never knows, on its own, when data has been received.Thus, an out-of-band mechanism for notifying the client that data has been transmitted is needed.This out-of-band signaling mechanism is implemented using a TCP/IP socket.The choice for using a TCP/IP socket has been made because this signaling mechanism is not on the time-critical path of the data transfer, and using TCP/IP instead of something more exotic makes the implementation easier.
The receiver runs two threads, synchronized using a semaphore: the first thread receives data write notifications, posts the semaphore and activates the backpressure state, when the circular buffer usage goes over the upper threshold; the second thread waits for the semaphore, reads data and deactivates the backpressure state, when the circular buffer usage goes below the lower threshold.
The sender runs three threads, two of them synchronized using a semaphore: the first thread sends data and posts the semaphore; the second thread waits for the semaphore, sends data and the write notifications; the third thread listens for and receives backpressure commands.

Development setup
The development setup was based on Xilinx Alveo U50 cards [3], Nvidia Mellanox ConnectX-5 [4] and a DELL Networking Z9264F-ON RDMA-capable switch.One machine, used as the sender, has a Xilinx Alveo U50 card installed for the FPGA RDMA sender and an Nvidia Mellanox ConnectX-5 board installed for out-of-band control.Two other machines, used as the clients/receivers, each have an Nvidia Mellanox ConnectX-5 board installed, which is used both for out-of-band control and RDMA data transfer.All of them must be connected to the RDMA-capable switch.

Design consequences
Tests have been run with 8192×100, 8192×1000, 32768×100 and 32768×1000 message size [bytes] × message count bursts.Using message sizes of less than 8192 bytes or message counts of less than 100 does not allow for fully utilizing the available bandwidth.

JINST 19 C03034
The circular buffer capacities used in the tests have been 10, 100 and 1000.10 has proven to be too small.No matter what other parameters were being used, with a capacity of 10, buffer overruns were always happening.Capacities of 100 or 1000 performed without any problems.

Continuous flow, multiple connections (software)
All tests have been run with a single sender.Each individual connection has independent flow control.The 1 connection/1 client tests have been run with 8192×100, 8192×1000, 32768×100, 32768×1000 bursts.The 2 connections/1 client, 2 connections/2 clients, 4 connections/1 client, 4 connections/2 clients tests have been run only with 8192×1000 bursts.Two threshold configurations have been tested: 90/85 (90% buffer usage for upper threshold and 85% buffer usage for lower threshold) and 80/75.There was no noticeable difference in performance between the two configurations.From the run tests, the only important parameter seems to be the upper threshold, which needs to be chosen in such a was so that no capacity overruns happen.
On the PCs in the development setup, if the sender sends data with less than approximately 4.2 GB/s, the receiver is able to read it fast enough so that the backpressure is never triggered.If backpressure is not triggered, the send and receive bandwidths are almost equal.If backpressure is triggered, the send bandwidth is roughly double the receive bandwidth.See figure 1.If more than a connection is used on one or multiple clients, then the fraction of the total available bandwidth of 100 Gb/s for each connection is small enough so that backpressure never activates.See figure 2 for the results of the 4 connections/1 client test case.

Multiple connections (hardware)
The hardware implementation of the multiple connections feature was developed initially to run individual burst tests.As a result, independent control of each connection has not been implemented yet.The tests have been run with message sizes of 128 bytes to 512 Megabytes and message counts of 10, 100 and 1000.There were two references used: the bandwidth reported by the ib_write_bw test from the Perftest package and the bandwidth measured running the test with a single connection.
For the 10 and 100 message count tests two connection configurations have been used: tests with 2 connections and tests with 4 connections.In turn, the tests with 2 connections have been run in two configurations: both connections on the same client machine and two client machines, each with one connection; the test with 4 connections have been run in another two configurations: all 4 connections on the same client machine and two client machines, each with 2 connections.In the case of the 1000 message count tests, only the 2 connection tests have been run.All the 4 connection test setups are currently overloading the resources of the FPGA RDMA core implementation.
-4 -  Finally, the theoretical maximum bandwidth of the used links is 100 Gb/s (i.e.12.5 GB/s).A software implementation, both ours and what can be measured with Perftest, can reach up to 10.5 GB/s.The hardware implementation has been measured to reach up to 11.54 GB/s with a single connection and up to 11.98 GB/s total with multiple connections.