To spray or not to spray	: Validation

To spray or not to spray : Validation

[The tests results in this article were updated after initial publication]

Introduction

The previous article describes the potential option to spray RDMA packets in an Ethernet fabric. Packet spraying is a promising technique to ensure even utilization of the fabric links, but it may negatively impact the applications. This article shows spraying validation results with shipping Nvidia Connect X 6 DX NICs, that support out of order RoCEv2 Reliable Connected transport service packet processing. This NVidia extension is called "Adaptive Routing".

In summary, spraying works even in case of heavy re-ordering in the Ethernet fabric.

Let’s first define where and how re-ordering can occur and what level of re-ordering can we expect in production.

What kind of the re-ordering can we expect?

The figure below shows the concept of the packet spraying solution – packets take different paths through the fabric and the path selection is done at the leaf randomly, packet by packet.

Packets of the same flow are directed into different queues, where some other packets may reside. As a result, some packets may stay longer in the fabric and subsequent packets may reach the endpoint earlier, breaking the order.

The diagram below shows one example, where packet is delayed because of some other packet in the spine port queue – re-ordering in the flow occurs as a result.

Load balancing is random, therefore over longer periods of time all the paths in the system are loaded evenly, but several packets may accidentally take the same path during a very short period (microseconds). The number of packets that may be directed to the same queue is a function of the incoming and outgoing port count:

  • At the leaf device there are typically 16 or 32 GPU ports sending traffic to 16 or 32 uplinks.

  • In the 3 stage CLOS network, many leaf devices (typically from 32 to 64) may pick the same spine device causing congestion towards target leaf.

While it is possible to mathematically find out the number of packets that may show up on a given uplink with certain confidence, it is fair to assume that this number is a fraction of the incoming port count.

Our main focus are out of order packets, or packets with lower sequences number arriving after packets with higher sequence number.

In this test, we initially verified the scenario where packets arriving out of order are separated by 1 to 20 other packets of the same flow. Two more tests verified scenario where sequential out of order packets were separated by more than 200 and 400 packets.

The capture below shows the case where the distance (in packets) between two subsequent unordered packets is 18.

The distance between the unordered packets is a good measure of re-ordering. 

The following plot shows the distance distribution of the packets we used in the test, it is shown as percentage of total packet count.

The bell-shaped curve in the middle is not intentional – this is a property of the test bed. In this test, 25.03% of the packets in this test scenario arrived to the target NIC out of order.

How did we achieve this? We were using the most versatile router on the planet: Juniper Networks MX.

Juniper MX Router for the Packet Re-order Emulation

Juniper MX supports a large variety of the use cases: residential and business edge services, metro, enterprise routing, Virtual Private Cloud routing just to name a few. In the AI/ML context, we demonstrated in-network aggregation support on MX too. But in this test, we use MX in an usual role: configurable packet re-ordering device.

Note, our actual AI/ML fabric design is based on PTX routers and QFX switches.

The diagram below shows the traffic flow through the router (from right to left).

Packet re-ordering is emulated using the following technique:

  • RDMA_WRITE_ONLY packets are separated into two groups: packets following normal (short delay) path and packets following longer path. Paths are taken randomly, with the same 50% probability. Other non-RDMA_WRITE_ONLY packets are sent over a 3rd (direct) path.

  • Extra delays on one of the paths are implemented by chaining 8 policer instructions in a filter. This adds several microseconds to the packet processing time. Very high policing rate is chosen to avoid any drops – the policer has no effect on packet processing except the delay added. By changing the number of policers in a chain we can manipulate the distance distribution.

Besides the fabric re-ordering emulation device, we also need the actual NIC and the server.

The NIC and the server

We used Nvidia Connect X 6 DX NIC for our tests connected to the emulated fabric through 100GE interfaces. RoCEv2 Adaptive Routing functionality requires firmware versions higher than or equal to 22, hence we picked the NIC supported by this firmware.

Here is the list of the hardware used:

  • Connect X 6 DX NIC ( P/N MCX623106AC-CDA_Ax), firmware version 22.38.1900.

  • Supermicro Server, with Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz, Ubuntu 22.04.3 LTS.

Test Results

The tests were performed using ib_write_bw utility from the perftest package. The tool reports throughput in gigabits. Note that specific processor core was picked to ensure that the processor and the NIC are on the same NUMA node.

Here is the sample output :

First, we ran the test in 3 different scenarios for 10 minutes each, and received results that match theoretical maximum performance:

  • Adaptive Routing : Enabled. Spraying : Disabled. Throughput : 97.66 Gbps.

  • Adaptive Routing : Enabled. Spraying : Enabled. Throughput : 97.66 Gbps.

  • Adaptive Routing : Disabled. Spraying : Disabled. Throughput : 98.01 Gbps.

For a packet carrying 4096 byte of the payload, the L2 and L3 overhead is 78 byte and 20 byte is added at L1. 4096 / 4194 * 100Gbps= 97.66332Gbps.

Note that if Adaptive Routing is disabled, the network overheads reduce a little because not every packet has RDMA Extended Transport Header (RETH), and the performance improves by 0.34%.

Just for comparison, the next test verifies the impact of packet re-ordering when adaptive routing is disabled. In this test, all packets traversing the MX router were re-ordered.

As expected, the impact is significant, the performance drops more than 80x times (probably to the point where no re-ordering is seen due to the increased gaps between packets).

  • Adaptive Routing : Disabled. Spraying : Enabled. Throughput : 0.19 Gbps.

The result confirms that one can not just randomly spray all packets - only eligible packets must be sprayed and only if endpoints support out of order packet reception.

Finally, two more tests were performed to exercise how much re-ordering the NIC can tolerate (in a default NIC configuration).

In the first test, the number of paths was increased to 8 and delays were adjusted using the following policer / counter combinations (more policer / counter instructions increase the delay):

  • Path 1 : 0 policers / counters

  • Path 2 : 8 policers / counters

  • Path 3 : 8 policers / counters

  • Path 4 : 8 policers / counters

  • Path 5 : 16 policers / counters

  • Path 6 : 32 policers / counters

  • Path 7 : 64 policers / counters

  • Path 8 : 128 policers / counters

The resulting distance distribution is shown below.

And the final test increased the number of policers / counters for the path 2 from 8 to 256, the maximum distance has further increased as shown below.

The performance results are shown below

  • Profile with 200+ distance. Adaptive Routing : Enabled. Spraying : Enabled. Throughput : 97.66 Gbps.

  • Profile with 400+ distance. Adaptive Routing : Enabled. Spraying : Enabled. Throughput : 87.58 Gbps.

As seen from the tests, in a scenario with very large re-ordering, the performance may degrade. Note, there were no drops registered, and most likely the rate reduces when the number of unacknowledged packets reach configured / supported maximum.

Overall, the Adaptive Routing and Sparing combination results are very encouraging – NICs that have been shipping for many years deployed with Juniper AI/ML fabric emulation perform great with very heavy re-ordering!

But is there a downside?

What is the cost of doing it?

Well, first, RDMA packet spraying is not a new technique – it is enabled by default in Nvidia InfiniBand deployments. It has been in use for a while, without people even noticing it.

But we tried to understand the negative impact, and here is what we found so far, besides small 0.34% reduction in the throughput: if out of order packet reception is enabled in the NIC (Adaptive Routing), the number of packets with acknowledgements does increase. In the non-Adaptive Routing case, the acknowledgements are generated per entire message (which may be comprised of RDMA WRITE First, RDMA WRITE Last and multiple RDMA WRITE Middle packets), whereas in case of Adaptive Routing, the acknowledgments are generated for a group of several RDMA_WRITE_ONLY packets. In our case the ACKs were sent for a group of 3 to 4 packets.

Here is the capture of the packet with an acknowledgement.

For every 3 to 4 frames 4096 bytes each in one direction, there will be a small 66-byte L2 packet sent in the other direction.

Is it worth enabling it? 66 bytes for a transfer of 12K to 16K is probably not a big tax, compared to the alternative techniques, such as Fully Scheduled Fabrics.

Conclusion

This test validates simple yet effective Juniper AI/ML fabric design which uses packet spraying as a technique to load balance traffic evenly across all fabric paths and avoid congestion, thus reducing job completion times.

The tests were performed with Nvidia Connect X 6 DX NIC, and we are open to verifying it with other NICs too.

Thanks Dmitry Shokarev. Loved your analysis. Do you happen to test retransmission with packet spraying in the case of packet drops. I believe going back to sequence number that dropped and restarting from there would be performance hit though packets beyond the dropped sequence number are received.

Like
Reply
Shawn Zhang

Sr. Consulting Systems Engineer @ Juniper Networks

3mo

kind of vendors introduce Cell-based transport between Leaf and Spin, the idea comes back to QFabric, or even TXP Matrix era.

Like
Reply
Chris Whyte

Principal Network Solutions Architect at Marvell Semiconductor

3mo

Traffic patterns from collectives based on how you parallelize an LLM are a critical piece missing from this analysis, imo - especially as you overlay them onto a cluster that will inherently involve both a scale-out and scale-up domain. Not to mention, there are other things you can do at the NIC to improve ECMP or things you can do with some topology detection mechanisms in order to optimize for traffic locality. Point being, you will get a very different result in your analysis if/when the above are included. Therefore, you've really only identified the importance of packet spraying with and without adaptive routing in a very crude (unrealistic) deployment so it's not clear to me how useful this is. Now, I suspect you understand this already so how do you plan to address it?

Thank you for sharing, and It's encouraging to see the statistical rigor :).

Like
Reply

As someone who was at Juniper when the M160 was released and reordered packets, then the T640 also reordered for a little bit, it's kinda of ironic to see it as a desired behavior now :)

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics