Hands-On Demo: Collective Benchmarking for AI Data Centers
Show Description
This demonstration of Keysight AI (KAI) Data Center Builder shows how network events impact completion times. The first demo showcases the effects of congestion on completion times and how poor fabric utilization impacts performance. You’ll also see how Keysight AI (KAI) Data Center Builder can show how increasing parallelism of data transfer helps improve utilization and completion times.
In the presentation by Keysight Technologies at AI Field Day, Ankur Sheth, Director of AI Test R&D, demonstrates KAI Data Center Builder, focusing on how network events impact completion times. The setup involves emulating a server with eight GPUs connected to a two-tier fabric network, using the AresONE box to simulate the GPUs and network interface cards (NICs). The demonstration shows the effects of network congestion on performance and how increasing the parallelism of data transfer can improve fabric utilization and completion times. The first scenario examines the impact of congestion on the network, revealing poor performance due to misconfigured congestion control settings.
Sheth explains the configuration and results of running an "All Reduce" collective operation, which is commonly used during the backward pass of a training job. The initial test shows that the network’s poor configuration led to low utilization and high latency, with only 25% of the theoretical throughput achieved. Detailed flow completion times and cumulative distribution functions (CDFs) highlight significant discrepancies in data transfer times, indicating a problem in the network configuration. After adjusting the network settings, particularly the Priority Flow Control (PFC) settings, performance improves dramatically, achieving 95% utilization and significantly reducing completion times.
In a second experiment, Sheth demonstrates the impact of using different algorithms and increasing the number of Q-Pairs, which are connections used in the RDMA over Converged Ethernet (RoCE) protocol. The halving-doubling algorithm initially shows average performance with significant tail latencies. By increasing the Q-Pairs from one to eight, the network’s performance improves, with more parallel and consistent data transfer times. This change allows the network to better load balance the traffic, resulting in more efficient utilization. The presentation concludes with a demonstration of how KAI Data Center Builder's metrics and data can be integrated into automated test cases and analyzed using tools like Jupyter notebooks, providing valuable insights for network designers and engineers.