RADAR: Rapid Dynamic Rerouting to Achieve Load Balancing in Datacenter Networks

Datacenters are one of the most critical facilities to support Internet services. A datacenter could host hundreds of thousands of servers to enable data- and computing-intensive applications. For large scale datacenters, the interconnection network, called datacenter network (DCN), must be able to provide high-speed, low-latency data exchange among the servers.


Modern DCN designs employ dense interconnections to achieve ultra-high capacity. Examples of such topologies include fat tree and Clos networks. A critical issue in such designs is load balancing in multi-path routing. A good scheme should split traffic among multiple feasible paths in a balanced way so that a high throughput can be achieved without causing congestion. The typical approach for load balancing is to use equal-cost multi-path (ECMP) routing with hash-based traffic splitting. In this scheme, each switch maintains multiple next-hop forwarding entries for destination. When a flow arrives, the switch calculates a hash value using the flow ID, and then selects one of the forwarding entries based on the hash value. This approach can balance the number of flows across multiple paths but cannot balance the traffic load because the flows have different bit rates.


In this project, we present a new scheme called RADAR (rapid dynamic rerouting) to achieve load balancing in datacenter networks. RADAR is complementary to ECMP in that the default routing is performed by ECMP. When congestion is detected, rerouting is performed by moving some of the big flows to uncongested paths to improve the performance. The key of the design include the following. First, we present a big flow detection scheme. The scheme is based on sampling and has the advantage of low hardware complexity. Second, a probe-based scheme is designed to detect congestion and rerouting paths. This probe-based congestion detection had the advantage of avoiding high software/hardware complexity in switches and does not introduce too much overhead traffic. Third, a low-complexity flow-based rerouting is designed by modifying the packet header. As a result, we perform flow-based rerouting without requiring a flow table in all the switches and routers. 


A comparison of the proposed scheme to the default ECMP scheme shows that RADAR achieves substantial performance improvement. For example, using TCP in a small-size fat tree network, we find the performance improvement can be as high as 48%, see details below:

Traffic Pattern Packet drop rate in RADAR Throughput Improvement over ECMP
A 0 31.93%
B 140.97 ppm 48.06%
C 18.24 ppm 36.93%
D 0.70 ppm 32.00%

 

The following figure show the throughput of each flow when rerouting using RADAR is turned on and off. Clearly RADAR significantly increases the throughput of many flows. 

An earlier work of RADAR is published in Infocom Workshop on Cloud Computing 2011. We are still work on this project and will publish more of the design and experiment in the near future.

Publications

Kang Xi, Yulei Liu, H. Jonathan Chao, "Enabling Flow-based Routing Control in Data Center Networks using Probe and ECMP", IEEE INFOCOM 2011 Workshop on Cloud Computing, 2011.