Google's Maglev paper describes a high-performance software-based load balancing system. Like many of the systems that Google has described in the literature, Maglev applies commodity hardware and horizontal scaling to a problem in a novel way. These are some notes on the parts most interesting to me.
Packets reaching Google's networks are destined for virtual IP addresses, which are assigned to particular services. These addresses map to several physical machines running the service application. Maglev's basic job is to look up the machines serving that virtual IP and then to evenly divide the load between these servers. Each Maglev array manages many virtual IPs and many backend servers per IP.
The main smarts in Maglev are in allowing an unintelligent frontend hardware router to distribute packets -- and packet fragments -- in any fashion it chooses to a resizeable array of Maglev instances which forward packets on to the machines handling each service and maintain connection semantics.
There were several problems that caught my attention:
The basics here are a combination of efficient coding and moving the networking work into user space. This speeds things up in a few ways. First, you avoid the context switches into kernel space. Second, the kernel's stack supports far more features than Maglev requires, so doing the work itself avoids a lot of possible code paths and checks. Finally, you can interact directly with the network card via shared memory buffers rather than be interrupt driven; when you're pretty sure there will always be packets waiting, polling is pretty efficient.
Maglev is fortunate to be distributing the load from many different source IPs, so can achieve good balancing characteristics based on the source and destination data.
An obvious initial problem is that the router is distributing packets to multiple Maglev machines, but these packets form part of a larger connection stream, so they need to be routed to the same machine to allow complete requests to be reconstructed. The Maglev machines take care of routing packets belonging to the same connection to the same backend machine using a consistent hashing technique.
Backend selection is based on a packet's 5-tuple representation. This representation is the packet's source IP, source port, destination IP, destination port and IP version number. To avoid sharing state for ensuring consistent backend selection, a Maglev server takes the 5-tuple of an incoming packet and derives the appropriate backend for it using the hash of the 5-tuple. First, the Maglev machine checks a connection table for an existing entry. If that fails, a specialised consistent hashing technique allows each Maglev machine to choose the same backend server for a given connection. The selection is then stored in the connection table, for quicker directing of packets that are part of that connection.
Overall, the paper is a very readable solution to an interesting problem of scale.