The CNCF's plan to crowdfight patent trolls
Today: Why enterprise open-source contributors might be the secret weapon against patent trolls, AI models are starting to run into scaling problems, and the latest enterprise moves.
Billions of dollars have already been invested over the last year retrofitting data centers to accommodate AI workloads in one of the biggest inflection points in data-center architecture in decades.
Today's data centers evolved to handle the explosion of web and mobile apps that dominated the first two decades of the 21st century. Entering 2024, data-center operators are rethinking their design, layout, and equipment choices thanks to the unique demands that AI apps place on tech infrastructure.
Billions of dollars have already been invested over the last year retrofitting data centers used by the hyperscalers — AWS, Google, Meta, Microsoft, and data-center landlords like Equinix — to accommodate these new requirements, and it will take years for those changes to ripple across the hundreds of cloud computing regions operated by those giants. Everything from the placement of servers within those massive buildings to the upgrade of networking equipment is in a state of flux.
It's one of the biggest inflection points in data-center architecture in decades, and it will require massive amounts of investment to stay competitive against AI data-center upstarts like Lambda Labs and CoreWeave that have the luxury of starting from scratch. "A good chunk of the new data center bill is going to be dominated in terms of handling AI workloads," said Partha Ranganathan, an engineering fellow and vice president at Google who has played a key role in its data-center design strategy.
But this buildout also comes just as many industry experts expect overall demand for AI computing power will shift from the model-training frenzy of the present to AI inference, which some believe will be better served by computing resources located outside the centralized cloud data center on edge networks or even back on-premises.
Here are some of the components of modern data centers that are rapidly changing amid the AI boom, as explained by the people responsible for keeping up with those changes.
Chips are the most fundamental part of any computer, and the reordering of this market thanks to AI is old news in 2024; just ask any Nvidia shareholder. But the chips needed to train AI models use more electricity than their traditional computing cousins, and that is having several follow-on effects across the data center.
"Power is by far the most constrained resource in any data center," said Prasad Kalyanaraman, vice president of infrastructure services for AWS. For example, traditional server racks could accommodate 30 to 40 servers per rack with a standard power cable, known in this world as a "whip," but only three or four AI servers in a rack would draw the maximum load those cables can accommodate, he said.
Power is by far the most constrained resource in any data center.
As a result, AWS is redesigning the electrical systems that run above server racks in its data centers to accommodate increased demand for electricity across all of its racks. It is also improving its demand forecasting abilities to make sure it can obtain enough renewable energy to service AI workloads as they grow, Kalyanaraman said.
Equinix rents data center space to a wide variety of tenants with different needs, but it is seeing a big increase in demand for power among its clients, said Tiffany Osias, vice president of global colocation services. Customers used to need anywhere between four to eight kilowatts per server rack, but as they move into AI-adjacent applications those requirements have increased almost tenfold in some cases, she said.
"When we think about architecting our data centers, we think about, do we have enough power in order to feed these boxes? And then, do we have enough cooling in order to keep them cold enough to operate in the most efficient condition?" Osias said.
One of the trickiest parts of operating a modern data center is finding the best way to mitigate the heat coming off thousands of very powerful computers. For a long time, the easiest way to do that was to simply direct a steady stream of air across the front of those servers and capture the waste heat coming off the back, sometimes directing that air through pads soaked with water in what's known as evaporative cooling.
AI servers need a more powerful approach. Data centers operators are in various stages of shifting to liquid cooling, which uses a closed-circuit loop of chilled liquid to absorb the heat directly off the chip and cool that liquid down outside the server rack.
This is an expensive undertaking and requires a fundamental shift in server rack design. For now, Microsoft is using what it called a "sidekick" to cool its newest Maia 100 AI chips in data centers that were designed around air cooling, which is most of them.
AWS has been able to rely on air cooling during its 2023 AI buildout, but Kalyanaraman expects that by the end of this year and into next, it will need to incorporate liquid cooling across its AI servers. The company has been experimenting with technology similar to Microsoft's sidekick approach as well as immersive cooling, which Microsoft has also done significant research toward bringing to maturity.
Equinix has a mixture of customers using both traditional air cooling and newer liquid cooling technologies inside its spaces, Osias said. But the trend is definitely moving toward liquid cooling, as more and more companies start building liquid cooling products and driving the prices down.
Liquid cooling also allows data-center operators to cluster AI servers much more closely together than they could if they had to cool those servers with air, she said, which helps address another crucial bottleneck in the AI data center.
At Marvell, which designs optical interconnects for data-center networking, customers are increasingly interested in supplementing traditional copper networking equipment with faster connections that can help minimize the amount of time it takes to train models on extremely expensive AI servers.
"What happens in large AI clusters is that connectivity becomes more important to keep all the GPUs up and running properly," said Radha Nagarajan, senior vice president and chief technology officer for Marvell's optical and cloud connectivity group. This problem wasn't as prevalent on smaller AI clusters used over the last few years, he said, but the generative AI boom has changed the way companies allocate computing power for these tasks.
"When you do training workloads, typically customers want to run large clusters where they can run hundreds of billions of parameters that they actually use for their training workloads," AWS's Kalyanaraman said. "In those workloads, latency matters, and specifically what matters is the instance to instance latency, (or) the latency between the different racks within a data center that is part of the same cluster."
Networking cables, routers, and switches designed for applications that aren't as sensitive to latency therefore need an upgrade to accommodate the demands of training generative AI models. AWS believes that the work it did developing the Elastic Fabric Adapter for high-performance computing workloads helps solve part of this challenge, but it also developed what Kalyanaraman called the "10P/10U network," an internal design that provides at 10 terabytes per second of throughput with 10 microseconds of latency.
Other companies are adopting Infiniband, a networking standard led by Nvidia and several other data-center equipment makers. Sales of Infiniband technology are surging, according to The Next Platform, but most data centers still use Ethernet as their standard networking tech.
Infiniband promises the high-performance/low-latency connections needed for AI training, but it is best suited for rack-to-rack interconnects rather than the longer connections that cloud providers like AWS have to offer across availability zones in cloud regions that can be several miles apart. That means traditional Ethernet is unlikely to disappear from modern data centers, in part because many AI experts think the current obsession with training models will soon start to fade.
While AI foundational model companies and cloud providers battled for the scarce GPUs needed to train AI models in 2023, over the next couple of years AI researchers and cloud providers expect AI inference to become a much more important part of the enterprise AI stack. Inference is the process of returning an answer when the end user enters something into an AI model, and that type of workload resembles traditional computing much more than AI model training.
"Inference is where the money happens for companies," Equinix's Osias said. "Inference is where they gain competitive advantage."
Because inference doesn't require as much computing horsepower as training, a lot of companies are betting that customers will want to run AI inference locally, either on their own on-premises servers, across edge computing networks, or even on PCs and smartphones.
Inference is where the money happens for companies. Inference is where they gain competitive advantage.
"AI inference on a network is going to be the sweet spot for many businesses: private data stays close to wherever users physically are, while still being extremely cost-effective to run because it’s nearby,” said Matthew Prince, CEO of Cloudflare, in a press release announcing a partnership with Nvidia to deploy GPUs across Cloudflare's network.
The big centralized cloud providers might play a very different role in that world; their customers will always need some amount of AI training resources as their models evolve, but the bulk of their activity could shift away from the current technology and equipment buildout in cloud data centers needed to accommodate the training boom.
AWS's Kalyanaraman didn't dispute the notion that low-latency edge processing will make sense for a lot of inference workloads, but, as you might expect, he still thinks a lot of companies will still want to use cloud providers for inference because of the stability and resiliency they offer.
"Training workloads have the ability to checkpoint and then restart from a certain checkpoint if there's any kind of failure. You can't do that with inference," he said.
The speed at which the generative AI boom has shifted enterprise computing priorities continues to have ripple effects across hardware, networking, and software companies, and seems unlikely to slow down any time soon. What is clear, however, is that enterprise tech vendors are spending enormous amounts of money to try and keep up with those changes, and the foundational layer for the entire IT industry is getting a fresh new look.
Editor's note: This story was updated on Wed. Jan 10 to correct the positioning of the electrical systems in AWS data centers.