Why Vercel overhauled its serverless infrastructure for the AI era

Vercel's serverless infrastructure was designed at a time when speed was the most important goal. AI apps are a little different, and Fluid Compute is an effort to rebuild that infrastructure for the AI era.

Vercel CEO Guillermo Rauch speaks on a stage next to Vercel's logo wearing a white t-shirt and black pants
Vercel founder and CEO Guillermo Rauch speaks on stage at Vercel Ship in New York last year. (Credit: Vercel)

As companies struggle to deploy apps built around large-language models, they're also exposing inefficiencies in cloud tools that were designed to run older workloads. After rewriting the infrastructure beneath its serverless computing service last year, Vercel is ready to shift its customers onto a new platform that will make it cheaper to run AI apps.

Fluid Compute is a new architecture for Vercel Functions that was designed to eliminate the idle period when an AI app is waiting for a model to answer a question — which can take seconds or even minutes on computing infrastructure used to operating in milliseconds — and costs real money. In an exclusive interview with Runtime, Vercel co-founder and CEO Guillermo Rauch described Fluid Compute as the natural evolution of serverless computing.

"Fluid Compute sets out to fix serverless for the AI era," Rauch said. It's an acknowledgement that tried-and-true computing infrastructure strategies can change very quickly when something like generative AI comes along, a broader topic that I'll be discussing with Rauch at the HumanX conference in March alongside fellow panelists Andrew Feldman of Cerebras, Robert Nishihara of Anyscale, and Sharon Zhou of Lamini AI.

Vercel, which according to Crunchbase has raised $563 million in funding, is primarily known for its web application development platform. Developers use Vercel's open-source Next.js framework and managed infrastructure services to quickly launch and run cloud apps without having to provision and configure their own hardware.

However, Vercel originally designed the infrastructure that powers its managed computing services to run traditional web apps. Fluid Compute is an effort to rebuild that infrastructure to process AI apps without changing anything about the way non-AI apps run.

"It's the future of Vercel, and I'm hoping it's the future of the industry at large," he said. The company plans to demonstrate Fluid Compute for customers and developers later today.

Speed kills

AWS introduced the principles behind serverless computing back in 2014 with the launch of Lambda. Apps built around Lambda and other serverless development platforms use functions that execute distinct tasks in response to external triggers, which allows computing resources to spin up and shut down very quickly.

At that time developers were obsessed with speed, having realized that their users and customers wouldn't tolerate sites and apps that ran even 100 milliseconds or so slower than what they expected, and "we optimized the world's compute for that [problem]," Rauch said. Vercel's managed infrastructure runs on AWS and the company works closely with its Lambda team.

Now there's an entirely different set of expectations around apps that work with LLMs, given that concerns about accuracy make it harder for users and developers to trust those apps. "Even the customer wants the back end to be slow," Rauch said, pointing to a new feature in OpenAI's ChatGPT that allows the user to ask it to "use more intelligence" to answer a prompt, which takes longer to run.

Basically, you're treating it more like a server when you need it.

But as Vercel's customers started using the serverless platform to build AI apps, they realized they were wasting computing resources while awaiting a response from the model. Traditional servers understand how to manage idle resources, but in serverless platforms like Vercel's "the problem is that you have that computer just waiting for a very long time and while you're claiming that space of memory, the customer is indeed paying," Rauch said.

Fluid Compute gets around this problem by introducing what the company is calling "in-function concurrency," which "allows a single instance to handle multiple invocations by utilizing idle time spent waiting for backend responses," Vercel said in a blog post last October announcing a beta version of the technology. "Basically, you're treating it more like a server when you need it," Rauch said.

Suno was one of Fluid Compute's beta testers, and saw "upwards of 40% cost savings on function workloads," Rauch said. Depending on the app, other customers could see even greater savings without having to change their app's configuration, he said.

Back is the new front

Fluid Compute was designed to work with Node.js and Python applications, which are two of the most widely used frameworks and programming languages (respectively) among professional developers surveyed by Stack Overflow. Cloudflare Workers is a rival serverless computing platform that uses a similar technology to more efficiently deal with idle requests, but it is based on a different runtime and Node,js developers have to implement a few workarounds to get their apps to run.

Rauch is hopeful that Fluid Compute will reduce the number of customers shocked by the size of their Vercel bills after their apps went viral or saw an unexpected surge in demand. That experience has been even more painful for AI app developers, who found they were paying more than they expected to serve their users with a slow app.

"Fluid addresses a huge percentage of those cases," Rauch said. "Developers felt like they weren't in control of that back end becoming slower, and Fluid brings into a world of predictability where you're concerned about what you do control, which is your code and the things that you ship."

The new platform could also make Vercel a more interesting option for larger enterprises that like the principles of serverless computing but need to make sure they're operating as efficiently as possible.

"Typically, Vercel has been seen by many as for front-end workloads." Rauch said. "With Fluid, you can run any kind of back-end workload as long as it's in those runtimes like Node and Python. It's not just the ability to run it, but to do so efficiently."

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Runtime.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.