Every modern enterprise company will tell you they employ a data-driven approach toward making business decisions, but an increasing number of those same companies are no longer willing to let someone else drive their data. Open storage formats are here to stay, and they have leveled the playing field for one of tech's best rivalries: Snowflake and Databricks, who will outline competing visions for the future of data over the next two weeks at back-to-back user conferences.
Faced with rising costs and new opportunities to use their corporate data across a variety of tools, enterprises are bringing open-source table formats into the mainstream this year. Data stored in those formats can sit in a low-cost storage service like AWS's S3 and be processed by a wide variety of different services, which allows a lot more flexibility and control over how companies put their data to work.
"As data becomes more important within an organization, putting it into a centralized location where any query engine or any workflow can touch it is an advantage, as opposed to having a cloud data warehouse for marketing and a cloud data warehouse for product," said Tomasz Tunguz, general partner at Theory Ventures.
This trend has been building for several years, but the AI boom has forced enterprises who had been sitting on the sidelines to modernize their data architectures, and most of those enterprises already work with either Databricks or Snowflake as a central part of their data strategy.
Databricks was ahead of the open-format trend with Delta Lake, but the two companies have been converging on each other's turf for several years. In 2022 Snowflake announced support for Delta Lake as well as Iceberg, a rival open format.
In the years since, Iceberg has emerged as the leading format choice thanks to its reputation as the more vendor-neutral format, according to buyers and sellers of data services interviewed by Runtime. "We are 100% behind Iceberg as the specification for open table formats," said Christian Kleinerman, executive vice president for product at Snowflake, which won't surprise anyone if it announces comprehensive support for Iceberg next week.
Databricks also supports Iceberg, but is far more closely identified with Delta Lake and is still its biggest champion, said Jonathan Keller, vice president of product management at the company. "We obviously think it's the best format, [that] it provides the best range of capabilities from streaming to core data engineering to [business intelligence]," Keller said.
But companies like Microsoft and Google are betting that in the future, customers will want data services that work with whichever format they prefer. If those services take off, it could calm a budding standards war before it gets out of hand and force Databricks and Snowflake to compete on the strength of their data-processing engines, as well as the security and governance features that every enterprise customer needs.
Separate ways
The rise of open table formats is just the latest chapter in the ongoing quest to decouple storage from computing engines, a legacy of the success and decline of Hadoop.
Hadoop was a revelation when it first hit the scene in the mid-2000s because it allowed companies to add tons of new data to a common holding tank, which eventually became known as a data lake. Companies loved that they no longer had to maintain a bunch of separate databases, but in the original versions of Hadoop compute and storage were coupled very tightly together, so if you added a bunch of new data to it you also had to add new computing resources.
Hadoop was also originally designed for batch processing and therefore not very good at generating real-time data needed for business intelligence analysts to make decisions, Kleinerman said. If you wanted to do that, you still had to link Hadoop data with another database, such as IBM's Netezza or Teradata.
That was an expensive strategy, said Justin Borgman, founder and CEO of Starburst. Borgman worked for Teradata in that era and tried very hard to keep customers from defecting to new alternatives like Snowflake and Amazon Redshift that could provide both business intelligence reporting and independent scaling for compute and storage, and in hindsight that was never going to work.
We're seeing this movie play out again.
But the landscape has shifted once again. Until recently Snowflake required customers to upload their data to Snowflake's processing engine in proprietary formats and charged storage fees, which nobody minded until interest rates began to rise and the budget hawks got the upper hand. Suddenly nobody wanted to maintain a copy of their data in Snowflake and a copy of their data for some other important service, such as observability.
Databricks, on the other hand, allows customers to store their data where they please and only pay for the computing resources used by its data engine, which was a popular approach with the data scientists and researchers working on the massive datasets associated with AI models. Over the last year Snowflake has moved in this direction, and Iceberg is its main strategy for meeting those customers where they are.
"We're seeing this movie play out again," Borgman said. "The biggest difference is that in 15 years, both the query engines and the file formats have advanced enough that this is like actually doable."
Peace in our time
The three main open table formats used to separate data storage from data engines — Delta Lake, Iceberg, and Hudi — all emerged around the same time as interest in cloud computing surged. Delta Lake was developed internally by Databricks and released to the Linux Foundation in 2019, while Iceberg, born at Netflix, and Hudi, developed by Uber, wound up at the Apache Software Foundation.
Until recently, betting on a format was a consequential decision that could lock you into a particular set of vendor tools for quite some time. But efforts like the OneTable project backed by Google and Microsoft are trying to make it easier for end users to work across formats, and last week at Build Microsoft announced a partnership with Snowflake that added support for Iceberg to Fabric, its data-analytics platform that previously only supported Delta Lake.
Why should we argue about these things? If eventually it goes one way or another, we're fine.
"We don't really think we need to pick a horse in the race," said Arun Ulagaratchagan, corporate vice president for Azure data at Microsoft. "Why should we argue about these things? If eventually it goes one way or another, we're fine."
Borgman agreed. As customers adopt open formats, "what happens is that we instead compete at a higher level, which I think for a while will be really the query-engine level," he said.
But there have been two main obstacles to the adoption of open formats: security and governance concerns. After all, when all corporate data is stored in a single pile, making sure that certain people have access to only certain parts of that pile gets trickier.
Even when using mature storage technology, "it's very hard for your average enterprise to answer a question like, 'who has access to a piece of data?'" Keller said. "You can go from having data in pockets and not being able to access it, to it all being in a swamp if you don't have a good strategy for how you're going to govern and manage it."
Given the number of competing projects, it's likely they could head in different directions when building out those features. However, "we're slowly seeing the industry agree on some governance and security constructs to have representations that different entities can share," Kleinerman said.
With the rise of AI models and huge training data sets, "the surface area to protect is now massive, and much bigger," Tunguz said. As Snowflake, Databricks, and their various partners compete for data business in the AI era, security and governance tools could become a much bigger selling point than support for one open format or another.