Databricks CEO Ali Ghodsi laid out a compelling vision for enterprise data storage buyers last month as he described the strategy behind his company's purchase of Tabular.
Founded by the creators of the Apache Iceberg open data format. Tabular will allow Databricks to work on making Iceberg, widely adopted by other data companies including archrival Snowflake, compatible with the Delta Lake format born inside Databricks, Ghodsi said in a press conference at its Data & AI Summit in June. He described that goal — which will take years of work — as a step toward creating something akin to a "USB format" for open data.
However, he made no mention of Apache Hudi, the first open table format to pave the way toward the creation of today's standard for modern enterprise data storage: the data lakehouse. Hudi, which was developed inside Uber and is actually older than Delta Lake or Iceberg, has many similarities with the other two and is widely used by companies like Walmart and Notion that depend on real-time data.
Ghodsi's unification drive appeared to leave Hudi on the outside, but in a recent interview OneHouse CEO Vinoth Chandar, who created Hudi at Uber, pointed to an announcement released two days after the hubbub of the Tabular acquisition. Databricks' Jonathan Brito and OneHouse's Kyle Weller announced that XTable, an open-source project originally developed inside OneHouse with backing from Google and Microsoft, now works with a Delta Lake feature called UniForm.
XTable is currently incubating inside the Apache Software Foundation, which means it's not ready to go into production at most enterprises, but it allows users to convert data stored in one format to either of the other two. UniForm, on the other hand, was designed to allow companies using Iceberg or Hudi read data stored in the Delta Lake format, and while there are still plenty of hurdles a partnership between the two projects could be the road map to Ghodsi's vision of a "USB format."
That could allow Delta Lake and Iceberg shops to take advantage of some of the features of Hudi that its users can't find in other formats, such as its support for workloads that require a lot of updates, said Notion's head of data Daniel Sternberg.
"These are all interesting decisions you have to make when choosing one of these competing open-source formats," he said. "And you make a bet based on what fits your needs and your workloads the best, and what type of adoption this has across the industry."
Traffic jam
Hudi was born from a problem Uber encountered as it went through a period of explosive growth, Chandar said. As more and more users started using Uber to book trips, the data needed to calculate how long it would take a driver to pick up a user and travel to their destination was in constant flux thanks to traffic, construction, or road closures.
Uber was using both a data warehouse and a Hadoop data lake at the time but neither one could handle this new type of data, which needed to be updated nearly every second. The data lake scaled easily as mountains of data came in, but it couldn't process that data fast enough for Uber's needs, while the data warehouse performance was quite snappy but it could only handle limited amounts of data at a time.
"Every industry moved toward, "oh, I can't shove files into the cloud anymore and plug in a data lake, I have to do some management around it,'" Chandar said.
So Uber created Hudi to address those concerns, and after it saw a lot of interest from the open-source community Uber transferred the project to Apache in 2019. Chandar founded OneHouse, which has raised $68 million in funding, in 2021 around the idea of an open data lakehouse that could work with data stored in any of the three major formats.
When Notion was looking to modernize its data strategy two years ago, Hudi was by far the best format choice for its needs, Sternberg said. Notion's note-taking and productivity app needs to keep track of an enormous amount of updates to its database as users edit their documents, a similar problem that led to the creation of Hudi.
At that time, Iceberg lacked support for an open-source way to connect to Apache Kafka, an event-streaming project, and Delta Lake was getting a lot of flack from users who believed the project was designed to advance the interests of Databricks over its open-source community. Hudi's open-source bona fides were clear, and that was a big plus for Notion and its determination to use as much open-source software across its infrastructure as it grows, Sternberg said.
Come together
The three major open formats have a lot in common because they are all based on Parquet, a widely used column-storage file format that is a subset of a table storage format. However, there are still substantial differences in the way they read and write data to require some sort of common denominator to be interoperable.
For fast-growing companies like Notion, format compatibility isn't as important a consideration, said XZ Tie, engineering lead.
"I care less about specific formats, I care more about what I can get out of it, the scalability needed to operate and the compatibility and integration with different tools," Tie said.
Format compatibility is much more important for larger companies that have multiple business units with different technology requirements or need to integrate recent acquisitions into their tech stacks. It also appeals to companies looking to cut costs and simplify their data infrastructure.
Before Snowflake embraced the open data format concept over the last couple of years, its proprietary storage formats essentially locked customers into its query engine. Moving to open formats allows companies to store their data where and how they like, but managing data stored in different, incompatible open formats across an enterprise remains complex.
"If we're getting to a place where a higher percentage of Notion's data beyond just these blocks lives in our lake, and we want to do be able to externally reference and join it with data that's already in the warehouse, then I start to care a lot more about that kind of compatibility," Sternberg said. "Maybe I would trade some level of workload efficiency for compatibility, but we're not quite there at this moment in time."
At the end of 2022 we had zero products on interoperability. Now we have two products on interoperability.
Databricks, Snowflake, OneHouse, and their cloud computing partners will have a lot of work ahead of them trying to bring UniForm and XTable together. At the moment, the projects allow data to be read across different formats but "they do not address the significant distinctions between these formats on the write side," according to Alex Merced, a developer advocate at Dremio.
And there's also the political wrangling needed to bring three formats together in a way that doesn't favor any one vendor. That will require Databricks and Snowflake to turn down the heat on a long-simmering feud between their companies.
But Chandar sees a lot of progress in a short amount of time.
"At the end of 2022 we had zero products on interoperability. Now we have XTable and UniForm, two products on interoperability," he said. "We look at what the users are saying in the open-source community, and then we are going to continue building the technology out."