Why Hudi is key to data format unity

Today: Lost in last months' discussions of open data formats was Hudi, but its tech will enable true format compatibility, Stack Overflow's annual developer survey, and the latest moves in enterprise tech.

Why Hudi is key to data format unity
Photo by Chris Liverani / Unsplash

Welcome to Runtime! Today: Lost in last months' discussions of open data formats was Hudi, but its tech will enable true format compatibility, Stack Overflow's annual developer survey, and the latest moves in enterprise tech.

(Was this email forwarded to you? Sign up here to get Runtime each week.)


Sometimes a notion

Databricks CEO Ali Ghodsi laid out a compelling vision for enterprise data storage buyers last month as he described the strategy behind his company's purchase of Tabular. Founded by the creators of the Apache Iceberg open data format. Tabular will allow Databricks to work on making Iceberg, widely adopted by other data companies including archrival Snowflake, compatible with the Delta Lake format born inside Databricks, Ghodsi said in a press conference at its Data & AI Summit in June

Ghodsi's unification drive appeared to leave out the first open data format to pave the way toward the creation of the data lakehouse: Apache Hudi. Hudi, which was developed inside Uber and is actually older than Delta Lake or Iceberg, has many similarities with the other two and is widely used by companies like Walmart and Notion that depend on real-time data.

But in a recent interview, OneHouse CEO Vinoth Chandar, who created Hudi at Uber, pointed to an announcement released two days after the hubbub of the Tabular acquisition.

Hudi was born from a problem Uber encountered as it went through a period of explosive growth, Chandar said.

  • As more and more users started using Uber to book trips, the data needed to calculate how long it would take a driver to pick up a user and travel to their destination was in constant flux thanks to traffic, construction, or road closures.
  • Uber was using both a data warehouse and a Hadoop data lake at the time but neither one could handle this new type of data, which needed to be updated nearly every second.
  • The data lake scaled easily as mountains of data came in, but it couldn't process that data fast enough for Uber's needs, while the data warehouse performance was quite snappy but it could only handle limited amounts of data at a time.
  • So Uber created Hudi to address those concerns, and after it saw a lot of interest from the open-source community Uber transferred the project to Apache in 2019.

Hudi was by far the best format choice for Notion's needs when it was looking to modernize its data strategy two years ago, said Daniel Sternberg, head of data.

  • Notion's note-taking and productivity app needs to keep track of an enormous amount of updates to its database as users edit their documents, a similar problem that led to the creation of Hudi.
  • For fast-growing companies like Notion, format compatibility isn't as important a consideration, said XZ Tie, engineering lead.
  • "I care less about specific formats, I care more about what I can get out of it, the scalability needed to operate and the compatibility and integration with different tools," Tie said.

But format compatibility is much more important for larger companies that have multiple business units with different technology requirements or need to integrate recent acquisitions into their tech stacks. It also appeals to companies looking to cut costs and simplify their data infrastructure.

  • "If we're getting to a place where a higher percentage of Notion's data beyond just these blocks lives in our lake, and we want to do be able to externally reference and join it with data that's already in the warehouse, then I start to care a lot more about that kind of compatibility," Sternberg said. "Maybe I would trade some level of workload efficiency for compatibility, but we're not quite there at this moment in time."

Read the rest of the full story on Runtime here.


Survey says...

After years of generative AI hype, 63% of professional software developers are using AI coding tools, an increase from a year ago according to Stack Overflow's annual developer survey results. That doesn't mean they're fully comfortable with those tools.

When asked how they feel about AI developer tools overall, 72% rated them very favorable or favorable, which is a pretty good satisfaction rating. But it's a decrease from 77% of developers who felt the same way last year, and an overwhelming majority of those surveyed said AI development tools have trouble handling complex tasks.

Still, an overwhelming majority also said their companies plan to increase the use of AI tools in their development workflows next year. The entire survey is always worth a read for insights into what developers really think about developer tools, programming languages, databases, and cloud providers, and can be found here.


Enterprise moves

Nadav Zafrir is the new CEO of Check Point Software, with plans to take over the role from founder Gil Shwed in December.

Larry D’Angelo is the new chief revenue officer at DigitalOcean, joining the cloud provider from Rapid7.

Pranay Ahlawat is the new chief technology and AI officer at Commvault, a newly created role for the former partner at Boston Consulting Group.

Jason Wakeam is the new chief revenue officer at Backblaze, also a newly created position for the former vice president of global sales at SnapLogic.

Manish Gupta and Marilyn Miller are the new chief marketing officer and chief people officer, respectively, at LaunchDarkly.


The Runtime roundup

ServiceNow's strong earnings performance in the second quarter was eclipsed by the resignation of president and chief operating officer CJ Desai, who acknowledged violating internal hiring practices as part of an ongoing internal investigation into the company's pursuit of a government contract.

IBM beat Wall Street expectations for revenue and profit, but it's still only growing at 1.9%.

CrowdStrike sent $10 Uber Eats cards to partners as an apology for forcing them to reboot so many Windows PCs, thanks to what it acknowledged Wednesday was a bad update rolled out all at once.


Thanks for reading — see you Tuesday!

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Runtime.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.