The only thing that hasn't changed about data in the nearly two decades that Vikas Ranjan has been shaping T-Mobile's data strategy is its strategic importance to the company.
Other than that, nearly everything has changed as data infrastructure evolved from big data warehouses running Hadoop on-premises gave way to cloud computing and the explosion of data management options that have flourished in its wake. And now generative AI technology is changing the equation once again, but this time in a slightly different way.
Coding assistants like GitHub's Copilot could help data scientists who lack a traditional software-engineering background solve business problems much more quickly than before, said Ranjan, who now manages a platform team of developers and data engineers. "The Copilots of the world are now giving you some hints … they give you enough for you to understand what the problem is that you're trying to solve with the code."
In a recent interview, Ranjan spoke about the differences between Snowflake and Databricks, the concept of data observability, and how to solve the skills gap in data science.
This interview has been edited and condensed for clarity.
It feels like we've really moved from the height of the Big Data era maybe 10 years ago, then through to the Snowflakes and Databricks and other new cloud-focused companies that became mainstream like five years ago. Now it seems like people are starting to feel comfortable with these new approaches; they understand Snowflake, they understand Databricks, they're well understood ways to operate data on the cloud. As T-Mobile thinks about the evolution of all of that, where are you planning for your data infrastructure to go over the next couple of years?
What we have seen is both these tools are really, really good. I'm not here to kind of critique one versus another. I'm just talking more from my experience, and I've been a big user of both these great tools in our ecosystem.
Telcos are much bigger in scale than many other industry verticals, and I think we are seeing a space for both. When we talk about heavy, complex, big volume data sets, the distributed processing engine is where I see Databricks playing a really critical role for us. You're talking about processing and curating hundreds of terabytes to a petabyte of data every day.
Then at the same time, Snowflake came in over the last, I would say four or five, six years now, they were looking (at data) more from a business user perspective.
Not everybody is a data engineer; not everybody knows how to scan the data and make a business decision. Business users wanted to generate insights, they wanted to solve their business problem. This is where we saw Snowflake look at this as an opportunity: "let's provide a tool which can help solve their problems rather than kind of doing these complex data transformations, learning Spark and big (data) tools," and Snowflake works really, really well for them.
There is a great value, if you are talking about from a technology perspective, in more of a curation layer, more of an analytics layer. And if you want to make it simple and easy for 80% of your user community who is not a Spark user, Snowflake fits in really well.
Now, if you fast forward from today to two years from now, I think this is a very interesting space because in the last announcements from both the companies last month, both companies are kind of doing the same thing. I think the way I'm seeing this evolve two years from now, it all depends on the enterprises and the scales and the volumes you're dealing with, and the competency and the skills they have in their teams.
If you have advanced engineers who have really smart skills in Scala, Java, dealing with high volume, high-scale data sets, Databricks has an advantage there. At the same time, if your users are more SQL based users, SQL data warehouse users, and good with Python, I think Snowflake will really serve most of the use cases. So the way I see that in the future is going to be a combination of both.
Earlier you mentioned data observability. I'm familiar with observability from the systems infrastructure standpoint, but can you explain a little bit more about how that works in the world of data?
I think a lot of people have different understandings of what data observability means. I think traditionally when we talk about availability, like companies have been doing on the systems and the infrastructure side for quite a long time, it was more of looking at the health of the infrastructure, looking at alerts and looking at the optimization of the infrastructure. In the data space, I look at this as four dimensions.
The first dimension is basically looking at this purely from the operator of these data pipelines, all the data developers or engineers who are building these pipelines. The first and foremost (priority) for them is to identify the issue with the data pipelines. Is there a pipeline running slower? Are there transformation rules that are not applied clearly? Is the data not providing the insights they are looking for?
The second side of looking at observability is more around data reliability or data quality. Do we have the right reconciliation rules in place? Are we seeing any drift in the data? Are we seeing any drift in the schema? Is my data supposed to solve Problem A and it has given me Result B?
I think the third is more about cost. With the cloud you're constantly looking at ways to optimize your capacity, optimize your processing, looking at it in different ways to basically reduce waste or improve the optimization of your pipelines.
And the fourth dimension is looking at it from the user's perspective. This is more about providing those insights to all the different personas from a business perspective when they're looking at the data.
Are data engineering roles going to converge, where data people will need to get better at business cases and business users will need to understand data? Or are they that distinct where they really are separate disciplines and sort of need to be focused on separately?
Traditionally what we have seen is the successful data engineers — when we're talking about the high-scale solutions, the high volumes, and the complex problems — are the ones who had traditional backgrounds in Java or Scala, some of these distributed object-oriented languages. So they were more of a software engineer, and then kind of started playing with the data, but the guiding principles and the core foundation remained the same. You have to understand how your JVMs work, you have to understand how your infrastructure works, you have to really figure out memory utilization, optimization and whatnot.
And then you have the second side of the organization, these were more like SQL workers. They are really, really good at writing complex SQL transformations to present this data, like some aggregate, some KPIs, some business outcomes and results. Power users.
And there is always a big gap between the skills on both sides, right? SQL developers are really good in SQL, but they really do not understand how to work in Java and Scala and how distributed processing works. They can pick up Python, but again, that's all they can do.
Now what I've really become fascinated with — and I think pretty much everybody is kind of fascinated with seeing this promise — is where now Gen AI is coming into the picture. If you look at the last one and a half years with the Open AI announcements, I think every single company is now doing some acquisitions around this space, and kind of deciphering this, or opening up this big area of opportunities for our developer communities.
We have been kind of looking at tools like Copilot. A SQL developer who is not a Java developer, or even a user (comes across) a complex transformation written in Java. The user is trying to understand what this means, and they want to use that transformation logic to basically write a simple SQL statement for XYZ business problem. The Copilots of the world are now giving you some hints … they give you enough for you to understand what the problem is that you're trying to solve with the code.
It's kind of building a knowledge base, it's kind of building a community, it's building this whole thing with the aim to increase the velocity of your developers. If your development time traditionally was 12 hours for you to write complex code, we'd probably be able to write it in like five hours now. And then you're going to hand this over to your other users who are new and your developers who are new to the team. So I see that as an opportunity where the convergence will happen between the two different traditional users, the developers and the consumers.