Behind every breakthrough in enterprise technology over the past few decades you'll find a database that emerged to meet the needs of applications that had evolved beyond what traditional products could provide. This year, as engineering managers and CIOs are being asked to articulate a generative AI strategy in the middle of a hype cycle for the ages, the vector database is having its coming-out party.
Vector databases are ideal for generative AI applications because they allow companies to search for relationships between their unstructured data points and help their large-language models remember those relationships over time. The concept is not exactly new; recommendation engines have been telling you for decades that you might want to watch The Wire because you liked The Sopranos, but older databases can't handle the explosion of different types of data that we've seen in recent years.
Vector databases are designed to handle these unique kinds of information retrieval tasks "in the same way that in your brain, the way you remember faces and the way you remember, I don't know, poetry, is completely different," said Pinecone founder and CEO Edo Liberty during a recent panel discussion. "It's just organized differently. It's accessed differently. It's a different kind of data."
In an otherwise down year for venture investing, hundreds of millions of dollars are being poured into vector database startups like Pinecone, which raised $100 million from manifesto-generation machine Andreessen Horowitz in April. Established database companies like MongoDB, PlanetScale, and even Oracle are adding vector search capabilities to their existing products in hopes of capitalizing on the trend.
"I think the opportunity that the market has is if you look at the size and growth of unstructured data within an enterprise — I've seen analyst reports saying that it's growing at 3x the pace of any other type of data — the volumes of unstructured data that with some of this new technology can be made harvestable is massive," said Michael Gilfix, chief product and engineering officer for KX, which has adapted its work on time-series databases into the vector database world.
What's your vector
When large-language models are trained on massive data sets, they spit out what they've learned as vectors, "a quantity that has magnitude and direction" and can be represented on a graph. Those vectors help establish relationships between words, phrases, and sentences to understand how they are both similar and different.
Vector representations are natural representations for neural networks.
After those relationships are established, the models can rely on the vectors to help them draw lines between existing data and a new input that doesn't match anything in that data set. When someone types a query into an AI model, the model processes that query into a vector and then searches across the vectors already stored in a vector database to figure out a directionally correct answer.
Yury Malkov, author of a widely used index for implementing vector search called HNSW, loosely compared the process to the famous "six degrees of separation" research conducted by Stanley Milgram to plot social graphs. "Vector representations are natural representations for neural networks," he said, and they make LLMs practical because it would be prohibitively expensive and time consuming to run the new query through the training model every time.
And those models also can't retain the output produced by the new query; they can only fall back on their training sets. Vector databases allow companies to store new queries specific to their businesses over time as vectors and provide that information to AI models for more accurate and timely results.
Jonathan Ellis, co-founder and CTO of DataStax, provided an example. The LLM that kicked off this generative AI boom, OpenAI's GPT-4, can't tell you anything that has happened after September 2021, the cut-off date for its training data. If you asked GPT-4 why the Los Angeles Clippers have struggled so much in recent years despite the presence of several superstars, it would most likely hallucinate an answer given the gap in its training data, he said.
However, if you indexed several news articles about the Clippers written in recent years in a vector database and directed GPT-4 to access that data when asking the question, "you get much more specific, much higher quality results and it doesn't feel the need to make stuff up, because now it's grounded in reality from that context," he said. (The answer is they can't stay healthy.)
Plotting a course
As companies scramble to figure out how generative AI can have an impact on their businesses, vector databases will play a key role in marrying the power of these LLMs to the unstructured data sets produced by those companies in the course of business.
One question that has yet to be resolved, however, is whether or not companies will be able to get away with layering vector search capabilities to their existing databases, or whether or not they'll need to use a purpose-built vector database in order to unlock their full potential. Believe it or not, vector database startups want you to choose the best-of-breed option while existing vendors think they can provide a more comprehensive product.
Businesses that are just starting to experiment with vector search will probably be fine working with their existing database providers, but only to a certain point, said Andre Zayarni, co-founder and CEO of vector database startup Qdrant.
"As soon as your use cases start to grow beyond a few million data points, you'll probably need a dedicated solution," he said. That might sound like a lot, but even the smaller LLMs have billions of data points.
The question is a little different for Ellis and DataStax, which is best known for its NoSQL databases based on Apache Cassandra. The company is layering vector search capabilities into its Astra and DataStax Enterprise products, and thinks customers will be able to see a difference right away.
For older relational database vendors, "it's a square peg in a round hole to do this kind of vector search," Ellis said. However, he believes "that because of how Cassandra works under the hood and how we built a pluggable indexing system, we can do this in a much more efficient and much more performant way, while still delivering the kinds of CRUD applications — create, retrieve, update, delete — that people expect from from a normal database."
Enterprise customers need to know that the investments they make in generative AI technologies will still work several years down the road, and vector databases will go a long way toward making sure they have that continuity, KX's Gilfix said.
"It isn't just the finding of the information, they have to put that information in context. and they've got to figure out how the fact that their information changes over time so they can give the right answer," he said. "That's something that the data layer has to help."