Building Production RAG Systems: What Google

There's a moment in every technical tutorial when you can tell whether the presenters have shipped this thing to actual users or just got it working on their laptops. Google Cloud engineers Ayo Adedeji and Annie Wang hit that moment about twenty minutes into their seventy-three-minute walkthrough of building a production-ready RAG agent, when they casually mention the difference between OLAP and OLTP workloads.

Most tutorials would skip that distinction. Most tutorials are wrong.

The video covers what Google calls a "production-ready" RAG system—retrieval-augmented generation, the technique that lets AI models answer questions by first consulting a knowledge base rather than hallucinating from their training data. The presenters build the entire backend: vector database, embedding pipeline, retrieval system, deployment infrastructure. What makes it interesting is watching where complexity lives in these systems, and it's not always where you'd expect.

The Chunking Problem Nobody Talks About

RAG systems start with a problem that sounds mundane: how do you break up documents? The technical term is "chunking," and Adedeji explains why it matters more than you'd think.

"When you're working with these big unstructured documents, it doesn't make sense for various reasons to try to embed the entire document at once," he notes. "If you encode everything at once you can kind of dilute the semantic meaning of what's being encoded."

The team demonstrates recursive chunking—splitting text at sentence boundaries. It's a blunt instrument. As Adedeji acknowledges, "this is what you would call perhaps not context aware. You can imagine there might be meaningful context that's along multiple periods or multiple sentences." But blunt instruments ship, and perfect context-aware systems sit in research papers.

They're using Google's text-embedding-005 model, which converts each chunk into a 768-dimensional vector. Wang explains it plainly: "We have 768 dimension perspective to describe a text information." That's the semantic fingerprint that makes retrieval possible.

Two Databases, Two Jobs

Here's where the tutorial diverges from the usual "build a chatbot in fifteen minutes" content. The presenters implement RAG twice—first in BigQuery, then in Cloud SQL. This isn't redundancy; it's architecture.

BigQuery handles OLAP workloads: analytical processing, batch operations, the kind of work that takes seconds and processes millions of records. "These are analytical big data sets that take perhaps multiple seconds but that's the point," Adedeji explains.

Cloud SQL handles OLTP: real-time transactional processing with low latency requirements. "When you think about production, if we're doing RAG in production, we want to do that against an OLTP database rather than an OLAP data warehouse just because we want real-time results," he says.

Most developers building their first RAG system wouldn't know this matters. They'd pick whatever database looked easiest and wonder why their production system felt slow. The gap between demo code and production code often lives in these architectural choices that only reveal themselves under load.

The Identity Problem

One detail stands out for its obscurity: connection identities. Wang walks through creating a connection between BigQuery and Google's embedding model, emphasizing that "most operations have some sort of identity. Whether it be a service or the connection between two different services you have to have an identity behind that."

This is infrastructure work that nobody photographs for their portfolio. It's also the work that breaks at three in the morning when you're trying to diagnose why your RAG system suddenly can't talk to your vector database. Security models, service accounts, connection permissions—the plumbing that makes systems actually run.

Semantic Search in Practice

The retrieval demonstration uses cosine distance to find the top three chunks most similar to a query. Wang explains the choice: "We always recommend something like cosine distance because it's more of a matter of similarity rather than just magnitude. Euclidean distance would factor in magnitude. What we mean by that is like the length of the document, things like that."

They test it with a gaming query about tactics against enemies that cause paralysis. The top result contains the exact phrase "paralyzing aura." It works, but Adedeji immediately contextualizes what this means: "There are certain things you cannot convert into structured queries. When it comes to things that have semantic meaning involved, embeddings and retrieval-augmented generation is your best bet of actually getting useful insights."

That qualification matters. RAG isn't a replacement for SQL analytics. It's a different tool for different problems. Knowing which tool to use requires understanding what problem you actually have.

What Production Means

The tutorial eventually moves into Apache Beam pipelines, pgvector indexing, and Cloud Run deployments. The pattern is clear: production readiness isn't one thing. It's containerization, scaling, security settings, batch processing, monitoring, connection management.

"Without retrieval-augmented generation the model often times is limited to its pre-trained knowledge," Adedeji notes. "It may not have knowledge of your unique business documents." True enough. But the corollary is that making RAG work in production requires knowledge that most tutorials don't provide: which database for which workload, how to handle identities between services, why chunking strategy matters, what distance metrics to use.

Google is selling cloud services here, but they're doing it by showing the actual work. The video runs seventy-three minutes because that's how long it takes to build something that might actually survive contact with users. Whether that's reassuring or terrifying depends on whether you've already promised your stakeholders you'd have this shipped by next week.

Bob Reynolds is Senior Technology Correspondent for Buzzrag.