You are in an AI engineer interview at Google.
The interviewer asks:
"Our data is spread across several sources (Gmail, Drive, etc.)
How would you build a unified query engine over it?"
You: "I'll embed everything in a vector DB and do RAG."
Interview over!
Here's what you missed:
Devs treat context retrieval like a weekend project.
Their mental model is simple: "Just embed the data, store them in vector DB, and call it a day."
This works beautifully for static sources.
But the problem is that no real-world workflow looks like this.
To understand better, consider this query:
"What's blocking the Chicago office project, and when's our next meeting about it?"
Answering this single query requires searching across sources like Linear (for blockers), Calendar (for meetings), Gmail (for emails), and Slack (for discussions).
No naive RAG setup with data dumped into a vector DB can answer this!
To actually solve this problem, you'd need to think of it as building an Agentic context retrieval system with three critical layers:
> Ingestion layer:
- Connect to apps without auth headaches.
- Process different data sources properly before embedding (email vs code vs calendar).
- Detect if a source is updated and refresh embeddings (ideally, without a full refresh).
> Retrieval layer:
- Expand vague queries to infer what users actually want.
- Direct queries to the correct data sources.
- Layer multiple search strategies like semantic-based, keyword-based, graph-based.
- Ensure retrieving only what users are authorized to see.
- Weigh old vs. new retrieved info (recent data matters more, but old context still counts).
> Generation layer:
- Provide a citation-backed LLM response.
That's months of engineering before your first query works.
It's definitely a tough problem to solve...
...but this is precisely how giants like Google (in Vertex AI Search), Microsoft (in M365 products), AWS (in Amazon Q Business), etc., are solving it.
If you want to see it in practice, this approach is actually implemented in Airweave, a recently trending 100% open-source framework that provides the context retrieval layer for AI agents across 30+ apps and databases.
It implements everything I mentioned above:
- How to handle authentication across apps.
- How to process different data sources.
- How to gather info from multiple tools.
- How to weigh old vs. new info.
- How to detect updates and do real-time sync.
- How to generate perplexity-like citation-backed responses, and more.
For instance, to detect updates and initiate a re-sync, one might do timestamp comparisons.
But this does not tell if the content actually changed (maybe only the permission was updated), and you might still re-embed everything unnecessarily.
Airweave handles this by implementing source-specific hashing techniques like entity-level hashing, file content hashing, cursor-based syncing, etc.
You can see the full implementation on GitHub and try it yourself.
But the core insight applies regardless of the framework you use:
Context retrieval for Agents is an infrastructure problem, not an embedding problem.
You need to build for continuous sync, intelligent chunking, and hybrid search from day one.
I have shared the Airweave repo in the replies!