Multi Tenant RAG: The Retriever Is the Security Boundary

🗓️ Last updated: June 2026

Every B2B SaaS company is shipping some version of the same product right now: "ask questions about your data." Underneath the marketing copy, it is a multi tenant retrieval augmented generation system. Ingest documents per customer, embed them, store them, retrieve them at query time, hand the snippets to a model, return a grounded answer. The architecture diagram is everywhere. The hard problems, as is traditional, are not in it.

The hard problems surface the first time a customer's CISO asks: "How do you guarantee that a query from Acme Corp can never, under any circumstances, surface a chunk from Globex?" If the answer is "we filter by tenant ID in the application layer," the conversation is already lost. In multi tenant RAG, the retriever is the boundary, and any control sitting above the retriever is theatre.

This is the third article in System Design, Reimagined. Article 1 added a synthesis layer on top of classical search. Article 2 put the model in the loop of a typing user. This article does something different: it takes a system that used to be a per tenant document search engine and shows how the retrieval layer becomes the single most security critical component the moment an LLM is reading its output.

Requirements: Permissions Are Part of the Index

The functional requirements look familiar from any document search product:

Customers upload documents (PDFs, HTML pages, Markdown, transcripts, Confluence exports, source code)
A user submits a natural language question
The system returns a grounded answer with citations to specific document spans
Retrieval latency is bounded; total time to first token is in the low seconds

The nonfunctional requirements are where the architecture lives. Five of them are nonnegotiable in this product class:

Strict tenant isolation. Cross tenant data leakage is the highest severity bug a B2B SaaS company can ship. It also voids contracts.
Per document ACLs. Even inside one tenant, users have different permissions. A retrieval result that a user is not allowed to see is itself a leak.
Freshness with eventual consistency tolerance. Newly uploaded documents must become retrievable within a reasonable window, but the system cannot block ingestion on the index being fully up to date.
Citation fidelity. Every claim in the generated answer must trace to a specific span of a specific document the user has read access to. No citation, no claim.
Audit. Who asked what, what got retrieved, what got cited, when. Frequently required by compliance regimes.

The first two requirements are the article. Everything else is consequence.

The Classical Baseline: Document Search Without Synthesis

Before RAG, a multi tenant document search product looked like this:

Ingestion parsed documents, extracted text, normalised encodings, and pushed records into an inverted index (typically Elasticsearch, OpenSearch, or Solr).
Permissions were enforced either by per tenant indices (one index per customer, total isolation) or by document level access control filters applied at query time.
Query sent a keyword search to the index and returned ranked document hits.
UI showed snippets and let the user click through to the source.

This architecture had two genuinely nice properties. The permission boundary was crystal clear: the retrieval call already knew which tenant and which user, and the filter happened inside the index server. The worst failure mode was "user sees a search result they shouldn't have clicked," which is bad but recoverable.

RAG breaks both properties. The retrieved content does not just appear in a clickable list. It gets stuffed into a prompt and read aloud as fact by a language model. A retrieval result the user wasn't supposed to see is no longer "an item in the list"; it is content the user will see, narrated as truth. The blast radius of a permissions bug just went up by an order of magnitude.

The Four Decisions That Cascade Through Everything Else

Four architectural choices set the shape of the rest of the system. Make them carefully; they are expensive to reverse.

Decision 1: Per tenant or shared index?

A per tenant index is the safest answer to the isolation question (no shared physical storage, no filter logic to get wrong). It is also operationally heavy at scale: thousands of tenants means thousands of indices to keep warm. The on call team will develop opinions about this. A shared index with tenant ID filtering is cheaper but makes isolation a property of correct filter application, which is a property that fails silently and announces itself at the worst possible moment.

The reasonable middle path is partitioned storage with a shared service: tenant data lives in distinct collections, namespaces, or shards, but the same service handles requests for all of them. Most production vector databases (Pinecone, Weaviate, Qdrant, Milvus) explicitly model this as first class namespacing.

Decision 2: What is the retrieval primitive?

Pure lexical (BM25), pure dense (embedding similarity), or hybrid (both, combined). Each has a different failure mode. Lexical search is precise but brittle to phrasing. Dense search is robust to phrasing but happily retrieves semantically adjacent, factually wrong content and calls it a win. Hybrid retrieval with a rank fusion step is the production default for almost every real RAG product today.

Decision 3: How are documents chunked?

Too small and the chunks lose context. Too large and they waste prompt tokens and dilute relevance. Fixed size, sentence boundary, paragraph boundary, semantic similarity: each has tradeoffs and none is universally correct. This single decision contributes more to retrieval quality than any other ingestion choice, and it is also the one most teams make once, regret indefinitely, and never have budget to revisit.

Decision 4: Where does the permission filter live?

In the application layer (filter after retrieval), in the index layer (filter during retrieval), or in the storage layer (separate physical storage per permission domain). The further down the stack the filter lives, the safer it is, and the harder it is to bypass. This is not a coincidence.

We will work through all four.

Ingestion: Chunking, Embeddings, and Metadata

The ingestion pipeline is the first place a multi tenant RAG system either earns or loses its quality budget. The pipeline is conceptually simple: parse, chunk, embed, store. Each step has a nonobvious failure mode, and the steps do not fail loudly.

Parsing is harder than it looks. PDFs are the canonical example: multi column layouts that come back as interleaved lines, headers and footers stapled to body paragraphs, tables flattened into runs of numbers with no structural context surviving the journey. Cheap PDF parsing will ruin retrieval quality before the rest of the pipeline has a chance. Pay the price for a good parser. Then accept that you will pay it again when your parser of choice stops being maintained.

Chunking is where most teams under invest. Fixed size chunks (a few hundred to around a thousand tokens, with some overlap) are the universal default because they are simple, not because they are good. Better approaches include sentence boundary chunking, paragraph aware chunking, and recursive structure aware chunking that respects headings. The chunk boundary is the unit your retriever returns and your model reads. Treat it accordingly.

Embedding is the most fungible step. Multiple providers offer competitive embedding models today. The interesting architectural choices are around dimension (smaller dimensions reduce storage and recall cost; larger dimensions improve retrieval quality), versioning (you will eventually reembed everything; this is not a risk to hedge, it is a certainty to plan for), and per tenant embedding model selection (most products use one model for everyone; some niche use cases benefit from tenant specific fine tuning).

Metadata is where the permissions story actually lives, and we will come back to it.

Hybrid Retrieval: BM25 plus Dense Vectors

A query in production grade RAG is almost never a pure vector similarity search. It is a hybrid of:

Lexical retrieval: BM25 (Robertson and Zaragoza's well known formulation) over the inverted index. Excellent for queries containing specific terms, product names, error codes, or domain jargon.
Dense retrieval: vector similarity over the embedding index. Excellent for paraphrased queries, conceptual questions, and queries that share no vocabulary with the documents.

The two retrievers return overlapping but nonidentical candidate sets. A fusion step combines them. Reciprocal Rank Fusion (Cormack et al., 2009) is the standard combiner because it is simple, parameter free in its basic form, and consistently competitive against tuned weighting schemes.

The hybrid step is followed by an optional reranker: a cross encoder model that scores each candidate against the query and reorders them. Rerankers are expensive (you are running a model on every candidate) but they substantially improve precision at small candidate counts. The standard pattern is: retrieve broadly with hybrid search, narrow with a reranker, hand a small number of top passages to the synthesis model.

The architectural lesson is that "RAG retrieval" is a pipeline of three or four stages, not a single call. Treat it that way in your code, your latency budget, and your observability.

Permissions Coupled to Retrieval

This is the section the article exists for.

The wrong answer is: retrieve everything, then filter by permissions in the application layer before passing to the model. This is wrong for two reasons. First, the model will read whatever you hand it; if you accidentally hand it a chunk the user cannot see, the leak happens regardless of whether you remember to redact it afterward. Second, post filter retrieval kills relevance. If your top candidates were all permission restricted, the user gets a "no results" response even when relevant accessible content exists further down the list. You have simultaneously leaked data and served a bad answer.

The right answer is: permissions are an index time filter, not a query time filter. The retrieval call itself must produce only chunks the user is allowed to see. This requires three things in the architecture:

Every chunk carries its ACL as metadata. When a document is ingested, the document's permissions (tenant ID, group IDs, user level grants, sensitivity labels) are attached as searchable metadata to every chunk derived from it.
The retriever applies the ACL filter inside the index. Pinecone, Weaviate, Qdrant, Milvus, and Elasticsearch all support metadata filtered queries; use them, do not roll your own. A filter that runs in the vector database, before the top k cut, is fundamentally safer than a filter that runs in your application code.
The user's effective permissions are computed once, server side, and attached to the query. Never trust permission claims from the client. Resolve the user's full ACL set (tenant, groups, document grants, role based exclusions) at the query gateway, then pass that resolved set into the retriever.

A subtler point: when permissions change, the index does not change automatically. A user removed from a document yesterday must not be able to retrieve chunks from that document today. This means either you rewrite the chunk metadata on permission change (cheap if the database supports partial updates; expensive otherwise) or you maintain a per user denylist consulted at query time. The first option is cleaner; the second is operationally simpler when permissions churn frequently. Pick one and write down the consistency guarantee out loud. The one you cannot articulate is the one that fails at the worst time.

Freshness and Eventual Consistency

Customers expect documents to be findable shortly after upload. They also expect deleted documents to become unfindable immediately. These two expectations meet the practical reality that embedding and indexing are not instantaneous.

The architecture splits the problem in three:

Soft delete the document immediately. The metadata says "deleted"; the retriever filters out anything marked deleted. This is fast because it is a metadata flip, and it gives you correctness on the safety critical side right away.
Hard delete asynchronously. A background job purges the actual chunks and embeddings on a schedule.
Index new documents asynchronously with a freshness SLA. The user gets a "processing" state until the document is queryable.

The same pattern applies to updates. A user editing a document expects the new content to be retrievable, but reembedding and reindexing takes time. Most production systems treat the previous version as authoritative until the new version is fully ingested, then atomically swap the references.

Document this contract explicitly: "Your changes are visible to search within N minutes." Customers tolerate eventual consistency when they know it exists. They hate it when they discover it by surprise, which is always at the worst possible time.

Observability, Citation, and Audit

Every multi tenant RAG system needs three streams of telemetry that go beyond classical search:

Retrieval observability captures, per query: the user, the tenant, the resolved ACL set, the candidate set returned by hybrid retrieval, the reranker scores, and the final top passages handed to the model. Without this, you cannot debug "the model said X but I have no idea where it got that from." You will say this exact sentence at 11pm on a Tuesday. Build the observability now.

Citation telemetry captures, per generated answer: which input chunks the answer claims to be grounded in, and ideally a span level pointer back to the source document. This is where citation grounding research (Gao, Yen et al., 2023) and emerging OpenTelemetry GenAI conventions are doing useful work. If your answer cites a passage, you must be able to reproduce that citation from the logs.

Audit logs capture the security view: who queried, what they were allowed to retrieve, what was retrieved, what was cited, and what was rendered. This is the view auditors want. Compliance customers will not buy your product without it.

These three streams overlap significantly but serve different audiences (engineering, product, security). Build them as separate pipelines with shared upstream events, not as a single firehose that you grep through later.

Tenant Isolation

The CISO's question deserves a real answer. Multi tenant RAG has at least four layers where isolation can be enforced, and the strongest implementations enforce at all four:

Storage layer. Per tenant collections, namespaces, or shards. Even when the underlying database is shared, the data physically lives in distinct logical containers.
Index layer. Mandatory tenant ID filter applied by the retriever, with no path that allows queries without it. Defense in depth means a request with no tenant ID returns zero results, not "everything."
Application layer. The query gateway validates the authenticated user belongs to the claimed tenant before any retrieval call is made. Tenant ID is never accepted from the client.
Encryption layer. Per tenant encryption keys (envelope encryption with a customer managed KMS key) so that even a low level data exfiltration scenario yields ciphertext.

Pick the layers your security posture and compliance regime require. For most B2B SaaS at the SOC 2 or ISO 27001 level, the first three are nonnegotiable; the fourth is upsold to regulated customers (finance, healthcare).

Failure Modes

These are the failures that production multi tenant RAG systems actually exhibit, in roughly the order of how badly they hurt:

Cross tenant leakage. A query from tenant A returns a chunk from tenant B. The highest severity bug in the entire product class. Causes: missing tenant filter in the retrieval call, shared embedding index without partition discipline, application layer filtering instead of index layer filtering.
Cross user leakage inside a tenant. A user without permission retrieves a chunk they should not have. Same family of causes as failure 1, scoped narrower. Still a security incident.
Stale embeddings. A document was updated; the index still has the old embeddings. The model cites text that no longer exists in the document. Mitigations: index version metadata on every chunk, atomic ingestion, ingestion status surfaced to the UI.
Chunk truncation. A chunk is too short to contain the answer; the model hallucinates context from neighbouring chunks; the citation lands on the wrong span. Mitigations: structure aware chunking, larger overlap, semantic similarity based chunk grouping during retrieval.
Hybrid retrieval imbalance. BM25 dominates the candidates and dense retrieval has no influence (or vice versa). The product behaves well on some query classes and badly on others. Mitigations: track per query candidate sources; evaluate on a query distribution that represents real usage.
Citation drift. The model generates an answer with citations that do not actually support the claim. Mitigations: post generation grounding check that reverifies each cited claim against its source chunk before showing it to the user.

Failures 1 and 2 are security incidents. Failures 3 through 6 are quality incidents. The first two must be prevented at design time; the rest can be mitigated by design and caught operationally.

Cost Economics

The cost surface of multi tenant RAG has four contributors, and they amortise very differently:

Ingestion cost (parsing, chunking, embedding) is paid once per document and again on update. Embedding is the dominant per document cost. Strategies: batch embedding requests, deduplicate identical chunks across tenants where permitted, archive cold tenants to cheaper storage.
Storage cost (vector index plus metadata) is paid continuously and scales with corpus size. Strategies: tiered storage (hot/cold partitions), shorter embedding dimensions where quality permits, compression and quantisation.
Retrieval cost is paid per query. It is small per query but high per peak second. Strategies: query caching for popular questions (be careful that cache keys include the resolved ACL), connection pooling, candidate count tuning.
Synthesis cost is the LLM call. It is by far the largest per query cost. Strategies: prompt caching for stable system prompts (see article 2), context trimming, smaller models for low stakes queries.

The economic shape of multi tenant RAG is fixed corpus cost plus variable query cost. Pricing has to reflect both. Single axis pricing (per seat or per query alone) undercharges some customers and overcharges others. The math is not subtle. Most production products end up with a hybrid pricing model that mirrors the cost structure.

The Architect's Checklist

A 12 item action list to take into a real multi tenant RAG design:

Choose your isolation model up front: per tenant indices, shared with partition, or hybrid. Write down which layers enforce the boundary.
Make the retrieval call refuse to execute without a resolved tenant ID and ACL set. Defense in depth means "no filter" returns zero results, never everything.
Attach the document's full ACL to every chunk's metadata at ingestion. Filter inside the index, never above it.
Resolve the user's effective permissions server side at the query gateway. Never trust client supplied tenant or permission claims.
Use a hybrid retrieval pipeline (lexical plus dense) with reciprocal rank fusion. Add a reranker for high precision use cases.
Treat chunk boundary design as a first class quality decision. Default to structure aware chunking with overlap; do not ship fixed size chunking on important content.
Implement soft delete that flips metadata immediately, with hard delete on an async schedule.
Build three telemetry streams from the start: retrieval observability, citation telemetry, security audit log. Do not merge them.
Run a post generation grounding check on every cited claim before showing the answer. This is your defense against citation drift.
Plan for a reembedding migration on day one. It will happen; design the pipeline to make it boring.
Budget per tenant for ingestion, storage, retrieval, and synthesis separately. Price to match the cost shape.
Pen test cross tenant leakage continuously. The boundary that has not been actively attacked is a boundary that has silently rotted.

The Architect's Mental Model

A multi tenant RAG system is not "a vector database with a chatbot on top." It is a permission aware retrieval engine whose output is read aloud by a language model that does not understand authorisation. Every architectural move: index layer filtering, per tenant partitioning, metadata on chunks, structured ingestion, hybrid retrieval, grounding checks, audit logs. All of it exists to make sure the model never sees a chunk it should not, and never says something it cannot ground.

The classical document search engine is still in there. The inverted index still runs. BM25 still earns its keep. The vector database sits next to it. The LLM sits on top. The security perimeter sits inside the retriever, not above it.

If you are building this, the rest of Architecting Agents applies. The safe tool call discipline from article 3 is the right model for the retriever: it is a read tool that crosses tenant boundaries, and it deserves contract level controls. The observability discipline from article 4 is your audit pipeline. The cost bounding discipline from article 5 applies per tenant, not just per query.

In multi tenant RAG, the retriever is the boundary. Everything that bypasses the retriever is a leak waiting to happen.

The Retriever Is Your Security Perimeter: Designing a Multi Tenant RAG Knowledge Base

Requirements: Permissions Are Part of the Index

The Classical Baseline: Document Search Without Synthesis

The Four Decisions That Cascade Through Everything Else

Ingestion: Chunking, Embeddings, and Metadata

Hybrid Retrieval: BM25 plus Dense Vectors

Permissions Coupled to Retrieval

Freshness and Eventual Consistency

Observability, Citation, and Audit

Tenant Isolation

Failure Modes

Cost Economics

The Architect's Checklist

The Architect's Mental Model

Comments

System Design, Reimagined

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More from this blog

Faster Than the Fingers: Designing a Real Time Coding Assistant

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More Agents, More Problems

Your Agent Has a Spending Problem

Command Palette

Requirements: Permissions Are Part of the Index

The Classical Baseline: Document Search Without Synthesis

The Four Decisions That Cascade Through Everything Else

Ingestion: Chunking, Embeddings, and Metadata

Hybrid Retrieval: BM25 plus Dense Vectors

Permissions Coupled to Retrieval

Freshness and Eventual Consistency

Observability, Citation, and Audit

Tenant Isolation

Failure Modes

Cost Economics

The Architect's Checklist

The Architect's Mental Model

Comments

System Design, Reimagined

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More from this blog