RAG DELIVERY

A practical guide to enterprise RAG rollout

Srishti GenAI

Retrieval-augmented generation is easy to prototype and surprisingly hard to run for a year without embarrassing regressions. The failure pattern is familiar: a charismatic demo on a static snapshot, then production documents change, ACLs are wrong in the index, chunk boundaries mangle tables, and the team discovers users asked questions that no chunk can answer. This guide sequences the work so you improve retrieval quality and trust in measured steps—not by buying a bigger embedding model and hoping.

Phase zero: scope the corpus and the question types

Before touching vectors, list the questions you must answer and the documents that are authoritative. If two departments disagree on which PDF is canonical, fix that socially before you encode conflict into embeddings. Categorize questions: factual lookups, procedural steps, comparisons across versions, and “no sufficient internal source” cases. Your assistant should handle the last category explicitly—refusal and escalation beat confident guessing.

Define freshness expectations: HR policies may be quarterly; security advisories may be daily. Different SLAs imply different re-ingestion jobs and monitoring. If everything is labeled “best effort,” nothing will be maintained.

Ingestion with permissions, not “ingest everything”

Enterprise RAG must enforce access control at retrieval time. Practical approaches include: indexing with metadata that mirrors your IAM or document ACLs, filtering every query by the caller’s entitlements, and segmented indices per sensitivity tier when complexity warrants it. Testing should include negative cases—a user who may read folder A must never see chunks from folder B even if the embedding space is numerically close.

OCR quality, scanned PDFs, and password-protected attachments are where projects stall. Decide early whether low-quality sources are excluded, remediated, or flagged to users. “Garbage in, fluent garbage out” is worse than omitting a document and saying so.

Chunking and structure: respect the document physics

Fixed token windows ignore headings, tables, and lists. Prefer structure-aware chunking where headers define boundaries, tables stay intact or are serialized consistently, and code blocks are not torn mid-function. For HTML knowledge bases, consider chunking by article or section ID rather than blind character counts.

Maintain overlap sparingly—enough to preserve continuity, not so much that duplicate near-identical chunks dilute ranking signals. Version metadata (source URL, modified time, product SKU) should ride along in filters the model never sees directly but retrieval uses aggressively.

Embeddings and bilingual or domain vocabulary

Choose embedding models that match your languages and domain. For specialized jargon (legal, medicine, manufacturing), evaluate retrieval with a realistic query set—not Wikipedia trivia. Sometimes a hybrid pipeline (sparse keyword plus dense vector) beats a fancier embedding model on exact phrase and SKU retrieval.

Plan for model migration: when you change embeddings, you likely rebuild the index. Budget downtime or blue-green index swaps and regression tests so you do not silently degrade quality on a weekend deploy.

Reranking and context packing

Initial vector hits are candidates, not finals. Cross-encoder rerankers or lightweight scoring layers often lift precision on the top five chunks users actually see. Context packing—deciding what fits in the LLM window—should prioritize diversity (do not fetch five nearly identical chunks) and citeability (each paragraph should map to a source pointer users can audit).

If your application requires citations, enforce citation discipline in the prompt and penalize answers that paraphrase without anchors in evaluation. Users forgive imperfect answers less than they forgive fake precision.

Evaluation beyond “vibes”

Minimum viable evaluation includes: labeled question-answer pairs with gold spans, automated checks for empty retrieval, manual review sampling on new releases, and red-team prompts designed to elicit policy violations. Track metrics over time—drift in grounding rate often precedes user complaints by weeks.

Integrate evaluations into CI for ingestion changes. When someone tweaks chunking or swaps rerankers, the same harness should run before promotion. This is how you prevent “helpful” tweaks from being undocumented quality gambles.

Observability and operations

Log query distributions, retrieval latencies, empty-hit rates, and low-confidence generations. Alert when distributions shift abruptly—often a sign of a broken connector or a poisoned document batch. Runbooks should cover index rollback, feature flags to disable retrieval in crisis, and communications templates if you must pause a customer-facing assistant.

Cost controls belong here too: cap expensive reranker calls for noisy traffic, deduplicate embeddings on duplicate documents, and cache frequent lookups where latency matters.

Human workflows around the AI

Even strong RAG needs human maintenance: owners who approve new sources, reviewers who adjudicate bad answers, and helpdesk paths for “the assistant said X but policy says Y.” If those roles are unnamed, quality will decay because no one feels responsible for the corpus.

Multilingual and multi-region nuances

If your users ask questions in several languages while documents remain mostly English (or the reverse), plan for query translation, multilingual embeddings, or parallel indices per locale—with explicit QA in each language. Mixed-language orgs often underestimate how retrieval precision drops when chunks and queries are not aligned linguistically. Pilot with native speakers reviewing answers, not only automated scores.

Data residency may force region-specific vector stores and model endpoints. That duplication increases ops cost but can be non-negotiable. Document which regions are in scope for phase one so you do not accidentally index EU HR documents into a US-only index “temporarily.”

When RAG is the wrong primary bet

If answers require heavy numeric computation over live transactional data, a retrieval layer that dumps text snippets into the model may underperform against deterministic tools or SQL-backed agents. If the task is mostly stable formatting or classification with little factual drift, fine-tuning or disciplined prompting might reach reliability faster than maintaining a large corpus pipeline. Use RAG where freshness, citations, and permission-aware knowledge dominate—not everywhere generative text appears.

How this relates to agents

RAG often powers tool-using agents that fetch, summarize, and act. The same ACL and freshness discipline applies—agents amplify mistakes if retrieval is sloppy. See our agent implementation playbook for orchestration patterns that pair well with grounded retrieval.

Checklists and deeper guides

Use our enterprise RAG checklist as a worksheet, and the long-form RAG vs fine-tuning article when executives ask why you are not “just training a model.” For build-focused positioning, the RAG development page summarizes how we deliver. When you want a second opinion on sequencing or risk, reach out with your corpus outline and ten example questions—we can stress-test feasibility quickly.