The Promise and Reality of RAG
Retrieval-Augmented Generation (RAG) has become the default architecture for grounding large language models in organizational data. The concept is straightforward: instead of relying solely on what an LLM learned during training, you retrieve relevant documents at query time and feed them as context.
In theory, this eliminates hallucinations, keeps responses current, and respects data boundaries. In practice, most RAG implementations deliver mediocre results — not because the architecture is flawed, but because the details matter enormously.
We've deployed RAG systems for NGOs managing 50,000+ beneficiary records, research institutions with decades of publications, and government agencies with complex regulatory documentation. Every deployment taught us something. This article distills those lessons into a practical guide — built entirely on open-source tools.
Why Open-Source RAG?
Before diving into architecture, a critical decision: why build with open-source?
When you send a query containing beneficiary data to a proprietary AI API, that data travels to external servers. You lose control over who processes it, where it's stored, and what happens to it. For organizations handling sensitive information — refugee records, medical data, legal cases — this is unacceptable.
An open-source RAG stack means:
- Data never leaves your infrastructure — Every component runs on servers you control
- Full auditability — You can inspect every model, every algorithm, every data flow
- No per-query costs — After the initial infrastructure investment, querying is virtually free
- No vendor lock-in — Replace any component without rewriting your system
- GDPR/RGPD compliance by design — Data residency is guaranteed, not promised
Understanding the RAG Pipeline
A RAG system has three distinct phases, and each one can silently degrade overall quality:
graph LR A["INGEST\nParse · Clean\nChunk · Embed"] --> B["RETRIEVE\nEmbed · Search\nRe-rank · Filter"] B --> C["GENERATE\nAssemble · Prompt\nGenerate · Stream"] C --> D["VALIDATE\nFaithfulness · Relevance\nCitations · Feedback"] style A fill:#EFF6FF,stroke:#1E40AF,color:#0F172A style B fill:#ECFDF5,stroke:#0F766E,color:#0F172A style C fill:#FEF3C7,stroke:#D97706,color:#0F172A style D fill:#FCE7F3,stroke:#DB2777,color:#0F172A
The critical insight: retrieval quality caps generation quality. If you retrieve the wrong documents, even the best LLM will produce wrong answers — confidently, with perfect grammar.
Where Most RAG Systems Fail
1. Document Preprocessing
The most overlooked step. Raw documents contain headers, footers, page numbers, table formatting, image captions, and metadata noise that pollutes your embeddings.
Common failures:
- PDF extraction that loses table structure, turning rows into garbled text
- OCR artifacts from scanned documents introducing phantom characters
- HTML-to-text conversion that strips semantic structure
- Duplicate content from multiple document versions
What works — all open-source:
unstructured.io(Apache 2.0) for complex multi-format documentspymupdf4llm(AGPL) for PDFs with tables and structured contentDocling(MIT, by IBM) for enterprise-grade document conversionTesseract OCR(Apache 2.0) for scanned documents
# Open-source document processing pipeline
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
elements = partition(filename="annual_report.pdf")
# Filter out noise elements
clean_elements = [
el for el in elements
if el.category not in ["Header", "Footer", "PageNumber"]
and len(str(el)) > 50 # Skip tiny fragments
]
# Chunk respecting document structure
chunks = chunk_by_title(
clean_elements,
max_characters=1500,
overlap=200,
combine_text_under_n_chars=300
)
2. Chunking Strategy
The most common mistake is treating chunking as a preprocessing afterthought. Fixed-size chunks (512 tokens, 1000 characters) ignore document structure entirely. A paragraph split mid-sentence loses its meaning.
| Strategy | Quality | Complexity | Best for |
|---|---|---|---|
| Fixed-size | ⭐⭐ | Trivial | Homogeneous text |
| Sentence-split | ⭐⭐⭐ | Low | Conversational |
| Semantic | ⭐⭐⭐⭐ | Medium | Technical docs |
| Document-aware | ⭐⭐⭐⭐⭐ | High | Structured docs |
Recommendation: Start with document-aware chunking. Fall back to semantic only if parsing is unreliable.
What works: Semantic chunking that respects document structure. Use heading boundaries, paragraph breaks, and logical sections. Overlap chunks by 10-15% to maintain context at boundaries. Add a metadata prefix to each chunk so the LLM knows the source.
# Adding provenance context to each chunk
for chunk in chunks:
chunk.text = f"[Source: {doc.title} | Section: {chunk.section}]\n{chunk.text}"
3. Embedding Quality
Not all embeddings are created equal. Generic embeddings miss domain-specific semantic connections. When your documents use specialized vocabulary — legal terms, medical codes, humanitarian jargon — generic models fail silently.
Real-world example: In a project for an NGO, the term "protection" appeared constantly. In humanitarian contexts, "protection" refers specifically to safeguarding vulnerable populations from violence. Generic embeddings treated it as a synonym for "security" or "defense," retrieving irrelevant IT documents when users asked about protection programs.
Open-source embedding models we recommend:
| Model | Languages | Dimensions | License |
|---|---|---|---|
| BAAI/bge-large-en-v1.5 | EN | 1024 | MIT |
| BAAI/bge-m3 | 100+ | 1024 | MIT |
| intfloat/e5-large-v2 | EN | 1024 | MIT |
| intfloat/multilingual-e5-large-instruct | 100+ | 1024 | MIT |
| nomic-ai/nomic-embed-text-v1.5 | EN | 768 | Apache 2.0 |
| Snowflake/arctic-embed | EN | 1024 | Apache 2.0 |
For multilingual (our default): BAAI/bge-m3 or e5-large. For English-only: Snowflake/arctic-embed or bge-large.
We've seen 30-40% improvement in retrieval accuracy by switching from generic to domain-tuned open-source embeddings. Fine-tuning bge-m3 on domain-specific pairs reduced hallucinations by 60% in one deployment.
4. Retrieval Strategy
Simple cosine similarity with top-k retrieval is a starting point, not a solution. It misses documents that are relevant but use different terminology. It can't handle multi-hop reasoning.
The retrieval hierarchy:
graph TB L4["Level 4: AGENTIC<br/>LLM reasons about query, iterates search,<br/>combines multi-hop evidence"] L3["Level 3: RE-RANKED<br/>Cross-encoder re-scores top candidates<br/>(cross-encoder/ms-marco — open-source)"] L2["Level 2: HYBRID<br/>Dense embeddings + sparse BM25<br/>Catches semantic and lexical matches"] L1["Level 1: BASIC<br/>Cosine similarity, top-k retrieval<br/>Starting point only"] L4 --> L3 L3 --> L2 L2 --> L1 style L4 fill:#7C3AED,stroke:#5B21B6,color:#FFFFFF style L3 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF style L2 fill:#0F766E,stroke:#065F46,color:#FFFFFF style L1 fill:#94A3B8,stroke:#64748B,color:#FFFFFF
# Hybrid retrieval with open-source re-ranking
from sentence_transformers import CrossEncoder # Apache 2.0
# Stage 1: Hybrid retrieval (Qdrant handles both dense + sparse)
dense_results = qdrant_client.search(
collection_name="documents",
query_vector=embed_query(query),
limit=30
)
sparse_results = bm25_index.search(query, k=30) # rank_bm25 (BSD)
# Merge and deduplicate
candidates = merge_results(dense_results, sparse_results, k=50)
# Stage 2: Cross-encoder re-ranking (open-source model)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)
# Return top results by re-ranked score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
final_context = [doc for doc, score in ranked[:8]]
5. Context Assembly
Even with perfect retrieval, how you assemble the context window matters. Stuffing all retrieved documents into a single prompt leads to the "lost in the middle" problem — LLMs pay more attention to information at the beginning and end of the context.
What works:
- Order retrieved chunks by relevance, most relevant first
- Include source metadata so the LLM can cite its sources
- Set a context budget (e.g., 4000 tokens) and truncate intelligently
- Use a structured prompt template that separates context from instructions
The Full Open-Source Architecture
Here's our production architecture — every single component is open-source and self-hostable:
graph TB
subgraph ingestion["Ingestion Pipeline"]
D["Documents<br/>(PDF, HTML, DOCX)"] --> P["Unstructured.io<br/>(parsing)"]
P --> SC["Semantic<br/>Chunking"]
SC --> ME["Metadata<br/>Enrichment"]
ME --> EMB["BGE-M3 / E5-Large<br/>(open-source embeddings)"]
EMB --> QD[("Qdrant<br/>(vectors)")]
EMB --> ES[("Elasticsearch<br/>(BM25 index)")]
end
subgraph query["Query Pipeline"]
UQ["User Query"] --> QX["Query<br/>Expander"]
QX --> HS["Hybrid<br/>Search"]
HS --> RR["Cross-Encoder<br/>Re-ranking"]
RR --> CA["Context Assembly<br/>+ Prompt Template"]
CA --> LLM["Llama 3 / Mistral / Qwen<br/>(via vLLM or Ollama)"]
LLM --> ANS["Answer + Citations<br/>+ Source Documents"]
end
subgraph monitor["Monitoring"]
AP["Arize Phoenix<br/>(open-source)"]
PG["Prometheus<br/>+ Grafana"]
end
QD -.-> HS
ES -.-> HS
style ingestion fill:#EFF6FF,stroke:#1E40AF
style query fill:#ECFDF5,stroke:#0F766E
style monitor fill:#FEF3C7,stroke:#D97706
The Complete Open-Source Stack
| Component | Tool | License |
|---|---|---|
| Vector Store | Qdrant | Apache 2.0 |
| Sparse Index | Elasticsearch / Meilisearch | SSPL / MIT |
| Embeddings | BGE-M3 / E5-Large | MIT |
| Re-ranker | cross-encoder/ms-marco | Apache 2.0 |
| LLM (large) | Llama 3.1 70B / Qwen 2.5 72B | Meta / Apache |
| LLM (fast) | Mistral 7B / Llama 3.1 8B | Apache 2.0 |
| LLM Serving | vLLM / Ollama | Apache 2.0 / MIT |
| Orchestration | LangChain / Haystack | MIT |
| Monitoring | Arize Phoenix | Apache 2.0 |
| Metrics | Prometheus + Grafana | Apache 2.0 |
| Document Parsing | Unstructured / Docling | Apache 2.0 / MIT |
| Evaluation | RAGAS | Apache 2.0 |
Total license cost: $0. Total vendor lock-in: zero. Total data leaving your infrastructure: none.
Multi-Language RAG
For organizations operating across languages — common for international NGOs and EU institutions:
- Multilingual embeddings —
BAAI/bge-m3handles 100+ languages in the same vector space, entirely self-hosted - Cross-lingual retrieval — A query in Spanish retrieves relevant documents in English, and vice versa
- Language-aware generation — Open-source LLMs like Llama 3 and Qwen 2.5 are natively multilingual
We've found that multilingual retrieval accuracy drops ~15% compared to monolingual. Compensate by retrieving more candidates (top-50 instead of top-30) and relying more heavily on re-ranking.
Evaluation: The RAGAS Framework
Don't deploy a RAG system without systematic evaluation. RAGAS (open-source, Apache 2.0) measures four dimensions:
| Dimension | Question it answers | Target |
|---|---|---|
| Context Precision | Are retrieved docs actually relevant? | > 0.75 |
| Context Recall | Did we find ALL relevant information? | > 0.80 |
| Faithfulness | Does the answer stick to the context? No hallucinations? | > 0.85 |
| Answer Relevance | Does the answer address the user's question? | > 0.80 |
Below these thresholds, users will notice quality problems.
# Automated evaluation with RAGAS (Apache 2.0)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Using a self-hosted LLM as the evaluator
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
llm=local_llm # Llama 3 via vLLM — no data leaves your infra
)
print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Relevancy: {results['answer_relevancy']:.2f}")
print(f"Precision: {results['context_precision']:.2f}")
Production Monitoring with Open-Source Tools
graph TB
subgraph phoenix["Arize Phoenix (Apache 2.0)"]
PT["LLM Trace Logging"]
PQ["Retrieval Quality Dashboards"]
PH["Hallucination Detection"]
PD["Embedding Drift Monitoring"]
end
subgraph prom["Prometheus + Grafana (Apache 2.0)"]
PL["Retrieval Latency (P50, P95, P99)"]
PE["End-to-end Latency (< 3s target)"]
PR["Empty Retrieval Rate"]
PU["Token Throughput & GPU Utilization"]
end
subgraph feedback["User Feedback Loop"]
FT["Thumbs Up/Down"]
FR["Report Issue Mechanism"]
FC["Feedback drives Re-evaluation"]
end
style phoenix fill:#EFF6FF,stroke:#1E40AF
style prom fill:#ECFDF5,stroke:#0F766E
style feedback fill:#FEF3C7,stroke:#D97706
Set up alerting for anomalies. A sudden spike in empty retrievals might mean new documents weren't indexed. A drop in faithfulness scores might indicate prompt template changes had unintended effects.
Cost Optimization
One of the biggest advantages of open-source RAG: you control the cost curve.
| Proprietary (API) | Open-Source (Self-hosted) | |
|---|---|---|
| Embedding cost | ~$0.0001 per 1K tokens | Near zero after infrastructure |
| LLM generation | ~$0.01-0.06 per 1K tokens | Near zero after infrastructure |
| 10K queries/month | $300-1,800/month | $200-800/month (GPU server) |
| Scaling | Linear — costs grow with usage | Flat — 100K queries costs the same |
| Data sovereignty | Data leaves your infra every query | Data never leaves your infra |
| Break-even | — | ~5,000 queries/month |
At 50K+ queries/month, open-source is 5-10x cheaper.
Tips for optimizing your self-hosted stack:
- Batch embedding during ingestion (not real-time) to maximize GPU utilization
- Use quantized models — GPTQ or AWQ quantization runs 70B models on a single A100
- Tiered inference — Use Mistral 7B for query analysis/routing, Llama 3.1 70B for final generation
- Redis caching — Cache frequent queries and responses. Even 10% cache hit saves GPU cycles
- Context pruning — Don't send 20 chunks to the LLM if 5 suffice
Hardware Recommendations
| Scale | GPU | Handles |
|---|---|---|
| Prototype | RTX 4090 (24GB) | Mistral 7B + embeddings |
| Small org | A10G (24GB) | Llama 3.1 8B quantized |
| Medium org | A100 (80GB) | Llama 3.1 70B quantized |
| Large org | 2x A100 (80GB) | Qwen 2.5 72B full precision |
CPU-only option: Ollama with Llama 3.1 8B runs on any modern server — slower but zero GPU cost.
Common Pitfalls We've Seen
After building RAG systems for dozens of organizations, these patterns keep recurring:
-
"We'll just use the defaults" — Default chunking, default embeddings, default prompts. This gives you a demo, not a product.
-
No evaluation dataset — You can't improve what you can't measure. Building a 50-100 question eval set is the single best investment you can make.
-
Ignoring document freshness — Documents get updated, policies change, data expires. Without a re-indexing strategy, your RAG system serves stale information.
-
Overloading the context — More context is not better. After ~4000 tokens of context, quality degrades because of the "lost in the middle" effect.
-
Skipping the human-in-the-loop — For high-stakes applications (legal, medical, humanitarian), always include a mechanism for human review of AI responses.
-
Defaulting to proprietary APIs — Sending sensitive organizational data to external APIs is a compliance and security risk. Self-hosted open-source models eliminate this risk entirely.
Conclusion
RAG is not a plug-and-play solution. It's an architecture that requires careful engineering at every stage — from document processing to retrieval to generation. The difference between a RAG system that "kind of works" and one that transforms how an organization uses its knowledge is in these engineering details.
The good news: the entire stack can be built with open-source tools. No vendor lock-in. No data leaving your infrastructure. No per-query fees that scale linearly with usage. And with models like Llama 3.1, Mistral, and Qwen 2.5, the quality gap between open-source and proprietary models has narrowed dramatically.
At Quorax, we've built RAG systems for organizations processing thousands of documents across multiple languages — from NGO beneficiary databases to research institution archives. Every system runs on open-source infrastructure that the organization owns and controls. Not because we're ideological about it, but because for organizations handling sensitive data, sovereignty isn't optional.