Quorax | AI, Data & Security Consulting

The Promise and Reality of RAG

Retrieval-Augmented Generation (RAG) has become the default architecture for grounding large language models in organizational data. The concept is straightforward: instead of relying solely on what an LLM learned during training, you retrieve relevant documents at query time and feed them as context.

In theory, this eliminates hallucinations, keeps responses current, and respects data boundaries. In practice, most RAG implementations deliver mediocre results — not because the architecture is flawed, but because the details matter enormously.

We've deployed RAG systems for NGOs managing 50,000+ beneficiary records, research institutions with decades of publications, and government agencies with complex regulatory documentation. Every deployment taught us something. This article distills those lessons into a practical guide — built entirely on open-source tools.

Why Open-Source RAG?

Before diving into architecture, a critical decision: why build with open-source?

When you send a query containing beneficiary data to a proprietary AI API, that data travels to external servers. You lose control over who processes it, where it's stored, and what happens to it. For organizations handling sensitive information — refugee records, medical data, legal cases — this is unacceptable.

An open-source RAG stack means:

Data never leaves your infrastructure — Every component runs on servers you control
Full auditability — You can inspect every model, every algorithm, every data flow
No per-query costs — After the initial infrastructure investment, querying is virtually free
No vendor lock-in — Replace any component without rewriting your system
GDPR/RGPD compliance by design — Data residency is guaranteed, not promised

Understanding the RAG Pipeline

A RAG system has three distinct phases, and each one can silently degrade overall quality:

graph LR
  A["INGEST\nParse · Clean\nChunk · Embed"] --> B["RETRIEVE\nEmbed · Search\nRe-rank · Filter"]
  B --> C["GENERATE\nAssemble · Prompt\nGenerate · Stream"]
  C --> D["VALIDATE\nFaithfulness · Relevance\nCitations · Feedback"]
  style A fill:#EFF6FF,stroke:#1E40AF,color:#0F172A
  style B fill:#ECFDF5,stroke:#0F766E,color:#0F172A
  style C fill:#FEF3C7,stroke:#D97706,color:#0F172A
  style D fill:#FCE7F3,stroke:#DB2777,color:#0F172A

The critical insight: retrieval quality caps generation quality. If you retrieve the wrong documents, even the best LLM will produce wrong answers — confidently, with perfect grammar.

Where Most RAG Systems Fail

1. Document Preprocessing

The most overlooked step. Raw documents contain headers, footers, page numbers, table formatting, image captions, and metadata noise that pollutes your embeddings.

Common failures:

PDF extraction that loses table structure, turning rows into garbled text
OCR artifacts from scanned documents introducing phantom characters
HTML-to-text conversion that strips semantic structure
Duplicate content from multiple document versions

What works — all open-source:

unstructured.io (Apache 2.0) for complex multi-format documents
pymupdf4llm (AGPL) for PDFs with tables and structured content
Docling (MIT, by IBM) for enterprise-grade document conversion
Tesseract OCR (Apache 2.0) for scanned documents

# Open-source document processing pipeline
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(filename="annual_report.pdf")

# Filter out noise elements
clean_elements = [
    el for el in elements
    if el.category not in ["Header", "Footer", "PageNumber"]
    and len(str(el)) > 50  # Skip tiny fragments
]

# Chunk respecting document structure
chunks = chunk_by_title(
    clean_elements,
    max_characters=1500,
    overlap=200,
    combine_text_under_n_chars=300
)

2. Chunking Strategy

The most common mistake is treating chunking as a preprocessing afterthought. Fixed-size chunks (512 tokens, 1000 characters) ignore document structure entirely. A paragraph split mid-sentence loses its meaning.

Strategy	Quality	Complexity	Best for
Fixed-size	⭐⭐	Trivial	Homogeneous text
Sentence-split	⭐⭐⭐	Low	Conversational
Semantic	⭐⭐⭐⭐	Medium	Technical docs
Document-aware	⭐⭐⭐⭐⭐	High	Structured docs

Recommendation: Start with document-aware chunking. Fall back to semantic only if parsing is unreliable.

What works: Semantic chunking that respects document structure. Use heading boundaries, paragraph breaks, and logical sections. Overlap chunks by 10-15% to maintain context at boundaries. Add a metadata prefix to each chunk so the LLM knows the source.

# Adding provenance context to each chunk
for chunk in chunks:
    chunk.text = f"[Source: {doc.title} | Section: {chunk.section}]\n{chunk.text}"

3. Embedding Quality

Not all embeddings are created equal. Generic embeddings miss domain-specific semantic connections. When your documents use specialized vocabulary — legal terms, medical codes, humanitarian jargon — generic models fail silently.

Real-world example: In a project for an NGO, the term "protection" appeared constantly. In humanitarian contexts, "protection" refers specifically to safeguarding vulnerable populations from violence. Generic embeddings treated it as a synonym for "security" or "defense," retrieving irrelevant IT documents when users asked about protection programs.

Open-source embedding models we recommend:

Model	Languages	Dimensions	License
BAAI/bge-large-en-v1.5	EN	1024	MIT
BAAI/bge-m3	100+	1024	MIT
intfloat/e5-large-v2	EN	1024	MIT
intfloat/multilingual-e5-large-instruct	100+	1024	MIT
nomic-ai/nomic-embed-text-v1.5	EN	768	Apache 2.0
Snowflake/arctic-embed	EN	1024	Apache 2.0

For multilingual (our default): BAAI/bge-m3 or e5-large. For English-only: Snowflake/arctic-embed or bge-large.

We've seen 30-40% improvement in retrieval accuracy by switching from generic to domain-tuned open-source embeddings. Fine-tuning bge-m3 on domain-specific pairs reduced hallucinations by 60% in one deployment.

4. Retrieval Strategy

Simple cosine similarity with top-k retrieval is a starting point, not a solution. It misses documents that are relevant but use different terminology. It can't handle multi-hop reasoning.

The retrieval hierarchy:

graph TB
  L4["Level 4: AGENTIC<br/>LLM reasons about query, iterates search,<br/>combines multi-hop evidence"]
  L3["Level 3: RE-RANKED<br/>Cross-encoder re-scores top candidates<br/>(cross-encoder/ms-marco — open-source)"]
  L2["Level 2: HYBRID<br/>Dense embeddings + sparse BM25<br/>Catches semantic and lexical matches"]
  L1["Level 1: BASIC<br/>Cosine similarity, top-k retrieval<br/>Starting point only"]
  L4 --> L3
  L3 --> L2
  L2 --> L1
  style L4 fill:#7C3AED,stroke:#5B21B6,color:#FFFFFF
  style L3 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
  style L2 fill:#0F766E,stroke:#065F46,color:#FFFFFF
  style L1 fill:#94A3B8,stroke:#64748B,color:#FFFFFF

# Hybrid retrieval with open-source re-ranking
from sentence_transformers import CrossEncoder  # Apache 2.0

# Stage 1: Hybrid retrieval (Qdrant handles both dense + sparse)
dense_results = qdrant_client.search(
    collection_name="documents",
    query_vector=embed_query(query),
    limit=30
)
sparse_results = bm25_index.search(query, k=30)  # rank_bm25 (BSD)

# Merge and deduplicate
candidates = merge_results(dense_results, sparse_results, k=50)

# Stage 2: Cross-encoder re-ranking (open-source model)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Return top results by re-ranked score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
final_context = [doc for doc, score in ranked[:8]]

5. Context Assembly

Even with perfect retrieval, how you assemble the context window matters. Stuffing all retrieved documents into a single prompt leads to the "lost in the middle" problem — LLMs pay more attention to information at the beginning and end of the context.

What works:

Order retrieved chunks by relevance, most relevant first
Include source metadata so the LLM can cite its sources
Set a context budget (e.g., 4000 tokens) and truncate intelligently
Use a structured prompt template that separates context from instructions

The Full Open-Source Architecture

Here's our production architecture — every single component is open-source and self-hostable:

graph TB
  subgraph ingestion["Ingestion Pipeline"]
    D["Documents<br/>(PDF, HTML, DOCX)"] --> P["Unstructured.io<br/>(parsing)"]
    P --> SC["Semantic<br/>Chunking"]
    SC --> ME["Metadata<br/>Enrichment"]
    ME --> EMB["BGE-M3 / E5-Large<br/>(open-source embeddings)"]
    EMB --> QD[("Qdrant<br/>(vectors)")]
    EMB --> ES[("Elasticsearch<br/>(BM25 index)")]
  end
  subgraph query["Query Pipeline"]
    UQ["User Query"] --> QX["Query<br/>Expander"]
    QX --> HS["Hybrid<br/>Search"]
    HS --> RR["Cross-Encoder<br/>Re-ranking"]
    RR --> CA["Context Assembly<br/>+ Prompt Template"]
    CA --> LLM["Llama 3 / Mistral / Qwen<br/>(via vLLM or Ollama)"]
    LLM --> ANS["Answer + Citations<br/>+ Source Documents"]
  end
  subgraph monitor["Monitoring"]
    AP["Arize Phoenix<br/>(open-source)"]
    PG["Prometheus<br/>+ Grafana"]
  end
  QD -.-> HS
  ES -.-> HS
  style ingestion fill:#EFF6FF,stroke:#1E40AF
  style query fill:#ECFDF5,stroke:#0F766E
  style monitor fill:#FEF3C7,stroke:#D97706

The Complete Open-Source Stack

Component	Tool	License
Vector Store	Qdrant	Apache 2.0
Sparse Index	Elasticsearch / Meilisearch	SSPL / MIT
Embeddings	BGE-M3 / E5-Large	MIT
Re-ranker	cross-encoder/ms-marco	Apache 2.0
LLM (large)	Llama 3.1 70B / Qwen 2.5 72B	Meta / Apache
LLM (fast)	Mistral 7B / Llama 3.1 8B	Apache 2.0
LLM Serving	vLLM / Ollama	Apache 2.0 / MIT
Orchestration	LangChain / Haystack	MIT
Monitoring	Arize Phoenix	Apache 2.0
Metrics	Prometheus + Grafana	Apache 2.0
Document Parsing	Unstructured / Docling	Apache 2.0 / MIT
Evaluation	RAGAS	Apache 2.0

Total license cost: $0. Total vendor lock-in: zero. Total data leaving your infrastructure: none.

Multi-Language RAG

For organizations operating across languages — common for international NGOs and EU institutions:

Multilingual embeddings — BAAI/bge-m3 handles 100+ languages in the same vector space, entirely self-hosted
Cross-lingual retrieval — A query in Spanish retrieves relevant documents in English, and vice versa
Language-aware generation — Open-source LLMs like Llama 3 and Qwen 2.5 are natively multilingual

We've found that multilingual retrieval accuracy drops ~15% compared to monolingual. Compensate by retrieving more candidates (top-50 instead of top-30) and relying more heavily on re-ranking.

Evaluation: The RAGAS Framework

Don't deploy a RAG system without systematic evaluation. RAGAS (open-source, Apache 2.0) measures four dimensions:

Dimension	Question it answers	Target
Context Precision	Are retrieved docs actually relevant?	> 0.75
Context Recall	Did we find ALL relevant information?	> 0.80
Faithfulness	Does the answer stick to the context? No hallucinations?	> 0.85
Answer Relevance	Does the answer address the user's question?	> 0.80

Below these thresholds, users will notice quality problems.

# Automated evaluation with RAGAS (Apache 2.0)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Using a self-hosted LLM as the evaluator
results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
    llm=local_llm  # Llama 3 via vLLM — no data leaves your infra
)

print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Relevancy:    {results['answer_relevancy']:.2f}")
print(f"Precision:    {results['context_precision']:.2f}")

Production Monitoring with Open-Source Tools

graph TB
  subgraph phoenix["Arize Phoenix (Apache 2.0)"]
    PT["LLM Trace Logging"]
    PQ["Retrieval Quality Dashboards"]
    PH["Hallucination Detection"]
    PD["Embedding Drift Monitoring"]
  end
  subgraph prom["Prometheus + Grafana (Apache 2.0)"]
    PL["Retrieval Latency (P50, P95, P99)"]
    PE["End-to-end Latency (< 3s target)"]
    PR["Empty Retrieval Rate"]
    PU["Token Throughput & GPU Utilization"]
  end
  subgraph feedback["User Feedback Loop"]
    FT["Thumbs Up/Down"]
    FR["Report Issue Mechanism"]
    FC["Feedback drives Re-evaluation"]
  end
  style phoenix fill:#EFF6FF,stroke:#1E40AF
  style prom fill:#ECFDF5,stroke:#0F766E
  style feedback fill:#FEF3C7,stroke:#D97706

Set up alerting for anomalies. A sudden spike in empty retrievals might mean new documents weren't indexed. A drop in faithfulness scores might indicate prompt template changes had unintended effects.

Cost Optimization

One of the biggest advantages of open-source RAG: you control the cost curve.

	Proprietary (API)	Open-Source (Self-hosted)
Embedding cost	~$0.0001 per 1K tokens	Near zero after infrastructure
LLM generation	~$0.01-0.06 per 1K tokens	Near zero after infrastructure
10K queries/month	$300-1,800/month	$200-800/month (GPU server)
Scaling	Linear — costs grow with usage	Flat — 100K queries costs the same
Data sovereignty	Data leaves your infra every query	Data never leaves your infra
Break-even	—	~5,000 queries/month

At 50K+ queries/month, open-source is 5-10x cheaper.

Tips for optimizing your self-hosted stack:

Batch embedding during ingestion (not real-time) to maximize GPU utilization
Use quantized models — GPTQ or AWQ quantization runs 70B models on a single A100
Tiered inference — Use Mistral 7B for query analysis/routing, Llama 3.1 70B for final generation
Redis caching — Cache frequent queries and responses. Even 10% cache hit saves GPU cycles
Context pruning — Don't send 20 chunks to the LLM if 5 suffice

Hardware Recommendations

Scale	GPU	Handles
Prototype	RTX 4090 (24GB)	Mistral 7B + embeddings
Small org	A10G (24GB)	Llama 3.1 8B quantized
Medium org	A100 (80GB)	Llama 3.1 70B quantized
Large org	2x A100 (80GB)	Qwen 2.5 72B full precision

CPU-only option: Ollama with Llama 3.1 8B runs on any modern server — slower but zero GPU cost.

Common Pitfalls We've Seen

After building RAG systems for dozens of organizations, these patterns keep recurring:

"We'll just use the defaults" — Default chunking, default embeddings, default prompts. This gives you a demo, not a product.
No evaluation dataset — You can't improve what you can't measure. Building a 50-100 question eval set is the single best investment you can make.
Ignoring document freshness — Documents get updated, policies change, data expires. Without a re-indexing strategy, your RAG system serves stale information.
Overloading the context — More context is not better. After ~4000 tokens of context, quality degrades because of the "lost in the middle" effect.
Skipping the human-in-the-loop — For high-stakes applications (legal, medical, humanitarian), always include a mechanism for human review of AI responses.
Defaulting to proprietary APIs — Sending sensitive organizational data to external APIs is a compliance and security risk. Self-hosted open-source models eliminate this risk entirely.

Conclusion

RAG is not a plug-and-play solution. It's an architecture that requires careful engineering at every stage — from document processing to retrieval to generation. The difference between a RAG system that "kind of works" and one that transforms how an organization uses its knowledge is in these engineering details.

The good news: the entire stack can be built with open-source tools. No vendor lock-in. No data leaving your infrastructure. No per-query fees that scale linearly with usage. And with models like Llama 3.1, Mistral, and Qwen 2.5, the quality gap between open-source and proprietary models has narrowed dramatically.

At Quorax, we've built RAG systems for organizations processing thousands of documents across multiple languages — from NGO beneficiary databases to research institution archives. Every system runs on open-source infrastructure that the organization owns and controls. Not because we're ideological about it, but because for organizations handling sensitive data, sovereignty isn't optional.

The Promise and Reality of RAG

Why Open-Source RAG?

Before diving into architecture, a critical decision: why build with open-source?

An open-source RAG stack means:

Data never leaves your infrastructure — Every component runs on servers you control
Full auditability — You can inspect every model, every algorithm, every data flow
No per-query costs — After the initial infrastructure investment, querying is virtually free
No vendor lock-in — Replace any component without rewriting your system
GDPR/RGPD compliance by design — Data residency is guaranteed, not promised

Understanding the RAG Pipeline

A RAG system has three distinct phases, and each one can silently degrade overall quality:

graph LR
  A["INGEST\nParse · Clean\nChunk · Embed"] --> B["RETRIEVE\nEmbed · Search\nRe-rank · Filter"]
  B --> C["GENERATE\nAssemble · Prompt\nGenerate · Stream"]
  C --> D["VALIDATE\nFaithfulness · Relevance\nCitations · Feedback"]
  style A fill:#EFF6FF,stroke:#1E40AF,color:#0F172A
  style B fill:#ECFDF5,stroke:#0F766E,color:#0F172A
  style C fill:#FEF3C7,stroke:#D97706,color:#0F172A
  style D fill:#FCE7F3,stroke:#DB2777,color:#0F172A

The critical insight: retrieval quality caps generation quality. If you retrieve the wrong documents, even the best LLM will produce wrong answers — confidently, with perfect grammar.

Where Most RAG Systems Fail

1. Document Preprocessing

The most overlooked step. Raw documents contain headers, footers, page numbers, table formatting, image captions, and metadata noise that pollutes your embeddings.

Common failures:

PDF extraction that loses table structure, turning rows into garbled text
OCR artifacts from scanned documents introducing phantom characters
HTML-to-text conversion that strips semantic structure
Duplicate content from multiple document versions

What works — all open-source:

unstructured.io (Apache 2.0) for complex multi-format documents
pymupdf4llm (AGPL) for PDFs with tables and structured content
Docling (MIT, by IBM) for enterprise-grade document conversion
Tesseract OCR (Apache 2.0) for scanned documents

# Open-source document processing pipeline
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(filename="annual_report.pdf")

# Filter out noise elements
clean_elements = [
    el for el in elements
    if el.category not in ["Header", "Footer", "PageNumber"]
    and len(str(el)) > 50  # Skip tiny fragments
]

# Chunk respecting document structure
chunks = chunk_by_title(
    clean_elements,
    max_characters=1500,
    overlap=200,
    combine_text_under_n_chars=300
)

2. Chunking Strategy

Strategy	Quality	Complexity	Best for
Fixed-size	⭐⭐	Trivial	Homogeneous text
Sentence-split	⭐⭐⭐	Low	Conversational
Semantic	⭐⭐⭐⭐	Medium	Technical docs
Document-aware	⭐⭐⭐⭐⭐	High	Structured docs

Recommendation: Start with document-aware chunking. Fall back to semantic only if parsing is unreliable.

# Adding provenance context to each chunk
for chunk in chunks:
    chunk.text = f"[Source: {doc.title} | Section: {chunk.section}]\n{chunk.text}"

3. Embedding Quality

Open-source embedding models we recommend:

Model	Languages	Dimensions	License
BAAI/bge-large-en-v1.5	EN	1024	MIT
BAAI/bge-m3	100+	1024	MIT
intfloat/e5-large-v2	EN	1024	MIT
intfloat/multilingual-e5-large-instruct	100+	1024	MIT
nomic-ai/nomic-embed-text-v1.5	EN	768	Apache 2.0
Snowflake/arctic-embed	EN	1024	Apache 2.0

For multilingual (our default): BAAI/bge-m3 or e5-large. For English-only: Snowflake/arctic-embed or bge-large.

4. Retrieval Strategy

Simple cosine similarity with top-k retrieval is a starting point, not a solution. It misses documents that are relevant but use different terminology. It can't handle multi-hop reasoning.

The retrieval hierarchy:

graph TB
  L4["Level 4: AGENTIC<br/>LLM reasons about query, iterates search,<br/>combines multi-hop evidence"]
  L3["Level 3: RE-RANKED<br/>Cross-encoder re-scores top candidates<br/>(cross-encoder/ms-marco — open-source)"]
  L2["Level 2: HYBRID<br/>Dense embeddings + sparse BM25<br/>Catches semantic and lexical matches"]
  L1["Level 1: BASIC<br/>Cosine similarity, top-k retrieval<br/>Starting point only"]
  L4 --> L3
  L3 --> L2
  L2 --> L1
  style L4 fill:#7C3AED,stroke:#5B21B6,color:#FFFFFF
  style L3 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
  style L2 fill:#0F766E,stroke:#065F46,color:#FFFFFF
  style L1 fill:#94A3B8,stroke:#64748B,color:#FFFFFF

# Hybrid retrieval with open-source re-ranking
from sentence_transformers import CrossEncoder  # Apache 2.0

# Stage 1: Hybrid retrieval (Qdrant handles both dense + sparse)
dense_results = qdrant_client.search(
    collection_name="documents",
    query_vector=embed_query(query),
    limit=30
)
sparse_results = bm25_index.search(query, k=30)  # rank_bm25 (BSD)

# Merge and deduplicate
candidates = merge_results(dense_results, sparse_results, k=50)

# Stage 2: Cross-encoder re-ranking (open-source model)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Return top results by re-ranked score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
final_context = [doc for doc, score in ranked[:8]]

5. Context Assembly

What works:

Order retrieved chunks by relevance, most relevant first
Include source metadata so the LLM can cite its sources
Set a context budget (e.g., 4000 tokens) and truncate intelligently
Use a structured prompt template that separates context from instructions

The Full Open-Source Architecture

Here's our production architecture — every single component is open-source and self-hostable:

graph TB
  subgraph ingestion["Ingestion Pipeline"]
    D["Documents<br/>(PDF, HTML, DOCX)"] --> P["Unstructured.io<br/>(parsing)"]
    P --> SC["Semantic<br/>Chunking"]
    SC --> ME["Metadata<br/>Enrichment"]
    ME --> EMB["BGE-M3 / E5-Large<br/>(open-source embeddings)"]
    EMB --> QD[("Qdrant<br/>(vectors)")]
    EMB --> ES[("Elasticsearch<br/>(BM25 index)")]
  end
  subgraph query["Query Pipeline"]
    UQ["User Query"] --> QX["Query<br/>Expander"]
    QX --> HS["Hybrid<br/>Search"]
    HS --> RR["Cross-Encoder<br/>Re-ranking"]
    RR --> CA["Context Assembly<br/>+ Prompt Template"]
    CA --> LLM["Llama 3 / Mistral / Qwen<br/>(via vLLM or Ollama)"]
    LLM --> ANS["Answer + Citations<br/>+ Source Documents"]
  end
  subgraph monitor["Monitoring"]
    AP["Arize Phoenix<br/>(open-source)"]
    PG["Prometheus<br/>+ Grafana"]
  end
  QD -.-> HS
  ES -.-> HS
  style ingestion fill:#EFF6FF,stroke:#1E40AF
  style query fill:#ECFDF5,stroke:#0F766E
  style monitor fill:#FEF3C7,stroke:#D97706

The Complete Open-Source Stack

Component	Tool	License
Vector Store	Qdrant	Apache 2.0
Sparse Index	Elasticsearch / Meilisearch	SSPL / MIT
Embeddings	BGE-M3 / E5-Large	MIT
Re-ranker	cross-encoder/ms-marco	Apache 2.0
LLM (large)	Llama 3.1 70B / Qwen 2.5 72B	Meta / Apache
LLM (fast)	Mistral 7B / Llama 3.1 8B	Apache 2.0
LLM Serving	vLLM / Ollama	Apache 2.0 / MIT
Orchestration	LangChain / Haystack	MIT
Monitoring	Arize Phoenix	Apache 2.0
Metrics	Prometheus + Grafana	Apache 2.0
Document Parsing	Unstructured / Docling	Apache 2.0 / MIT
Evaluation	RAGAS	Apache 2.0

Total license cost: $0. Total vendor lock-in: zero. Total data leaving your infrastructure: none.

Multi-Language RAG

For organizations operating across languages — common for international NGOs and EU institutions:

Multilingual embeddings — BAAI/bge-m3 handles 100+ languages in the same vector space, entirely self-hosted
Cross-lingual retrieval — A query in Spanish retrieves relevant documents in English, and vice versa
Language-aware generation — Open-source LLMs like Llama 3 and Qwen 2.5 are natively multilingual

We've found that multilingual retrieval accuracy drops ~15% compared to monolingual. Compensate by retrieving more candidates (top-50 instead of top-30) and relying more heavily on re-ranking.

Evaluation: The RAGAS Framework

Don't deploy a RAG system without systematic evaluation. RAGAS (open-source, Apache 2.0) measures four dimensions:

Dimension	Question it answers	Target
Context Precision	Are retrieved docs actually relevant?	> 0.75
Context Recall	Did we find ALL relevant information?	> 0.80
Faithfulness	Does the answer stick to the context? No hallucinations?	> 0.85
Answer Relevance	Does the answer address the user's question?	> 0.80

Below these thresholds, users will notice quality problems.

# Automated evaluation with RAGAS (Apache 2.0)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Using a self-hosted LLM as the evaluator
results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
    llm=local_llm  # Llama 3 via vLLM — no data leaves your infra
)

print(f"Faithfulness: {results['faithfulness']:.2f}")
print(f"Relevancy:    {results['answer_relevancy']:.2f}")
print(f"Precision:    {results['context_precision']:.2f}")

Production Monitoring with Open-Source Tools

graph TB
  subgraph phoenix["Arize Phoenix (Apache 2.0)"]
    PT["LLM Trace Logging"]
    PQ["Retrieval Quality Dashboards"]
    PH["Hallucination Detection"]
    PD["Embedding Drift Monitoring"]
  end
  subgraph prom["Prometheus + Grafana (Apache 2.0)"]
    PL["Retrieval Latency (P50, P95, P99)"]
    PE["End-to-end Latency (< 3s target)"]
    PR["Empty Retrieval Rate"]
    PU["Token Throughput & GPU Utilization"]
  end
  subgraph feedback["User Feedback Loop"]
    FT["Thumbs Up/Down"]
    FR["Report Issue Mechanism"]
    FC["Feedback drives Re-evaluation"]
  end
  style phoenix fill:#EFF6FF,stroke:#1E40AF
  style prom fill:#ECFDF5,stroke:#0F766E
  style feedback fill:#FEF3C7,stroke:#D97706

Cost Optimization

One of the biggest advantages of open-source RAG: you control the cost curve.

	Proprietary (API)	Open-Source (Self-hosted)
Embedding cost	~$0.0001 per 1K tokens	Near zero after infrastructure
LLM generation	~$0.01-0.06 per 1K tokens	Near zero after infrastructure
10K queries/month	$300-1,800/month	$200-800/month (GPU server)
Scaling	Linear — costs grow with usage	Flat — 100K queries costs the same
Data sovereignty	Data leaves your infra every query	Data never leaves your infra
Break-even	—	~5,000 queries/month

At 50K+ queries/month, open-source is 5-10x cheaper.

Tips for optimizing your self-hosted stack:

Batch embedding during ingestion (not real-time) to maximize GPU utilization
Use quantized models — GPTQ or AWQ quantization runs 70B models on a single A100
Tiered inference — Use Mistral 7B for query analysis/routing, Llama 3.1 70B for final generation
Redis caching — Cache frequent queries and responses. Even 10% cache hit saves GPU cycles
Context pruning — Don't send 20 chunks to the LLM if 5 suffice

Hardware Recommendations

Scale	GPU	Handles
Prototype	RTX 4090 (24GB)	Mistral 7B + embeddings
Small org	A10G (24GB)	Llama 3.1 8B quantized
Medium org	A100 (80GB)	Llama 3.1 70B quantized
Large org	2x A100 (80GB)	Qwen 2.5 72B full precision

CPU-only option: Ollama with Llama 3.1 8B runs on any modern server — slower but zero GPU cost.

Common Pitfalls We've Seen

After building RAG systems for dozens of organizations, these patterns keep recurring:

"We'll just use the defaults" — Default chunking, default embeddings, default prompts. This gives you a demo, not a product.
No evaluation dataset — You can't improve what you can't measure. Building a 50-100 question eval set is the single best investment you can make.
Ignoring document freshness — Documents get updated, policies change, data expires. Without a re-indexing strategy, your RAG system serves stale information.
Overloading the context — More context is not better. After ~4000 tokens of context, quality degrades because of the "lost in the middle" effect.
Skipping the human-in-the-loop — For high-stakes applications (legal, medical, humanitarian), always include a mechanism for human review of AI responses.
Defaulting to proprietary APIs — Sending sensitive organizational data to external APIs is a compliance and security risk. Self-hosted open-source models eliminate this risk entirely.

Building RAG Systems That Actually Work

The Promise and Reality of RAG

Why Open-Source RAG?

Understanding the RAG Pipeline

Where Most RAG Systems Fail

1. Document Preprocessing

2. Chunking Strategy

3. Embedding Quality

4. Retrieval Strategy

5. Context Assembly

The Full Open-Source Architecture

The Complete Open-Source Stack

Multi-Language RAG

Evaluation: The RAGAS Framework

Production Monitoring with Open-Source Tools

Cost Optimization

Hardware Recommendations

Common Pitfalls We've Seen

Conclusion

Want to learn more?

Building RAG Systems That Actually Work

The Promise and Reality of RAG

Why Open-Source RAG?

Understanding the RAG Pipeline

Where Most RAG Systems Fail

1. Document Preprocessing

2. Chunking Strategy

3. Embedding Quality

4. Retrieval Strategy

5. Context Assembly

The Full Open-Source Architecture

The Complete Open-Source Stack

Multi-Language RAG

Evaluation: The RAGAS Framework

Production Monitoring with Open-Source Tools

Cost Optimization

Hardware Recommendations

Common Pitfalls We've Seen

Conclusion

Want to learn more?