AI & Machine Learning October 21st, 2025 13 min read

Evaluating RAG Systems with RAGAS: Metrics That Matter

A comprehensive guide to evaluating RAG systems using RAGAS framework. Learn how to measure faithfulness, answer relevancy, context precision, and context recall to build reliable AI applications.

AstraQ
By Team Astraq
Evaluating RAG Systems with RAGAS: Metrics That Matter

In our RAG Complete Guide and Advanced RAG Agents articles, we built production-ready retrieval systems with hybrid search and multi-hop reasoning. But how do you know if your RAG system is actually good? Enter RAGAS, the gold standard for evaluating retrieval-augmented generation systems.

Why Evaluate RAG Systems?

Building a RAG system is one thing; knowing whether it works reliably is another. Without proper evaluation, you're flying blind:

  • Hallucinations may go undetected until users complain
  • Retrieval quality degrades silently as your corpus grows
  • Prompt changes might hurt performance without you knowing
  • Model upgrades could introduce regressions

"You can't improve what you can't measure." This principle is especially critical for RAG systems where both retrieval AND generation can fail independently.

Understanding RAGAS Metrics

RAGAS provides four core metrics that evaluate different aspects of your RAG pipeline:

MetricWhat It MeasuresRangeTarget
FaithfulnessIs the answer grounded in context?0-1> 0.9
Answer RelevancyDoes it answer the question?0-1> 0.8
Context PrecisionAre retrieved docs ranked correctly?0-1> 0.75
Context RecallWas all necessary info retrieved?0-1> 0.8

The Metric Relationships

Setting Up RAGAS

Install Dependencies

Add RAGAS to your existing RAG project:

$> uv add ragas datasets

Prepare Your Evaluation Dataset

RAGAS requires a specific format with questions, ground truth answers, and retrieved contexts:

evaluation/dataset.py

from datasets import Dataset

eval_data = {
    "question": [
        "What is the embedding dimension of all-MiniLM-L6-v2?",
        "How does hybrid search improve retrieval quality?",
        "When should you use ColBERT reranking?",
    ],
    "ground_truth": [
        "The all-MiniLM-L6-v2 model produces embeddings with 384 dimensions.",
        "Hybrid search combines dense semantic embeddings with sparse keyword matching, capturing both meaning and exact terms.",
        "ColBERT reranking should be used when you need high precision and can afford the additional compute cost.",
    ],
    "answer": [],  # Will be filled by your RAG system
    "contexts": [],  # Will be filled by retrieval
}

dataset = Dataset.from_dict(eval_data)

Generate Answers and Contexts

Run your RAG system on the evaluation questions:

evaluation/generate.py

from search import hybrid_search
from generate import generate_response

def evaluate_rag_system(questions: list[str]) -> tuple[list[str], list[list[str]]]:
    """Run RAG pipeline and collect answers with contexts."""
    answers = []
    all_contexts = []

    for question in questions:
        # Retrieve context
        results = hybrid_search(question, limit=5)
        contexts = [r["document"] for r in results]

        # Generate answer
        answer = generate_response(question, contexts)

        answers.append(answer)
        all_contexts.append(contexts)

    return answers, all_contexts

Running RAGAS Evaluation

Here's the complete evaluation pipeline:

evaluation/run_ragas.py

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

from evaluation.generate import evaluate_rag_system

# Your evaluation questions and ground truths
questions = [
    "What is the embedding dimension of all-MiniLM-L6-v2?",
    "How does hybrid search improve retrieval quality?",
    "When should you use ColBERT reranking?",
]

ground_truths = [
    "The all-MiniLM-L6-v2 model produces embeddings with 384 dimensions.",
    "Hybrid search combines dense semantic embeddings with sparse keyword matching.",
    "ColBERT reranking is best for precision-critical applications.",
]

# Generate answers and retrieve contexts
answers, contexts = evaluate_rag_system(questions)

# Create RAGAS dataset
eval_dataset = Dataset.from_dict({
    "question": questions,
    "ground_truth": ground_truths,
    "answer": answers,
    "contexts": contexts,
})

# Run evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(results)

Deep Dive: Faithfulness

Faithfulness measures whether the generated answer is grounded in the retrieved context, essentially detecting hallucinations.

metrics/faithfulness_example.py

from ragas.metrics import faithfulness

# Example of faithful response
faithful_example = {
    "question": "What is the dimension of MiniLM embeddings?",
    "contexts": [["MiniLM-L6-v2 produces 384-dimensional embeddings."]],
    "answer": "MiniLM embeddings have 384 dimensions.",
}

# Example of unfaithful (hallucinated) response
unfaithful_example = {
    "question": "What is the dimension of MiniLM embeddings?",
    "contexts": [["MiniLM-L6-v2 produces 384-dimensional embeddings."]],
    "answer": "MiniLM embeddings have 768 dimensions and support GPU acceleration.",
}

How Faithfulness is Calculated

The formula is straightforward:

Faithfulness = (Supported Claims) / (Total Claims)

Deep Dive: Answer Relevancy

Answer relevancy measures whether the response actually addresses the user's question:

metrics/relevancy_example.py

# Highly relevant answer
relevant = {
    "question": "How do I configure Qdrant for hybrid search?",
    "answer": "To configure Qdrant for hybrid search, create a collection with both dense and sparse vector configs...",
}

# Irrelevant answer (factually correct but doesn't answer the question)
irrelevant = {
    "question": "How do I configure Qdrant for hybrid search?",
    "answer": "Qdrant is a vector database written in Rust that supports HNSW indexing.",
}

The Relevancy Algorithm

RAGAS calculates relevancy by:

  1. Generating hypothetical questions from the answer
  2. Computing semantic similarity between generated and original questions
  3. Averaging the similarity scores

metrics/relevancy_internal.py

def answer_relevancy_score(question: str, answer: str, llm) -> float:
    """Simplified relevancy calculation."""
    # Generate 3 questions that the answer could be responding to
    generated_questions = llm.generate(
        f"Generate 3 questions that this answer responds to:\n{answer}"
    )

    # Compute embeddings
    q_embedding = embed(question)
    gen_embeddings = [embed(q) for q in generated_questions]

    # Average cosine similarity
    similarities = [cosine_sim(q_embedding, ge) for ge in gen_embeddings]
    return sum(similarities) / len(similarities)

Deep Dive: Context Precision

Context precision evaluates whether the most relevant documents are ranked highest:

metrics/precision_example.py

# Good precision: relevant docs at top
good_ranking = {
    "question": "What is BM25?",
    "contexts": [
        "BM25 is a ranking function for information retrieval.",  # Relevant ✓
        "BM25 uses term frequency and inverse document frequency.",  # Relevant ✓
        "Vector databases store embeddings.",  # Less relevant
    ],
}

# Poor precision: relevant docs buried
poor_ranking = {
    "question": "What is BM25?",
    "contexts": [
        "Qdrant is a vector database.",  # Not relevant
        "FastEmbed generates embeddings.",  # Not relevant
        "BM25 is a ranking function for information retrieval.",  # Relevant but ranked last!  
    ],
}

Why Ranking Matters

Even if you retrieve all relevant documents, poor ranking wastes context window tokens on irrelevant information at the top. This leads to:

  • Lower generation quality
  • Higher token costs
  • Potential hallucinations from irrelevant context
PositionPrecision@K Weight
1stHighest impact
2ndHigh impact
3rdMedium impact
4th+Decreasing impact

Deep Dive: Context Recall

Context recall measures whether your retrieval captured all the information needed to answer the question:

metrics/recall_example.py

# Good recall: all needed info retrieved
good_recall = {
    "question": "Compare BM25 and dense embeddings.",
    "ground_truth": "BM25 uses keyword matching while dense embeddings capture semantic similarity.",
    "contexts": [
        "BM25 is a keyword-based ranking algorithm.",  # ✓ Covers BM25
        "Dense embeddings capture semantic meaning.",  # ✓ Covers dense
    ],
}

# Poor recall: missing information
poor_recall = {
    "question": "Compare BM25 and dense embeddings.",
    "ground_truth": "BM25 uses keyword matching while dense embeddings capture semantic similarity.",
    "contexts": [
        "BM25 is a keyword-based ranking algorithm.",  # ✓ Covers BM25
        # Missing any context about dense embeddings!  
    ],
}

Building an Evaluation Pipeline

For production systems, automate evaluation with a complete pipeline:

evaluation/pipeline.py

import json
from datetime import datetime
from pathlib import Path

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset


class RAGEvaluator:
    """Automated RAG evaluation pipeline."""

    def __init__(self, output_dir: str = "eval_results"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.metrics = [
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ]

    def run_evaluation(
        self,
        questions: list[str],
        ground_truths: list[str],
        answers: list[str],
        contexts: list[list[str]],
        run_name: str = None,
    ) -> dict:
        """Run full RAGAS evaluation and save results."""
        dataset = Dataset.from_dict({
            "question": questions,
            "ground_truth": ground_truths,
            "answer": answers,
            "contexts": contexts,
        })

        results = evaluate(dataset=dataset, metrics=self.metrics)

        # Save results
        run_name = run_name or datetime.now().strftime("%Y%m%d_%H%M%S")
        self._save_results(results, run_name)

        return results

    def _save_results(self, results: dict, run_name: str) -> None:
        """Persist evaluation results."""
        output_file = self.output_dir / f"{run_name}.json"
        with open(output_file, "w") as f:
            json.dump(
                {
                    "timestamp": datetime.now().isoformat(),
                    "metrics": dict(results),
                },
                f,
                indent=2,
            )

    def compare_runs(self, run1: str, run2: str) -> dict:
        """Compare two evaluation runs."""
        with open(self.output_dir / f"{run1}.json") as f:
            results1 = json.load(f)
        with open(self.output_dir / f"{run2}.json") as f:
            results2 = json.load(f)

        comparison = {}
        for metric in results1["metrics"]:
            diff = results2["metrics"][metric] - results1["metrics"][metric]
            comparison[metric] = {
                "before": results1["metrics"][metric],
                "after": results2["metrics"][metric],
                "change": diff,
                "improved": diff > 0,
            }

        return comparison

Using the Pipeline

evaluation/run_pipeline.py

from evaluation.pipeline import RAGEvaluator
from evaluation.generate import evaluate_rag_system

evaluator = RAGEvaluator()

# Load your test set
questions = load_test_questions()
ground_truths = load_ground_truths()

# Run your RAG system
answers, contexts = evaluate_rag_system(questions)

# Evaluate
results = evaluator.run_evaluation(
    questions=questions,
    ground_truths=ground_truths,
    answers=answers,
    contexts=contexts,
    run_name="baseline_v1",
)

print(f"Faithfulness: {results['faithfulness']:.3f}")
print(f"Relevancy: {results['answer_relevancy']:.3f}")
print(f"Precision: {results['context_precision']:.3f}")
print(f"Recall: {results['context_recall']:.3f}")

Creating Test Datasets

Test Data Best Practices

Interpreting Results

When Metrics Disagree

ScenarioLikely CauseAction
High Faithfulness, Low RelevancyOverly cautious answersAdjust prompt to be more comprehensive
Low Faithfulness, High RelevancyHallucinating relevant-sounding contentAdd grounding instructions, reduce temperature
High Precision, Low RecallNot retrieving enough docsIncrease k, broaden search
Low Precision, High RecallToo many irrelevant docsAdd reranking, tune similarity threshold

Setting Thresholds

A/B Testing RAG Changes

Use RAGAS to validate improvements before deploying:

evaluation/ab_test.py

from evaluation.pipeline import RAGEvaluator

evaluator = RAGEvaluator()

# Test current system
current_answers, current_contexts = evaluate_rag_system_v1(questions)
current_results = evaluator.run_evaluation(
    questions, ground_truths, current_answers, current_contexts,
    run_name="current",
)

# Test proposed changes
new_answers, new_contexts = evaluate_rag_system_v2(questions)
new_results = evaluator.run_evaluation(
    questions, ground_truths, new_answers, new_contexts,
    run_name="proposed",
)

# Compare
comparison = evaluator.compare_runs("current", "proposed")
for metric, data in comparison.items():
    status = "📈" if data["improved"] else "📉"
    print(f"{status} {metric}: {data['before']:.3f}{data['after']:.3f}")

CI/CD Integration

Integrate RAGAS into your deployment pipeline:

.github/workflows/rag-eval.yml

name: RAG Evaluation

on:
  pull_request:
    paths:
      - "src/rag/**"
      - "prompts/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install uv
          uv sync

      - name: Run RAGAS evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          uv run python -m evaluation.run_ragas

      - name: Check quality gates
        run: |
          uv run python -m evaluation.check_gates  # [!code highlight]

Advanced: Custom Metrics

Sometimes the built-in metrics aren't enough. RAGAS supports custom metrics:

metrics/custom.py

from ragas.metrics.base import MetricWithLLM
from dataclasses import dataclass


@dataclass
class TechnicalAccuracy(MetricWithLLM):
    """Custom metric for technical documentation accuracy."""

    name: str = "technical_accuracy"

    def _score(self, row: dict) -> float:
        """Score technical accuracy using LLM-as-judge."""
        prompt = f"""
        Question: {row['question']}
        Answer: {row['answer']}
        Ground Truth: {row['ground_truth']}

        Rate the technical accuracy from 0.0 to 1.0.
        Consider: code correctness, API accuracy, version specificity.

        Return only a number.
        """

        response = self.llm.invoke(prompt)
        return float(response.content.strip())


# Use in evaluation
from ragas import evaluate

custom_metric = TechnicalAccuracy()
results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, custom_metric],
)

Conclusion

Evaluating your RAG system isn't optional, it's essential for building reliable AI applications. With RAGAS, you can:

  • Detect hallucinations before users do (Faithfulness)
  • Ensure relevance of generated answers (Answer Relevancy)
  • Optimize retrieval ranking for better context (Context Precision)
  • Verify complete information retrieval (Context Recall)

Integrate evaluation into your development workflow, set quality gates, and continuously monitor production systems. Your users, and your reputation, depend on it.


Need help evaluating your RAG system? Contact us to learn how AstraQ can help you build reliable AI applications.