RAGAS evaluates RAG pipeline quality with reference-free metrics. pip install ragas. from ragas import evaluate, EvaluationDataset. from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness. Core metrics: faithfulness — measures if answer is grounded in retrieved context (LLM decomposes answer into claims and checks each against context). answer_relevancy — measures if answer is relevant to the question (embeds answer + reverse-engineered questions). context_precision — measures if retrieved context is ranked appropriately (useful chunks ranked higher). context_recall — measures if all gold answer sentences can be attributed to retrieved context (requires ground truth). answer_correctness — factual + semantic similarity vs ground truth answer. Dataset: from ragas.dataset_schema import SingleTurnSample, sample = SingleTurnSample(user_input="What is RAG?", response="RAG combines retrieval...", retrieved_contexts=["RAG paper text..."], reference="Retrieval-Augmented Generation..."). dataset = EvaluationDataset(samples=[sample1, sample2, ...]). Evaluate: from ragas.llms import LangchainLLMWrapper, from langchain_anthropic import ChatAnthropic, llm = LangchainLLMWrapper(ChatAnthropic(model="claude-sonnet-4-6")), result = evaluate(dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision], llm=llm). result.to_pandas() returns DataFrame. Testset generation: from ragas.testset import TestsetGenerator, generator = TestsetGenerator.from_langchain(llm, critic_llm, embeddings), testset = generator.generate_with_langchain_docs(documents, test_size=10). Async: await aevaluate(dataset, metrics=...). RunConfig(timeout=60, max_retries=3, max_wait=120) for API rate limits. Claude Code generates RAGAS evaluation pipelines, testset generators, metric configs, and result analysis scripts for RAG applications.
CLAUDE.md for RAGAS
## RAGAS Stack
- Version: ragas >= 0.2
- Core: from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
- Sample: SingleTurnSample(user_input, response, retrieved_contexts, reference?)
- Dataset: EvaluationDataset(samples=[...])
- Eval: evaluate(dataset, metrics=[...], llm=LangchainLLMWrapper(ChatAnthropic(...)))
- Testset: TestsetGenerator.from_langchain(llm, critic_llm, embeddings) → generate_with_langchain_docs
- Reference-free: faithfulness, answer_relevancy (no ground truth needed)
- Reference-required: context_recall, answer_correctness (need gold answers)
RAGAS Evaluation Pipeline
# evaluation/ragas_eval.py — comprehensive RAG evaluation with RAGAS
from __future__ import annotations
import os
from pathlib import Path
import pandas as pd
from datasets import Dataset
# ── 1. Prepare evaluation dataset ────────────────────────────────────────────
def build_eval_dataset_manual() -> "EvaluationDataset":
"""Build RAGAS evaluation dataset from manually curated QA pairs."""
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample
samples = [
SingleTurnSample(
user_input="What is Retrieval-Augmented Generation?",
response=(
"Retrieval-Augmented Generation (RAG) is a technique that enhances "
"LLM responses by first retrieving relevant documents from a knowledge "
"base and then using them as context for generation."
),
retrieved_contexts=[
"RAG was introduced in the paper 'Retrieval-Augmented Generation for "
"Knowledge-Intensive NLP Tasks'. It combines a retrieval component with "
"a sequence-to-sequence model.",
"The retrieval component uses dense vector search to find relevant "
"documents from a large corpus.",
],
reference=(
"RAG is an AI framework that retrieves relevant information from a "
"knowledge base before generating a response, improving factual accuracy."
),
),
SingleTurnSample(
user_input="What is the difference between sparse and dense retrieval?",
response=(
"Sparse retrieval uses keyword matching (like BM25) while dense retrieval "
"uses neural embeddings to find semantically similar documents."
),
retrieved_contexts=[
"BM25 is a sparse retrieval method based on term frequency and inverse "
"document frequency. It works by exact keyword matching.",
"Dense retrieval uses bi-encoder models to embed queries and documents "
"into the same vector space for semantic similarity search.",
],
reference=(
"Sparse retrieval matches exact terms; dense retrieval uses embedding "
"similarity to find semantically related content."
),
),
# Reference-free sample (no ground truth answer)
SingleTurnSample(
user_input="How does chunking affect RAG performance?",
response="Smaller chunks improve precision but reduce context; larger chunks retain more context but may introduce noise.",
retrieved_contexts=[
"Document chunking strategies significantly impact RAG quality. "
"Fixed-size chunks are simple but may split semantically related text. "
"Semantic chunking preserves meaning boundaries."
],
),
]
return EvaluationDataset(samples=samples)
# ── 2. Configure LLM evaluator ────────────────────────────────────────────────
def get_evaluator_llm(provider: str = "anthropic"):
"""Build LLM wrapper for RAGAS evaluation."""
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
if provider == "anthropic":
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings # RAGAS needs embeddings for some metrics
llm = LangchainLLMWrapper(ChatAnthropic(model="claude-sonnet-4-6", max_tokens=1024))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
else:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
return llm, embeddings
# ── 3. Run evaluation ──────────────────────────────────────────────────────────
def evaluate_rag_pipeline(
dataset_or_path: "EvaluationDataset | str | None" = None,
provider: str = "anthropic",
include_reference_metrics: bool = True,
) -> pd.DataFrame:
"""
Run full RAGAS evaluation suite.
Returns DataFrame with per-sample metric scores.
"""
from ragas import evaluate, RunConfig
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
context_entity_recall,
)
# Build dataset
if dataset_or_path is None:
dataset = build_eval_dataset_manual()
elif isinstance(dataset_or_path, str):
dataset = load_eval_dataset_from_csv(dataset_or_path)
else:
dataset = dataset_or_path
llm, embeddings = get_evaluator_llm(provider)
# Core reference-free metrics (always run)
metrics = [faithfulness, answer_relevancy, context_precision]
# Add reference-requiring metrics if ground truth is available
if include_reference_metrics:
metrics += [context_recall, answer_correctness]
run_config = RunConfig(
timeout=120,
max_retries=3,
max_wait=180,
max_workers=4, # Parallel evaluation calls
)
print(f"Evaluating {len(dataset)} samples with {len(metrics)} metrics...")
result = evaluate(
dataset=dataset,
metrics=metrics,
llm=llm,
embeddings=embeddings,
run_config=run_config,
show_progress=True,
)
scores_df = result.to_pandas()
print("\n=== RAGAS Evaluation Results ===")
mean_scores = scores_df[[m.name for m in metrics if m.name in scores_df.columns]].mean()
for metric, score in mean_scores.items():
print(f" {metric:<25}: {score:.3f}")
return scores_df
# ── 4. Load from CSV ──────────────────────────────────────────────────────────
def load_eval_dataset_from_csv(csv_path: str) -> "EvaluationDataset":
"""
Load evaluation dataset from CSV.
Expected columns: user_input, response, retrieved_contexts (pipe-separated), reference (optional)
"""
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample
df = pd.read_csv(csv_path)
samples = []
for _, row in df.iterrows():
contexts = str(row.get("retrieved_contexts", "")).split("|") if row.get("retrieved_contexts") else []
samples.append(SingleTurnSample(
user_input=str(row["user_input"]),
response=str(row["response"]),
retrieved_contexts=contexts,
reference=str(row["reference"]) if pd.notna(row.get("reference")) else None,
))
return EvaluationDataset(samples=samples)
# ── 5. Testset generation ─────────────────────────────────────────────────────
def generate_testset_from_docs(
doc_paths: list[str],
test_size: int = 20,
output_path: str = "testset.csv",
provider: str = "anthropic",
) -> pd.DataFrame:
"""
Generate synthetic QA testset from local documents using RAGAS TestsetGenerator.
Creates diverse question types: simple, reasoning, multi-context, conditioning.
"""
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from ragas.testset import TestsetGenerator
llm, embeddings = get_evaluator_llm(provider)
# Load documents
docs = []
for path in doc_paths:
if path.endswith(".pdf"):
docs.extend(PyPDFLoader(path).load())
else:
docs.extend(TextLoader(path).load())
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
doc_chunks = splitter.split_documents(docs)
print(f"Loaded {len(doc_chunks)} document chunks from {len(doc_paths)} files")
generator = TestsetGenerator.from_langchain(
generator_llm=llm.langchain_llm,
critic_llm=llm.langchain_llm,
embeddings=embeddings.langchain_embeddings,
)
testset = generator.generate_with_langchain_docs(doc_chunks, test_size=test_size)
testset_df = testset.to_pandas()
testset_df.to_csv(output_path, index=False)
print(f"Testset saved: {output_path} ({len(testset_df)} samples)")
print(f"Question types:\n{testset_df['evolution_type'].value_counts()}")
return testset_df
# ── 6. Continuous evaluation loop ────────────────────────────────────────────
async def async_evaluate(
dataset: "EvaluationDataset",
provider: str = "anthropic",
) -> pd.DataFrame:
"""Async evaluation for faster throughput with concurrent API calls."""
from ragas import aevaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
llm, embeddings = get_evaluator_llm(provider)
result = await aevaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
llm=llm,
embeddings=embeddings,
)
return result.to_pandas()
if __name__ == "__main__":
scores_df = evaluate_rag_pipeline()
print("\nSample-level scores:")
print(scores_df[["user_input", "faithfulness", "answer_relevancy", "context_precision"]].to_string())
scores_df.to_csv("rag_eval_results.csv", index=False)
print("\nResults saved: rag_eval_results.csv")
For the Arize Phoenix alternative when needing combined tracing and evaluation in a single platform with a visual UI for span exploration — Phoenix handles the full observability stack while RAGAS focuses exclusively on evaluation metrics, making RAGAS the right choice when you already have distributed tracing (Langfuse, Datadog) and need best-in-class RAG-specific metrics. For the LangSmith evaluation alternative when already using LangChain Hub and wanting native integration with the LangChain tracing platform — LangSmith traces and evaluates LangChain pipelines natively while RAGAS provides the most widely-cited academic RAG metrics (faithfulness, answer relevancy, context precision/recall) that are specifically designed for retrieval-augmented generation quality measurement regardless of framework. The Claude Skills 360 bundle includes RAGAS skill sets covering dataset preparation, metric configuration, LLM judge setup, testset generation, async evaluation, and result analysis. Start with the free tier to try RAG evaluation code generation.