Blog / AI / Claude Code for Arize Phoenix: LLM Evaluation and Tracing

Claude Code for Arize Phoenix: LLM Evaluation and Tracing

Published: October 9, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

Arize Phoenix traces LLM applications and runs automated evaluations. pip install arizephoenix openinference-instrumentation-anthropic. Start: import phoenix as px; px.launch_app() — opens UI at http://localhost:6006. Instrument Anthropic: from openinference.instrumentation.anthropic import AnthropicInstrumentor, AnthropicInstrumentor().instrument() — all Anthropic calls auto-traced. OpenAI: from openinference.instrumentation.openai import OpenAIInstrumentor; OpenAIInstrumentor().instrument(). LangChain: from openinference.instrumentation.langchain import LangChainInstrumentor; LangChainInstrumentor().instrument(). LlamaIndex: LlamaIndexInstrumentor().instrument(). Query traces: client = px.Client(), spans_df = client.get_spans_dataframe(). Evaluate hallucination: from phoenix.evals import HallucinationEvaluator, run_evals, evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4o")), results = run_evals(dataframe=spans_df, evaluators=[evaluator]). RAG relevance: from phoenix.evals import RelevanceEvaluator. QA correctness: from phoenix.evals import QAEvaluator. Custom eval: from phoenix.evals import llm_classify, template = ClassificationTemplate(rails=["correct","incorrect"], template="Is {response} correct for {question}?"), results = llm_classify(dataframe=df, template=template, model=OpenAIModel()). Datasets: from phoenix.datasets import Dataset, dataset = px.Client().upload_dataset(dataframe=df, dataset_name="rag-eval"). px.active_session() returns current session handle. session.get_evaluations() lists all logged evals. Annotations: from phoenix.trace.openai import OpenAIInstrumentor — compatible with OpenTelemetry exporters. Claude Code generates Phoenix instrumentation, RAG eval pipelines, custom eval templates, and evaluation result analysis.

CLAUDE.md for Arize Phoenix

## Arize Phoenix Stack
- Version: arize-phoenix >= 4.0, openinference-instrumentation-* for auto-trace
- Launch: px.launch_app() → http://localhost:6006 (or connect to remote)
- Instrument: AnthropicInstrumentor/OpenAIInstrumentor/LangChainInstrumentor().instrument()
- Trace spans: SpanKind.LLM | RETRIEVER | CHAIN | EMBEDDING | RERANKER
- Query: px.Client().get_spans_dataframe() → pandas DataFrame with trace data
- Eval: run_evals(dataframe, evaluators=[HallucinationEvaluator, RelevanceEvaluator])
- Custom: llm_classify(df, template=ClassificationTemplate(rails=[...], template=...))
- Upload: px.Client().upload_dataset(df, dataset_name) for persistent eval datasets

Phoenix Tracing and Evaluation

# observability/phoenix_eval.py — LLM tracing and automated evaluation with Phoenix
from __future__ import annotations
import os
import pandas as pd

import phoenix as px
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    RelevanceEvaluator,
    OpenAIModel,
    run_evals,
    llm_classify,
)
from phoenix.evals.templates import ClassificationTemplate, PromptTemplate
from phoenix.trace import SpanKind


# ── 1. Launch Phoenix and configure instrumentation ───────────────────────────

def setup_phoenix(
    project_name: str  = "my-rag-app",
    port:         int  = 6006,
    remote_url:   str | None = None,
) -> px.Client:
    """
    Launch Phoenix UI and instrument all LLM SDKs.
    Returns Phoenix client for querying traces.
    """
    if remote_url:
        # Connect to remote Phoenix instance (Arize Cloud or self-hosted)
        import phoenix as px
        os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = remote_url
    else:
        # Launch local Phoenix server
        session = px.launch_app(port=port)
        print(f"Phoenix UI: {session.url}")

    # Auto-instrument Anthropic
    try:
        from openinference.instrumentation.anthropic import AnthropicInstrumentor
        AnthropicInstrumentor().instrument()
        print("Anthropic instrumentation enabled")
    except ImportError:
        print("pip install openinference-instrumentation-anthropic")

    # Auto-instrument OpenAI
    try:
        from openinference.instrumentation.openai import OpenAIInstrumentor
        OpenAIInstrumentor().instrument()
        print("OpenAI instrumentation enabled")
    except ImportError:
        pass

    # Auto-instrument LangChain
    try:
        from openinference.instrumentation.langchain import LangChainInstrumentor
        LangChainInstrumentor().instrument()
        print("LangChain instrumentation enabled")
    except ImportError:
        pass

    # Auto-instrument LlamaIndex
    try:
        from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
        LlamaIndexInstrumentor().instrument()
        print("LlamaIndex instrumentation enabled")
    except ImportError:
        pass

    return px.Client()


# ── 2. Manual span creation ───────────────────────────────────────────────────

def traced_rag_pipeline_with_spans(query: str) -> str:
    """
    Manual tracing with Phoenix spans for custom pipelines.
    Uses OpenTelemetry API for span creation.
    """
    from opentelemetry import trace
    from opentelemetry.trace import SpanKind as OTSpanKind

    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("rag-pipeline", kind=OTSpanKind.INTERNAL) as pipeline_span:
        pipeline_span.set_attribute("input.value", query)
        pipeline_span.set_attribute("session.id",  "user-session-123")

        # Retrieval span
        with tracer.start_as_current_span("vector-retrieval", kind=OTSpanKind.CLIENT) as ret_span:
            ret_span.set_attribute("input.value",      query)
            ret_span.set_attribute("openinference.span.kind", "RETRIEVER")
            docs = [{"doc_id": f"d{i}", "text": f"Relevant doc {i} for {query}"} for i in range(3)]
            ret_span.set_attribute("retrieval.documents", str(docs))
            ret_span.set_attribute("output.value", f"Retrieved {len(docs)} docs")

        # LLM generation span
        context = " ".join(d["text"] for d in docs)
        with tracer.start_as_current_span("llm-completion", kind=OTSpanKind.CLIENT) as llm_span:
            llm_span.set_attribute("openinference.span.kind",    "LLM")
            llm_span.set_attribute("llm.model_name",             "claude-sonnet-4-6")
            llm_span.set_attribute("input.value",                f"Context: {context}\nQ: {query}")
            response = f"Answer to '{query}' based on context."
            llm_span.set_attribute("output.value",               response)
            llm_span.set_attribute("llm.token_count.prompt",     len(context.split()) * 2)
            llm_span.set_attribute("llm.token_count.completion",  len(response.split()) * 2)

        pipeline_span.set_attribute("output.value", response)
    return response


# ── 3. Retrieve spans for evaluation ─────────────────────────────────────────

def get_rag_spans_for_eval(client: px.Client) -> pd.DataFrame:
    """
    Query traced RAG spans and prepare DataFrame for evaluation.
    Expected columns: input, output, reference (retrieved context).
    """
    spans_df = client.get_spans_dataframe(
        filter_condition="span_kind == 'LLM'",
        start_time=pd.Timestamp.now() - pd.Timedelta(hours=24),
    )

    if spans_df.empty:
        # Synthetic data if no traces yet
        spans_df = pd.DataFrame({
            "input":     ["What is RAG?", "Explain embeddings", "How does attention work?"],
            "output":    [
                "RAG combines retrieval with generation for grounded responses.",
                "Embeddings are dense vector representations of text.",
                "Attention weighs token importance within a sequence.",
            ],
            "reference": [
                "Retrieval-Augmented Generation (RAG) retrieves relevant documents...",
                "Word embeddings map words to high-dimensional vectors...",
                "The attention mechanism assigns weights to each token...",
            ],
            "context":   [
                "Document: RAG overview. RAG combines information retrieval with LLMs...",
                "Document: ML glossary. Embeddings are numerical representations...",
                "Document: Transformer paper. Attention is all you need...",
            ],
        })
    return spans_df


# ── 4. Automated evaluation ────────────────────────────────────────────────────

def run_rag_evaluations(
    spans_df: pd.DataFrame,
    eval_model_name: str = "gpt-4o-mini",
) -> dict[str, pd.DataFrame]:
    """
    Run multiple automated evaluators on RAG traces.
    Returns dict of evaluator_name → results DataFrame.
    """
    eval_model = OpenAIModel(
        model=eval_model_name,
        api_key=os.environ.get("OPENAI_API_KEY", ""),
    )

    evaluators = {
        "hallucination": HallucinationEvaluator(eval_model),
        "relevance":     RelevanceEvaluator(eval_model),
        "qa_correctness": QAEvaluator(eval_model),
    }

    results = {}
    for name, evaluator in evaluators.items():
        print(f"Running {name} evaluation...")
        try:
            result_df = run_evals(
                dataframe=spans_df,
                evaluators=[evaluator],
                provide_explanation=True,
            )
            results[name] = result_df
            score_col = [c for c in result_df.columns if "score" in c.lower()]
            if score_col:
                avg = result_df[score_col[0]].mean()
                print(f"  {name}: avg_score={avg:.3f} over {len(result_df)} samples")
        except Exception as e:
            print(f"  {name} failed: {e}")

    return results


# ── 5. Custom LLM-as-judge evaluation ────────────────────────────────────────

def custom_conciseness_eval(
    spans_df: pd.DataFrame,
    eval_model_name: str = "gpt-4o-mini",
) -> pd.DataFrame:
    """Custom evaluation: is the response concise (under 50 words)?"""
    eval_model = OpenAIModel(model=eval_model_name)

    template = ClassificationTemplate(
        rails=["concise", "verbose"],
        template=(
            "You are evaluating response conciseness.\n\n"
            "Question: {input}\n"
            "Response: {output}\n\n"
            "Is this response concise (under 50 words) or verbose (over 50 words)?\n"
            "Answer:"
        ),
        explanation_template=(
            "Explain in one sentence why this response is {label}."
        ),
    )

    return llm_classify(
        dataframe=spans_df,
        template=template,
        model=eval_model,
        rails=["concise", "verbose"],
        provide_explanation=True,
    )


# ── 6. Upload evaluation datasets ────────────────────────────────────────────

def create_eval_dataset(
    client: px.Client,
    dataset_name: str = "rag-gold-standard",
) -> None:
    """Upload a golden dataset for repeatable benchmark evaluation."""
    gold_df = pd.DataFrame({
        "input":          ["What is attention?",         "Define RLHF"],
        "expected_output":["A mechanism that weighs...", "RLHF trains models..."],
        "context":        ["Attention paper abstract...", "InstructGPT paper..."],
    })

    dataset = client.upload_dataset(
        dataframe=gold_df,
        dataset_name=dataset_name,
        input_keys=["input", "context"],
        output_keys=["expected_output"],
    )
    print(f"Dataset uploaded: {dataset_name} ({len(gold_df)} examples)")


# ── Main pipeline ─────────────────────────────────────────────────────────────

if __name__ == "__main__":
    client = setup_phoenix()

    # Run some traced queries
    for query in ["What is a transformer?", "Explain gradient descent"]:
        response = traced_rag_pipeline_with_spans(query)
        print(f"Q: {query}\nA: {response[:80]}...\n")

    # Evaluate
    spans_df = get_rag_spans_for_eval(client)
    eval_results = run_rag_evaluations(spans_df)

    for eval_name, df in eval_results.items():
        print(f"\n{eval_name} results:\n{df.head(3)}")

For the Langfuse alternative when needing cost tracking, prompt versioning, and team collaboration features baked into the same observability tool — Langfuse handles cost/budget management tightly while Phoenix’s automated LLM-as-judge evaluators (HallucinationEvaluator, RelevanceEvaluator, QAEvaluator) and built-in RAG evaluation metrics make it the stronger choice specifically for teams running systematic quality evaluation pipelines on retrieval-augmented generation applications. For the custom LLM evaluation harness alternative when writing evaluation logic from scratch in Python with pytest or a custom framework — a custom harness gives full control but Phoenix’s pre-built evaluators covering the most common RAG failure modes (hallucination, context relevance, answer faithfulness, QA correctness) with OpenAI-compatible LLM-as-judge templates save several days of implementation work. The Claude Skills 360 bundle includes Arize Phoenix skill sets covering auto-instrumentation, manual span tracing, RAG evaluation, custom LLM-as-judge templates, dataset uploads, and evaluation result analysis. Start with the free tier to try LLM evaluation pipeline generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39