Building AI-powered features into applications is now a standard engineering task — but the patterns are still evolving. Streaming tokens, function calling, embedding pipelines, retrieval-augmented generation, and multi-step agents all have sharp edges. Claude Code helps here because it knows the current SDK patterns and generates code that handles the non-obvious parts: token counting, retry logic, context window management, and tool use schemas.

This guide covers building LLM integrations with Claude Code: Anthropic/OpenAI API setup, streaming, RAG pipelines, function calling, and agentic workflows.

Setting Up the Claude API

Add Claude API integration to this project.
I need: streaming responses, token counting, and a retry wrapper
for rate limits. Use the latest Anthropic SDK.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  maxRetries: 3,     // Auto-retry 429 and 529 with exponential backoff
  timeout: 60_000,   // 60s per request
});

// Non-streaming — returns full response
async function callClaude(
  prompt: string,
  options?: {
    systemPrompt?: string;
    model?: string;
    maxTokens?: number;
    temperature?: number;
  }
): Promise<{ content: string; inputTokens: number; outputTokens: number }> {
  const response = await client.messages.create({
    model: options?.model ?? 'claude-sonnet-4-6',
    max_tokens: options?.maxTokens ?? 1024,
    temperature: options?.temperature ?? 0,
    system: options?.systemPrompt,
    messages: [{ role: 'user', content: prompt }],
  });

  const content = response.content
    .filter(block => block.type === 'text')
    .map(block => block.text)
    .join('');

  return {
    content,
    inputTokens: response.usage.input_tokens,
    outputTokens: response.usage.output_tokens,
  };
}

// Token counting (before sending — avoid unexpected costs)
async function countTokens(messages: Anthropic.MessageParam[], systemPrompt?: string): Promise<number> {
  const response = await client.messages.countTokens({
    model: 'claude-sonnet-4-6',
    system: systemPrompt,
    messages,
  });
  return response.input_tokens;
}

CLAUDE.md for AI-integrated projects

## AI Integration
- LLM provider: Anthropic Claude (primary), OpenAI (fallback for embeddings)
- Model: claude-sonnet-4-6 for most tasks, claude-haiku-4-5-20251001 for classification/routing
- Max context: 200k tokens — but keep prompts under 10k for latency
- All AI calls go through src/lib/ai/client.ts — never instantiate directly
- Log all LLM calls with token counts to the ai_usage table
- Never include PII in prompts — anonymize user data before sending
- Rate limit: 10 AI requests per user per minute

Streaming Responses

The chat feature needs streaming — show tokens as they arrive,
not all at once after 10 seconds. Send to the frontend via SSE.

// Backend — Next.js route handler
import { NextRequest } from 'next/server';
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

export async function POST(request: NextRequest) {
  const { message, conversationHistory } = await request.json();

  // Return a readable stream
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      try {
        const anthropicStream = await client.messages.stream({
          model: 'claude-sonnet-4-6',
          max_tokens: 2048,
          system: 'You are a helpful assistant.',
          messages: [
            ...conversationHistory,
            { role: 'user', content: message },
          ],
        });

        for await (const event of anthropicStream) {
          if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
            // SSE format: "data: {text}\n\n"
            controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`));
          }
        }

        // Signal completion
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
        
      } catch (error) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: 'Stream failed' })}\n\n`));
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

// Frontend — React hook
function useStreamingChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (message: string, history: Message[]) => {
    setResponse('');
    setIsStreaming(true);

    const res = await fetch('/api/chat', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ message, conversationHistory: history }),
    });

    const reader = res.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split('\n\n').filter(line => line.startsWith('data: '));
      
      for (const line of lines) {
        const data = line.slice(6); // Remove "data: "
        if (data === '[DONE]') { setIsStreaming(false); return; }
        
        const parsed = JSON.parse(data);
        if (parsed.text) setResponse(prev => prev + parsed.text);
      }
    }

    setIsStreaming(false);
  }, []);

  return { response, isStreaming, sendMessage };
}

Retrieval-Augmented Generation (RAG)

Build a RAG pipeline for our documentation.
Users ask questions, the system retrieves relevant docs,
and Claude answers using only that context.
Use pgvector for embeddings storage.

import OpenAI from 'openai'; // Using OpenAI for embeddings, Claude for generation
import { db } from './db';   // Postgres with pgvector extension

const openai = new OpenAI();
const claude = new Anthropic();

// Step 1: Embed documents during indexing
async function embedDocument(chunk: DocumentChunk): Promise<void> {
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunk.content,
  });

  await db.query(
    `INSERT INTO document_embeddings (id, content, source_url, embedding, metadata)
     VALUES ($1, $2, $3, $4::vector, $5)
     ON CONFLICT (id) DO UPDATE SET content = $2, embedding = $4::vector`,
    [chunk.id, chunk.content, chunk.sourceUrl, JSON.stringify(embedding.data[0].embedding), chunk.metadata],
  );
}

// Step 2: Retrieve relevant chunks for a query
async function retrieveContext(query: string, topK = 5): Promise<DocumentChunk[]> {
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  });

  const results = await db.query(
    `SELECT id, content, source_url, metadata,
            1 - (embedding <=> $1::vector) AS similarity
     FROM document_embeddings
     WHERE 1 - (embedding <=> $1::vector) > 0.7  -- Minimum similarity threshold
     ORDER BY similarity DESC
     LIMIT $2`,
    [JSON.stringify(queryEmbedding.data[0].embedding), topK],
  );

  return results.rows;
}

// Step 3: Generate answer using retrieved context
async function answerWithRAG(question: string): Promise<{ answer: string; sources: string[] }> {
  const contextChunks = await retrieveContext(question);
  
  if (contextChunks.length === 0) {
    return {
      answer: "I couldn't find relevant documentation for that question.",
      sources: [],
    };
  }

  const contextText = contextChunks
    .map((chunk, i) => `[Source ${i + 1}: ${chunk.source_url}]\n${chunk.content}`)
    .join('\n\n---\n\n');

  const response = await claude.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: `You are a documentation assistant. Answer questions using ONLY the provided context.
If the context doesn't contain enough information, say so — don't make up answers.
Always cite which source(s) you used.`,
    messages: [{
      role: 'user',
      content: `Context:\n${contextText}\n\nQuestion: ${question}`,
    }],
  });

  const answer = response.content[0].type === 'text' ? response.content[0].text : '';
  const sources = contextChunks.map(c => c.source_url);

  return { answer, sources };
}

Function Calling / Tool Use

I want Claude to be able to look up real-time product inventory
and create support tickets from a conversation.
Add tool use to the chat API.

const tools: Anthropic.Tool[] = [
  {
    name: 'check_inventory',
    description: 'Check current inventory level for a product SKU',
    input_schema: {
      type: 'object',
      properties: {
        sku: {
          type: 'string',
          description: 'Product SKU to check (e.g., "WIDGET-RED-L")',
        },
      },
      required: ['sku'],
    },
  },
  {
    name: 'create_support_ticket',
    description: 'Create a support ticket when a customer issue cannot be resolved in chat',
    input_schema: {
      type: 'object',
      properties: {
        issue_summary: { type: 'string', description: 'Brief description of the issue' },
        priority: { type: 'string', enum: ['low', 'medium', 'high', 'urgent'] },
        customer_id: { type: 'string' },
      },
      required: ['issue_summary', 'priority', 'customer_id'],
    },
  },
];

async function chatWithTools(messages: Anthropic.MessageParam[], customerId: string): Promise<string> {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    tools,
    system: `You are a customer support agent for Acme Store.
Help customers with inventory questions and issues.
Create a support ticket if you cannot resolve an issue directly.`,
    messages,
  });

  // Check if Claude wants to use a tool
  if (response.stop_reason === 'tool_use') {
    const toolUseBlocks = response.content.filter(b => b.type === 'tool_use');
    const toolResults: Anthropic.ToolResultBlockParam[] = [];

    for (const toolUse of toolUseBlocks) {
      if (toolUse.type !== 'tool_use') continue;

      let result: string;
      
      if (toolUse.name === 'check_inventory') {
        const inventory = await getInventory((toolUse.input as any).sku);
        result = JSON.stringify(inventory);
        
      } else if (toolUse.name === 'create_support_ticket') {
        const input = toolUse.input as any;
        const ticket = await createTicket({
          summary: input.issue_summary,
          priority: input.priority,
          customerId,
        });
        result = JSON.stringify({ ticketId: ticket.id, status: 'created' });
        
      } else {
        result = JSON.stringify({ error: 'Unknown tool' });
      }

      toolResults.push({
        type: 'tool_result',
        tool_use_id: toolUse.id,
        content: result,
      });
    }

    // Continue the conversation with tool results
    return chatWithTools([
      ...messages,
      { role: 'assistant', content: response.content },
      { role: 'user', content: toolResults },
    ], customerId);
  }

  // No more tool calls — return final text
  return response.content
    .filter(b => b.type === 'text')
    .map(b => b.text)
    .join('');
}

Prompt Caching (Cost Reduction)

We send the same large system prompt on every API call.
It's 5,000 tokens. How do I cache it to reduce costs?

// Anthropic supports prompt caching for repeated content
const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LARGE_SYSTEM_PROMPT, // 5,000 tokens
      cache_control: { type: 'ephemeral' }, // Cache for 5 minutes
    },
  ],
  messages: [{ role: 'user', content: userMessage }],
});

// After the first call, this 5,000 token prompt is cached.
// Subsequent calls within 5 minutes:
// - Input cost: only the new user message tokens
// - Cache read cost: ~10% of normal input cost for the cached portion

Prompt caching is especially valuable for RAG (cache the retrieved context across follow-up questions) and for multi-turn conversations with a long system prompt.

Building a Simple Agent Loop

Build an agent that can search the web, read URLs, and
summarize research on a given topic. It should keep working
until it has enough information to write a comprehensive answer.

const agentTools: Anthropic.Tool[] = [
  {
    name: 'search_web',
    description: 'Search the web for information on a topic',
    input_schema: {
      type: 'object',
      properties: {
        query: { type: 'string' },
        num_results: { type: 'number', description: 'Number of results to return (1-10)' },
      },
      required: ['query'],
    },
  },
  {
    name: 'read_url',
    description: 'Fetch and read the content of a URL',
    input_schema: {
      type: 'object',
      properties: { url: { type: 'string' } },
      required: ['url'],
    },
  },
];

async function runResearchAgent(topic: string): Promise<string> {
  const messages: Anthropic.MessageParam[] = [
    {
      role: 'user',
      content: `Research this topic thoroughly and write a comprehensive summary: ${topic}`,
    },
  ];

  const MAX_ITERATIONS = 10; // Prevent infinite loops
  
  for (let i = 0; i < MAX_ITERATIONS; i++) {
    const response = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 4096,
      tools: agentTools,
      messages,
    });

    // Add assistant response to history
    messages.push({ role: 'assistant', content: response.content });

    // If no more tool calls, we have the final answer
    if (response.stop_reason === 'end_turn') {
      return response.content
        .filter(b => b.type === 'text')
        .map(b => b.text)
        .join('');
    }

    // Execute tools and add results
    if (response.stop_reason === 'tool_use') {
      const toolResults: Anthropic.ToolResultBlockParam[] = [];
      
      for (const block of response.content) {
        if (block.type !== 'tool_use') continue;
        
        const result = block.name === 'search_web'
          ? await searchWeb((block.input as any).query, (block.input as any).num_results)
          : await fetchUrl((block.input as any).url);
        
        toolResults.push({
          type: 'tool_result',
          tool_use_id: block.id,
          content: JSON.stringify(result),
        });
      }
      
      messages.push({ role: 'user', content: toolResults });
    }
  }
  
  throw new Error(`Agent exceeded max iterations for topic: ${topic}`);
}

The agent loop is simple: send messages → get response → if tool calls, execute them and add results → repeat until the model returns with end_turn. The MAX_ITERATIONS guard prevents runaway loops from bugs in tool execution or model behavior.

Structured Output

I need Claude to always return JSON matching a specific schema.
The output feeds into a database insert — it must be valid.

// Use tool calling as structured output mechanism
const extractionTool: Anthropic.Tool = {
  name: 'extract_product_data',
  description: 'Extract structured product information from the input text',
  input_schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      price_cents: { type: 'integer' },
      category: { type: 'string', enum: ['electronics', 'clothing', 'food', 'furniture'] },
      in_stock: { type: 'boolean' },
      features: { type: 'array', items: { type: 'string' } },
    },
    required: ['name', 'price_cents', 'category', 'in_stock'],
  },
};

async function extractProductData(rawText: string): Promise<Product> {
  const response = await client.messages.create({
    model: 'claude-haiku-4-5-20251001', // Haiku is fast and cheap for extraction
    max_tokens: 500,
    tools: [extractionTool],
    tool_choice: { type: 'tool', name: 'extract_product_data' }, // Force tool use
    messages: [{ role: 'user', content: rawText }],
  });

  const toolUse = response.content.find(b => b.type === 'tool_use');
  if (!toolUse || toolUse.type !== 'tool_use') {
    throw new Error('Model did not call the extraction tool');
  }

  return toolUse.input as Product;
}

tool_choice: { type: 'tool', name: '...' } forces Claude to use the specified tool, guaranteeing structured JSON output every time. This is more reliable than asking for JSON in the prompt.

For observability into your LLM calls — tracking latency, token costs, and error rates across models — see the observability guide. For building full autonomous agents with multi-step reasoning, see the building autonomous agents guide. The Claude Skills 360 bundle includes AI integration skill sets for RAG pipeline patterns, prompt templates, and agent architectures. Start with the free tier to try LLM integration code generation.

Claude Code for AI/LLM Integrations: Building with APIs, RAG, and Agents

Setting Up the Claude API

CLAUDE.md for AI-integrated projects

Streaming Responses

Retrieval-Augmented Generation (RAG)

Function Calling / Tool Use

Prompt Caching (Cost Reduction)

Building a Simple Agent Loop

Structured Output

Keep Reading

Claude Code for Functional Programming: Pure Functions, Composition, and fp-ts

Claude Code for Web Scraping: Playwright, Anti-Bot Handling, and Data Pipelines

Claude Code for Search Implementation: Full-Text, Vector, and Faceted Search

Put these ideas into practice