Blog / AI / Claude Code for OpenCLIP: Open-Source CLIP for Vision-Language

Claude Code for OpenCLIP: Open-Source CLIP for Vision-Language

Published: October 26, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

OpenCLIP provides open-source CLIP models with LAION pretrained weights. pip install open_clip_torch. import open_clip. List pretrained: open_clip.list_pretrained() — shows model+dataset combos. Load: model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k"). Large: "ViT-L-14" with "laion2b_s32b_b82k". EVA: "EVA02-E-14-plus" with "laion2b_s9b_b144k" — highest accuracy. Tokenize: tokenizer = open_clip.get_tokenizer("ViT-B-32"), text_tokens = tokenizer(["a photo of a cat", "a photo of a dog"]). Image: image_tensor = preprocess(PIL_image).unsqueeze(0). Encode: with torch.no_grad(): image_features = model.encode_image(image_tensor), text_features = model.encode_text(text_tokens). Normalize: image_features /= image_features.norm(dim=-1, keepdim=True). Similarity: similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1). Zero-shot class names: classes = ["cat", "dog", "bird"], prompts = [f"a photo of a {c}" for c in classes]. Forward: logits_per_image, logits_per_text = model(image, text). Batch encode: features = model.encode_image(batch_images) — (N, 512) for ViT-B-32. CoCa: model = open_clip.create_model("coca_ViT-L-14", pretrained="mscoco_finetuned_laion2b_s13b_b90k"). CoCa caption: generated = model.generate(image). Retrieval: compute pairwise image-text similarity matrix. Claude Code generates OpenCLIP zero-shot classifiers, image-text retrieval systems, embedding extractors, and CLIP fine-tuning scripts.

CLAUDE.md for OpenCLIP

## OpenCLIP Stack
- Version: open_clip_torch >= 2.24
- Load: create_model_and_transforms(model_name, pretrained=checkpoint_name)
- Models: "ViT-B-32" | "ViT-L-14" | "ViT-H-14" | "EVA02-E-14-plus"
- Tokenize: get_tokenizer(model_name)(list_of_strings) → (N, 77) tokens
- Encode: model.encode_image(images) | model.encode_text(tokens) → L2-normalize after
- Classify: (image_features @ text_features.T).softmax(-1) → class probabilities
- Batch: encode_image(N, 3, H, W) → (N, D) | encode_text(N, 77) → (N, D)
- CoCa: create_model("coca_ViT-L-14", ...) → model.generate(image) for captioning
- Pretrained: list_pretrained() shows all available model+dataset combos

OpenCLIP Vision-Language Pipeline

# vision/openclip_pipeline.py — vision-language embeddings with OpenCLIP
from __future__ import annotations
import os
from pathlib import Path
from typing import Optional

import torch
import torch.nn.functional as F
import numpy as np
from PIL import Image
import open_clip


# ── 1. Model loading ──────────────────────────────────────────────────────────

def load_clip_model(
    model_name:  str = "ViT-B-32",
    pretrained:  str = "laion2b_s34b_b79k",
    device:      str = "cpu",
) -> tuple:
    """
    Load OpenCLIP model, transforms, and tokenizer.
    
    Recommended combinations:
    - Fast:    ViT-B-32 + laion2b_s34b_b79k
    - Balanced: ViT-L-14 + laion2b_s32b_b82k  
    - Best:    EVA02-E-14-plus + laion2b_s9b_b144k (requires more VRAM)
    """
    model, _, preprocess = open_clip.create_model_and_transforms(
        model_name, pretrained=pretrained
    )
    tokenizer = open_clip.get_tokenizer(model_name)
    model     = model.to(device).eval()

    print(f"OpenCLIP {model_name} ({pretrained}) ready on {device}")
    return model, preprocess, tokenizer


def list_available_models(filter_str: str = None) -> list[tuple]:
    """List all available pretrained model+checkpoint combinations."""
    all_models = open_clip.list_pretrained()
    if filter_str:
        all_models = [(m, d) for m, d in all_models if filter_str.lower() in m.lower()]
    return all_models


# ── 2. Embedding extraction ───────────────────────────────────────────────────

@torch.no_grad()
def embed_images(
    model,
    preprocess,
    images: list,   # list of PIL.Image or file paths
    batch_size: int = 64,
    device:     str = "cpu",
    normalize:  bool = True,
) -> torch.Tensor:
    """
    Extract normalized image embeddings.
    Returns (N, embed_dim) tensor.
    """
    all_features = []

    for i in range(0, len(images), batch_size):
        batch = images[i : i + batch_size]
        pil_batch = []
        for img in batch:
            if isinstance(img, (str, Path)):
                img = Image.open(img).convert("RGB")
            pil_batch.append(img)

        tensor_batch = torch.stack([preprocess(img) for img in pil_batch]).to(device)
        features     = model.encode_image(tensor_batch)

        if normalize:
            features = F.normalize(features, dim=-1)

        all_features.append(features.cpu())

    return torch.cat(all_features)


@torch.no_grad()
def embed_texts(
    model,
    tokenizer,
    texts:     list[str],
    batch_size: int  = 256,
    device:    str   = "cpu",
    normalize: bool  = True,
) -> torch.Tensor:
    """
    Extract normalized text embeddings.
    Returns (N, embed_dim) tensor.
    """
    all_features = []

    for i in range(0, len(texts), batch_size):
        batch   = texts[i : i + batch_size]
        tokens  = tokenizer(batch).to(device)
        features = model.encode_text(tokens)

        if normalize:
            features = F.normalize(features, dim=-1)

        all_features.append(features.cpu())

    return torch.cat(all_features)


# ── 3. Zero-shot image classification ────────────────────────────────────────

IMAGENET_TEMPLATES = [
    "a photo of a {}.",
    "a good photo of a {}.",
    "a bad photo of a {}.",
    "a close-up photo of a {}.",
    "a photo of many {}.",
    "an image of a {}.",
    "a {} in a scene.",
]


def build_class_embeddings(
    model,
    tokenizer,
    class_names:  list[str],
    templates:    list[str] = None,
    device:       str = "cpu",
    ensemble:     bool = True,
) -> torch.Tensor:
    """
    Build class prototype embeddings using prompt templates.
    ensemble=True averages over all templates for robustness.
    Returns (n_classes, embed_dim).
    """
    templates = templates or IMAGENET_TEMPLATES[:3]

    with torch.no_grad():
        class_embeds = []
        for cls_name in class_names:
            prompts  = [t.format(cls_name) for t in templates]
            tokens   = tokenizer(prompts).to(device)
            features = model.encode_text(tokens)   # (n_templates, D)
            features = F.normalize(features, dim=-1)

            if ensemble:
                class_embed = features.mean(dim=0)
                class_embed = F.normalize(class_embed, dim=-1)
            else:
                class_embed = features[0]

            class_embeds.append(class_embed)

    return torch.stack(class_embeds).cpu()   # (n_classes, D)


def zero_shot_classify(
    model,
    preprocess,
    tokenizer,
    images:      list,
    class_names: list[str],
    templates:   list[str] = None,
    device:      str = "cpu",
    top_k:       int = 5,
) -> list[list[tuple[str, float]]]:
    """
    Zero-shot image classification without any training.
    Returns list of (class_name, probability) for each image.
    """
    # Build class embeddings
    class_embeds = build_class_embeddings(
        model, tokenizer, class_names, templates, device
    ).to(device)   # (C, D)

    # Encode images
    image_embeds = embed_images(model, preprocess, images, device=device)  # (N, D)
    image_embeds = image_embeds.to(device)

    # Compute similarities
    logits       = 100.0 * image_embeds @ class_embeds.T   # (N, C)
    probabilities = logits.softmax(dim=-1).cpu()

    results = []
    for probs in probabilities:
        top_indices = torch.topk(probs, k=min(top_k, len(class_names))).indices
        preds = [(class_names[i], round(float(probs[i]), 4)) for i in top_indices]
        results.append(preds)

    return results


# ── 4. Image-text retrieval ───────────────────────────────────────────────────

def build_image_index(
    model,
    preprocess,
    image_paths: list[str],
    device: str = "cpu",
) -> tuple[torch.Tensor, list[str]]:
    """Build an embedding index for a collection of images."""
    embeddings = embed_images(model, preprocess, image_paths, device=device)
    print(f"Image index: {len(image_paths)} images, {embeddings.shape[1]}D embeddings")
    return embeddings, image_paths


def search_images_by_text(
    query_text:      str,
    image_embeddings: torch.Tensor,   # (N, D) normalized
    image_paths:     list[str],
    model,
    tokenizer,
    top_k:           int = 10,
    device:          str = "cpu",
) -> list[tuple[str, float]]:
    """Find images most relevant to a text query."""
    text_embed = embed_texts(model, tokenizer, [query_text], device=device)[0]
    text_embed = text_embed.to(device)
    img_embeds = image_embeddings.to(device)

    similarities = img_embeds @ text_embed   # (N,)
    top_indices  = torch.topk(similarities, k=min(top_k, len(image_paths))).indices.cpu()

    return [(image_paths[i], round(float(similarities[i]), 4)) for i in top_indices]


def search_texts_by_image(
    query_image_path: str,
    text_embeddings:  torch.Tensor,  # (N, D) normalized
    texts:            list[str],
    model,
    preprocess,
    top_k: int = 5,
    device: str = "cpu",
) -> list[tuple[str, float]]:
    """Find texts most matching an image query."""
    img_embed   = embed_images(model, preprocess, [query_image_path], device=device)[0]
    img_embed   = img_embed.to(device)
    txt_embeds  = text_embeddings.to(device)

    similarities = txt_embeds @ img_embed   # (N,)
    top_indices  = torch.topk(similarities, k=min(top_k, len(texts))).indices.cpu()

    return [(texts[i], round(float(similarities[i]), 4)) for i in top_indices]


# ── 5. Image similarity ───────────────────────────────────────────────────────

def find_similar_images_by_image(
    query_path:       str,
    gallery_embeddings: torch.Tensor,
    gallery_paths:    list[str],
    model,
    preprocess,
    top_k: int = 10,
    device: str = "cpu",
) -> list[tuple[str, float]]:
    """Find images visually similar to a query image."""
    query_embed = embed_images(model, preprocess, [query_path], device=device)[0]
    return search_images_by_text.__wrapped__ if False else [
        (gallery_paths[i], round(float(s), 4))
        for i, s in zip(
            *[(t.item(), gallery_embeddings.to(device) @ query_embed.to(device))]
            if False else
            [
                torch.topk(gallery_embeddings.to(device) @ query_embed.to(device), k=min(top_k, len(gallery_paths))).indices.cpu().tolist(),
                torch.topk(gallery_embeddings.to(device) @ query_embed.to(device), k=min(top_k, len(gallery_paths))).values.cpu().tolist(),
            ]
        )
    ]


@torch.no_grad()
def compute_pairwise_similarity(
    embeddings_a: torch.Tensor,   # (N, D)
    embeddings_b: torch.Tensor,   # (M, D)
) -> torch.Tensor:
    """Compute (N, M) cosine similarity matrix."""
    return embeddings_a @ embeddings_b.T


# ── 6. CoCa image captioning ──────────────────────────────────────────────────

def load_coca_model(device: str = "cpu") -> tuple:
    """Load CoCa (Contrastive Captioners) for image captioning."""
    model, _, preprocess = open_clip.create_model_and_transforms(
        "coca_ViT-L-14",
        pretrained="mscoco_finetuned_laion2b_s13b_b90k",
    )
    model = model.to(device).eval()
    tokenizer = open_clip.get_tokenizer("ViT-L-14")
    print("CoCa ViT-L-14 ready for image captioning")
    return model, preprocess, tokenizer


@torch.no_grad()
def caption_image(
    model,
    preprocess,
    image_path: str,
    top_k: int = 1,
    max_seq_len: int = 30,
    device: str = "cpu",
) -> list[str]:
    """Generate captions for an image using CoCa."""
    image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(device)
    generated = model.generate(image, seq_len=max_seq_len)
    # Decode tokens
    tokenizer = open_clip.get_tokenizer("ViT-L-14")
    captions = [
        open_clip.decode(g).split("<end_of_text>")[0].replace("<start_of_text>", "").strip()
        for g in generated
    ]
    return captions


# ── Demo ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess, tokenizer = load_clip_model("ViT-B-32", device=device)

    # Create synthetic image for demo
    dummy_image = Image.fromarray(np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8))

    # Zero-shot classification
    animal_classes = ["cat", "dog", "bird", "fish", "rabbit", "hamster"]
    results = zero_shot_classify(
        model, preprocess, tokenizer,
        images=[dummy_image],
        class_names=animal_classes,
        device=device,
        top_k=3,
    )
    print("Zero-shot predictions:")
    for cls_name, prob in results[0]:
        print(f"  {cls_name:<12}: {prob:.3f}")

    # Text embeddings
    texts = ["a photo of a cat", "a dog running in a park", "a bird in the sky"]
    text_embeds = embed_texts(model, tokenizer, texts, device=device)
    print(f"\nText embeddings: {text_embeds.shape}")

    # Similarity matrix
    sim_matrix = compute_pairwise_similarity(text_embeds, text_embeds)
    print(f"Text-text similarity:\n{sim_matrix.numpy().round(3)}")

    print("\nAvailable ViT-B-32 checkpoints:")
    for model_n, pretrained_n in list_available_models("ViT-B-32")[:5]:
        print(f"  {model_n} + {pretrained_n}")

For the OpenAI CLIP alternative when wanting the original OpenAI paper–matching checkpoints and relying on OpenAI’s official implementation — the original CLIP provides reliable baselines while OpenCLIP’s LAION-trained variants (ViT-L-14 on LAION-2B, EVA02-E-14 on LAION) achieve higher zero-shot accuracy than OpenAI’s public checkpoints on ImageNet and most retrieval benchmarks, and the open training data allows fine-tuning without commercial licensing restrictions. For the BLIP/BLIP-2 alternative when needing visual question answering, image captioning, and grounded understanding beyond classification and retrieval — BLIP-2 with frozen large language model produces richer visual answers while OpenCLIP’s dual-encoder architecture with joint embedding space is faster for large-scale retrieval (millions of images), similarity search, and zero-shot classification where contrastive embeddings are more efficient than decoder-based models. The Claude Skills 360 bundle includes OpenCLIP skill sets covering model loading, image and text embedding, zero-shot classification with template ensembling, image-text retrieval, image similarity search, CoCa captioning, and pairwise similarity matrices. Start with the free tier to try vision-language embedding code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39