Blog / AI / Claude Code for Whisper: OpenAI Speech Recognition

Claude Code for Whisper: OpenAI Speech Recognition

Published: October 18, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

Whisper transcribes audio in 99 languages with high accuracy. pip install openai-whisper. import whisper. model = whisper.load_model("base") — sizes: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.5B). Transcribe: result = model.transcribe("audio.mp3"), result["text"] — full transcript. Language: result = model.transcribe("audio.mp3", language="fr"). Translate to English: result = model.transcribe("audio.mp3", task="translate"). Word timestamps: result = model.transcribe("audio.mp3", word_timestamps=True), result["segments"][0]["words"] — list with {word, start, end, probability}. Segment-level: result["segments"] — list with {id, start, end, text}. Decoding options: options = whisper.DecodingOptions(language="en", without_timestamps=False, beam_size=5, best_of=5, temperature=0.0). Initial prompt: model.transcribe(audio, initial_prompt="This is a medical lecture about cardiology.") — guides vocabulary. Long audio: whisper.audio.load_audio("file.mp3") → whisper.audio.pad_or_trim(audio). Detect language: mel = whisper.log_mel_spectrogram(audio), probs = model.detect_language(mel)[1], max(probs, key=probs.get). faster-whisper: from faster_whisper import WhisperModel, model = WhisperModel("large-v3", device="cpu", compute_type="int8"), segments, info = model.transcribe("audio.mp3", beam_size=5, word_timestamps=True). OpenAI API: from openai import OpenAI, client.audio.transcriptions.create(model="whisper-1", file=open("audio.mp3","rb"), response_format="verbose_json"). Claude Code generates Whisper transcription pipelines, batch processors, subtitle exporters, and speaker-aware transcription scripts.

CLAUDE.md for Whisper

## Whisper Stack
- Version: openai-whisper >= 20231117 | faster-whisper >= 1.0
- Load: whisper.load_model("base" | "small" | "medium" | "large-v3")
- Transcribe: model.transcribe(audio_path, language=, task="transcribe"|"translate")
- Timestamps: word_timestamps=True → result["segments"][i]["words"][j] with start/end/prob
- Prompt: initial_prompt="domain vocab" to guide recognition
- FasterWhisper: WhisperModel(size, device, compute_type="int8") → transcribe(path, beam_size)
- OpenAI API: client.audio.transcriptions.create(model="whisper-1", file=..., response_format)
- Formats: response_format="json"|"text"|"srt"|"vtt"|"verbose_json"

Whisper Transcription Pipeline

# audio/whisper_pipeline.py — speech recognition with OpenAI Whisper
from __future__ import annotations
import os
import json
import time
from pathlib import Path
from typing import Optional


# ── 1. Model loading ──────────────────────────────────────────────────────────

def load_whisper_model(
    size:   str = "base",   # tiny | base | small | medium | large-v3
    device: str = "cpu",    # cpu | cuda
):
    """Load Whisper model. Larger = more accurate, slower."""
    import whisper
    print(f"Loading whisper-{size} on {device}...")
    model = whisper.load_model(size, device=device)
    print(f"Model ready: {sum(p.numel() for p in model.parameters()):,} params")
    return model


def load_faster_whisper(
    size:         str = "base",
    device:       str = "cpu",
    compute_type: str = "int8",   # int8 | float16 | float32 | int8_float16
):
    """
    Load faster-whisper for 2-4x speed vs original Whisper.
    CTranslate2-backed — much faster CPU inference with int8.
    """
    from faster_whisper import WhisperModel
    model = WhisperModel(
        size,
        device=device,
        compute_type=compute_type,
        cpu_threads=os.cpu_count() or 4,
        num_workers=2,
    )
    print(f"faster-whisper-{size} ({compute_type}) ready")
    return model


# ── 2. Basic transcription ────────────────────────────────────────────────────

def transcribe(
    model,
    audio_path:     str,
    language:       str   = None,   # None = auto-detect
    task:           str   = "transcribe",    # "transcribe" | "translate"
    beam_size:      int   = 5,
    temperature:    float = 0.0,
    initial_prompt: str   = None,
    word_timestamps: bool = False,
) -> dict:
    """
    Transcribe audio file. Returns dict with text, segments, language.
    Works with both openai-whisper and faster-whisper models.
    """
    import whisper as _whisper

    is_faster = not hasattr(model, "transcribe")

    if is_faster:
        # faster-whisper API
        segments, info = model.transcribe(
            audio_path,
            language=language,
            task=task,
            beam_size=beam_size,
            temperature=temperature,
            initial_prompt=initial_prompt,
            word_timestamps=word_timestamps,
            vad_filter=True,
            vad_parameters={"min_silence_duration_ms": 300},
        )
        segment_list = []
        full_text    = []
        for seg in segments:
            s = {"id": seg.id, "start": seg.start, "end": seg.end, "text": seg.text.strip()}
            if word_timestamps and seg.words:
                s["words"] = [{"word": w.word, "start": w.start, "end": w.end, "probability": w.probability}
                               for w in seg.words]
            segment_list.append(s)
            full_text.append(seg.text.strip())

        return {
            "text":     " ".join(full_text),
            "language": info.language,
            "segments": segment_list,
        }
    else:
        # Original openai-whisper API
        result = model.transcribe(
            audio_path,
            language=language,
            task=task,
            beam_size=beam_size,
            temperature=temperature,
            initial_prompt=initial_prompt,
            word_timestamps=word_timestamps,
        )
        return result


def detect_language(model, audio_path: str) -> tuple[str, dict]:
    """Detect spoken language and return probabilities."""
    import whisper
    import numpy as np

    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)
    mel   = whisper.log_mel_spectrogram(audio).to(model.device)

    _, probs = model.detect_language(mel)
    top_lang = max(probs, key=probs.get)
    top5     = dict(sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5])
    return top_lang, top5


# ── 3. Subtitle generation ────────────────────────────────────────────────────

def format_timestamp(seconds: float, srt: bool = False) -> str:
    """Format seconds as SRT (00:00:00,000) or VTT (00:00:00.000) timestamp."""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    sep = "," if srt else "."
    return f"{h:02d}:{m:02d}:{s:02d}{sep}{ms:03d}"


def to_srt(result: dict) -> str:
    """Convert Whisper result to SRT subtitle format."""
    lines = []
    for i, seg in enumerate(result["segments"], 1):
        start = format_timestamp(seg["start"], srt=True)
        end   = format_timestamp(seg["end"],   srt=True)
        lines.append(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n")
    return "\n".join(lines)


def to_vtt(result: dict) -> str:
    """Convert Whisper result to WebVTT subtitle format."""
    lines = ["WEBVTT\n"]
    for seg in result["segments"]:
        start = format_timestamp(seg["start"], srt=False)
        end   = format_timestamp(seg["end"],   srt=False)
        lines.append(f"{start} --> {end}\n{seg['text'].strip()}\n")
    return "\n".join(lines)


def save_subtitles(result: dict, base_path: str):
    """Save both SRT and VTT subtitle files."""
    srt_path = base_path + ".srt"
    vtt_path = base_path + ".vtt"
    Path(srt_path).write_text(to_srt(result), encoding="utf-8")
    Path(vtt_path).write_text(to_vtt(result), encoding="utf-8")
    print(f"Saved: {srt_path}, {vtt_path}")


# ── 4. Batch audio processing ─────────────────────────────────────────────────

def batch_transcribe(
    model,
    audio_files: list[str],
    output_dir:  str   = "./transcripts",
    language:    str   = None,
    format:      str   = "json",   # "json" | "txt" | "srt"
) -> list[dict]:
    """
    Transcribe multiple audio files and save results.
    Returns list of result dicts.
    """
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    results = []
    for i, audio_file in enumerate(audio_files):
        stem = Path(audio_file).stem
        print(f"[{i+1}/{len(audio_files)}] Transcribing: {audio_file}...")
        t0     = time.perf_counter()
        result = transcribe(model, audio_file, language=language)
        elapsed = time.perf_counter() - t0
        print(f"  Done in {elapsed:.1f}s: {result['language']} — {len(result['text'].split())} words")

        if format == "json":
            out_file = output_path / f"{stem}.json"
            out_file.write_text(json.dumps(result, ensure_ascii=False, indent=2), encoding="utf-8")
        elif format == "txt":
            out_file = output_path / f"{stem}.txt"
            out_file.write_text(result["text"], encoding="utf-8")
        elif format == "srt":
            save_subtitles(result, str(output_path / stem))

        results.append(result)

    print(f"\nCompleted {len(results)} files → {output_dir}")
    return results


# ── 5. Domain-specific transcription ─────────────────────────────────────────

DOMAIN_PROMPTS = {
    "medical":    "This is a medical consultation discussing symptoms, diagnoses, and treatments.",
    "legal":      "This is a legal deposition with formal legal terminology and case discussions.",
    "technical":  "This is a software engineering discussion about APIs, microservices, and machine learning.",
    "finance":    "This is a financial earnings call discussing revenue, EBITDA, and market outlook.",
    "podcast":    "This is a podcast interview with casual conversation.",
}


def transcribe_domain(
    model,
    audio_path: str,
    domain:     str = "technical",
    language:   str = "en",
) -> dict:
    """Transcribe audio with domain-specific vocabulary guidance."""
    prompt = DOMAIN_PROMPTS.get(domain, "")
    return transcribe(
        model,
        audio_path,
        language=language,
        initial_prompt=prompt,
        temperature=0.0,
        beam_size=5,
        word_timestamps=True,
    )


# ── 6. OpenAI Whisper API ─────────────────────────────────────────────────────

def transcribe_via_api(
    audio_path:      str,
    language:        str  = None,
    response_format: str  = "verbose_json",   # json | text | srt | vtt | verbose_json
    timestamp_granularities: list = None,      # ["word"] | ["segment"]
) -> dict | str:
    """
    Use OpenAI Whisper API — no local GPU needed.
    Costs ~$0.006/minute of audio.
    """
    from openai import OpenAI

    client = OpenAI()
    params = dict(
        model="whisper-1",
        file=open(audio_path, "rb"),
        response_format=response_format,
    )
    if language:
        params["language"] = language
    if timestamp_granularities:
        params["timestamp_granularities"] = timestamp_granularities

    result = client.audio.transcriptions.create(**params)

    if response_format in ("json", "verbose_json"):
        return result.model_dump()
    return result  # text, srt, vtt as string


def translate_via_api(audio_path: str) -> str:
    """Translate any language audio → English text via Whisper API."""
    from openai import OpenAI
    client = OpenAI()
    result = client.audio.translations.create(
        model="whisper-1",
        file=open(audio_path, "rb"),
        response_format="text",
    )
    return result


# ── 7. WhisperX with speaker diarization ─────────────────────────────────────

def transcribe_with_diarization(
    audio_path:    str,
    model_size:    str = "base",
    device:        str = "cpu",
    hf_token:      str = None,     # Required for pyannote diarization models
    num_speakers:  int = None,     # None = auto-detect
    language:      str = None,
) -> dict:
    """
    WhisperX: word-level alignment + speaker diarization.
    pip install whisperx
    Returns segments with speaker labels: {"start": 0.0, "end": 2.3, "text": "...", "speaker": "SPEAKER_00"}
    """
    import whisperx

    # 1. Transcribe
    model  = whisperx.load_model(model_size, device, compute_type="int8")
    audio  = whisperx.load_audio(audio_path)
    result = model.transcribe(audio, batch_size=16, language=language)

    # 2. Align word timestamps
    align_model, metadata = whisperx.load_align_model(
        language_code=result["language"],
        device=device,
    )
    result = whisperx.align(
        result["segments"], align_model, metadata, audio, device,
        return_char_alignments=False,
    )

    # 3. Diarize (requires HF token and pyannote models)
    if hf_token:
        diarize_model  = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
        diarize_result = diarize_model(audio, num_speakers=num_speakers)
        result         = whisperx.assign_word_speakers(diarize_result, result)

    return result


# ── Demo ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # Use faster-whisper for CPU inference demo
    model = load_faster_whisper("base", device="cpu", compute_type="int8")

    # Transcribe (replace "audio.mp3" with a real file)
    result = transcribe(model, "audio.mp3", word_timestamps=True)
    print(f"Language: {result['language']}")
    print(f"Transcript: {result['text'][:200]}...")

    if result["segments"]:
        print(f"\nFirst segment: [{result['segments'][0]['start']:.2f}s - {result['segments'][0]['end']:.2f}s]")
        print(f"  Text: {result['segments'][0]['text']}")

        if "words" in result["segments"][0]:
            first_words = result["segments"][0]["words"][:5]
            for w in first_words:
                print(f"  [{w['start']:.2f}-{w['end']:.2f}] {w['word']} (p={w['probability']:.2f})")

    # Generate subtitles
    srt = to_srt(result)
    print(f"\nSRT preview:\n{srt[:200]}")

For the AssemblyAI API alternative when needing production-grade cloud transcription with speaker diarization, auto-chapters, entity detection, and sentiment analysis in a single managed API — AssemblyAI bundles multiple post-processing features while Whisper runs completely on your own hardware with no per-minute cost, no data leaving your infrastructure, and full control over the model size and accuracy/speed trade-off. For the Google Speech-to-Text alternative when needing real-time streaming transcription of telephony audio with automatic punctuation and speaker tagging via managed cloud infrastructure — Google STT offers streaming with low first-byte latency while Whisper’s batch transcription achieves higher accuracy on challenging audio (accents, noise, domain vocabulary) and its initial_prompt parameter provides domain vocabulary guidance without any API configuration. The Claude Skills 360 bundle includes Whisper skill sets covering model loading, transcription with word timestamps, subtitle SRT/VTT generation, domain-specific prompting, batch audio processing, OpenAI Whisper API, faster-whisper CPU acceleration, and WhisperX speaker diarization. Start with the free tier to try speech recognition pipeline generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39