Blog / AI / Claude Code for tokenize: Python Lexical Analysis

Claude Code for tokenize: Python Lexical Analysis

Published: August 16, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s tokenize module lexes Python source into a stream of tokens. import tokenize, token. generate_tokens: tokenize.generate_tokens(readline) → iterator of TokenInfo; readline is a callable returning one line at a time (e.g., io.StringIO(source).readline). tokenize: tokenize.tokenize(readline_bytes) — for binary streams; also yields an ENCODING token first. TokenInfo: named tuple (type, string, start, end, line) where start/end are (row, col). Token types: token.NAME (identifiers, keywords), token.NUMBER, token.STRING, token.OP, token.COMMENT, token.NEWLINE (logical), token.NL (non-logical continuation), token.INDENT, token.DEDENT, token.ENCODING, token.ENDMARKER, token.ERRORTOKEN. token.tok_name[type] → type name string. untokenize: tokenize.untokenize(iterable) → source string; iterable of (type, string) or (type, string, start, end, line). detect_encoding: tokenize.detect_encoding(readline) → (encoding, bom_or_lines). open: tokenize.open(filename) → file opened with auto-detected encoding. TokenError: tokenize.TokenError — raised on premature EOF (unclosed brackets, etc.). Comment extraction: filter type == token.COMMENT. String extraction: filter type == token.STRING. Identifier extraction: filter type == token.NAME and string not in keyword.kwlist. Claude Code generates comment extractors, string literal scanners, identifier renaming tools, coding style checkers, and lightweight source modifiers.

CLAUDE.md for tokenize

## tokenize Stack
- Stdlib: import tokenize, token, io, keyword
- Tokens:  list(tokenize.generate_tokens(io.StringIO(src).readline))
- Filter:  [t for t in tokens if t.type == token.COMMENT]
- Rename:  modify t.string where t.type == token.NAME and t.string == old
- Restore: tokenize.untokenize(modified_tokens)  → new source str
- Encode:  tokenize.open(path) for encoding-safe file open

tokenize Lexical Analysis Pipeline

# app/tokutil.py — token extraction, comment scan, string finder, renamer
from __future__ import annotations

import io
import keyword
import token
import tokenize
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator


# ─────────────────────────────────────────────────────────────────────────────
# 1. Token helpers
# ─────────────────────────────────────────────────────────────────────────────

def tokenize_source(source: str) -> list[tokenize.TokenInfo]:
    """
    Tokenize a Python source string into a list of TokenInfo tuples.

    Example:
        toks = tokenize_source("x = 1 + 2  # add")
        for t in toks:
            print(token.tok_name[t.type], repr(t.string))
    """
    try:
        return list(tokenize.generate_tokens(io.StringIO(source).readline))
    except tokenize.TokenError:
        return []


def tokenize_file(path: str | Path) -> list[tokenize.TokenInfo]:
    """Tokenize a Python file, auto-detecting encoding."""
    with tokenize.open(str(path)) as f:
        try:
            return list(tokenize.generate_tokens(f.readline))
        except tokenize.TokenError:
            return []


def type_name(tok: tokenize.TokenInfo) -> str:
    """Return the human-readable token type name."""
    return token.tok_name.get(tok.type, f"UNKNOWN({tok.type})")


def tokens_of_type(
    tokens: list[tokenize.TokenInfo],
    *types: int,
) -> list[tokenize.TokenInfo]:
    """
    Filter tokens to those with any of the specified types.

    Example:
        names = tokens_of_type(toks, token.NAME)
        strings = tokens_of_type(toks, token.STRING)
    """
    return [t for t in tokens if t.type in types]


# ─────────────────────────────────────────────────────────────────────────────
# 2. Comment and docstring extraction
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class Comment:
    text:   str        # comment text without leading #
    lineno: int
    col:    int
    raw:    str        # original "# text" string


def extract_comments(source: str) -> list[Comment]:
    """
    Extract all comments from Python source.

    Example:
        for c in extract_comments(src):
            print(f"  line {c.lineno}: {c.text}")
    """
    return [
        Comment(
            text=tok.string.lstrip("#").strip(),
            lineno=tok.start[0],
            col=tok.start[1],
            raw=tok.string,
        )
        for tok in tokenize_source(source)
        if tok.type == token.COMMENT
    ]


@dataclass
class StringLiteral:
    value:    str      # decoded string (or raw if decoding fails)
    lineno:   int
    col:      int
    raw:      str      # original token string e.g. '"hello"' or 'b"bytes"'
    is_bytes: bool
    is_fstring: bool


def extract_strings(source: str) -> list[StringLiteral]:
    """
    Extract all string literals from Python source.

    Example:
        for s in extract_strings(src):
            print(f"  line {s.lineno}: {s.raw[:40]}")
    """
    result = []
    for tok in tokenize_source(source):
        if tok.type != token.STRING:
            continue
        raw = tok.string
        is_bytes   = raw.lstrip("rRuUbB").startswith(("b\"", "b'", "B\"", "B'")) or raw.startswith(("b", "B"))
        is_fstring = "f" in raw[:3].lower()
        try:
            value = eval(raw) if not is_bytes else raw  # eval to decode escapes
            if isinstance(value, bytes):
                value = raw
        except Exception:
            value = raw
        result.append(StringLiteral(
            value=value if isinstance(value, str) else raw,
            lineno=tok.start[0],
            col=tok.start[1],
            raw=raw,
            is_bytes=is_bytes,
            is_fstring=is_fstring,
        ))
    return result


# ─────────────────────────────────────────────────────────────────────────────
# 3. Identifier analysis
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class IdentifierInfo:
    name:     str
    lineno:   int
    col:      int
    is_kw:    bool


def extract_identifiers(source: str, include_keywords: bool = False) -> list[IdentifierInfo]:
    """
    Extract all NAME tokens (identifiers and keywords).

    Example:
        for ident in extract_identifiers(src, include_keywords=False):
            print(ident.name, ident.lineno)
    """
    return [
        IdentifierInfo(
            name=tok.string,
            lineno=tok.start[0],
            col=tok.start[1],
            is_kw=keyword.iskeyword(tok.string),
        )
        for tok in tokenize_source(source)
        if tok.type == token.NAME and (include_keywords or not keyword.iskeyword(tok.string))
    ]


def unique_names(source: str) -> list[str]:
    """
    Return sorted list of unique non-keyword identifier names.

    Example:
        names = unique_names(source)
    """
    return sorted({i.name for i in extract_identifiers(source)})


def name_frequency(source: str) -> dict[str, int]:
    """
    Return {identifier: count} sorted by frequency (descending).

    Example:
        freq = name_frequency(source)
        for name, count in list(freq.items())[:5]:
            print(f"  {name}: {count}")
    """
    from collections import Counter
    counts = Counter(i.name for i in extract_identifiers(source))
    return dict(counts.most_common())


# ─────────────────────────────────────────────────────────────────────────────
# 4. Token-based source transformation
# ─────────────────────────────────────────────────────────────────────────────

def rename_identifier(source: str, old: str, new: str) -> str:
    """
    Rename all occurrences of identifier old to new throughout the source.
    Preserves whitespace, comments, and formatting exactly (unlike AST round-trip).

    Example:
        new_src = rename_identifier(src, "old_function_name", "new_function_name")
    """
    toks = tokenize_source(source)
    modified = []
    for tok in toks:
        if tok.type == token.NAME and tok.string == old and not keyword.iskeyword(old):
            modified.append((tok.type, new, tok.start, tok.end, tok.line))
        else:
            modified.append(tok)
    try:
        return tokenize.untokenize(modified)
    except Exception:
        return source  # fall back to original on error


def strip_comments(source: str) -> str:
    """
    Remove all inline and standalone comments from source.
    Preserves all other formatting including blank lines.

    Example:
        clean = strip_comments(source_with_many_comments)
    """
    toks = tokenize_source(source)
    modified = [tok for tok in toks if tok.type != token.COMMENT]
    try:
        return tokenize.untokenize(modified)
    except Exception:
        return source


def find_todo_comments(source: str) -> list[Comment]:
    """
    Return all comments that start with TODO, FIXME, HACK, or XXX.

    Example:
        todos = find_todo_comments(source)
        for t in todos:
            print(f"  [{t.lineno}] {t.text}")
    """
    markers = {"TODO", "FIXME", "HACK", "XXX", "NOTE", "BUG"}
    return [
        c for c in extract_comments(source)
        if any(c.text.upper().startswith(m) for m in markers)
    ]


# ─────────────────────────────────────────────────────────────────────────────
# 5. Source statistics
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class SourceStats:
    total_lines:    int
    code_lines:     int
    comment_lines:  int
    blank_lines:    int
    n_tokens:       int
    n_names:        int
    n_numbers:      int
    n_strings:      int
    n_operators:    int
    n_keywords:     int
    todo_count:     int

    @classmethod
    def from_source(cls, source: str) -> "SourceStats":
        """
        Compute token statistics for a Python source string.

        Example:
            stats = SourceStats.from_source(Path("app.py").read_text())
            print(stats)
        """
        lines = source.splitlines()
        blank = sum(1 for l in lines if not l.strip())
        comment_lines_set: set[int] = set()
        n_tok = n_name = n_num = n_str = n_op = n_kw = 0

        toks = tokenize_source(source)
        for tok in toks:
            if tok.type in (token.NEWLINE, token.NL, token.ENDMARKER, token.ENCODING):
                continue
            n_tok += 1
            if tok.type == token.COMMENT:
                comment_lines_set.add(tok.start[0])
            elif tok.type == token.NAME:
                if keyword.iskeyword(tok.string):
                    n_kw += 1
                else:
                    n_name += 1
            elif tok.type == token.NUMBER:
                n_num += 1
            elif tok.type == token.STRING:
                n_str += 1
            elif tok.type == token.OP:
                n_op += 1

        comment_lines = len(comment_lines_set)
        # code lines = non-blank lines with at least some non-comment tokens
        code_lines = len(lines) - blank - comment_lines

        return cls(
            total_lines=len(lines),
            code_lines=max(0, code_lines),
            comment_lines=comment_lines,
            blank_lines=blank,
            n_tokens=n_tok,
            n_names=n_name,
            n_numbers=n_num,
            n_strings=n_str,
            n_operators=n_op,
            n_keywords=n_kw,
            todo_count=len(find_todo_comments(source)),
        )

    def __str__(self) -> str:
        return (
            f"Lines: total={self.total_lines} code={self.code_lines} "
            f"comments={self.comment_lines} blank={self.blank_lines}\n"
            f"Tokens: {self.n_tokens} (names={self.n_names} kw={self.n_keywords} "
            f"num={self.n_numbers} str={self.n_strings} op={self.n_operators})\n"
            f"TODOs: {self.todo_count}"
        )


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    source = '''\
import os  # standard library
import requests  # third-party

# TODO: add retry logic
def fetch_data(url: str, timeout: int = 30) -> dict:
    """Fetch JSON from url."""
    # FIXME: handle rate limits
    response = requests.get(url, timeout=timeout)
    result = response.json()
    return result

class DataLoader:
    """Loads data from multiple sources."""

    def __init__(self, base_url: str) -> None:
        self.base_url = base_url  # NOTE: no trailing slash

    def load(self, path: str) -> dict:
        full_url = self.base_url + "/" + path
        return fetch_data(full_url)

x = 42
message = "Hello, world!"
pi = 3.14159
'''

    print("=== tokenize demo ===")

    toks = tokenize_source(source)
    print(f"\n  total tokens: {len(toks)}")

    print("\n--- extract_comments ---")
    for c in extract_comments(source):
        print(f"  line {c.lineno:3d} col {c.col}: {c.text}")

    print("\n--- find_todo_comments ---")
    for t in find_todo_comments(source):
        print(f"  [{t.lineno}] {t.text}")

    print("\n--- extract_strings ---")
    for s in extract_strings(source)[:5]:
        print(f"  line {s.lineno}: {s.raw[:40]!r}")

    print("\n--- name_frequency (top 5) ---")
    freq = name_frequency(source)
    for name, count in list(freq.items())[:5]:
        print(f"  {name:20s}: {count}")

    print("\n--- rename_identifier ---")
    renamed = rename_identifier(source, "fetch_data", "fetch_json")
    # Confirm only identifier changed, not string content
    count_old = renamed.count("fetch_data")
    count_new = renamed.count("fetch_json")
    print(f"  'fetch_data' occurrences after rename: {count_old}")
    print(f"  'fetch_json' occurrences after rename: {count_new}")

    print("\n--- strip_comments ---")
    clean = strip_comments(source)
    original_comments = len(extract_comments(source))
    remaining_comments = len(extract_comments(clean))
    print(f"  comments before: {original_comments}  after strip: {remaining_comments}")

    print("\n--- SourceStats ---")
    stats = SourceStats.from_source(source)
    print(stats)

    print("\n=== done ===")

For the ast alternative — ast.parse() produces a high-level structural tree with expression nodes, statement classes, and scope information; tokenize operates at the character level and gives you the raw token stream including whitespace layout, comment positions, and exact source positions — use ast for structural analysis (finding function definitions, analyzing call graphs, transforming expressions); use tokenize when you need to preserve exact source formatting (comments, blank lines, indentation), extract literals without evaluating them, or perform rename/replace operations that must not disturb surrounding whitespace. For the pygments alternative — pygments (PyPI) provides lexers for 500+ languages with syntax-highlight rendering to HTML, ANSI, LaTeX, and more; tokenize only handles Python and outputs token tuples rather than rendered HTML — use pygments for documentation generation, code display in web apps, and terminal colorization; use tokenize for Python-specific programmatic analysis where you need exact CPython token types, positions, and encoding detection. The Claude Skills 360 bundle includes tokenize skill sets covering tokenize_source()/tokenize_file()/type_name()/tokens_of_type() core helpers, extract_comments()/extract_strings() content extraction, extract_identifiers()/unique_names()/name_frequency() identifier analysis, rename_identifier()/strip_comments()/find_todo_comments() source transformation, and SourceStats with code/comment/blank line counts and token frequency breakdown. Start with the free tier to try Python lexical analysis and tokenize pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39