Python’s tokenize module lexes Python source into a stream of tokens. import tokenize, token. generate_tokens: tokenize.generate_tokens(readline) → iterator of TokenInfo; readline is a callable returning one line at a time (e.g., io.StringIO(source).readline). tokenize: tokenize.tokenize(readline_bytes) — for binary streams; also yields an ENCODING token first. TokenInfo: named tuple (type, string, start, end, line) where start/end are (row, col). Token types: token.NAME (identifiers, keywords), token.NUMBER, token.STRING, token.OP, token.COMMENT, token.NEWLINE (logical), token.NL (non-logical continuation), token.INDENT, token.DEDENT, token.ENCODING, token.ENDMARKER, token.ERRORTOKEN. token.tok_name[type] → type name string. untokenize: tokenize.untokenize(iterable) → source string; iterable of (type, string) or (type, string, start, end, line). detect_encoding: tokenize.detect_encoding(readline) → (encoding, bom_or_lines). open: tokenize.open(filename) → file opened with auto-detected encoding. TokenError: tokenize.TokenError — raised on premature EOF (unclosed brackets, etc.). Comment extraction: filter type == token.COMMENT. String extraction: filter type == token.STRING. Identifier extraction: filter type == token.NAME and string not in keyword.kwlist. Claude Code generates comment extractors, string literal scanners, identifier renaming tools, coding style checkers, and lightweight source modifiers.
CLAUDE.md for tokenize
## tokenize Stack
- Stdlib: import tokenize, token, io, keyword
- Tokens: list(tokenize.generate_tokens(io.StringIO(src).readline))
- Filter: [t for t in tokens if t.type == token.COMMENT]
- Rename: modify t.string where t.type == token.NAME and t.string == old
- Restore: tokenize.untokenize(modified_tokens) → new source str
- Encode: tokenize.open(path) for encoding-safe file open
tokenize Lexical Analysis Pipeline
# app/tokutil.py — token extraction, comment scan, string finder, renamer
from __future__ import annotations
import io
import keyword
import token
import tokenize
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator
# ─────────────────────────────────────────────────────────────────────────────
# 1. Token helpers
# ─────────────────────────────────────────────────────────────────────────────
def tokenize_source(source: str) -> list[tokenize.TokenInfo]:
"""
Tokenize a Python source string into a list of TokenInfo tuples.
Example:
toks = tokenize_source("x = 1 + 2 # add")
for t in toks:
print(token.tok_name[t.type], repr(t.string))
"""
try:
return list(tokenize.generate_tokens(io.StringIO(source).readline))
except tokenize.TokenError:
return []
def tokenize_file(path: str | Path) -> list[tokenize.TokenInfo]:
"""Tokenize a Python file, auto-detecting encoding."""
with tokenize.open(str(path)) as f:
try:
return list(tokenize.generate_tokens(f.readline))
except tokenize.TokenError:
return []
def type_name(tok: tokenize.TokenInfo) -> str:
"""Return the human-readable token type name."""
return token.tok_name.get(tok.type, f"UNKNOWN({tok.type})")
def tokens_of_type(
tokens: list[tokenize.TokenInfo],
*types: int,
) -> list[tokenize.TokenInfo]:
"""
Filter tokens to those with any of the specified types.
Example:
names = tokens_of_type(toks, token.NAME)
strings = tokens_of_type(toks, token.STRING)
"""
return [t for t in tokens if t.type in types]
# ─────────────────────────────────────────────────────────────────────────────
# 2. Comment and docstring extraction
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class Comment:
text: str # comment text without leading #
lineno: int
col: int
raw: str # original "# text" string
def extract_comments(source: str) -> list[Comment]:
"""
Extract all comments from Python source.
Example:
for c in extract_comments(src):
print(f" line {c.lineno}: {c.text}")
"""
return [
Comment(
text=tok.string.lstrip("#").strip(),
lineno=tok.start[0],
col=tok.start[1],
raw=tok.string,
)
for tok in tokenize_source(source)
if tok.type == token.COMMENT
]
@dataclass
class StringLiteral:
value: str # decoded string (or raw if decoding fails)
lineno: int
col: int
raw: str # original token string e.g. '"hello"' or 'b"bytes"'
is_bytes: bool
is_fstring: bool
def extract_strings(source: str) -> list[StringLiteral]:
"""
Extract all string literals from Python source.
Example:
for s in extract_strings(src):
print(f" line {s.lineno}: {s.raw[:40]}")
"""
result = []
for tok in tokenize_source(source):
if tok.type != token.STRING:
continue
raw = tok.string
is_bytes = raw.lstrip("rRuUbB").startswith(("b\"", "b'", "B\"", "B'")) or raw.startswith(("b", "B"))
is_fstring = "f" in raw[:3].lower()
try:
value = eval(raw) if not is_bytes else raw # eval to decode escapes
if isinstance(value, bytes):
value = raw
except Exception:
value = raw
result.append(StringLiteral(
value=value if isinstance(value, str) else raw,
lineno=tok.start[0],
col=tok.start[1],
raw=raw,
is_bytes=is_bytes,
is_fstring=is_fstring,
))
return result
# ─────────────────────────────────────────────────────────────────────────────
# 3. Identifier analysis
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class IdentifierInfo:
name: str
lineno: int
col: int
is_kw: bool
def extract_identifiers(source: str, include_keywords: bool = False) -> list[IdentifierInfo]:
"""
Extract all NAME tokens (identifiers and keywords).
Example:
for ident in extract_identifiers(src, include_keywords=False):
print(ident.name, ident.lineno)
"""
return [
IdentifierInfo(
name=tok.string,
lineno=tok.start[0],
col=tok.start[1],
is_kw=keyword.iskeyword(tok.string),
)
for tok in tokenize_source(source)
if tok.type == token.NAME and (include_keywords or not keyword.iskeyword(tok.string))
]
def unique_names(source: str) -> list[str]:
"""
Return sorted list of unique non-keyword identifier names.
Example:
names = unique_names(source)
"""
return sorted({i.name for i in extract_identifiers(source)})
def name_frequency(source: str) -> dict[str, int]:
"""
Return {identifier: count} sorted by frequency (descending).
Example:
freq = name_frequency(source)
for name, count in list(freq.items())[:5]:
print(f" {name}: {count}")
"""
from collections import Counter
counts = Counter(i.name for i in extract_identifiers(source))
return dict(counts.most_common())
# ─────────────────────────────────────────────────────────────────────────────
# 4. Token-based source transformation
# ─────────────────────────────────────────────────────────────────────────────
def rename_identifier(source: str, old: str, new: str) -> str:
"""
Rename all occurrences of identifier old to new throughout the source.
Preserves whitespace, comments, and formatting exactly (unlike AST round-trip).
Example:
new_src = rename_identifier(src, "old_function_name", "new_function_name")
"""
toks = tokenize_source(source)
modified = []
for tok in toks:
if tok.type == token.NAME and tok.string == old and not keyword.iskeyword(old):
modified.append((tok.type, new, tok.start, tok.end, tok.line))
else:
modified.append(tok)
try:
return tokenize.untokenize(modified)
except Exception:
return source # fall back to original on error
def strip_comments(source: str) -> str:
"""
Remove all inline and standalone comments from source.
Preserves all other formatting including blank lines.
Example:
clean = strip_comments(source_with_many_comments)
"""
toks = tokenize_source(source)
modified = [tok for tok in toks if tok.type != token.COMMENT]
try:
return tokenize.untokenize(modified)
except Exception:
return source
def find_todo_comments(source: str) -> list[Comment]:
"""
Return all comments that start with TODO, FIXME, HACK, or XXX.
Example:
todos = find_todo_comments(source)
for t in todos:
print(f" [{t.lineno}] {t.text}")
"""
markers = {"TODO", "FIXME", "HACK", "XXX", "NOTE", "BUG"}
return [
c for c in extract_comments(source)
if any(c.text.upper().startswith(m) for m in markers)
]
# ─────────────────────────────────────────────────────────────────────────────
# 5. Source statistics
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class SourceStats:
total_lines: int
code_lines: int
comment_lines: int
blank_lines: int
n_tokens: int
n_names: int
n_numbers: int
n_strings: int
n_operators: int
n_keywords: int
todo_count: int
@classmethod
def from_source(cls, source: str) -> "SourceStats":
"""
Compute token statistics for a Python source string.
Example:
stats = SourceStats.from_source(Path("app.py").read_text())
print(stats)
"""
lines = source.splitlines()
blank = sum(1 for l in lines if not l.strip())
comment_lines_set: set[int] = set()
n_tok = n_name = n_num = n_str = n_op = n_kw = 0
toks = tokenize_source(source)
for tok in toks:
if tok.type in (token.NEWLINE, token.NL, token.ENDMARKER, token.ENCODING):
continue
n_tok += 1
if tok.type == token.COMMENT:
comment_lines_set.add(tok.start[0])
elif tok.type == token.NAME:
if keyword.iskeyword(tok.string):
n_kw += 1
else:
n_name += 1
elif tok.type == token.NUMBER:
n_num += 1
elif tok.type == token.STRING:
n_str += 1
elif tok.type == token.OP:
n_op += 1
comment_lines = len(comment_lines_set)
# code lines = non-blank lines with at least some non-comment tokens
code_lines = len(lines) - blank - comment_lines
return cls(
total_lines=len(lines),
code_lines=max(0, code_lines),
comment_lines=comment_lines,
blank_lines=blank,
n_tokens=n_tok,
n_names=n_name,
n_numbers=n_num,
n_strings=n_str,
n_operators=n_op,
n_keywords=n_kw,
todo_count=len(find_todo_comments(source)),
)
def __str__(self) -> str:
return (
f"Lines: total={self.total_lines} code={self.code_lines} "
f"comments={self.comment_lines} blank={self.blank_lines}\n"
f"Tokens: {self.n_tokens} (names={self.n_names} kw={self.n_keywords} "
f"num={self.n_numbers} str={self.n_strings} op={self.n_operators})\n"
f"TODOs: {self.todo_count}"
)
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
source = '''\
import os # standard library
import requests # third-party
# TODO: add retry logic
def fetch_data(url: str, timeout: int = 30) -> dict:
"""Fetch JSON from url."""
# FIXME: handle rate limits
response = requests.get(url, timeout=timeout)
result = response.json()
return result
class DataLoader:
"""Loads data from multiple sources."""
def __init__(self, base_url: str) -> None:
self.base_url = base_url # NOTE: no trailing slash
def load(self, path: str) -> dict:
full_url = self.base_url + "/" + path
return fetch_data(full_url)
x = 42
message = "Hello, world!"
pi = 3.14159
'''
print("=== tokenize demo ===")
toks = tokenize_source(source)
print(f"\n total tokens: {len(toks)}")
print("\n--- extract_comments ---")
for c in extract_comments(source):
print(f" line {c.lineno:3d} col {c.col}: {c.text}")
print("\n--- find_todo_comments ---")
for t in find_todo_comments(source):
print(f" [{t.lineno}] {t.text}")
print("\n--- extract_strings ---")
for s in extract_strings(source)[:5]:
print(f" line {s.lineno}: {s.raw[:40]!r}")
print("\n--- name_frequency (top 5) ---")
freq = name_frequency(source)
for name, count in list(freq.items())[:5]:
print(f" {name:20s}: {count}")
print("\n--- rename_identifier ---")
renamed = rename_identifier(source, "fetch_data", "fetch_json")
# Confirm only identifier changed, not string content
count_old = renamed.count("fetch_data")
count_new = renamed.count("fetch_json")
print(f" 'fetch_data' occurrences after rename: {count_old}")
print(f" 'fetch_json' occurrences after rename: {count_new}")
print("\n--- strip_comments ---")
clean = strip_comments(source)
original_comments = len(extract_comments(source))
remaining_comments = len(extract_comments(clean))
print(f" comments before: {original_comments} after strip: {remaining_comments}")
print("\n--- SourceStats ---")
stats = SourceStats.from_source(source)
print(stats)
print("\n=== done ===")
For the ast alternative — ast.parse() produces a high-level structural tree with expression nodes, statement classes, and scope information; tokenize operates at the character level and gives you the raw token stream including whitespace layout, comment positions, and exact source positions — use ast for structural analysis (finding function definitions, analyzing call graphs, transforming expressions); use tokenize when you need to preserve exact source formatting (comments, blank lines, indentation), extract literals without evaluating them, or perform rename/replace operations that must not disturb surrounding whitespace. For the pygments alternative — pygments (PyPI) provides lexers for 500+ languages with syntax-highlight rendering to HTML, ANSI, LaTeX, and more; tokenize only handles Python and outputs token tuples rather than rendered HTML — use pygments for documentation generation, code display in web apps, and terminal colorization; use tokenize for Python-specific programmatic analysis where you need exact CPython token types, positions, and encoding detection. The Claude Skills 360 bundle includes tokenize skill sets covering tokenize_source()/tokenize_file()/type_name()/tokens_of_type() core helpers, extract_comments()/extract_strings() content extraction, extract_identifiers()/unique_names()/name_frequency() identifier analysis, rename_identifier()/strip_comments()/find_todo_comments() source transformation, and SourceStats with code/comment/blank line counts and token frequency breakdown. Start with the free tier to try Python lexical analysis and tokenize pipeline code generation.