Blog / AI / Claude Code for bleach: Python HTML Sanitization

Claude Code for bleach: Python HTML Sanitization

Published: February 12, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

bleach sanitizes HTML and linkifies URLs. pip install bleach. Clean: import bleach; bleach.clean(html). Whitelist tags: bleach.clean(html, tags=["p","strong","em","a","ul","li"]). Attributes: bleach.clean(html, attributes={"a":["href","title"],"img":["src","alt"]}). Strip vs escape: strip=True removes disallowed tags, strip=False (default) HTML-escapes them. Comments: strip_comments=True removes HTML comments. Callable attributes: attributes=lambda tag,name,val: name in ("href","class"). Linkify: bleach.linkify(text) — finds bare URLs and wraps in <a>. Linkify callback: bleach.linkify(text, callbacks=[set_target]) — set_target(attrs, new=False): attrs[("","target")]="_blank"; return attrs. Skip pre: bleach.linkify(text, skip_tags=["pre","code"]). Cleaner: from bleach import Cleaner; c = Cleaner(tags=[...], attributes={...}); c.clean(html). LinkifyFilter: from bleach.linkifier import LinkifyFilter; c = Cleaner(..., filters=[LinkifyFilter]). Markdown pipeline: raw_html = markdown.markdown(user_input); safe = bleach.clean(raw_html, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS). Plain text: bleach.clean(html, tags=[], strip=True) — strips all tags. bleach.clean(text, tags=[], strip=True, strip_comments=True). Claude Code generates bleach sanitizer configs, markdown+bleach pipelines, and Jinja2 safe filters.

CLAUDE.md for bleach

## bleach Stack
- Version: bleach >= 6.1 | pip install bleach
- Clean: bleach.clean(html, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)
- Linkify: bleach.linkify(text, callbacks=[nofollow_callback], skip_tags=["pre","code"])
- Reusable: Cleaner(tags=..., attributes=..., filters=[LinkifyFilter]) instance
- Pipeline: markdown.markdown(src) → bleach.clean(html, tags=MD_TAGS, strip=True)
- Plain text: bleach.clean(html, tags=[], strip=True) — removes all HTML
- Security: always sanitize user-provided HTML before rendering; never trust raw input

bleach Sanitization Pipeline

# app/sanitize.py — bleach HTML sanitization, linkification, and markdown pipeline
from __future__ import annotations

import re
from typing import Any, Callable

import bleach
from bleach import Cleaner
from bleach.linkifier import LinkifyFilter


# ─────────────────────────────────────────────────────────────────────────────
# 1. Tag and attribute allowlists
# ─────────────────────────────────────────────────────────────────────────────

# Basic prose — suitable for blog comments, forum posts
BASIC_TAGS = [
    "p", "br", "strong", "b", "em", "i", "u", "s",
    "ul", "ol", "li",
    "blockquote", "pre", "code",
    "h1", "h2", "h3", "h4", "h5", "h6",
    "a", "hr",
]

BASIC_ATTRS: dict[str, list[str]] = {
    "a":   ["href", "title", "rel"],
    "code": ["class"],   # allow lang class for syntax highlighting
}

# Rich content — suitable for trusted CMS editors
RICH_TAGS = BASIC_TAGS + [
    "img", "figure", "figcaption",
    "table", "thead", "tbody", "tfoot", "tr", "th", "td",
    "caption", "colgroup", "col",
    "details", "summary",
    "abbr", "cite", "del", "ins", "mark", "sub", "sup",
    "div", "span",
]

RICH_ATTRS: dict[str, list[str]] = {
    **BASIC_ATTRS,
    "img":   ["src", "alt", "width", "height", "loading"],
    "th":    ["scope", "colspan", "rowspan"],
    "td":    ["colspan", "rowspan"],
    "div":   ["class", "id"],
    "span":  ["class"],
    "abbr":  ["title"],
    "*":     ["class"],   # allow class on any tag — filtered below
}


def _allow_class(tag: str, name: str, value: str) -> bool:
    """
    Callable attribute filter — allow `class` only for known safe tag types.
    Use instead of "*": ["class"] in the dict form to avoid overbroad permission.
    """
    CLASSABLE = {"div", "span", "code", "pre", "table", "td", "th", "p"}
    if name == "class" and tag in CLASSABLE:
        # Optionally restrict to known CSS class patterns
        return bool(re.match(r"^[a-zA-Z0-9_\- ]+$", value))
    return name in RICH_ATTRS.get(tag, [])


# ─────────────────────────────────────────────────────────────────────────────
# 2. Linkify callbacks
# ─────────────────────────────────────────────────────────────────────────────

def _nofollow_callback(
    attrs: dict[tuple[str | None, str], str],
    new: bool = False,
) -> dict[tuple[str | None, str], str]:
    """
    Add rel="nofollow noopener noreferrer" to all auto-linked and existing hrefs.
    Called by bleach.linkify for every <a> tag found.
    """
    attrs[(None, "rel")] = "nofollow noopener noreferrer"
    return attrs


def _target_blank_callback(
    attrs: dict[tuple[str | None, str], str],
    new: bool = False,
) -> dict[tuple[str | None, str], str]:
    """Open every link in a new tab."""
    attrs[(None, "target")] = "_blank"
    return attrs


def _external_only_callback(
    attrs: dict[tuple[str | None, str], str],
    new: bool = False,
) -> dict[tuple[str | None, str], str]:
    """Only process links that are external (start with http/https)."""
    href = attrs.get((None, "href"), "")
    if not href.startswith(("http://", "https://")):
        return attrs
    attrs[(None, "rel")] = "nofollow noopener noreferrer"
    attrs[(None, "target")] = "_blank"
    return attrs


# ─────────────────────────────────────────────────────────────────────────────
# 3. Simple clean helpers
# ─────────────────────────────────────────────────────────────────────────────

def clean_basic(html: str) -> str:
    """
    Strip everything except basic prose tags.
    strip=True removes disallowed elements entirely instead of escaping them.
    """
    return bleach.clean(
        html,
        tags=BASIC_TAGS,
        attributes=BASIC_ATTRS,
        strip=True,
        strip_comments=True,
    )


def clean_rich(html: str) -> str:
    """Allow richer tag set with callable attribute filter."""
    return bleach.clean(
        html,
        tags=RICH_TAGS,
        attributes=_allow_class,
        strip=True,
        strip_comments=True,
    )


def strip_all_tags(html: str) -> str:
    """Remove all HTML tags — produces plain text from HTML input."""
    return bleach.clean(html, tags=[], strip=True, strip_comments=True)


# ─────────────────────────────────────────────────────────────────────────────
# 4. Reusable Cleaner instances
# ─────────────────────────────────────────────────────────────────────────────

# Basic cleaner — no linkification
basic_cleaner = Cleaner(
    tags=BASIC_TAGS,
    attributes=BASIC_ATTRS,
    strip=True,
    strip_comments=True,
)

# Rich cleaner with automatic URL → link conversion
rich_cleaner_with_links = Cleaner(
    tags=RICH_TAGS,
    attributes=_allow_class,
    strip=True,
    strip_comments=True,
    filters=[LinkifyFilter],
)


def make_cleaner(
    tags: list[str] | None = None,
    attributes: dict | Callable | None = None,
    linkify: bool = False,
) -> Cleaner:
    """Factory for per-context Cleaner instances."""
    filters = [LinkifyFilter] if linkify else []
    return Cleaner(
        tags=tags or BASIC_TAGS,
        attributes=attributes or BASIC_ATTRS,
        strip=True,
        strip_comments=True,
        filters=filters,
    )


# ─────────────────────────────────────────────────────────────────────────────
# 5. Linkify text
# ─────────────────────────────────────────────────────────────────────────────

def linkify(text: str, open_new_tab: bool = True) -> str:
    """
    Wrap bare URLs in the text with <a> tags.
    skip_tags=["pre","code"] avoids linkifying URLs inside code blocks.
    """
    callbacks = [_nofollow_callback]
    if open_new_tab:
        callbacks.append(_target_blank_callback)
    return bleach.linkify(
        text,
        callbacks=callbacks,
        skip_tags=["pre", "code"],
    )


# ─────────────────────────────────────────────────────────────────────────────
# 6. Markdown → bleach pipeline (the recommended safe rendering pattern)
# ─────────────────────────────────────────────────────────────────────────────

# Tags that the `markdown` library produces — only these are allowed through
MARKDOWN_ALLOWED_TAGS = [
    "p", "br",
    "strong", "em", "del",
    "h1", "h2", "h3", "h4", "h5", "h6",
    "ul", "ol", "li",
    "blockquote",
    "pre", "code",
    "a", "hr",
    "table", "thead", "tbody", "tr", "th", "td",
    "img",
    "sup",   # footnotes
    "div",   # toc wrapper
]

MARKDOWN_ALLOWED_ATTRS: dict[str, list[str]] = {
    "a":   ["href", "title", "rel", "id"],
    "img": ["src", "alt", "title", "width", "height"],
    "code": ["class"],
    "div": ["class", "id"],
    "th":  ["align"],
    "td":  ["align"],
    "h1":  ["id"], "h2": ["id"], "h3": ["id"],   # TOC anchors
    "h4":  ["id"], "h5": ["id"], "h6": ["id"],
}


def render_markdown_safe(user_markdown: str) -> str:
    """
    Convert user-supplied Markdown to sanitized HTML.

    Step 1 — markdown.markdown() converts the Markdown syntax to raw HTML.
             This HTML may contain arbitrary tags if the user embedded raw HTML.
    Step 2 — bleach.clean() strips everything not in the allowlist, preventing XSS.

    Never render markdown.markdown() output directly without sanitization.
    """
    try:
        import markdown as _md
        raw_html = _md.markdown(
            user_markdown,
            extensions=["tables", "fenced_code", "nl2br", "sane_lists"],
        )
    except ImportError:
        # Fallback: treat as plain text
        raw_html = bleach.clean(user_markdown, tags=[], strip=True)

    return bleach.clean(
        raw_html,
        tags=MARKDOWN_ALLOWED_TAGS,
        attributes=MARKDOWN_ALLOWED_ATTRS,
        strip=True,
        strip_comments=True,
    )


# ─────────────────────────────────────────────────────────────────────────────
# 7. Jinja2 filter registration
# ─────────────────────────────────────────────────────────────────────────────

def register_bleach_filters(env) -> None:
    """
    Register sanitization filters for Jinja2 templates.

    Usage:
      {{ user.bio | clean_html | safe }}
      {{ comment.body | mdrender_safe | safe }}
      {{ post.text | linkify | safe }}
    """
    env.filters["clean_html"]    = clean_basic
    env.filters["clean_rich"]    = clean_rich
    env.filters["strip_tags"]    = strip_all_tags
    env.filters["linkify"]       = linkify
    env.filters["mdrender_safe"] = render_markdown_safe


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # XSS attempt
    evil = (
        '<p>Hello <script>alert("xss")</script> world</p>'
        '<img src=x onerror=alert(1)>'
        '<a href="javascript:void(0)">click</a>'
        '<div onclick="steal()">text</div>'
    )

    print("=== Input ===")
    print(evil)

    print("\n=== clean_basic (strip=True) ===")
    print(clean_basic(evil))

    print("\n=== strip_all_tags ===")
    print(strip_all_tags(evil))

    text_with_urls = (
        "Check out https://python.org and also http://example.com/path?q=1 for more."
    )
    print("\n=== linkify ===")
    print(linkify(text_with_urls))

    md_input = textwrap.dedent("""\
        ## Hello

        This is **bold** and _italic_.

        <script>alert("xss in markdown")</script>

        | A | B |
        |---|---|
        | 1 | 2 |
    """) if False else """## Hello\n\nThis is **bold**.\n\n<script>alert(1)</script>"""

    print("\n=== render_markdown_safe ===")
    print(render_markdown_safe(md_input))


import textwrap  # ensure available for __main__

For the html.escape() alternative — html.escape(user_input) turns every < into < and every > into >, which is correct for inserting plain text into HTML but rejects all formatting — if you want users to write **bold** and have it render as <strong>bold</strong> you must parse Markdown first, and then a whitelist-based sanitizer like bleach is the only safe option because html.escape() applied after markdown.markdown() would double-escape the intended tags. For the lxml.html.clean.Cleaner alternative — lxml.html.clean.Cleaner is a more powerful C-backed HTML cleaner that handles malformed HTML better and supports remove_tags, allow_tags, safe_attrs_only, and inline CSS removal, but it requires the lxml C extension and is heavier to install; bleach is a pure-Python wrapper around html5lib which handles tag soup and broken HTML reliably, and bleach’s Cleaner + LinkifyFilter pattern covers the common case of sanitizing Markdown output while also auto-linking bare URLs in a single pass. The Claude Skills 360 bundle includes bleach skill sets covering bleach.clean with tags/attributes/strip/strip_comments, callable attribute filter with tag+name+value signature, _nofollow_callback and _target_blank_callback linkify callbacks, bleach.linkify with skip_tags for pre/code blocks, Cleaner class for reusable instances, LinkifyFilter in Cleaner.filters, render_markdown_safe pipeline (markdown → bleach), MARKDOWN_ALLOWED_TAGS/ATTRS allowlist for post-markdown sanitization, strip_all_tags for plain text extraction, and Jinja2 filter registration for clean_html/mdrender_safe/linkify. Start with the free tier to try HTML sanitization pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39