bleach sanitizes HTML and linkifies URLs. pip install bleach. Clean: import bleach; bleach.clean(html). Whitelist tags: bleach.clean(html, tags=["p","strong","em","a","ul","li"]). Attributes: bleach.clean(html, attributes={"a":["href","title"],"img":["src","alt"]}). Strip vs escape: strip=True removes disallowed tags, strip=False (default) HTML-escapes them. Comments: strip_comments=True removes HTML comments. Callable attributes: attributes=lambda tag,name,val: name in ("href","class"). Linkify: bleach.linkify(text) — finds bare URLs and wraps in <a>. Linkify callback: bleach.linkify(text, callbacks=[set_target]) — set_target(attrs, new=False): attrs[("","target")]="_blank"; return attrs. Skip pre: bleach.linkify(text, skip_tags=["pre","code"]). Cleaner: from bleach import Cleaner; c = Cleaner(tags=[...], attributes={...}); c.clean(html). LinkifyFilter: from bleach.linkifier import LinkifyFilter; c = Cleaner(..., filters=[LinkifyFilter]). Markdown pipeline: raw_html = markdown.markdown(user_input); safe = bleach.clean(raw_html, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS). Plain text: bleach.clean(html, tags=[], strip=True) — strips all tags. bleach.clean(text, tags=[], strip=True, strip_comments=True). Claude Code generates bleach sanitizer configs, markdown+bleach pipelines, and Jinja2 safe filters.
CLAUDE.md for bleach
## bleach Stack
- Version: bleach >= 6.1 | pip install bleach
- Clean: bleach.clean(html, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRS, strip=True)
- Linkify: bleach.linkify(text, callbacks=[nofollow_callback], skip_tags=["pre","code"])
- Reusable: Cleaner(tags=..., attributes=..., filters=[LinkifyFilter]) instance
- Pipeline: markdown.markdown(src) → bleach.clean(html, tags=MD_TAGS, strip=True)
- Plain text: bleach.clean(html, tags=[], strip=True) — removes all HTML
- Security: always sanitize user-provided HTML before rendering; never trust raw input
bleach Sanitization Pipeline
# app/sanitize.py — bleach HTML sanitization, linkification, and markdown pipeline
from __future__ import annotations
import re
from typing import Any, Callable
import bleach
from bleach import Cleaner
from bleach.linkifier import LinkifyFilter
# ─────────────────────────────────────────────────────────────────────────────
# 1. Tag and attribute allowlists
# ─────────────────────────────────────────────────────────────────────────────
# Basic prose — suitable for blog comments, forum posts
BASIC_TAGS = [
"p", "br", "strong", "b", "em", "i", "u", "s",
"ul", "ol", "li",
"blockquote", "pre", "code",
"h1", "h2", "h3", "h4", "h5", "h6",
"a", "hr",
]
BASIC_ATTRS: dict[str, list[str]] = {
"a": ["href", "title", "rel"],
"code": ["class"], # allow lang class for syntax highlighting
}
# Rich content — suitable for trusted CMS editors
RICH_TAGS = BASIC_TAGS + [
"img", "figure", "figcaption",
"table", "thead", "tbody", "tfoot", "tr", "th", "td",
"caption", "colgroup", "col",
"details", "summary",
"abbr", "cite", "del", "ins", "mark", "sub", "sup",
"div", "span",
]
RICH_ATTRS: dict[str, list[str]] = {
**BASIC_ATTRS,
"img": ["src", "alt", "width", "height", "loading"],
"th": ["scope", "colspan", "rowspan"],
"td": ["colspan", "rowspan"],
"div": ["class", "id"],
"span": ["class"],
"abbr": ["title"],
"*": ["class"], # allow class on any tag — filtered below
}
def _allow_class(tag: str, name: str, value: str) -> bool:
"""
Callable attribute filter — allow `class` only for known safe tag types.
Use instead of "*": ["class"] in the dict form to avoid overbroad permission.
"""
CLASSABLE = {"div", "span", "code", "pre", "table", "td", "th", "p"}
if name == "class" and tag in CLASSABLE:
# Optionally restrict to known CSS class patterns
return bool(re.match(r"^[a-zA-Z0-9_\- ]+$", value))
return name in RICH_ATTRS.get(tag, [])
# ─────────────────────────────────────────────────────────────────────────────
# 2. Linkify callbacks
# ─────────────────────────────────────────────────────────────────────────────
def _nofollow_callback(
attrs: dict[tuple[str | None, str], str],
new: bool = False,
) -> dict[tuple[str | None, str], str]:
"""
Add rel="nofollow noopener noreferrer" to all auto-linked and existing hrefs.
Called by bleach.linkify for every <a> tag found.
"""
attrs[(None, "rel")] = "nofollow noopener noreferrer"
return attrs
def _target_blank_callback(
attrs: dict[tuple[str | None, str], str],
new: bool = False,
) -> dict[tuple[str | None, str], str]:
"""Open every link in a new tab."""
attrs[(None, "target")] = "_blank"
return attrs
def _external_only_callback(
attrs: dict[tuple[str | None, str], str],
new: bool = False,
) -> dict[tuple[str | None, str], str]:
"""Only process links that are external (start with http/https)."""
href = attrs.get((None, "href"), "")
if not href.startswith(("http://", "https://")):
return attrs
attrs[(None, "rel")] = "nofollow noopener noreferrer"
attrs[(None, "target")] = "_blank"
return attrs
# ─────────────────────────────────────────────────────────────────────────────
# 3. Simple clean helpers
# ─────────────────────────────────────────────────────────────────────────────
def clean_basic(html: str) -> str:
"""
Strip everything except basic prose tags.
strip=True removes disallowed elements entirely instead of escaping them.
"""
return bleach.clean(
html,
tags=BASIC_TAGS,
attributes=BASIC_ATTRS,
strip=True,
strip_comments=True,
)
def clean_rich(html: str) -> str:
"""Allow richer tag set with callable attribute filter."""
return bleach.clean(
html,
tags=RICH_TAGS,
attributes=_allow_class,
strip=True,
strip_comments=True,
)
def strip_all_tags(html: str) -> str:
"""Remove all HTML tags — produces plain text from HTML input."""
return bleach.clean(html, tags=[], strip=True, strip_comments=True)
# ─────────────────────────────────────────────────────────────────────────────
# 4. Reusable Cleaner instances
# ─────────────────────────────────────────────────────────────────────────────
# Basic cleaner — no linkification
basic_cleaner = Cleaner(
tags=BASIC_TAGS,
attributes=BASIC_ATTRS,
strip=True,
strip_comments=True,
)
# Rich cleaner with automatic URL → link conversion
rich_cleaner_with_links = Cleaner(
tags=RICH_TAGS,
attributes=_allow_class,
strip=True,
strip_comments=True,
filters=[LinkifyFilter],
)
def make_cleaner(
tags: list[str] | None = None,
attributes: dict | Callable | None = None,
linkify: bool = False,
) -> Cleaner:
"""Factory for per-context Cleaner instances."""
filters = [LinkifyFilter] if linkify else []
return Cleaner(
tags=tags or BASIC_TAGS,
attributes=attributes or BASIC_ATTRS,
strip=True,
strip_comments=True,
filters=filters,
)
# ─────────────────────────────────────────────────────────────────────────────
# 5. Linkify text
# ─────────────────────────────────────────────────────────────────────────────
def linkify(text: str, open_new_tab: bool = True) -> str:
"""
Wrap bare URLs in the text with <a> tags.
skip_tags=["pre","code"] avoids linkifying URLs inside code blocks.
"""
callbacks = [_nofollow_callback]
if open_new_tab:
callbacks.append(_target_blank_callback)
return bleach.linkify(
text,
callbacks=callbacks,
skip_tags=["pre", "code"],
)
# ─────────────────────────────────────────────────────────────────────────────
# 6. Markdown → bleach pipeline (the recommended safe rendering pattern)
# ─────────────────────────────────────────────────────────────────────────────
# Tags that the `markdown` library produces — only these are allowed through
MARKDOWN_ALLOWED_TAGS = [
"p", "br",
"strong", "em", "del",
"h1", "h2", "h3", "h4", "h5", "h6",
"ul", "ol", "li",
"blockquote",
"pre", "code",
"a", "hr",
"table", "thead", "tbody", "tr", "th", "td",
"img",
"sup", # footnotes
"div", # toc wrapper
]
MARKDOWN_ALLOWED_ATTRS: dict[str, list[str]] = {
"a": ["href", "title", "rel", "id"],
"img": ["src", "alt", "title", "width", "height"],
"code": ["class"],
"div": ["class", "id"],
"th": ["align"],
"td": ["align"],
"h1": ["id"], "h2": ["id"], "h3": ["id"], # TOC anchors
"h4": ["id"], "h5": ["id"], "h6": ["id"],
}
def render_markdown_safe(user_markdown: str) -> str:
"""
Convert user-supplied Markdown to sanitized HTML.
Step 1 — markdown.markdown() converts the Markdown syntax to raw HTML.
This HTML may contain arbitrary tags if the user embedded raw HTML.
Step 2 — bleach.clean() strips everything not in the allowlist, preventing XSS.
Never render markdown.markdown() output directly without sanitization.
"""
try:
import markdown as _md
raw_html = _md.markdown(
user_markdown,
extensions=["tables", "fenced_code", "nl2br", "sane_lists"],
)
except ImportError:
# Fallback: treat as plain text
raw_html = bleach.clean(user_markdown, tags=[], strip=True)
return bleach.clean(
raw_html,
tags=MARKDOWN_ALLOWED_TAGS,
attributes=MARKDOWN_ALLOWED_ATTRS,
strip=True,
strip_comments=True,
)
# ─────────────────────────────────────────────────────────────────────────────
# 7. Jinja2 filter registration
# ─────────────────────────────────────────────────────────────────────────────
def register_bleach_filters(env) -> None:
"""
Register sanitization filters for Jinja2 templates.
Usage:
{{ user.bio | clean_html | safe }}
{{ comment.body | mdrender_safe | safe }}
{{ post.text | linkify | safe }}
"""
env.filters["clean_html"] = clean_basic
env.filters["clean_rich"] = clean_rich
env.filters["strip_tags"] = strip_all_tags
env.filters["linkify"] = linkify
env.filters["mdrender_safe"] = render_markdown_safe
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# XSS attempt
evil = (
'<p>Hello <script>alert("xss")</script> world</p>'
'<img src=x onerror=alert(1)>'
'<a href="javascript:void(0)">click</a>'
'<div onclick="steal()">text</div>'
)
print("=== Input ===")
print(evil)
print("\n=== clean_basic (strip=True) ===")
print(clean_basic(evil))
print("\n=== strip_all_tags ===")
print(strip_all_tags(evil))
text_with_urls = (
"Check out https://python.org and also http://example.com/path?q=1 for more."
)
print("\n=== linkify ===")
print(linkify(text_with_urls))
md_input = textwrap.dedent("""\
## Hello
This is **bold** and _italic_.
<script>alert("xss in markdown")</script>
| A | B |
|---|---|
| 1 | 2 |
""") if False else """## Hello\n\nThis is **bold**.\n\n<script>alert(1)</script>"""
print("\n=== render_markdown_safe ===")
print(render_markdown_safe(md_input))
import textwrap # ensure available for __main__
For the html.escape() alternative — html.escape(user_input) turns every < into < and every > into >, which is correct for inserting plain text into HTML but rejects all formatting — if you want users to write **bold** and have it render as <strong>bold</strong> you must parse Markdown first, and then a whitelist-based sanitizer like bleach is the only safe option because html.escape() applied after markdown.markdown() would double-escape the intended tags. For the lxml.html.clean.Cleaner alternative — lxml.html.clean.Cleaner is a more powerful C-backed HTML cleaner that handles malformed HTML better and supports remove_tags, allow_tags, safe_attrs_only, and inline CSS removal, but it requires the lxml C extension and is heavier to install; bleach is a pure-Python wrapper around html5lib which handles tag soup and broken HTML reliably, and bleach’s Cleaner + LinkifyFilter pattern covers the common case of sanitizing Markdown output while also auto-linking bare URLs in a single pass. The Claude Skills 360 bundle includes bleach skill sets covering bleach.clean with tags/attributes/strip/strip_comments, callable attribute filter with tag+name+value signature, _nofollow_callback and _target_blank_callback linkify callbacks, bleach.linkify with skip_tags for pre/code blocks, Cleaner class for reusable instances, LinkifyFilter in Cleaner.filters, render_markdown_safe pipeline (markdown → bleach), MARKDOWN_ALLOWED_TAGS/ATTRS allowlist for post-markdown sanitization, strip_all_tags for plain text extraction, and Jinja2 filter registration for clean_html/mdrender_safe/linkify. Start with the free tier to try HTML sanitization pipeline code generation.