I am using the bleach package to strip away invalid html. I am puzzled why the dir attribute is being stripped from my string. Is dir not an attribute, or could it just be that the package does not support dir?
I have included the entire script, so you can run it for your convenience.
import bleach
string = """<p dir="rtl">asdasdasd <span>asdasdasd</span> asdsadasdsad .<br data-mce-bogus="1"></p>"""
def strip_invalid_html(html):
""" strips invalid tags/attributes """
allowed_tags = [
'p', 'a', 'blockquote',
'h1', 'h2', 'h3', 'h4', 'h5',
'strong', 'em',
'br',
'span',
]
allowed_attributes = {
'a': ['href', 'title'],
'dir': ['rtl', 'ltr']
}
cleaned_html = bleach.clean(
html,
attributes=allowed_attributes,
strip=True,
tags=allowed_tags
)
print(cleaned_html)
strip_invalid_html(string)
Advertisement
Answer
If you pass a dict for attributes, the dict should map tag names to allowed attribute names, not map attribute names to allowed attribute values.
If you want 'dir' to be an allowed attribute for p tags, you need a 'p': ['dir'] entry, not a 'dir': ['rtl', 'ltr'] entry:
allowed_attributes = {
'a': ['href', 'title'],
'p': ['dir'],
}