Skip to content
Advertisement

Extracting url from style: background-url: with beautifulsoup and without regex?

I have:

<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"

I want to get the url, however I don’t know how to do that without the use of regex. Is it even possible?

so far my solution with regex is:

url = re.findall('('(.*?)')', soup['style'])[0]

Advertisement

Answer

You could try using the cssutils package. Something like this should work:

import cssutils
from bs4 import BeautifulSoup

html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />"""
soup = BeautifulSoup(html)
div_style = soup.find('div')['style']
style = cssutils.parseStyle(div_style)
url = style['background-image']

>>> url
u'url(/uploads/images/players/16113-1399107741.jpeg)'
>>> url = url.replace('url(', '').replace(')', '')    # or regex/split/find/slice etc.
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'

Although you are ultimately going to need to parse out the actual url this method should be more resilient to changes in the HTML. If you really dislike string manipulation and regex, you can pull the url out in this roundabout way:

sheet = cssutils.css.CSSStyleSheet()
sheet.add("dummy_selector { %s }" % div_style)
url = list(cssutils.getUrls(sheet))[0]
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement