I have:
<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
I want to get the url, however I don’t know how to do that without the use of regex. Is it even possible?
so far my solution with regex is:
url = re.findall('('(.*?)')', soup['style'])[0]
Advertisement
Answer
You could try using the cssutils package. Something like this should work:
import cssutils from bs4 import BeautifulSoup html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />""" soup = BeautifulSoup(html) div_style = soup.find('div')['style'] style = cssutils.parseStyle(div_style) url = style['background-image'] >>> url u'url(/uploads/images/players/16113-1399107741.jpeg)' >>> url = url.replace('url(', '').replace(')', '') # or regex/split/find/slice etc. >>> url u'/uploads/images/players/16113-1399107741.jpeg'
Although you are ultimately going to need to parse out the actual url this method should be more resilient to changes in the HTML. If you really dislike string manipulation and regex, you can pull the url out in this roundabout way:
sheet = cssutils.css.CSSStyleSheet() sheet.add("dummy_selector { %s }" % div_style) url = list(cssutils.getUrls(sheet))[0] >>> url u'/uploads/images/players/16113-1399107741.jpeg'