I’m trying to extract all urls from a sitemap that contain the word foo in the url. I’ve managed to extract all the urls but can’t figure out how to only get the ones I want. So in the below example I only want the urls for apples and pears returned.
<url> <loc> https://www.example.com/p-1224-apples-foo-09897.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc> https://www.example.com/p-1433-pears-foo-00077.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc> https://www.example.com/p-3411-oranges-ping-66554.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url>
Advertisement
Answer
I modify the xml to valid format (add <urls> and </urls>), save them into src.xml:
<urls> <url> <loc> https://www.example.com/p-1224-apples-foo-09897.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc> https://www.example.com/p-1433-pears-foo-00077.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc> https://www.example.com/p-3411-oranges-ping-66554.php </loc> <lastmod>2018-05-29</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> </urls>
Use xml.etree.ElementTree to parse xml:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('src.xml')
>>> root = tree.getroot()
>>> for url in root.findall('url'):
...     for loc in url.findall('loc'):
...             if loc.text.__contains__('foo'):
...                     print(loc.text)
...
https://www.example.com/p-1224-apples-foo-09897.php
https://www.example.com/p-1433-pears-foo-00077.php