Python, extract urls from xml sitemap that contain a certain word

Question

I'm trying to extract all urls from a sitemap that contain the word foo in the url. I've managed to extract all the urls but can't figure out how to only get the ones I want. So in the below example I only want the urls for apples and pears returned. Answer I modify the xml to valid format (add

Accepted Answer

I modify the xml to valid format (add and ), save them into src.xml:https://www.example.com/p-1224-apples-foo-09897.php2018-05-29daily1.0https://www.example.com/p-1433-pears-foo-00077.php2018-05-29daily1.0https://www.example.com/p-3411-oranges-ping-66554.php2018-05-29daily1.0Use xml.etree.ElementTree to parse xml:>>> import xml.etree.ElementTree as ET>>> tree = ET.parse('src.xml')>>> root = tree.getroot()>>> for url in root.findall('url'):... for loc in url.findall('loc'):... if loc.text.__contains__('foo'):... print(loc.text)...https://www.example.com/p-1224-apples-foo-09897.phphttps://www.example.com/p-1433-pears-foo-00077.php

Advertisement

Answer