I’m trying to extract all urls from a sitemap that contain the word foo in the url. I’ve managed to extract all the urls but can’t figure out how to only get the ones I want. So in the below example I only want the urls for apples and pears returned.
JavaScript
x
25
25
1
<url>
2
<loc>
3
https://www.example.com/p-1224-apples-foo-09897.php
4
</loc>
5
<lastmod>2018-05-29</lastmod>
6
<changefreq>daily</changefreq>
7
<priority>1.0</priority>
8
</url>
9
<url>
10
<loc>
11
https://www.example.com/p-1433-pears-foo-00077.php
12
</loc>
13
<lastmod>2018-05-29</lastmod>
14
<changefreq>daily</changefreq>
15
<priority>1.0</priority>
16
</url>
17
<url>
18
<loc>
19
https://www.example.com/p-3411-oranges-ping-66554.php
20
</loc>
21
<lastmod>2018-05-29</lastmod>
22
<changefreq>daily</changefreq>
23
<priority>1.0</priority>
24
</url>
25
Advertisement
Answer
I modify the xml to valid format (add <urls>
and </urls>
), save them into src.xml:
JavaScript
1
27
27
1
<urls>
2
<url>
3
<loc>
4
https://www.example.com/p-1224-apples-foo-09897.php
5
</loc>
6
<lastmod>2018-05-29</lastmod>
7
<changefreq>daily</changefreq>
8
<priority>1.0</priority>
9
</url>
10
<url>
11
<loc>
12
https://www.example.com/p-1433-pears-foo-00077.php
13
</loc>
14
<lastmod>2018-05-29</lastmod>
15
<changefreq>daily</changefreq>
16
<priority>1.0</priority>
17
</url>
18
<url>
19
<loc>
20
https://www.example.com/p-3411-oranges-ping-66554.php
21
</loc>
22
<lastmod>2018-05-29</lastmod>
23
<changefreq>daily</changefreq>
24
<priority>1.0</priority>
25
</url>
26
</urls>
27
Use xml.etree.ElementTree
to parse xml:
JavaScript
1
12
12
1
>>> import xml.etree.ElementTree as ET
2
>>> tree = ET.parse('src.xml')
3
>>> root = tree.getroot()
4
>>> for url in root.findall('url'):
5
for loc in url.findall('loc'):
6
if loc.text.__contains__('foo'):
7
print(loc.text)
8
9
10
https://www.example.com/p-1224-apples-foo-09897.php
11
https://www.example.com/p-1433-pears-foo-00077.php
12