I’m using BeautifulSoup to extract this line:
JavaScript
x
2
1
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
2
from a webpage.
JavaScript
1
8
1
<div>
2
<ul id="pkgsrcslist">
3
<li>
4
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
5
</li>
6
</ul>
7
</div>
8
Specifically, I want this part: iwgtk-0.8.tar.gz
I’ve written this code:
JavaScript
1
26
26
1
#!/usr/bin/env python3
2
3
from bs4 import BeautifulSoup
4
import requests
5
6
url="https://aur.archlinux.org/packages/iwgtk"
7
#url=sys.argv[1]
8
9
page = requests.get(url)
10
if page.status_code ==200:
11
soup = BeautifulSoup(page.text, 'html.parser')
12
urls = []
13
# loop over the [li] tags
14
for tag in soup.find_all('li'):
15
atag = tag.find('a')
16
try:
17
if 'href' in atag.attrs:
18
url = atag.get('href').contents[0]
19
urls.append(url)
20
except:
21
pass
22
23
# print all the urls stored in the urls list
24
for url in urls:
25
print(url)
26
and I assume it is this line
JavaScript
1
2
1
url = atag.get('href').contents[0]
2
that fails. I’ve tried
JavaScript
1
2
1
url = atag.get('a').contents[0]
2
but that failed too.
Advertisement
Answer
Try to select your elements more specific:
JavaScript
1
2
1
soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')
2
or more comfortable via css selector
JavaScript
1
2
1
soup.select('#pkgsrcslist a')
2
and use get('href')
to get the url or text
/ get_text()
to get its text or use both and store as key value in dict
:
JavaScript
1
12
12
1
2
soup = BeautifulSoup(page.text, 'html.parser')
3
pkgs = {}
4
5
for tag in soup.select('#pkgsrcslist a'):
6
print('url: ' +tag.get('href'))
7
print('text: ' + tag.text)
8
### update your a dict of package versions and links
9
pkgs.update({
10
tag.text:tag.get('href')
11
})
12
Example
JavaScript
1
16
16
1
from bs4 import BeautifulSoup
2
import requests
3
4
url="https://aur.archlinux.org/packages/iwgtk"
5
6
page = requests.get(url)
7
if page.status_code ==200:
8
soup = BeautifulSoup(page.text, 'html.parser')
9
pkgs = {}
10
for tag in soup.select('#pkgsrcslist a'):
11
pkgs.update({
12
tag.text:tag.get('href')
13
})
14
15
print(pkgs)
16
Output
JavaScript
1
2
1
{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}
2