I’m using BeautifulSoup to extract this line:
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
from a webpage.
<div> <ul id="pkgsrcslist"> <li> <a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a> </li> </ul> </div>
Specifically, I want this part: iwgtk-0.8.tar.gz
I’ve written this code:
#!/usr/bin/env python3 from bs4 import BeautifulSoup import requests url="https://aur.archlinux.org/packages/iwgtk" #url=sys.argv[1] page = requests.get(url) if page.status_code ==200: soup = BeautifulSoup(page.text, 'html.parser') urls = [] # loop over the [li] tags for tag in soup.find_all('li'): atag = tag.find('a') try: if 'href' in atag.attrs: url = atag.get('href').contents[0] urls.append(url) except: pass # print all the urls stored in the urls list for url in urls: print(url)
and I assume it is this line
url = atag.get('href').contents[0]
that fails. I’ve tried
url = atag.get('a').contents[0]
but that failed too.
Advertisement
Answer
Try to select your elements more specific:
soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')
or more comfortable via css selector
soup.select('#pkgsrcslist a')
and use get('href')
to get the url or text
/ get_text()
to get its text or use both and store as key value in dict
:
... soup = BeautifulSoup(page.text, 'html.parser') pkgs = {} for tag in soup.select('#pkgsrcslist a'): print('url: ' +tag.get('href')) print('text: ' + tag.text) ### update your a dict of package versions and links pkgs.update({ tag.text:tag.get('href') })
Example
from bs4 import BeautifulSoup import requests url="https://aur.archlinux.org/packages/iwgtk" page = requests.get(url) if page.status_code ==200: soup = BeautifulSoup(page.text, 'html.parser') pkgs = {} for tag in soup.select('#pkgsrcslist a'): pkgs.update({ tag.text:tag.get('href') }) print(pkgs)
Output
{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}