I'm using BeautifulSoup to extract this line: from a webpage. Specifically, I want this part: iwgtk-0.8.tar.gz I've written this code: and I assume it is this line that fails. I've tried but that failed too. Answer Try to select your elements more specific: or more comfortable via css selector and use get('href') to get the url or text / get_text()

How to extract link to Package Sources from Arch User Repository (AUR) website

I’m using BeautifulSoup to extract this line:

<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>

JavaScript
​x
 
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
​

from a webpage.

<div>
    <ul id="pkgsrcslist">
        <li>
            <a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
        </li>
    </ul>
</div>

JavaScript
 
<div>
    <ul id="pkgsrcslist">
        <li>
            <a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
        </li>
    </ul>
</div>
​

Specifically, I want this part: iwgtk-0.8.tar.gz

I’ve written this code:

#!/usr/bin/env python3

from bs4 import BeautifulSoup
import requests

url="https://aur.archlinux.org/packages/iwgtk"
#url=sys.argv[1]

page = requests.get(url)
if page.status_code ==200:
    soup = BeautifulSoup(page.text, 'html.parser')
    urls = []
# loop over the [li] tags
    for tag in soup.find_all('li'):
        atag = tag.find('a')
        try:
            if 'href' in atag.attrs:
                url = atag.get('href').contents[0]
                urls.append(url)
        except:
            pass

# print all the urls stored in the urls list
for url in urls:
    print(url)

JavaScript
 
#!/usr/bin/env python3
​
from bs4 import BeautifulSoup
import requests
​
url="https://aur.archlinux.org/packages/iwgtk"
#url=sys.argv[1]
​
page = requests.get(url)
if page.status_code ==200:
    soup = BeautifulSoup(page.text, 'html.parser')
    urls = []
# loop over the [li] tags
    for tag in soup.find_all('li'):
        atag = tag.find('a')
        try:
            if 'href' in atag.attrs:
                url = atag.get('href').contents[0]
                urls.append(url)
        except:
            pass
​
# print all the urls stored in the urls list
for url in urls:
    print(url)
​

and I assume it is this line

url = atag.get('href').contents[0]

JavaScript
 
url = atag.get('href').contents[0]
​

that fails. I’ve tried

url = atag.get('a').contents[0]

JavaScript
 
url = atag.get('a').contents[0]
​

but that failed too.

Answer

Try to select your elements more specific:

soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')

JavaScript
 
soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')
​

or more comfortable via css selector

soup.select('#pkgsrcslist a')

JavaScript
 
soup.select('#pkgsrcslist a')
​

and use get('href') to get the url or text / get_text() to get its text or use both and store as key value in dict:

...
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}

for tag in soup.select('#pkgsrcslist a'):
    print('url: ' +tag.get('href'))
    print('text: ' + tag.text)
    ### update your a dict of package versions and links
    pkgs.update({
        tag.text:tag.get('href')
    })

JavaScript
 
...
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}
​
for tag in soup.select('#pkgsrcslist a'):
    print('url: ' +tag.get('href'))
    print('text: ' + tag.text)
    ### update your a dict of package versions and links
    pkgs.update({
        tag.text:tag.get('href')
    })
​

Example

from bs4 import BeautifulSoup
import requests

url="https://aur.archlinux.org/packages/iwgtk"

page = requests.get(url)
if page.status_code ==200:
    soup = BeautifulSoup(page.text, 'html.parser')
    pkgs = {}
    for tag in soup.select('#pkgsrcslist a'):
        pkgs.update({
            tag.text:tag.get('href')
        })

print(pkgs)

JavaScript
 
from bs4 import BeautifulSoup
import requests
​
url="https://aur.archlinux.org/packages/iwgtk"
​
page = requests.get(url)
if page.status_code ==200:
    soup = BeautifulSoup(page.text, 'html.parser')
    pkgs = {}
    for tag in soup.select('#pkgsrcslist a'):
        pkgs.update({
            tag.text:tag.get('href')
        })
​
print(pkgs)
​

Output

{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}

JavaScript
 
{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}
​

Advertisement

Answer

Example

Output