Skip to content
Advertisement

Python – Iterate through list of website and scrape data – failing at requests.get

I have a list of items that I scraped from Github. This is sitting in df_actionname [‘ActionName’]. Each [‘ActionName’] can then be converted into a [‘Weblink’] to create a website link. I am trying to loop through each weblink and scrape data from it.

My code:

#Code to create input data

import pandas as pd
actionnameListFinal = ['TruffleHog OSS','Metrics embed','Super-Linter',]
df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName'])

# Create dataframes
df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName'])

#Create new column for parsed action names
df_actionname['Parsed'] = df_actionname['ActionName'].str.replace( r'[^A-Za-z0-9]+','-', regex = True)
df_actionname['Weblink'] = 'https://github.com/marketplace/actions/' + df_actionname['Parsed']

for website in df_actionname['Weblink']:
URL = df_actionname['Weblink']
detailpage = requests.get(URL)

My code is failing at ” detailpage= requests.get(URL) ” The error message I am getting is:

in get_adapter raise InvalidSchema(f”No connection adapters were found for {url!r}”) requests.exceptions.InvalidSchema: No connection adapters were found for ‘0 https://github.com/marketplace/actions/Truffle…n1 https://github.com/marketplace/actions/Metrics…n2 https://github.com/marketplace/actions/Super-L…n3 https://github.com/marketplace/actions/Swift-DocnName: Weblink, dtype: object’

Advertisement

Answer

You need to set a single valid url. Changing your for loop to

# from bs4 import BeautifulSoup 
for website in df_actionname['Weblink']:
  detailpage = requests.get(website)
  pageSoup = BeautifulSoup(detailpage.content, 'html.parser')
  print(f'scraped "{pageSoup.title.text}" from {website}')

gives me the output

scraped "TruffleHog OSS · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/TruffleHog-OSS
scraped "Metrics embed · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Metrics-embed
scraped "Super-Linter · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Super-Linter

The way you were doing it, not only was your code basically trying to repeatedly sending the same GET request every loop (since URL was not dependent on website at all), the input of requests.get was not a single url, as you can see if you add a print before the request: screenshot

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement