I have a list of items that I scraped from Github. This is sitting in df_actionname [‘ActionName’]. Each [‘ActionName’] can then be converted into a [‘Weblink’] to create a website link. I am trying to loop through each weblink and scrape data from it.
My code:
#Code to create input data import pandas as pd actionnameListFinal = ['TruffleHog OSS','Metrics embed','Super-Linter',] df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName']) # Create dataframes df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName']) #Create new column for parsed action names df_actionname['Parsed'] = df_actionname['ActionName'].str.replace( r'[^A-Za-z0-9]+','-', regex = True) df_actionname['Weblink'] = 'https://github.com/marketplace/actions/' + df_actionname['Parsed'] for website in df_actionname['Weblink']: URL = df_actionname['Weblink'] detailpage = requests.get(URL)
My code is failing at ” detailpage= requests.get(URL) ” The error message I am getting is:
in get_adapter raise InvalidSchema(f”No connection adapters were found for {url!r}”) requests.exceptions.InvalidSchema: No connection adapters were found for ‘0 https://github.com/marketplace/actions/Truffle…n1 https://github.com/marketplace/actions/Metrics…n2 https://github.com/marketplace/actions/Super-L…n3 https://github.com/marketplace/actions/Swift-DocnName: Weblink, dtype: object’
Advertisement
Answer
You need to set a single valid url. Changing your for
loop to
# from bs4 import BeautifulSoup for website in df_actionname['Weblink']: detailpage = requests.get(website) pageSoup = BeautifulSoup(detailpage.content, 'html.parser') print(f'scraped "{pageSoup.title.text}" from {website}')
gives me the output
scraped "TruffleHog OSS · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/TruffleHog-OSS scraped "Metrics embed · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Metrics-embed scraped "Super-Linter · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Super-Linter
The way you were doing it, not only was your code basically trying to repeatedly sending the same GET request every loop (since URL
was not dependent on website
at all), the input of requests.get
was not a single url, as you can see if you add a print
before the request: