I am trying to save all the < a > links within the python homepage into a folder named ‘Downloaded pages’. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name=’Downloaded Pages/www.python.org#content’> www.python.org#python-network <_io.BufferedWriter name=’Downloaded Pages/www.python.org#python-network’>
Traceback (most recent call last): File “/Users/Lucas/Python/AP book exercise/Web Scraping/linkVerification.py”, line 26, in downloadedPage = open(os.path.join(‘Downloaded Pages’, os.path.basename(linkUrlToOpen)), ‘wb’) IsADirectoryError: [Errno 21] Is a directory: ‘Downloaded Pages/’
I am unsure why this happens as it appears the pages are being saved as due to seeing ‘<_io.BufferedWriter name=’Downloaded Pages/www.python.org#content’>’, which says to me its the correct path.
This is my code:
import requests, os, bs4 # Create a new folder to download webpages to os.makedirs('Downloaded Pages', exist_ok=True) # Download webpage url = 'https://www.python.org/' res = requests.get(url) res.raise_for_status() # Check if the download was successful soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage # Find all 'a' links on the webpage linkElem = soupObj.select('a') numOfLinks = len(linkElem) for i in range(numOfLinks): linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href') print(os.path.basename(linkUrlToOpen)) # save each downloaded page to the 'Downloaded pages' folder downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') print(downloadedPage) if linkElem == []: print('Error, link does not work') else: for chunk in res.iter_content(100000): downloadedPage.write(chunk) downloadedPage.close()
Appreciate any advice, thanks.
Advertisement
Answer
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn’t specify it on the url like “http://python.org/” the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like @Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '' or '?'
So, i dont know if the following code it’s messy or not, but using the re
library i would do the following:
filename = re.sub('[/*:"?]+', '-', linkUrlToOpen.split("://")[1]) downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://"
part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-'
and that is the name that will be given to the file.
Hope it works!