I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I’ve already had to make several adjustments to the code in this tutorial, but I digress. I’m scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.
First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel. Second, it only displays the first 9 results (the first being “Books,” and the other 8 being the first 8 categories).
Image of desired scrape section
Here is my code:
s = Service('C:/Users/.../.../chromedriver.exe') browser = webdriver.Chrome(service=s) url = 'http://books.toscrape.com/' browser.get(url) results = [] content = browser.page_source soup = BeautifulSoup(content) # changes from the tutorial due to recommendations # from StackOverflow, based on similar questions # from error messages popping up when using original # formatting; tutorial is outdated. for li in soup.find_all(attrs={'class': 'side_categories'}): name = element.find('li') if name not in results: results.append(name.text) df = pd.DataFrame({'Categories': results}) df.to_excel('categories.xlsx', index=False) # per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html # encoding is Deprecated and apparently wasn't # needed for any excel writer other than xlwt, # which is no longer maintained.
Images of results:
View before tabbing further in the columns
What can I do to fix this?
Edit: Many apologies, I didn’t realize I have copied an older, incorrect version of my code blocks. Should be correct now.
Advertisement
Answer
The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors
:
for a in soup.select('ul.nav-list a'): if a.get_text(strip=True) not in results: results.append(a.get_text(strip=True))
Example
import requests from bs4 import BeautifulSoup import pandas as pd results = [] soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content) for a in soup.select('ul.nav-list a'): if a.get_text(strip=True) not in results: results.append(a.get_text(strip=True)) pd.DataFrame({'Categories': results})
Output
Categories | |
---|---|
0 | Books |
1 | Travel |
2 | Mystery |
3 | Historical Fiction |
4 | Sequential Art |
5 | Classics |
6 | Philosophy |
…