Skip to content
Advertisement

WebScraping: Pandas to_excel Not Displaying full DataFrame

I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I’ve already had to make several adjustments to the code in this tutorial, but I digress. I’m scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.

First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel. Second, it only displays the first 9 results (the first being “Books,” and the other 8 being the first 8 categories).

Image of desired scrape section

Here is my code:

s = Service('C:/Users/.../.../chromedriver.exe')
browser = webdriver.Chrome(service=s)
url = 'http://books.toscrape.com/'
browser.get(url)
results = []
content = browser.page_source
soup = BeautifulSoup(content)
# changes from the tutorial due to recommendations
# from StackOverflow, based on similar questions
# from error messages popping up when using original
# formatting; tutorial is outdated. 

for li in soup.find_all(attrs={'class': 'side_categories'}):
    name = element.find('li')
    if name not in results:
        results.append(name.text)

df = pd.DataFrame({'Categories': results})
df.to_excel('categories.xlsx', index=False)
# per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html
# encoding is Deprecated and apparently wasn't
# needed for any excel writer other than xlwt,
# which is no longer maintained.

Images of results:

View before tabbing further in the columns

End of the displayed results

What can I do to fix this?

Edit: Many apologies, I didn’t realize I have copied an older, incorrect version of my code blocks. Should be correct now.

Advertisement

Answer

The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors:

for a in soup.select('ul.nav-list a'):
    if a.get_text(strip=True) not in results:
        results.append(a.get_text(strip=True))

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

results = []

soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content)

for a in soup.select('ul.nav-list a'):
    if a.get_text(strip=True) not in results:
        results.append(a.get_text(strip=True))

pd.DataFrame({'Categories': results})

Output

Categories
0 Books
1 Travel
2 Mystery
3 Historical Fiction
4 Sequential Art
5 Classics
6 Philosophy

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement