I’m trying to scrape data from a webpage using beautifulsoup and (ultimately) output it into a csv. As a first step in this, I’ve tried to get the text of the relevant table. I managed to do this, but the code no longer gives me the same output when I rerun it: instead of returning all 12372 records when I run the for loop, it just saves the last one.
An abbreviated version of my code is:
from bs4 import BeautifulSoup BirthsSoup = BeautifulSoup(browser.page_source, features="html.parser") print(BirthsSoup.prettify()) # this confirms that the soup has captured the page as I want it to birthsTable = BirthsSoup.select('#t2 td') # selects all the elements in the table I want birthsLen = len(birthsTable) # birthsLen: 12372 for i in range(birthsLen): print(birthsTable[i].prettify()) # this confirms that the beautifulsoup tag object correctly captured all of the table for i in range(birthsLen): birthsText = birthsTable[i].getText() # this was supposed to compile the text for every element in the table
But the for loop only saves the text for the last (ie 12372nd) element in the table. Do I need to do something else in order for it to save each element when it loops through? I think my previous (desired) output had the text of each element on a new line.
This is my first time using python, so apologies if I’ve made an obvious mistake.
Advertisement
Answer
What you’re doing is overwriting your birthText string each iteration, so by the time it gets to the end only the last one will be saved. To solve this, create a list and append each line:
birthsLen = len(birthsTable) birthsText = [] for i in range(birthsLen): birthsText.append(birthsTable[i].getText())
Or, more concisely:
birthsText = [line.getText() for line in birthsTable]