from bs4 import BeautifulSoup import requests page = requests.get('https://www.capitol.tn.gov/house/members/').text soup = BeautifulSoup(page, 'html.parser') table = soup.find('table') rows = table.find_all('tr') header = rows[0].find_all('th') header_text = [] for item in header: header_text.append(item.get_text(strip=True)) # check header results print(header_text) # get rows for row in rows: row_text = [] a = row.find_all('a') td = row.find_all('td') for item in td: if item: row_text.append(item.get_text(strip=True)) # check row results if len(row_text) > 0: print(row_text)
I’m sorry if this is a stupid question, but I’m having a bit of trouble coming up with how to get the ‘a’s or ‘hrefs’ (aka the emails) to actually appear as the first item in the row. For starters, I’ve tried the insert() method, but it never actually gives me anything.
Advertisement
Answer
This does the job:
# get rows for row in rows: row_text = [] a = row.find_all('a') td = row.find_all('td') # print(td) for item in td: email = item.find("a", {"class": "email"}) if email != None: email = email.get("href") row_text.append(email) if item: row_text.append(item.get_text(strip=True)) # check row results if len(row_text) > 0: print(row_text)
The code basically checks if any element in a td
tag has an a
tag in it. If it finds an a
tag, it checks if the tag belong so the class email
. If it does then it gets the href
from the tag and stores it inside a variable by the name email
which is later appended to the row_text
list.