Im trying to write some code that will scrape different data from a table on a stock screener website and save the data in excel. The problem I’m having is there isn’t a distinct class code for some of the values I want to pull from the table. so I tried this only for the first header I wanted the ticker but it pulls all of the tab-links on the page. any help would be appreciated?
from bs4 import BeautifulSoup import requests import pandas as pd headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'} df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form'] url= "https://finviz.com/insidertrading.ashx" r = requests.get(url, headers=headers) soup = BeautifulSoup(r.content, 'lxml') Ticker = [item.text for item in soup.select('.tab-link:nth-of-type(1):not([id])')] print(Ticker)
I also tried this code Ticker = [item.text for item in soup.select('.insider-buy-row-2 .tab-link')]
and it did pull the ticker I wanted but it also included the persons name and other rows.
Advertisement
Answer
Use combination of pandas
and BeautifulSoup
–
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'} df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form'] url= "https://finviz.com/insidertrading.ashx" r = requests.get(url, headers=headers) soup = BeautifulSoup(r.content, 'lxml') tbl = soup.findAll("table") tbls = pd.read_html(str(tbl)) df = tbls[4] df, df.columns = df[1:] , df.iloc[0]
Important part here is pd.read_html
can read multiple dataframes from <table>
tags. You just have to grab the right table from the output and set the header properly.