I want to scrape Congressional stock trades from Capitol Trades. I can scrape the data, but the column that contains stock tickers has a span tag that separates company names from company tickers. pandas.read_html()
removes this span tag, which concatenates company names and tickers and makes it difficult to recover tickers.
For example, company names that end with an “INC” suffix run into tickers, which are also capital letters. See my example below with “INC” and “AE”.
Here is where I found the span tag:
Company tickers are 1 to 5 characters in length, and I have failed to regex tickers because there are many varieties of company suffixes (e.g., “INC”, “CORP”, “PLC”, “SE”, etc.), and not all company names have suffixes.
How can I either replace span tags with whitespace to separate company names and tickers or parse the span as another column?
Here is my code:
import pandas as pd import yfinance as yf from selenium import webdriver from bs4 import BeautifulSoup import time import datetime def get_url(page=1, pageSize=50, assetType='stock'): if page == 1: return f'https://www.capitoltrades.com/trades?assetType={assetType}&pageSize={pageSize}' elif page > 1: return f'https://www.capitoltrades.com/trades?assetType={assetType}&page={page}&pageSize={pageSize}' else: return None driver = webdriver.Firefox() driver.get(get_url(page=1)) driver.implicitly_wait(10) time.sleep(1) soup = BeautifulSoup(driver.page_source, 'lxml') tables = soup.find_all('table') table = pd.read_html(str(tables))[0] driver.close()
Advertisement
Answer
To separate company names and tickers or parse the span as another column aka to get overall neat and clean ResultSet
, you can change your tool selection strategy a bit. In this case, It would be better to apply bs4 with pandas DataFrame
instead of pd.read_html() method.
Full working code as an example:
import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager from bs4 import BeautifulSoup import time driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) #The base url: https://www.capitoltrades.com/trades?assetType=stock&pageSize=50 data = [] for page in range(1, 5): driver.get(f'https://www.capitoltrades.com/trades?assetType=stock&pageSize=50&page={page}') driver.maximize_window() time.sleep(3) soup = BeautifulSoup(driver.page_source, 'lxml') for row in soup.select('table.q-table.trades-table > tbody tr'): Politician = row.select_one('[class="q-fieldset politician-name"] > a').text.strip() Politician_info = row.select_one('[class="q-fieldset politician-info"]').get_text(' ',strip=True) Traded_Issuer = row.select_one('[class="q-fieldset issuer-name"] > a').text.strip() Issuer_ticker =row.select_one('span[class="q-field issuer-ticker"]').text.strip() Published = row.select_one('[class="q-td q-column--pubDate"] .q-value').text.strip() Traded = row.select_one('[class="q-td q-column--txbDate"] .q-value') Traded = Traded.text.strip() if Traded else None Filed_after = row.select_one('[class="q-td q-column--reportingGap"] .q-value').text.strip() Owner =row.select_one('[class="svg-image owner-icon"]+span').text.strip() _type = row.select_one('[class="q-data-cell tx-type"]').get_text(strip=True) Size = row.select_one('[class="q-td q-column--value"] > div').get_text(strip=True) #Size = Size.text.strip() if Size else None Price =row.select_one('[class="q-field trade-price"]').text.strip() data.append({ 'Politician':Politician, 'Politician_info':Politician_info, 'Traded_Issuer':Traded_Issuer, 'Issuer_ticker':Issuer_ticker, 'Published':Published, 'Traded':Traded, 'Filed_after':Filed_after, 'Owner':Owner, 'Type':_type, 'Size':Size, 'Price':Price }) df = pd.DataFrame(data) print(df)
Output:
Politician Politician_info Traded_Issuer Issuer_ticker ... Owner Type Size Price 0 Debbie Wasserman Schultz Democrat House FL ADAMS RESOURCES & ENERGY INC AE:US ... Child sell 1K - 15K 32.27 1 Kathy Manning Democrat House NC 3M Co MMM:US ... Spouse buy 1K - 15K 108.95 2 Kathy Manning Democrat House NC Accenture PLC ACN:US ... Spouse buy 1K - 15K 250.07 3 Kathy Manning Democrat House NC Adobe Inc ADBE:US ... Spouse sell 1K - 15K 286.15 4 Kathy Manning Democrat House NC Alphabet Inc GOOGL:US ... Spouse buy 1K - 15K 97.56 .. ... ... ... ... ... ... ... ... ... 195 Diana Harshbarger Republican House TN CME Group Inc CME:US ... Joint sell 1K - 15K 176.26 196 Diana Harshbarger Republican House TN CME Group Inc CME:US ... Spouse sell 1K - 15K 176.26 197 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Undisclosed sell 1K - 15K 268.69 198 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Undisclosed sell 1K - 15K 268.69 199 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Joint sell 1K - 15K 268.69 [200 rows x 11 columns]