Skip to content
Advertisement

Replace span tags with whitespace or parse contents as new column with pandas.read_html

I want to scrape Congressional stock trades from Capitol Trades. I can scrape the data, but the column that contains stock tickers has a span tag that separates company names from company tickers. pandas.read_html() removes this span tag, which concatenates company names and tickers and makes it difficult to recover tickers.

For example, company names that end with an “INC” suffix run into tickers, which are also capital letters. See my example below with “INC” and “AE”.

INC and AE

Here is where I found the span tag:

span tag

Company tickers are 1 to 5 characters in length, and I have failed to regex tickers because there are many varieties of company suffixes (e.g., “INC”, “CORP”, “PLC”, “SE”, etc.), and not all company names have suffixes.

How can I either replace span tags with whitespace to separate company names and tickers or parse the span as another column?

Here is my code:

JavaScript

Advertisement

Answer

To separate company names and tickers or parse the span as another column aka to get overall neat and clean ResultSet, you can change your tool selection strategy a bit. In this case, It would be better to apply bs4 with pandas DataFrame instead of pd.read_html() method.

Full working code as an example:

JavaScript

Output:

JavaScript
Advertisement