I have some beautiful soup code, like the example code below. I’m using it to screen scrape financial data from yahoo finance about mutual funds. In this piece of code I’m trying to scrape the “Bond Ratings” percentages, and save them in a dictionary. I’ve been trying to select element values based on the span class=”Fl(end)”, but I’m finding that it’s incorrectly pulling the text for some mutual funds for the “AAA” value. I was wondering if there’s a way instead to pull the text value for the percentage maybe using the bond name text inbetween the spans for example US Government to get the 0.00% from the corresponding 0.00%
some example code, sample data, and desired output is below.
code:
profile_URL='https://finance.yahoo.com/quote/TRARX/holdings?p=TRARX' profile_req = requests.get(profile_URL) profile_soup = bs(profile_req.text) bond_rating_list=['us_government_bond_perc', 'AAA_bond_perc', 'AA_bond_perc', 'A_bond_perc', 'BBB_bond_perc', 'BB_bond_perc', 'B_bond_perc', 'below_B_bond_perc', 'others_bond_perc'] bond_rating_dct={} for n in range(len(bond_rating_list)): bond_rating_dct[bond_rating_list[n]] = dc.Decimal(profile_soup.select('span[class="Fl(end)"]')[n+5].text.replace('%',''))
beautiful soup sample:
URL='https://finance.yahoo.com/quote/TRARX/holdings?p=TRARX' req = requests.get(URL) soup = bs(req.text) soup
data:
<span class="Mend(5px) Whs(nw)" data-reactid="225"><span data-reactid="226">US Government</span></span></span><span class="Fl(end)" data-reactid="227">0.00%</span></div><div class="Bdbw(1px) Bdbc($seperatorColor) Bdbs(s) H(25px) Pt(10px)" data-reactid="228"><span class="Fl(start)" data-reactid="229"><span class="Mend(5px) Whs(nw)" data-reactid="230"><span data-reactid="231">AAA</span></span></span><span class="Fl(end)" data-reactid="232">0.00%</span>
desired output example:
bond_rating_dct['us_government_bond_perc'] =0.00 bond_rating_dct['AAA_bond_perc'] =50.23
UPDATE: I’m running beautifulsoup4 4.6.3
Advertisement
Answer
To be a little more robust hopefully, I would move away from using classes, which tend to be dynamic, and use relationships between elements. I still use :contains
to anchor h3
.
I add in session
for efficiency of tcp-re-use. I also re-factor the code to use a function, which takes a symbol as argument and returns your desired dictionary; for ease code re-use.
Here I use a mapping dictionary, to produce the keys you want, which you can extend should new elements appear in that table.
from bs4 import BeautifulSoup as bs import requests def get_dict(symbol): bond_ratings = {} r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}') soup = bs(r.content, 'lxml') # html.parser for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'): for j in i.select_one('span'): bond_ratings[mappings[j.span.text]] = j.parent.next_sibling.text return bond_ratings mappings = { 'US Government': 'us_government_bond_perc', 'AAA': 'AAA_bond_perc', 'AA': 'AA_bond_perc', 'A': 'A_bond_perc', 'BBB': 'BBB_bond_perc', 'BB': 'BB_bond_perc', 'B': 'B_bond_perc', 'Below B': 'below_B_bond_perc', 'Others': 'others_bond_perc' } symbols = ['TRARX', 'FLCEX'] with requests.Session() as s: for symbol in symbols: bond_rating_dct = get_dict(symbol) print(bond_rating_dct)
Should you decide to be more consistent in your output dictionary keys, and don’t mind retrieving all items that have %s you can remove the mapping and simply use the following:
from bs4 import BeautifulSoup as bs import requests def get_dict(symbol): bond_ratings = {} r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}') soup = bs(r.content, 'lxml') # html.parser for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'): for j in i.select_one('span'): bond_ratings[j.span.text.replace(' ','_').lower() + '_bond_perc'] = j.parent.next_sibling.text return bond_ratings symbols = ['TRARX', 'FLCEX'] with requests.Session() as s: for symbol in symbols: bond_rating_dct = get_dict(symbol) print(bond_rating_dct)