screen scrape text values from span based on other text values from corresponding span with beautiful soup

Question

I have some beautiful soup code, like the example code below. I'm using it to screen scrape financial data from yahoo finance about mutual funds. In this piece of code I'm trying to scrape the "Bond Ratings" percentages, and save them in a dictionary. I've been trying to select element values based on the span class="Fl(end)", but I'm finding that

Accepted Answer

To be a little more robust hopefully, I would move away from using classes, which tend to be dynamic, and use relationships between elements. I still use :contains to anchor h3.I add in session for efficiency of tcp-re-use. I also re-factor the code to use a function, which takes a symbol as argument and returns your desired dictionary; for ease code re-use.Here I use a mapping dictionary, to produce the keys you want, which you can extend should new elements appear in that table.from bs4 import BeautifulSoup as bsimport requestsdef get_dict(symbol):    bond_ratings = {}    r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}')    soup = bs(r.content, 'lxml') # html.parser        for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'):        for j in i.select_one('span'):            bond_ratings[mappings[j.span.text]] = j.parent.next_sibling.text    return bond_ratingsmappings = {         'US Government': 'us_government_bond_perc',        'AAA': 'AAA_bond_perc',        'AA': 'AA_bond_perc',        'A': 'A_bond_perc',        'BBB': 'BBB_bond_perc',        'BB': 'BB_bond_perc',        'B': 'B_bond_perc',        'Below B': 'below_B_bond_perc',        'Others': 'others_bond_perc'}symbols = ['TRARX', 'FLCEX']with requests.Session() as s:        for symbol in symbols:        bond_rating_dct = get_dict(symbol)        print(bond_rating_dct)Should you decide to be more consistent in your output dictionary keys, and don&#8217;t mind retrieving all items that have %s you can remove the mapping and simply use the following:from bs4 import BeautifulSoup as bsimport requestsdef get_dict(symbol):    bond_ratings = {}    r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}')    soup = bs(r.content, 'lxml') # html.parser        for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'):        for j in i.select_one('span'):            bond_ratings[j.span.text.replace(' ','_').lower() + '_bond_perc'] = j.parent.next_sibling.text    return bond_ratingssymbols = ['TRARX', 'FLCEX']with requests.Session() as s:       for symbol in symbols:        bond_rating_dct = get_dict(symbol)        print(bond_rating_dct)

Advertisement

Answer