Skip to content
Advertisement

screen scrape text values from span based on other text values from corresponding span with beautiful soup

I have some beautiful soup code, like the example code below. I’m using it to screen scrape financial data from yahoo finance about mutual funds. In this piece of code I’m trying to scrape the “Bond Ratings” percentages, and save them in a dictionary. I’ve been trying to select element values based on the span class=”Fl(end)”, but I’m finding that it’s incorrectly pulling the text for some mutual funds for the “AAA” value. I was wondering if there’s a way instead to pull the text value for the percentage maybe using the bond name text inbetween the spans for example US Government to get the 0.00% from the corresponding 0.00%

some example code, sample data, and desired output is below.

code:

profile_URL='https://finance.yahoo.com/quote/TRARX/holdings?p=TRARX'

profile_req = requests.get(profile_URL)

profile_soup = bs(profile_req.text)

    


bond_rating_list=['us_government_bond_perc',
        'AAA_bond_perc',
        'AA_bond_perc',
        'A_bond_perc',
        'BBB_bond_perc',
        'BB_bond_perc',
        'B_bond_perc',
        'below_B_bond_perc',
        'others_bond_perc']

        bond_rating_dct={}

        for n in range(len(bond_rating_list)):

            bond_rating_dct[bond_rating_list[n]] = dc.Decimal(profile_soup.select('span[class="Fl(end)"]')[n+5].text.replace('%',''))

beautiful soup sample:

URL='https://finance.yahoo.com/quote/TRARX/holdings?p=TRARX'

req = requests.get(URL)

soup = bs(req.text)

soup

data:

<span class="Mend(5px) Whs(nw)" data-reactid="225"><span data-reactid="226">US Government</span></span></span><span class="Fl(end)" data-reactid="227">0.00%</span></div><div class="Bdbw(1px) Bdbc($seperatorColor) Bdbs(s) H(25px) Pt(10px)" data-reactid="228"><span class="Fl(start)" data-reactid="229"><span class="Mend(5px) Whs(nw)" data-reactid="230"><span data-reactid="231">AAA</span></span></span><span class="Fl(end)" data-reactid="232">0.00%</span>

desired output example:

bond_rating_dct['us_government_bond_perc'] =0.00
bond_rating_dct['AAA_bond_perc'] =50.23

UPDATE: I’m running beautifulsoup4 4.6.3

Advertisement

Answer

To be a little more robust hopefully, I would move away from using classes, which tend to be dynamic, and use relationships between elements. I still use :contains to anchor h3.

I add in session for efficiency of tcp-re-use. I also re-factor the code to use a function, which takes a symbol as argument and returns your desired dictionary; for ease code re-use.

Here I use a mapping dictionary, to produce the keys you want, which you can extend should new elements appear in that table.

from bs4 import BeautifulSoup as bs
import requests

def get_dict(symbol):
    bond_ratings = {}
    r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}')
    soup = bs(r.content, 'lxml') # html.parser
    
    for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'):
        for j in i.select_one('span'):
            bond_ratings[mappings[j.span.text]] = j.parent.next_sibling.text
    return bond_ratings

mappings = { 
        'US Government': 'us_government_bond_perc',
        'AAA': 'AAA_bond_perc',
        'AA': 'AA_bond_perc',
        'A': 'A_bond_perc',
        'BBB': 'BBB_bond_perc',
        'BB': 'BB_bond_perc',
        'B': 'B_bond_perc',
        'Below B': 'below_B_bond_perc',
        'Others': 'others_bond_perc'
}

symbols = ['TRARX', 'FLCEX']

with requests.Session() as s:    
    for symbol in symbols:
        bond_rating_dct = get_dict(symbol)
        print(bond_rating_dct)

Should you decide to be more consistent in your output dictionary keys, and don’t mind retrieving all items that have %s you can remove the mapping and simply use the following:

from bs4 import BeautifulSoup as bs
import requests

def get_dict(symbol):
    bond_ratings = {}
    r = s.get(f'https://finance.yahoo.com/quote/{symbol}/holdings?p={symbol}')
    soup = bs(r.content, 'lxml') # html.parser
    
    for i in soup.select('div:has(> h3:contains("Bond Ratings")) [class]:nth-child(n+2)'):
        for j in i.select_one('span'):
            bond_ratings[j.span.text.replace(' ','_').lower() + '_bond_perc'] = j.parent.next_sibling.text
    return bond_ratings

symbols = ['TRARX', 'FLCEX']

with requests.Session() as s:   
    for symbol in symbols:
        bond_rating_dct = get_dict(symbol)
        print(bond_rating_dct)
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement