Skip to content
Advertisement

Extracting specific string matches from a Stock Website page

I am trying webscrape stock market caps, using this below code. At first I traditionally tried to fetch the list of market cap values using bs4. When I used print(x.find('span',{'class': 'Trsdu(0.3s)'}).text) to do this, I got AttributeError: 'NoneType' object has no attribute 'text' error.

  for x in marketCapArray:
        print(x.find('span',{'class': 'Trsdu(0.3s)'}).text)

I did not know how to resolve the above error specific to my code. So I took an alternative using regex to simply extract the required values and tried this below.

Main Code

import bs4
import re
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

def pickTopGainers():
  url =  'https://in.finance.yahoo.com/gainers?offset=0&count=100'
  page = urlopen(url)
  soup = bs4.BeautifulSoup(page,"html.parser")
  marketCapArray = soup.find_all('td', {'class': 'Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)',
 'aria-label': 'Market cap'})
  print(str(marketCapArray))
  xi = re.findall("........</span>", str(marketCapArray)) # regex-use-1
  pi = re.sub("(</span>|....>N/A|>|")","", str(xi))
  print(pi)

pickTopGainers()

Results

This is what print(str(marketCapArray) would output. (pasted only some part)

[<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="93"><span class="Trsdu(0.3s)" data-reactid="94">159.404M</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="119"><span class="Trsdu(0.3s)" data-reactid="120">533.97M</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="145"><span data-reactid="146">N/A</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="171"><span class="Trsdu(0.3s)" data-reactid="172">2.952B</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="197"><span class="Trsdu(0.3s)" data-reactid="198">9.223B</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="223"><span data-reactid="224">N/A</span></td>]

This is the output of print(pi). Also the final output.

['159.404M', '533.97M', '', '2.952B', '9.223B', '']


Question

How can I avoid using regex replace(re.sub) in the above Main Code to achieve the given final output pi ? or Suggest me the right approach to do this. I feel my regex is unpleasant.

Advertisement

Answer

You can iterate row by row inside the <table>, where all information is stored. For example:

import requests
from bs4 import BeautifulSoup


url = 'https://in.finance.yahoo.com/gainers?offset=0&count=100'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

fmt_string = '{:<15} {:<60} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10}'
print(fmt_string.format('Symbol', 'Name', 'Price(int)', 'Change', '% change', 'Volume', 'AvgVol(3M)', 'Market Cap', 'PE ratio'))
for row in soup.select('table:has(a[href*="/quote/"]) > tbody > tr'):
    cells = [td.get_text(strip=True) for td in row.select('td')]
    print(fmt_string.format(*cells[:-1]))

Prints:

Symbol          Name                                                         Price(int) Change     % change   Volume     AvgVol(3M) Market Cap PE ratio  
CCCL.NS         Consolidated Construction Consortium Limited                 0.2000     +0.0500    +33.33%    57,902     290,154    159.404M   N/A       
KSERASERA.NS    KSS Limited                                                  0.2500     +0.0500    +25.00%    1.607M     2.601M     533.97M    N/A       
BONLON.BO       BONLON INDUSTRIES LIMITED                                    21.60      +3.60      +20.00%    16,000     N/A        N/A        N/A       
MENONBE.NS      Menon Bearings Limited                                       52.80      +8.80      +20.00%    2.334M     65,713     2.952B     25.05     
RPOWER.NS       Reliance Power Limited                                       3.3000     +0.5500    +20.00%    127.814M   18.439M    9.223B     N/A       
11DPD.BO        Nippon India Mutual Fund                                     0.0600     +0.0100    +20.00%    190        N/A        N/A        N/A       
ABFRLPP-E1.NS   Aditya Birla Rs.5 ppd up                                     105.65     +17.60     +19.99%    1.238M     N/A        N/A        N/A       
500110.BO       Chennai Petroleum Corporation Limited                        64.55      -0.15      -0.23%     42,765     61,584     9.612B     N/A       
ABFRLPP.BO      Aditya Birla Fashion and Retai                               106.05     +17.65     +19.97%    387,703    N/A        N/A        N/A       
RADIOCITY.NS    Music Broadcast Limited                                      21.35      +3.55      +19.94%    12.657M    1.013M     7.38B      124.13    
RADIOCITY.BO    Music Broadcast Limited                                      21.35      +3.55      +19.94%    898,070    90,236     7.38B      124.13    
MENONBE.BO      Menon Bearings Limited                                       52.65      +8.75      +19.93%    137,065    8,648      2.951B     24.98     
MTNL.BO         Mahanagar Telephone Nigam Limited                            10.72      +1.78      +19.91%    1.142M     156,275    6.754B     N/A       

...and so on.
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement