I am trying to scrape data from the html tables on this page and export it to a csv.
The only success i’ve had is with extracting the headers. I thought the problem might be with the page not fully loading before the data is scraped, hence my use of the the ‘requests_html’ library, but the issue still persists.
Here’s the code i am using:
import requests from bs4 import BeautifulSoup import csv from requests_html import HTMLSession url = 'https://www.fidelitypensionmanagers.com/Home/PriceHistory' s = HTMLSession() r = s.get(url) r.html.render(sleep=2) soup = BeautifulSoup(r.content.decode('utf8'),"lxml") table = soup.find("table",{"id":"fund-I"}) # to select the right table # find all rows rows = table.findAll('tr') # strip the header from rows headers = rows[0] header_text = [] # add the table header text to array for th in headers.findAll('th'): header_text.append(th.text) # init row text array row_text_array = [] # loop through rows and add row text to array for row in rows[1:]: row_text = [] # loop through the elements for row_element in row.findAll(['th', 'td']): # append the array with the elements inner text row_text.append(row_element.text.replace('n', '').strip()) # append the text array to the row text array row_text_array.append(row_text) with open("out.csv", "w") as f: wr = csv.writer(f) wr.writerow(header_text) for row_text_single in row_text_array: wr.writerow(row_text_single)
Any help would be grateful, thanks
Advertisement
Answer
Your assumption is correct, the contents are only reloaded when the corresponding areas are opened, so you should also start at this point and retrieve the data via these separate requests.
Simply change the value of id
parameter for https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json
You can get the ids
from the onclick
:
<div class="accordion" id="accordionOne"> <div class="card" onclick="getUnitPrices(12)"></div> <div class="card" onclick="getUnitPrices(1)"></div> <div class="card" onclick="getUnitPrices(13)"></div> <div class="card" onclick="getUnitPrices(2)"></div> <div class="card" onclick="getUnitPrices(9)"></div>
Example
simplest approach would be to use pandas
in my opinion:
import pandas as pd pd.read_json('https://www.fidelitypensionmanagers.com/Services/GetPriceHistory?id=1&format=json').to_csv('myfile.csv', index=False)
Output
SCHEME_ID,VALUATION_DATE,BID_PRICE_BF,OFFER_PRICE_BF,TOTAL_UNITS,ASSET_VALUE,STOCKS_VALUE,UNQUOTED_VALUE,UNQUOTED_APPR,UNINVESTED_CASH,UNDISTR_INCOME,MM_VALUE,STAMP_DUTIES,NSE_CSCS,BROKERAGE_FEES,TOTAL_BID_VALUE,TOTAL_OFFER_VALUE,BID_PRICE,OFFER_PRICE,APPR_TO_DATE,ID,PRICE_MOVEMENT,PRICE_CHANGE 1,/Date(1665644400000)/,0,0,22197476839.3793,93586560820.23,6562586060.8,0,0,395651878.65,0,9143340045.7,0,0,0,93586782102.51,93586782102.51,4.2161,4.2161,105797375.82,77484,, 1,/Date(1665558000000)/,0,0,22178176379.013,93502175535.94,6549468002.14,0,0,274281137.71,0,9179939634.67,0,0,0,93503191613.92,93503191613.92,4.216,4.216,92679317.16,77463,, 1,/Date(1665471600000)/,0,0,22173955797.4287,93469684659.81,6557409158.97,0,0,251510018.56,0,9177109809.92,0,0,0,93469875872.9,93469875872.9,4.2153,4.2153,100620473.99,77453,, 1,/Date(1665385200000)/,0,0,22164552155.3263,93418356788.97,6565624234.22,0,0,168155797.81,0,9174279985.2,0,0,0,93419154424.27,93419154424.27,4.2148,4.2148,108835549.24,77413,, 1,/Date(1665298800000)/,0,0,22164552155.3263,93396581110.67,6565624234.22,0,0,168155797.81,0,9171485395.41,0,0,0,93396989872.11,93396989872.11,4.2138,4.2138,108835549.24,77393,, 1,/Date(1665212400000)/,0,0,22164552155.3263,93374808849.27,6565624234.22,0,0,168155797.81,0,9168690805.6,0,0,0,93374825319.96,93374825319.96,4.2128,4.2128,108835549.24,77373,, 1,/Date(1665126000000)/,0,0,22164552155.3263,93353040003.64,6565624234.22,0,0,168155797.81,0,9165896215.82,0,0,0,93352660767.8,93352660767.8,4.2118,4.2118,108835549.24,77353,, ...