I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx
This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.
Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.
Therefore, I used Selenium. I wanted to Select All
at the begining (sending Ctrl+A
key combination), but it doesn’t work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A
doesn’t work.
I also tried to read the HTML contents, but Power BI puts div
elements on the screen using position:absolute
and distinguishing the location of a div
in the table (both row and column) is an effortful activity.
Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn’t find out the rules. It seems it puts keywords somewhere and uses their indices in the table.
Note: I realized that all of the data is not loaded and even shown at the same time. A div
of class scroll-bar-part-bar
is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.
The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:
from selenium import webdriver from selenium.webdriver.common.keys import Keys options = webdriver.ChromeOptions() options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe") driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9") parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div') children = parent.find_elements_by_xpath('.//*') values = [child.get_attribute('title') for child in children]
I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.
Advertisement
Answer
Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div') children = parent.find_elements_by_xpath('.//*')
Then sort them using their location:
x = [child.location['x'] for child in children] y = [child.location['y'] for child in children] index = np.lexsort((x,y))
To sort what we have read in different lines, this code may help:
rows = [] row = [] last_line = y[index[0]] for i in index: if last_line != y[i]: row.append[children[i].get_attribute('title')] else: rows.append(row) row = list([children[i].get_attribute('title')] rows.append(row)