Accessing the dataLayer (JS variable) when scraping with python

Question

I'm using beautiful soup to scrape a webpages. I want to access the dataLayer (a javascript variable) that is present on this webpage? How can I retrieve it using python? Answer You can parse it from the source with the help of re and json.loads to find the correct script tag that contains the json: Running it you see we

Accepted Answer

You can parse it from the source with the help of re and json.loads to find the correct script tag that contains the json:from bs4 import BeautifulSoupimport refrom json import loadsurl = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"soup = BeautifulSoup(requests.get(url).content)script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]json_data = loads(script_text[:script_text.find(";")])Running it you see we get what you want:In [31]: from bs4 import BeautifulSoupIn [32]: import re    In [33]: from json import loads    In [34]: import requestsIn [35]: url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"In [36]: soup = BeautifulSoup(requests.get(url).content, "html.parser")In [37]: script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]In [38]: json_data = loads(script_text[:script_text.find(";")])In [39]: json_dataOut[39]: [{'actor': '403573,19358,22868,612492,418933,436500,46797,729453,66391,16893,211493,249636,18324,483703,1193,165792,231665,114167,139915,155111,258115,119842,610268,166263,597100,134791,520768,149470,734146,633703,684803,763372,673220,748361,178486,241328,517093,765381,693327,196630,758799,220756,550759,737383,263596,174710,118600,663153,463379,740361,702873,659451,779133,779134,779135,779136,779137,779138,779139,779140,779141,779142,779143,779144,779145,779146,779147,779241,779242,779243,779244',  'director': '41198',  'genre': '13025=action&13012=fantastique',  'movie_distributors': 929,  'movie_id': 144185,  'movie_isshowtime': 1,  'movie_label': 'suicide_squad',  'nationality': '5002',  'press_rating': 2,  'releasedate': '2016-08-03',  'site_route': 'moviepage_videos_trailer',  'site_section': 'movie',  'user_activity': 'videowatch',  'user_rating': 3.4,  'video_id': 19561982,  'video_label': 'suicide_squad_bande_annonce_finale_vo',  'video_type_id': 31003,  'video_type_label': 'trailer'}]You could also use a regex but in this case using str.find to get the end of the data is sufficient.

Advertisement

Answer