Skip to content
Advertisement

Accessing the dataLayer (JS variable) when scraping with python

I’m using beautiful soup to scrape a webpages. I want to access the dataLayer (a javascript variable) that is present on this webpage? How can I retrieve it using python? enter image description here

Advertisement

Answer

You can parse it from the source with the help of re and json.loads to find the correct script tag that contains the json:

from bs4 import BeautifulSoup
import re
from json import loads
url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"

soup = BeautifulSoup(requests.get(url).content)

script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]

json_data = loads(script_text[:script_text.find(";")])

Running it you see we get what you want:

In [31]: from bs4 import BeautifulSoup
In [32]: import re    
In [33]: from json import loads    
In [34]: import requests

In [35]: url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"

In [36]: soup = BeautifulSoup(requests.get(url).content, "html.parser")

In [37]: script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]

In [38]: json_data = loads(script_text[:script_text.find(";")])

In [39]: json_data
Out[39]: 
[{'actor': '403573,19358,22868,612492,418933,436500,46797,729453,66391,16893,211493,249636,18324,483703,1193,165792,231665,114167,139915,155111,258115,119842,610268,166263,597100,134791,520768,149470,734146,633703,684803,763372,673220,748361,178486,241328,517093,765381,693327,196630,758799,220756,550759,737383,263596,174710,118600,663153,463379,740361,702873,659451,779133,779134,779135,779136,779137,779138,779139,779140,779141,779142,779143,779144,779145,779146,779147,779241,779242,779243,779244',
  'director': '41198',
  'genre': '13025=action&13012=fantastique',
  'movie_distributors': 929,
  'movie_id': 144185,
  'movie_isshowtime': 1,
  'movie_label': 'suicide_squad',
  'nationality': '5002',
  'press_rating': 2,
  'releasedate': '2016-08-03',
  'site_route': 'moviepage_videos_trailer',
  'site_section': 'movie',
  'user_activity': 'videowatch',
  'user_rating': 3.4,
  'video_id': 19561982,
  'video_label': 'suicide_squad_bande_annonce_finale_vo',
  'video_type_id': 31003,
  'video_type_label': 'trailer'}]

You could also use a regex but in this case using str.find to get the end of the data is sufficient.

Advertisement