I’m using beautiful soup to scrape a webpages. I want to access the dataLayer (a javascript variable) that is present on this webpage? How can I retrieve it using python?
Advertisement
Answer
You can parse it from the source with the help of re and json.loads to find the correct script tag that contains the json:
JavaScript
x
11
11
1
from bs4 import BeautifulSoup
2
import re
3
from json import loads
4
url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"
5
6
soup = BeautifulSoup(requests.get(url).content)
7
8
script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]
9
10
json_data = loads(script_text[:script_text.find(";")])
11
Running it you see we get what you want:
JavaScript
1
34
34
1
In [31]: from bs4 import BeautifulSoup
2
In [32]: import re
3
In [33]: from json import loads
4
In [34]: import requests
5
6
In [35]: url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"
7
8
In [36]: soup = BeautifulSoup(requests.get(url).content, "html.parser")
9
10
In [37]: script_text = soup.find("script", text=re.compile("vars+dataLayer")).text.split("= ", 1)[1]
11
12
In [38]: json_data = loads(script_text[:script_text.find(";")])
13
14
In [39]: json_data
15
Out[39]:
16
[{'actor': '403573,19358,22868,612492,418933,436500,46797,729453,66391,16893,211493,249636,18324,483703,1193,165792,231665,114167,139915,155111,258115,119842,610268,166263,597100,134791,520768,149470,734146,633703,684803,763372,673220,748361,178486,241328,517093,765381,693327,196630,758799,220756,550759,737383,263596,174710,118600,663153,463379,740361,702873,659451,779133,779134,779135,779136,779137,779138,779139,779140,779141,779142,779143,779144,779145,779146,779147,779241,779242,779243,779244',
17
'director': '41198',
18
'genre': '13025=action&13012=fantastique',
19
'movie_distributors': 929,
20
'movie_id': 144185,
21
'movie_isshowtime': 1,
22
'movie_label': 'suicide_squad',
23
'nationality': '5002',
24
'press_rating': 2,
25
'releasedate': '2016-08-03',
26
'site_route': 'moviepage_videos_trailer',
27
'site_section': 'movie',
28
'user_activity': 'videowatch',
29
'user_rating': 3.4,
30
'video_id': 19561982,
31
'video_label': 'suicide_squad_bande_annonce_finale_vo',
32
'video_type_id': 31003,
33
'video_type_label': 'trailer'}]
34
You could also use a regex but in this case using str.find to get the end of the data is sufficient.