I‘m Building a scraper for this website. I‘m using Python and scrapy Shell to extract the data that I want: xpath would be: //a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“
Using response.xpath(‘//a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“‘
returns []
I tried using contains(@class,“sb-card-company“)
with the same result. Using other containers in the same way, changed nothing. Using a different page also had no effect. Using hard nodes instead worked but I‘m curious about what I did wrong.
Advertisement
Answer
It’s not a problem with xpath. It’s a dynamically-loaded content issue.
Here’s an example of how you can get it from the json file:
JavaScript
x
61
61
1
scrapy shell
2
3
In [1]: url='https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direct
4
ion=desc&page=1&limit=21&filters={%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]}' :
5
6
In [2]: headers = {
7
"Accept": "application/json", :
8
"Accept-Encoding": "gzip, deflate, br", :
9
"Accept-Language": "en-US,en;q=0.5", :
10
"Cache-Control": "no-cache", :
11
"Connection": "keep-alive", :
12
"Content-Type": "application/json", :
13
"DNT": "1", :
14
"Host": "www.startbase.de", :
15
"Pragma": "no-cache", :
16
"Referer": "https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sor :
17
t%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22item :
18
sPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_ :
19
id%22%3A%5B10%5D%7D%7D", :
20
"Sec-Fetch-Dest": "empty", :
21
"Sec-Fetch-Mode": "cors", :
22
"Sec-Fetch-Site": "same-origin", :
23
"Sec-GPC": "1", :
24
"TE": "trailers", :
25
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372 :
26
9.169 Safari/537.36", :
27
"X-KL-Ajax-Request": "Ajax_Request" :
28
: }
29
30
In [3]: req = scrapy.Request(url=url, headers=headers)
31
32
In [4]: fetch(req)
33
2021-10-16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direction=desc&page=1&limit=21&filters=%7B%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]%7D> (referer: https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sort%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22itemsPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_id%22%3A%5B10%5D%7D%7D)
34
35
In [5]: json_data = response.json()
36
37
In [6]: for company in json_data['body']['items']:
38
print(company['company.url']) :
39
:
40
/organization/creditshelf/
41
/organization/amafin-gmbh/
42
/organization/fincompare/
43
/organization/epap/
44
/organization/clearvat/
45
/organization/51nodes/
46
/organization/altruja-gmbh/
47
/organization/flexvelop/
48
/organization/coin-analyst-ug/
49
/organization/caya/
50
/organization/rubarb/
51
/organization/memrange/
52
/organization/sevdesk-sevenit/
53
/organization/getsafe/
54
/organization/xavin/
55
/organization/giromatch/
56
/organization/digi-bel-projekt-von-meeting-minds/
57
/organization/digioptions/
58
/organization/trafinscout/
59
/organization/tangany-gmbh/
60
/organization/kiwi-financial-living/
61