Scrapy Python can‘t extract links with more stable xpath

Question

I‘m Building a scraper for this website. I‘m using Python and scrapy Shell to extract the data that I want: xpath would be: //a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“ Using response.xpath(‘//a[@class=“sb-card sb-card-company site-1x1 with-hover]/@href“‘ returns [] I tried using contains(@class,“sb-card-company“) with the same result. Using other containers in the same way, changed nothing. Using a different page also had no effect. Using

Accepted Answer

It&#8217;s not a problem with xpath. It&#8217;s a dynamically-loaded content issue.Here&#8217;s an example of how you can get it from the json file:scrapy shellIn [1]: url='https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direct   ...: ion=desc&page=1&limit=21&filters={%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]}'In [2]: headers = {   ...: "Accept": "application/json",   ...: "Accept-Encoding": "gzip, deflate, br",   ...: "Accept-Language": "en-US,en;q=0.5",   ...: "Cache-Control": "no-cache",   ...: "Connection": "keep-alive",   ...: "Content-Type": "application/json",   ...: "DNT": "1",   ...: "Host": "www.startbase.de",   ...: "Pragma": "no-cache",   ...: "Referer": "https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sor   ...: t%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22item   ...: sPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_   ...: id%22%3A%5B10%5D%7D%7D",   ...: "Sec-Fetch-Dest": "empty",   ...: "Sec-Fetch-Mode": "cors",   ...: "Sec-Fetch-Site": "same-origin",   ...: "Sec-GPC": "1",   ...: "TE": "trailers",   ...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372   ...: 9.169 Safari/537.36",   ...: "X-KL-Ajax-Request": "Ajax_Request"   ...: }In [3]: req = scrapy.Request(url=url, headers=headers)In [4]: fetch(req)2021-10-16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startbase.de/api/companies/?format=json&display=small&sort=company.startbase_score&sort-direction=desc&page=1&limit=21&filters=%7B%22company.type%22:%22startup%22,%22startup_profile.industry_id%22:[10]%7D> (referer: https://www.startbase.de/startups/?listOptions%5Bcompany-startup%5D=%7B%22version%22%3A1.3%2C%22sort%22%3A%22company.startbase_score%22%2C%22sortDirection%22%3A%22desc%22%2C%22display%22%3A%22small%22%2C%22itemsPerPage%22%3A21%2C%22page%22%3A1%2C%22userLocation%22%3Anull%2C%22filters%22%3A%7B%22startup_profile.industry_id%22%3A%5B10%5D%7D%7D)In [5]: json_data = response.json()In [6]: for company in json_data['body']['items']:    ...:     print(company['company.url'])    ...:/organization/creditshelf//organization/amafin-gmbh//organization/fincompare//organization/epap//organization/clearvat//organization/51nodes//organization/altruja-gmbh//organization/flexvelop//organization/coin-analyst-ug//organization/caya//organization/rubarb//organization/memrange//organization/sevdesk-sevenit//organization/getsafe//organization/xavin//organization/giromatch//organization/digi-bel-projekt-von-meeting-minds//organization/digioptions//organization/trafinscout//organization/tangany-gmbh//organization/kiwi-financial-living/

Advertisement

Answer