Scrapy extracting entire HTML element instead of following link

Question

I&#8217;m trying to access or follow every link that appears for commercial contractors from this website: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/ then extract the emails from the sites that each link leads to but when I run this script, scrapy follows the base url with the entir…

Accepted Answer

The webpage contains its in-built search option. Whenever you search by selecting the commercial contractors then data is loaded dynamically by JS via API as json format alomg with GET method.That&#8217;s why you can&#8217;t get the desired data from the plain HTML DOM.Full working Code as an example:import scrapyimport jsonclass TestSpider(scrapy.Spider):    name = 'test'       def start_requests(self):        headers= {            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',            'x-requested-with': 'XMLHttpRequest'        }        url='https://lslbc.louisiana.gov/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'        yield scrapy.Request(            url=url,            headers=headers,            callback= self.parse,            method="GET")    def parse(self, response):               resp = json.loads(response.body)        for item in resp['results']:            api_url = 'https://lslbc.louisiana.gov/wp-admin/admin-ajax.php?action=api_actions&api_action=company_details&company_id='+item['id']                        yield scrapy.Request(                url= api_url,                callback= self.parse_email,                method="GET"                             )        def parse_email(self, response):            resp2 = json.loads(response.body)        yield {            'Email':resp2['email_address']        }

Advertisement

Answer