I’m new to python and I’m trying to scrape a html with a scrapy spider but the response returns nothing. Wondering what’s wrong here? Thanks for any help in advance.
The url:
My spider:
import scrapy class lngspider(scrapy.Spider): name = 'scrapylng' user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' start_urls = ['https://directory.lubesngreases.com/LngMain/includes/themes/MuraBootstrap3/remote/api?fn=searchcompany&name&query&STATE&brand&COUNTRY&query2&mode=advanced&filters=%7B%7D&page=1&datatype=html'] def parse(self,response): for company in response.css('div.company-item row'): yield{ 'name' : products.css('class.CompanyHead').get() }
Output:
(workenv) C:Usersseanllngscraperlngscraper>scrapy crawl scrapylng 2022-05-26 21:53:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: lngscraper) 2022-05-26 21:53:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Windows-10-10.0.19043-SP0 2022-05-26 21:53:12 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'lngscraper', 'NEWSPIDER_MODULE': 'lngscraper.spiders', 'SPIDER_MODULES': ['lngscraper.spiders']} 2022-05-26 21:53:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2022-05-26 21:53:12 [scrapy.extensions.telnet] INFO: Telnet Password: 5b71199b20af863b 2022-05-26 21:53:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2022-05-26 21:53:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-05-26 21:53:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-05-26 21:53:12 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-05-26 21:53:12 [scrapy.core.engine] INFO: Spider opened 2022-05-26 21:53:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-05-26 21:53:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-05-26 21:53:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://directory.lubesngreases.com/LngMain/includes/themes/MuraBootstrap3/remote/api/?fn=searchcompany&name&query&STATE&brand&COUNTRY&query2&mode=advanced&filters=%7B%7D&page=1&datatype=html> from <GET https://directory.lubesngreases.com/LngMain/includes/themes/MuraBootstrap3/remote/api?fn=searchcompany&name&query&STATE&brand&COUNTRY&query2&mode=advanced&filters=%7B%7D&page=1&datatype=html> 2022-05-26 21:53:15 [filelock] DEBUG: Attempting to acquire lock 2667801190576 on C:Usersseanlpythonscriptsworkenvlibsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock 2022-05-26 21:53:15 [filelock] DEBUG: Lock 2667801190576 acquired on C:Usersseanlpythonscriptsworkenvlibsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock 2022-05-26 21:53:15 [filelock] DEBUG: Attempting to release lock 2667801190576 on C:Usersseanlpythonscriptsworkenvlibsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock 2022-05-26 21:53:15 [filelock] DEBUG: Lock 2667801190576 released on C:Usersseanlpythonscriptsworkenvlibsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock 2022-05-26 21:53:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://directory.lubesngreases.com/LngMain/includes/themes/MuraBootstrap3/remote/api/?fn=searchcompany&name&query&STATE&brand&COUNTRY&query2&mode=advanced&filters=%7B%7D&page=1&datatype=html> (referer: None) 2022-05-26 21:53:15 [scrapy.core.engine] INFO: Closing spider (finished) 2022-05-26 21:53:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 925, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 15651, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'elapsed_time_seconds': 2.974988, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2022, 5, 26, 13, 53, 15, 282689), 'httpcompression/response_bytes': 67300, 'httpcompression/response_count': 1, 'log_count/DEBUG': 7, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2022, 5, 26, 13, 53, 12, 307701)} 2022-05-26 21:53:15 [scrapy.core.engine] INFO: Spider closed (finished)
Advertisement
Answer
I added print('url:', response.url)
in parse()
and I see it runs this function.
First problem is that you use CSS
in wrong way.
This div
has two classes company-item
and row
and you have to use two dots (without space)
div.company-item.row
You use div.company-item row
which means <div class="company-item"> <row>
Second problem is that you use variable product
which doesn’t exist.
It has to be company.css()
instead of product.css()
Third problem is that it has to be span
, not class
in company.css()
or you should skip class
company.css('span.CompanyHead') company.css('.CompanyHead')
but this gives HTML
and you need pseudo-selector ::text
to get only text from HTML
company.css('span.CompanyHead::text')
import scrapy class lngspider(scrapy.Spider): name = 'scrapylng' user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' start_urls = ['https://directory.lubesngreases.com/LngMain/includes/themes/MuraBootstrap3/remote/api?fn=searchcompany&name&query&STATE&brand&COUNTRY&query2&mode=advanced&filters=%7B%7D&page=1&datatype=html'] def parse(self, response): print('url:', response.url) # see HTML #print(response.body.decode()) # save HTML in file to see it later in browser #with open('output.html', 'wb') as f: # f.write(response.body) for company in response.css('div.company-item.row'): name = company.css('span.CompanyHead::text').get() print('name:', name) yield { 'name': name, } # --- run without project and save in `output.csv` --- from scrapy.crawler import CrawlerProcess c = CrawlerProcess({ 'FEEDS': {'output.csv': {'format': 'csv'}}, }) c.crawl(lngspider) c.start()