I scraped some links from a website and I’m using scrapy spider for scraping purpose.
# image urls look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li').extract_first() for i in look_inside_image_urls: print("============> look_inside_image_urls ===============>", i)
But I got none type value. Just I am any number of image link of li. I download via loop.
This is my HTML code
<div class="lookInsideDiv" style="display: block;"> <div class="exitBtn"><i class="ion-close-round"></i></div> <div class="pagesArea"> <ul class="list-unstyled pages"> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li> </ul> </div> </div>
I just want to get all link inside li like this
https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
Advertisement
Answer
Try this, to extract the all image use extract() (its return list) instead of extract_first()(return first item) method.
look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li/img/@src').extract() for i in look_inside_image_urls: print("============> look_inside_image_urls ===============>", i)
Edit
from scrapy.selector import Selector html ="""<div class="lookInsideDiv" style="display: block;"> <div class="exitBtn"><i class="ion-close-round"></i></div> <div class="pagesArea"> <ul class="list-unstyled pages"> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li> <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li> </ul> </div> </div>""" data = Selector(text=html) look_inside_image_urls = data.xpath('//*/ul[@class="list-unstyled pages"]/li/img/@src').extract() for i in look_inside_image_urls: print("============> look_inside_image_urls ===============>", i) ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg ============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg