import scrapy from scrapy.http import Request class TestSpider(scrapy.Spider): name = 'test' start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx'] custom_settings = { 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36' } def parse(self, response): books = response.xpath("//div[@class='list-group']//@href").extract() for book in books: url = response.urljoin(book) print(url)
I want to remove these unnecessary url from the link the website is https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx
http://www.unbr.ro http://www.inppa.ro http://www.uniuneanotarilor.ro/Prima paginăhttp://www.executori.ro/ http://www.csm1909.ro http://www.inm-lex.ro http://www.just.ro
Advertisement
Answer
You can apply endswith
method along with continue
keyword to remove the desired urls
import scrapy from scrapy.http import Request class TestSpider(scrapy.Spider): name = 'test' start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx'] custom_settings = { 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36' } def parse(self, response): books = response.xpath("//div[@class='list-group']//@href").extract() for book in books: url = response.urljoin(book) if url.endswith('.ro') or url.endswith('.ro/'): continue print(url)
Output:
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=1091&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159077&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159076&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159075&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159021&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159020&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159019&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159018&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=21846&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=165927&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=83465&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=47724&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=32097&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=29573&Signature=378270 https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=19880&Signature=378270