JavaScript
x
22
22
1
import scrapy
2
from scrapy.http import Request
3
4
5
6
class TestSpider(scrapy.Spider):
7
name = 'test'
8
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
9
custom_settings = {
10
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
11
'DOWNLOAD_DELAY': 1,
12
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
13
}
14
15
16
17
def parse(self, response):
18
books = response.xpath("//div[@class='list-group']//@href").extract()
19
for book in books:
20
url = response.urljoin(book)
21
print(url)
22
I want to remove these unnecessary url from the link the website is https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx
JavaScript
1
9
1
http://www.unbr.ro
2
http://www.inppa.ro
3
http://www.uniuneanotarilor.ro/
4
Prima pagină
5
http://www.executori.ro/
6
http://www.csm1909.ro
7
http://www.inm-lex.ro
8
http://www.just.ro
9
Advertisement
Answer
You can apply endswith
method along with continue
keyword to remove the desired urls
JavaScript
1
22
22
1
import scrapy
2
from scrapy.http import Request
3
4
class TestSpider(scrapy.Spider):
5
name = 'test'
6
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
7
custom_settings = {
8
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
9
'DOWNLOAD_DELAY': 1,
10
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
11
}
12
13
14
15
def parse(self, response):
16
books = response.xpath("//div[@class='list-group']//@href").extract()
17
for book in books:
18
url = response.urljoin(book)
19
if url.endswith('.ro') or url.endswith('.ro/'):
20
continue
21
print(url)
22
Output:
JavaScript
1
16
16
1
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=1091&Signature=378270
2
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159077&Signature=378270
3
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159076&Signature=378270
4
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159075&Signature=378270
5
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159021&Signature=378270
6
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159020&Signature=378270
7
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159019&Signature=378270
8
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159018&Signature=378270
9
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=21846&Signature=378270
10
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=165927&Signature=378270
11
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=83465&Signature=378270
12
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=47724&Signature=378270
13
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=32097&Signature=378270
14
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=29573&Signature=378270
15
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=19880&Signature=378270
16