I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:
JavaScript
x
9
1
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
2
Плотность населения субъектов Российской Федерации
3
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
4
5
6
['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
7
Численность мужчин и женщин
8
/storage/mediabank/yKsfiyjR/demo13.xls
9
You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:
JavaScript
1
9
1
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
2
Плотность населения субъектов Российской Федерации
3
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
4
5
6
['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
7
Численность мужчин и женщин
8
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls
9
How should I do it? Here is the previously reproduced code:
JavaScript
1
27
27
1
import requests
2
from bs4 import BeautifulSoup
3
4
URL = "https://rosstat.gov.ru/folder/12781"
5
6
responce = requests.get(URL).text
7
soup = BeautifulSoup(responce, 'lxml')
8
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")
9
10
list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
11
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')
12
13
sources = []
14
15
for text_block_row in list_info_block_row:
16
new_list = []
17
title_element_row = text_block_row.find('div', class_='document-list__item-title')
18
preprocessing_title = title_element_row.text.strip()
19
link_element_row = text_block_row.find('a').get('href')
20
new_list.append(preprocessing_title)
21
new_list.append(link_element_row)
22
print(new_list)
23
print(title_element_row.text.strip())
24
print(link_element_row)
25
print('nn')
26
27
Advertisement
Answer
You can check if the string has an scheme, and if not add it and also the host:
JavaScript
1
6
1
if not link_element_row.startswith("http"):
2
parsed_url = urlparse(URL)
3
link_element_row = (
4
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
5
)
6
Full working code:
JavaScript
1
37
37
1
import requests
2
from urllib.parse import urlparse
3
from bs4 import BeautifulSoup
4
5
URL = "https://rosstat.gov.ru/folder/12781"
6
7
responce = requests.get(URL).text
8
soup = BeautifulSoup(responce, "lxml")
9
block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")
10
11
list_info_block_row = block.find_all(
12
"div", class_="document-list__item document-list__item--row"
13
)
14
list_info_block_col = block.find_all(
15
"div", class_="document-list__item document-list__item--col"
16
)
17
18
for text_block_row in list_info_block_row:
19
new_list = []
20
title_element_row = text_block_row.find("div", class_="document-list__item-title")
21
preprocessing_title = title_element_row.text.strip()
22
link_element_row = text_block_row.find("a").get("href")
23
new_list.append(preprocessing_title)
24
25
if not link_element_row.startswith("http"):
26
parsed_url = urlparse(URL)
27
link_element_row = (
28
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
29
)
30
31
new_list.append(link_element_row)
32
33
print(new_list)
34
print(title_element_row.text.strip())
35
print(link_element_row)
36
print("nn")
37
Research: