I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm'] Плотность населения субъектов Российской Федерации http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm ['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls'] Численность мужчин и женщин /storage/mediabank/yKsfiyjR/demo13.xls
You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm'] Плотность населения субъектов Российской Федерации http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm ['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls'] Численность мужчин и женщин https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls
How should I do it? Here is the previously reproduced code:
import requests from bs4 import BeautifulSoup URL = "https://rosstat.gov.ru/folder/12781" responce = requests.get(URL).text soup = BeautifulSoup(responce, 'lxml') block = soup.find('div', class_="col-lg-8 order-1 order-lg-1") list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row') list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col') sources = [] for text_block_row in list_info_block_row: new_list = [] title_element_row = text_block_row.find('div', class_='document-list__item-title') preprocessing_title = title_element_row.text.strip() link_element_row = text_block_row.find('a').get('href') new_list.append(preprocessing_title) new_list.append(link_element_row) print(new_list) print(title_element_row.text.strip()) print(link_element_row) print('nn')
Advertisement
Answer
You can check if the string has an scheme, and if not add it and also the host:
if not link_element_row.startswith("http"): parsed_url = urlparse(URL) link_element_row = ( parsed_url.scheme + "://" + parsed_url.netloc + link_element_row )
Full working code:
import requests from urllib.parse import urlparse from bs4 import BeautifulSoup URL = "https://rosstat.gov.ru/folder/12781" responce = requests.get(URL).text soup = BeautifulSoup(responce, "lxml") block = soup.find("div", class_="col-lg-8 order-1 order-lg-1") list_info_block_row = block.find_all( "div", class_="document-list__item document-list__item--row" ) list_info_block_col = block.find_all( "div", class_="document-list__item document-list__item--col" ) for text_block_row in list_info_block_row: new_list = [] title_element_row = text_block_row.find("div", class_="document-list__item-title") preprocessing_title = title_element_row.text.strip() link_element_row = text_block_row.find("a").get("href") new_list.append(preprocessing_title) if not link_element_row.startswith("http"): parsed_url = urlparse(URL) link_element_row = ( parsed_url.scheme + "://" + parsed_url.netloc + link_element_row ) new_list.append(link_element_row) print(new_list) print(title_element_row.text.strip()) print(link_element_row) print("nn")
Research: