Skip to content
Advertisement

Appending text to a string if it matches a condition

I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:

['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm


['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
/storage/mediabank/yKsfiyjR/demo13.xls

You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:

['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm


['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls

How should I do it? Here is the previously reproduced code:

import requests
from bs4 import BeautifulSoup

URL = "https://rosstat.gov.ru/folder/12781"

responce = requests.get(URL).text
soup = BeautifulSoup(responce, 'lxml')
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")

list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')

sources = []

for text_block_row in list_info_block_row:
    new_list = []
    title_element_row = text_block_row.find('div', class_='document-list__item-title')
    preprocessing_title = title_element_row.text.strip()
    link_element_row = text_block_row.find('a').get('href')
    new_list.append(preprocessing_title)
    new_list.append(link_element_row)
    print(new_list)
    print(title_element_row.text.strip())
    print(link_element_row)
    print('nn')

Advertisement

Answer

You can check if the string has an scheme, and if not add it and also the host:

if not link_element_row.startswith("http"):
        parsed_url = urlparse(URL)
        link_element_row = (
            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
        )

Full working code:

import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup

URL = "https://rosstat.gov.ru/folder/12781"

responce = requests.get(URL).text
soup = BeautifulSoup(responce, "lxml")
block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")

list_info_block_row = block.find_all(
    "div", class_="document-list__item document-list__item--row"
)
list_info_block_col = block.find_all(
    "div", class_="document-list__item document-list__item--col"
)

for text_block_row in list_info_block_row:
    new_list = []
    title_element_row = text_block_row.find("div", class_="document-list__item-title")
    preprocessing_title = title_element_row.text.strip()
    link_element_row = text_block_row.find("a").get("href")
    new_list.append(preprocessing_title)

    if not link_element_row.startswith("http"):
        parsed_url = urlparse(URL)
        link_element_row = (
            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
        )

    new_list.append(link_element_row)

    print(new_list)
    print(title_element_row.text.strip())
    print(link_element_row)
    print("nn")

Research:

Advertisement