Appending text to a string if it matches a condition

Question

I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get: You can see that in the second case I get only part of the link, while

Accepted Answer

You can check if the string has an scheme, and if not add it and also the host:if not link_element_row.startswith("http"):        parsed_url = urlparse(URL)        link_element_row = (            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row        )Full working code:import requestsfrom urllib.parse import urlparsefrom bs4 import BeautifulSoupURL = "https://rosstat.gov.ru/folder/12781"responce = requests.get(URL).textsoup = BeautifulSoup(responce, "lxml")block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")list_info_block_row = block.find_all(    "div", class_="document-list__item document-list__item--row")list_info_block_col = block.find_all(    "div", class_="document-list__item document-list__item--col")for text_block_row in list_info_block_row:    new_list = []    title_element_row = text_block_row.find("div", class_="document-list__item-title")    preprocessing_title = title_element_row.text.strip()    link_element_row = text_block_row.find("a").get("href")    new_list.append(preprocessing_title)    if not link_element_row.startswith("http"):        parsed_url = urlparse(URL)        link_element_row = (            parsed_url.scheme + "://" + parsed_url.netloc + link_element_row        )    new_list.append(link_element_row)    print(new_list)    print(title_element_row.text.strip())    print(link_element_row)    print("nn")Research:Get protocol + host name from URLstartswith

Advertisement

Answer