Skip to content
Advertisement

BeautifulSoup trying to remove HTML data from list

As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing | and -. I get span information as well as others that I would like to remove. As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change. The page architecture stays the same, which is why printing the items in the list stays the same. Wondering what would be the easiest way to clean the output. Here is the code section:

        infoLink = driver.find_element_by_xpath("//a[contains(@href, '?tmpl=component&detail=true&parcel=')]").click()
        driver.switch_to.window(driver.window_handles[1])
        aInfo = driver.current_url
        data = requests.get(aInfo)
        src = data.text
        soup = BeautifulSoup(src, "html.parser")
        parsed = soup.find_all("td")
        for item in parsed:
            Original = (parsed[21])
            Owner = parsed[13]
            Address = parsed[17]
            print (*Original, "|",*Owner, "-",*Address)

Example output is:

<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA  90405</br></span>

Thank you!

Advertisement

Answer

To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:

for item in parsed:
    Original = (parsed[21].get_text(strip=True))
    Owner = parsed[13].get_text(strip=True)
    Address = parsed[17].get_text(strip=True)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement