Skip to content
Advertisement

Replacing HTML but saving the word sticking at the end

I was working with text data, I want to remove anything HTML code that is things with “<” and “>”. For example

<< HTML > < p style=”text-align:justify” >Labour Solutions Australia (LSA) is a national labour hire and sourcing `

So I use the following code

def remove_html(s):
    
    s = re.sub('[^S]*<[^S]*', "", s)
    s = re.sub('[^S]*>[^S]*', "", s)
    return s

With the execution of the code we get the following result

Solutions Australia LSA is a national labour hire and sourcing

I don’t want to remove the word Labour but it get remove as it’s stick with ‘>’. Is there any way I can save it? Please suggest

Advertisement

Answer

import re
def remove_html(data):
    return re.sub('<[^>]+>', '', data).strip()

test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))

Output:

Labour Solutions Australia (LSA) is a national labour hire and sourcing

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement