Replacing HTML but saving the word sticking at the end

I was working with text data, I want to remove anything HTML code that is things with “<” and “>”. For example

<< HTML > < p style=”text-align:justify” >Labour Solutions Australia (LSA) is a national labour hire and sourcing `

So I use the following code

def remove_html(s):
    
    s = re.sub('[^S]*<[^S]*', "", s)
    s = re.sub('[^S]*>[^S]*', "", s)
    return s

JavaScript
​x
 
def remove_html(s):
    
    s = re.sub('[^S]*<[^S]*', "", s)
    s = re.sub('[^S]*>[^S]*', "", s)
    return s
​

With the execution of the code we get the following result

Solutions Australia LSA is a national labour hire and sourcing

I don’t want to remove the word Labour but it get remove as it’s stick with ‘>’. Is there any way I can save it? Please suggest

Answer

import re
def remove_html(data):
    return re.sub('<[^>]+>', '', data).strip()

test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))

JavaScript
 
import re
def remove_html(data):
    return re.sub('<[^>]+>', '', data).strip()
​
test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))
​

Output:

Labour Solutions Australia (LSA) is a national labour hire and sourcing

Advertisement

Answer