I was working with text data, I want to remove anything HTML code that is things with “<” and “>”. For example
<< HTML > < p style=”text-align:justify” >Labour Solutions Australia (LSA) is a national labour hire and sourcing `
So I use the following code
def remove_html(s): s = re.sub('[^S]*<[^S]*', "", s) s = re.sub('[^S]*>[^S]*', "", s) return s
With the execution of the code we get the following result
Solutions Australia LSA is a national labour hire and sourcing
I don’t want to remove the word Labour but it get remove as it’s stick with ‘>’. Is there any way I can save it? Please suggest
Advertisement
Answer
import re def remove_html(data): return re.sub('<[^>]+>', '', data).strip() test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing' print(remove_html(test_case))
Output:
Labour Solutions Australia (LSA) is a national labour hire and sourcing