I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:
myString = 'I'm júst a tésting stríng' substring = 'TESTING'
Perform something to obtain:
resultingString = 'I'm júst a stríng'
Right now I’ve read that difflib
library can compare two strings and weight it similarity somehow, but I’m not sure how to implement this for my case (without mentioning that I failed to install this lib).
Thanks!
Advertisement
Answer
This normalize()
method might be a little overkill and maybe using the code from @Harpe at https://stackoverflow.com/a/71591988/218663 works fine.
Here I am going to break the original string into “words” and then join all the non-matching words back into a string:
import unicodedata def normalize(text): return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower() myString = "I'm júst a tésting stríng" substring = "TESTING" newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring)) print(newString)
giving you:
I'm júst a stríng
If your “substring” could be multi-word I might think about switching strategies to a regex:
import re import unicodedata def normalize(text): return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower() myString = "I'm júst á tésting stríng" substring = "A TESTING" match = re.search(f"\s{ normalize(substring) }\s", normalize(myString)) if match: found_at = match.span() first_part = myString[:found_at[0]] second_part = myString[found_at[1]:] print(f"{first_part} {second_part}".strip())
I think that will give you:
I'm júst stríng