Skip to content
Advertisement

Find and remove slightly different substring on string

I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:

myString = 'I'm júst a tésting stríng'
substring = 'TESTING'

Perform something to obtain:

resultingString = 'I'm júst a stríng'

Right now I’ve read that difflib library can compare two strings and weight it similarity somehow, but I’m not sure how to implement this for my case (without mentioning that I failed to install this lib).

Thanks!

Advertisement

Answer

This normalize() method might be a little overkill and maybe using the code from @Harpe at https://stackoverflow.com/a/71591988/218663 works fine.

Here I am going to break the original string into “words” and then join all the non-matching words back into a string:

import unicodedata
def normalize(text):
    return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()

myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))

print(newString)

giving you:

I'm júst a stríng

If your “substring” could be multi-word I might think about switching strategies to a regex:

import re
import unicodedata

def normalize(text):
    return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()

myString = "I'm júst á tésting stríng"
substring = "A TESTING"
match = re.search(f"\s{ normalize(substring) }\s", normalize(myString))
if match:
    found_at = match.span()
    first_part = myString[:found_at[0]]
    second_part = myString[found_at[1]:]
    print(f"{first_part} {second_part}".strip())

I think that will give you:

I'm júst stríng
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement