I have a regex in python and I want to prevent matching substrings. I want to add ‘@’ at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents): for i, sent in enumerate(sents): sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'@1', str(sent)) return sents
And the example is :
mylist = list() mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9") add_atsign(mylist)
And the answer is :
['@ali_s @ali_t @ali_u @aabs:/t.co/@kMMALke2l9']
As you can see, it puts ‘@’ at the beginning of ‘aabs’ and ‘kMMALke2l9’. That it is wrong. I tried to edit the code as bellow :
def add_atsign(sents): for i, sent in enumerate(sents): sents[i] = re.sub(r'((^|s)[a-zA-Z0-9_]{4,15}(s|$))', r'@1', str(sent)) return sents
But the result will become like this :
['@ali_s ali_t@ ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements. The correct result I expect is:
"@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9"
Could anyone help? Thanks
Advertisement
Answer
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents): new_list = [] for string in sents: new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'@1', w) for w in string.split())) return new_list mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"] add_atsign(mylist) > ['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(w{4,15})$'
:
def add_atsign(sents): new_list = [] for string in sents: new_list.append(' '.join(re.sub(r'^(w{4,15})$', r'@1', w) for w in string.split())) return new_list