Skip to content
Advertisement

how to prevent regex matching substring of words?

I have a regex in python and I want to prevent matching substrings. I want to add ‘@’ at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:

def add_atsign(sents):
  for i, sent in enumerate(sents):
      sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'@1', str(sent))
  return sents

And the example is :

mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)

And the answer is :

['@ali_s @ali_t @ali_u @aabs:/t.co/@kMMALke2l9']

As you can see, it puts ‘@’ at the beginning of ‘aabs’ and ‘kMMALke2l9’. That it is wrong. I tried to edit the code as bellow :

def add_atsign(sents):
  for i, sent in enumerate(sents):
      sents[i] = re.sub(r'((^|s)[a-zA-Z0-9_]{4,15}(s|$))', r'@1', str(sent))
  return sents

But the result will become like this :

['@ali_s ali_t@ ali_u aabs:/t.co/kMMALke2l9']

As you can see It has wrong replacements. The correct result I expect is:

"@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9"

Could anyone help? Thanks

Advertisement

Answer

This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.

I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:

def add_atsign(sents):
    new_list = []
    for string in sents:
        new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'@1', w) 
                        for w in string.split()))
    return new_list

mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']

ie, we split, then replace only if the entire word matches, then rejoin.

By the way, your regex can be simplified to r'^(w{4,15})$':

def add_atsign(sents):
    new_list = []
    for string in sents:
        new_list.append(' '.join(re.sub(r'^(w{4,15})$', r'@1', w) 
                        for w in string.split()))
    return new_list
Advertisement