Skip to content
Advertisement

Python find all occurrences of hyphenated word and replace at position

I have to replace all occurrences of patterns with hyphen like c-c-c-c-come or oh-oh-oh-oh, etc. with the last token i.e. come or oh in this example, where

  • The number of character between hyphen is arbitrary, it can be one ore more characters
  • the token to match is the last token in the hyphenation, hence come in c-c-come.
  • the input string may have one or more occurrences of it like the following sentences:

    c-c-c-c-come to home today c-c-c-c-come to me

    oh-oh-oh-oh it's a bad life oh-oh-oh-oh

  • Need to find the start and end position of the matched token via finditer

    r = re.compile(pattern, flags=re.I | re.X | re.UNICODE)
    for m in r.finditer(text):
       word=m.group()
       characterOffsetBegin=m.start()
       characterOffsetEnd=m.end()
       # now replace and store indexes
    

[UPDATE]

Assumed that those hyphenated words does not belong to a fixed dictionary, I’m adding this constraint to it:

  • The number of character between hyphen must range from a minimum to a max, like {1,3} so that the capture group must match c-come, or c-c-come, but not a hyphenated real word like fine-tuning or like inter-face, etc.

Advertisement

Answer

An option using a capturing group and a backreference might be:

(?<!S)(w{2,3})(?:-1)*-(w+)(?!S)

That will match:

  • (?<!S) Negative lookbehind, assert what is on the left is not a non whitespace char
  • (w{2,3}) Capture in group 1 two or three times a word char
  • (?:-1)* Repeat 0+ times matching a hyphen followed by a backreference to what is matched in group 1
  • -(w+) Match - followed by matching 1+ word chars in group 2
  • (?!S) Negative lookahead, assert what is on the right is not a non whitespace char

In the replacement use the second capturing group \2 or r'2

Regex demo | Python demo

For example

import re

text = "c-c-c-c-come oh-oh-oh-oh it's a bad life oh-oh-oh-oh"
pattern = r"(?<!S)(w{1,3})(?:-1)*-(w+)(?!S)"
text = re.sub(pattern, r'2', text)
print(text)

Result

come oh it's a bad life oh
Advertisement