I have to replace all occurrences of patterns with hyphen like c-c-c-c-come or oh-oh-oh-oh, etc. with the last token i.e. come or oh in this example, where
- The number of character between hyphen is arbitrary, it can be one ore more characters
- the token to match is the last token in the hyphenation, hence
comeinc-c-come. the input string may have one or more occurrences of it like the following sentences:
c-c-c-c-come to home today c-c-c-c-come to meoh-oh-oh-oh it's a bad life oh-oh-oh-ohNeed to find the start and end position of the matched token via
finditerr = re.compile(pattern, flags=re.I | re.X | re.UNICODE) for m in r.finditer(text): word=m.group() characterOffsetBegin=m.start() characterOffsetEnd=m.end() # now replace and store indexes
[UPDATE]
Assumed that those hyphenated words does not belong to a fixed dictionary, I’m adding this constraint to it:
- The number of character between hyphen must range from a minimum to a max, like
{1,3}so that the capture group must matchc-come, orc-c-come, but not a hyphenated real word likefine-tuningor likeinter-face, etc.
Advertisement
Answer
An option using a capturing group and a backreference might be:
(?<!S)(w{2,3})(?:-1)*-(w+)(?!S)
That will match:
(?<!S)Negative lookbehind, assert what is on the left is not a non whitespace char(w{2,3})Capture in group 1 two or three times a word char(?:-1)*Repeat 0+ times matching a hyphen followed by a backreference to what is matched in group 1-(w+)Match-followed by matching 1+ word chars in group 2(?!S)Negative lookahead, assert what is on the right is not a non whitespace char
In the replacement use the second capturing group \2 or r'2
For example
import re
text = "c-c-c-c-come oh-oh-oh-oh it's a bad life oh-oh-oh-oh"
pattern = r"(?<!S)(w{1,3})(?:-1)*-(w+)(?!S)"
text = re.sub(pattern, r'2', text)
print(text)
Result
come oh it's a bad life oh