I have to replace all occurrences of patterns with hyphen like c-c-c-c-come
or oh-oh-oh-oh
, etc. with the last token i.e. come
or oh
in this example, where
- The number of character between hyphen is arbitrary, it can be one ore more characters
- the token to match is the last token in the hyphenation, hence
come
inc-c-come
. the input string may have one or more occurrences of it like the following sentences:
c-c-c-c-come to home today c-c-c-c-come to me
oh-oh-oh-oh it's a bad life oh-oh-oh-oh
Need to find the start and end position of the matched token via
finditer
r = re.compile(pattern, flags=re.I | re.X | re.UNICODE) for m in r.finditer(text): word=m.group() characterOffsetBegin=m.start() characterOffsetEnd=m.end() # now replace and store indexes
[UPDATE]
Assumed that those hyphenated words does not belong to a fixed dictionary, I’m adding this constraint to it:
- The number of character between hyphen must range from a minimum to a max, like
{1,3}
so that the capture group must matchc-come
, orc-c-come
, but not a hyphenated real word likefine-tuning
or likeinter-face
, etc.
Advertisement
Answer
An option using a capturing group and a backreference might be:
(?<!S)(w{2,3})(?:-1)*-(w+)(?!S)
That will match:
(?<!S)
Negative lookbehind, assert what is on the left is not a non whitespace char(w{2,3})
Capture in group 1 two or three times a word char(?:-1)*
Repeat 0+ times matching a hyphen followed by a backreference to what is matched in group 1-(w+)
Match-
followed by matching 1+ word chars in group 2(?!S)
Negative lookahead, assert what is on the right is not a non whitespace char
In the replacement use the second capturing group \2
or r'2
For example
import re text = "c-c-c-c-come oh-oh-oh-oh it's a bad life oh-oh-oh-oh" pattern = r"(?<!S)(w{1,3})(?:-1)*-(w+)(?!S)" text = re.sub(pattern, r'2', text) print(text)
Result
come oh it's a bad life oh