Find indices of target words without the surrounding brackets

Question

I want a set of sentences with target words target[&#8220;text&#8221;] surrounded by brackets/braces/parentheses and some are overlapping/nested brackets/braces/parentheses. I want to extract these target words as well as their correct indices in the sentence, without brackets/braces/parentheses. I have manag…

Accepted Answer

Given your sentence and your pattern:sentence = "{ia} ({fascia} antebrachii). Genom att aponeurosen fäster i armb"pattern = r'{[^{}]+}|[[^[]]+]|([^()]+)'and given that your delimiters are braces, brackets and parentheses.You can do the following:# extract your matches from the sentencematches = re.findall(pattern, sentence, overlapped=True)# clean the matches from the delimiterswords = [re.sub(r'[{}[]()]', '', m) for m in matches]# clean your sentence from the delimitersclean_sent = re.sub(r'[{}[]()]', '', sentence)# searches the clean words in the clean string targets = [{    "start": m.start(2),    "end": m.end(2),    "text": clean_sent[m.start(2) : m.end(2)],} for m in map(lambda word: re.search(f'(^|[^w]+)({word})($|[^w]+)', clean_sent), words)]    Side note on the last pattern search (^|[^w]+)({word})($|[^w]+). It checks for words ({word}) that are found:after the begin delimiter or anything other than letters (^|[^w]+)before the end delimiter or anything other than letters ($|[^w]+)The match.start and match.end function have &#8220;2&#8221; as input since we want to retrieve the start and end index of the second group.Does this solution help you?EDIT: How to handle the case when words are near delimiters during sentence cleaning?You can handle that edge cases by adding one space between delimiters and words before removing the delimiters.# clean your sentence from the delimitersclean_sent = re.sub(r'(w)([([{])', '\1 \2', clean_sent)clean_sent = re.sub(r'([)]}])(w)', '\1 \2', clean_sent)clean_sent = re.sub(r'[{}[]()]' , ''       , clean_sent)The first regex will match all delimiters preceeded by a letter, and replace it with the letter + delimiter separated by a space, using backreferencing.The second regex will match all delimiters followed by a letter, and replace it with the delimiter + letter separated by a space, using backreferencing.The third regex was taken directly from the answer snippet.

Advertisement

Answer