Skip to content
Advertisement

Find indices of target words without the surrounding brackets

I want a set of sentences with target words target["text"] surrounded by brackets/braces/parentheses and some are overlapping/nested brackets/braces/parentheses. I want to extract these target words as well as their correct indices in the sentence, without brackets/braces/parentheses. I have managed to do so with the brackets and so on:

JavaScript

Now I want to remove the brackets/braces/parentheses from the target["text"]s and find the correct indices of these targets (w/o the brackets/braces/parentheses). Because of the overlapping brackets etc, I am having trouble identifying the correct indices. This code below works with non-overlapped brackets:

JavaScript

What would be the recommended approach here? Thanks!

Expected output:

JavaScript

Advertisement

Answer

Given your sentence and your pattern:

JavaScript

and given that your delimiters are braces, brackets and parentheses.

You can do the following:

JavaScript

Side note on the last pattern search (^|[^w]+)({word})($|[^w]+). It checks for words ({word}) that are found:

  • after the begin delimiter or anything other than letters (^|[^w]+)
  • before the end delimiter or anything other than letters ($|[^w]+)

The match.start and match.end function have “2” as input since we want to retrieve the start and end index of the second group.

Does this solution help you?


EDIT: How to handle the case when words are near delimiters during sentence cleaning?

You can handle that edge cases by adding one space between delimiters and words before removing the delimiters.

JavaScript

The first regex will match all delimiters preceeded by a letter, and replace it with the letter + delimiter separated by a space, using backreferencing.

The second regex will match all delimiters followed by a letter, and replace it with the delimiter + letter separated by a space, using backreferencing.

The third regex was taken directly from the answer snippet.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement