Skip to content
Advertisement

Regular Expression split w/ Lookbehind loses second half

I have a string that contains a number of keywords. I would like to split the string into a list of those keywords (but keep the keywords because they identify what the following data means)

Take the following string for example:

test_string = "ªttypmp3pfilfDjTunes/DJ Music/(I've Had) The Time Of My Life.mp3tsng<(I've Had) The Time Of My Lifetart:Bill Medley & Jennifer Warnes"

the important keywords are “ttyp”, “pfil”, “tsng”, “tart”. I would like to split the file so the output looks:

split_test_string = ["ª","ttypmp3","pfilfDjTunes/DJ Music/(I've Had) The Time Of My Life.mp3","tsng<(I've Had) The Time Of My Life","tart:Bill Medley & Jennifer Warnes"].

I’ve been researching regular expressions, and I think this expression would work, but when tested in Python, I end up losing the part that I want to keep. According to the Python re.split documents, this should work.

Checkout my regex calculator: https://regex101.com/r/FOlgv8/1

Note: I’m trying to get the first part to work. Then I’ll add the rest of the keywords using |.

regex = r'(?=ttyp).*'

This is my example code:

import re
regex = r'(?=ttyp).*'

split_test_string = re.split(regex, test_string)
print(f"Results: {split_test_string}")

Console Output:

Results: ['ª', '']

I’ve tried positive lookahead and positive lookback with no luck. I could just use a literal ‘ttyp’ but then I lose the keyword.

Any help would be appreciated, I’ve been researching, trial and erroring (mostly erroring) for hours now.

Advertisement

Answer

Here ya go:

re.split("(?=ttyp|pfil|tsng|tart)", test_string)

The reason yours didn’t work is that you split by .*, meaning you capture everything after the separator and treat it as the seperator itself (and thus throw it).

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement