I want to split a whatsapp chat backup text by date and keep the date as part of messages. I tried and couldn’t achieve the exact result i want. If anyone can suggest me a way to achieve this, that would be a big help. (I don’t know much about regex)
import re chat = '27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️" 27/01/2019, 08:58 - You were added 19/03/2019, 19:29 - Member 02: Hello guys,,, 19/03/2019, 19:29 - Member 03: Hi there..' regex = r"(bd+/d+/d+.*?(?=bd+/d+/d+|$)*)" results = re.split(regex, chat) print(results)
the above code does the job and keep the seperator as seperate item, but what i want it to be a part of its correponding message (item):
Current Result
['27/01/2019', '08:58 - You were added', '19/03/2019', '19:29 - Member 02: Hello guys,,', '19/03/2019', '19:29 - Member 03: Hi there..']
WHAT I WANT
['27/01/2019, 08:58 - You were added', '19/03/2019, 19:29 - Member 02: Hello guys', '19/03/2019, 19:29 - Member 03: Hi there..']
Advertisement
Answer
That happened because you used re.split
that keeps the chunks captured in the resulting list as separate items.
Your regex makes sense only if your matches can span several lines, else, extracting any line that starts with a time-like pattern would be enough.
That is why I’d suggest
regex = r"bd+/d+/d.*?(?=s*bd+/d+/d+|$)" results = re.findall(regex, chat, re.S)
See the Python demo:
import re chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️" 27/01/2019, 08:58 - You were added 19/03/2019, 19:29 - Member 02: Hello guys,,, 19/03/2019, 19:29 - Member 03: Hi there..''' regex = r"bd+/d+/d.*?(?=s*bd+/d+/d+|$)" results = re.findall(regex, chat, re.S) for r in results: print(r)
Output:
27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️" 27/01/2019, 08:58 - You were added 19/03/2019, 19:29 - Member 02: Hello guys,,, 19/03/2019, 19:29 - Member 03: Hi there..
Note the absence of the redundant capturing group and no *
after the positive lookahead that made it optional. Whitespaces at the end of each match are stripped using s*
pattern inside the lookahead.
The re.S
flag allows .
to match any char including line break chars.