I’m working with the following DataFrame column containing Date |TimeStamp | Name | Message as a string
59770 [08/10/18, 5:57:43 PM] Luke: Message 59771 [08/10/18, 5:57:48 PM] Luke: Message 59772 [08/10/18, 5:57:50 PM] Luke: Message
I use the following function to capture the Date.
def getdate(x):
res = re.search("dd/dd/dd",x)
and the following code to capture the rest of the data (TimeStamp | Name | Message) into columns:
df['Data'].str.extract(r's*(.{10})](.*):(.*)')
Is there a workaround to capture and extract all 4 entities together?
Please Advise
Advertisement
Answer
As an alternative you could use regex named groups together with pandas extractall.
import pandas as pd
import re
df = pd.DataFrame(
[" [08/10/18, 5:57:43 PM] Luke: Message",
" [08/10/18, 5:57:48 PM] Luke: Message",
" [08/10/18, 5:57:50 PM] Luke: Message"])
print(df)
regex = re.compile(
r"(?P<date>d{2}/d{2}/d{2}),s*"
r"(?P<timestamp>d+:d+:d+s[AP]M)]s+"
r"(?P<name>.+?):s*"
r"(?P<message>.+)$"
)
df_out = df[0].str.extractall(regex).droplevel(1)
print(df_out)
Output from df_out
date timestamp name message 0 08/10/18 5:57:43 PM Luke Message 1 08/10/18 5:57:48 PM Luke Message 2 08/10/18 5:57:50 PM Luke Message