I’m working with the following DataFrame column containing Date |TimeStamp | Name | Message
as a string
59770 [08/10/18, 5:57:43 PM] Luke: Message 59771 [08/10/18, 5:57:48 PM] Luke: Message 59772 [08/10/18, 5:57:50 PM] Luke: Message
I use the following function to capture the Date.
def getdate(x): res = re.search("dd/dd/dd",x)
and the following code to capture the rest of the data (TimeStamp | Name | Message) into columns:
df['Data'].str.extract(r's*(.{10})](.*):(.*)')
Is there a workaround to capture and extract all 4 entities together?
Please Advise
Advertisement
Answer
As an alternative you could use regex named groups
together with pandas extractall
.
import pandas as pd import re df = pd.DataFrame( [" [08/10/18, 5:57:43 PM] Luke: Message", " [08/10/18, 5:57:48 PM] Luke: Message", " [08/10/18, 5:57:50 PM] Luke: Message"]) print(df) regex = re.compile( r"(?P<date>d{2}/d{2}/d{2}),s*" r"(?P<timestamp>d+:d+:d+s[AP]M)]s+" r"(?P<name>.+?):s*" r"(?P<message>.+)$" ) df_out = df[0].str.extractall(regex).droplevel(1) print(df_out)
Output from df_out
date timestamp name message 0 08/10/18 5:57:43 PM Luke Message 1 08/10/18 5:57:48 PM Luke Message 2 08/10/18 5:57:50 PM Luke Message