Skip to content
Advertisement

Using Regex to extract Data to different Columns in Pandas

I’m working with the following DataFrame column containing Date |TimeStamp | Name | Message as a string

59770        [08/10/18, 5:57:43 PM] Luke: Message
59771   [08/10/18, 5:57:48 PM] Luke: Message
59772     [08/10/18, 5:57:50 PM] Luke: Message

I use the following function to capture the Date.

def getdate(x):
    res = re.search("dd/dd/dd",x)

and the following code to capture the rest of the data (TimeStamp | Name | Message) into columns:

df['Data'].str.extract(r's*(.{10})](.*):(.*)')

Is there a workaround to capture and extract all 4 entities together?

Please Advise

Advertisement

Answer

As an alternative you could use regex named groups together with pandas extractall.

import pandas as pd
import re

df = pd.DataFrame(
    ["        [08/10/18, 5:57:43 PM] Luke: Message",
     "   [08/10/18, 5:57:48 PM] Luke: Message",
     "     [08/10/18, 5:57:50 PM] Luke: Message"])

print(df)

regex = re.compile(
    r"(?P<date>d{2}/d{2}/d{2}),s*"
    r"(?P<timestamp>d+:d+:d+s[AP]M)]s+"
    r"(?P<name>.+?):s*"
    r"(?P<message>.+)$"
    )

df_out = df[0].str.extractall(regex).droplevel(1)
print(df_out)

Output from df_out

       date   timestamp  name  message
0  08/10/18  5:57:43 PM  Luke  Message
1  08/10/18  5:57:48 PM  Luke  Message
2  08/10/18  5:57:50 PM  Luke  Message
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement