Python – Parse text file with no delimiter and dynamic width values

Tags: , ,



My goal is to parse a text file in Python that has no headers therefore no columns names and no delimiters. A sample of the original file looks as follows:

Apr 14, 2021 12:40:00 AM CEST INFO   [purge.PurgeManager run] PURGE: Atom purge all data beginning (1 threads)
Apr 14, 2021 1:40:00 AM CEST INFO    [purge.PurgeManager run] PURGE: Atom purge all data beginning (1 threads)
Apr 14, 2021 2:40:00 AM CEST INFO    [purge.PurgeManager run] PURGE: Atom purge all data beginning (1 threads)

I tried to import the file into an Excel file but since it has no delimiter nor fixed-width, each value of the row is wrapped within one cell (cell A).

Now, since the file is not fixed width or delimited, how can I extract the date from within each row (which is dynamic in width as you can see – row 1 length Apr 14, 2021 12:40:00 AM CEST INFO > row 2 length Apr 14, 2021 1:40:00 AM CEST INFO)? I have no interest in manipulating other columns’ values, except for this.

I’ve tried using panda library with both read_csv() and read_fwf() and:

  1. using read_csv() it returns a data frame with two columns: column[0] – the index and column[1] – the value (date and the rest) wrapped into one column cell;
  2. using read_fwf(): cannot quite use it since the width of the date is dynamic.

Is there any way of achieving this using Python? Thanks

Answer

You can read the file line by line and use str.split() to parse it:

import dateutil
import pandas as pd

data = []
with open("your_file.txt", "r") as f_in:
    for line in map(str.strip, f_in):
        if not line:
            continue
        line = line.split(maxsplit=6)
        date = " ".join(line[:6])
        status = line[-1].split(maxsplit=1)[0]
        rest = line[-1].split(maxsplit=1)[-1]
        data.append({"date": date, "status": status, "rest": rest})


tzmapping = {
    "CET": dateutil.tz.gettz("Europe/Berlin"),
    "CEST": dateutil.tz.gettz("Europe/Berlin"),
}

df = pd.DataFrame(data)
df["date"] = df["date"].apply(dateutil.parser.parse, tzinfos=tzmapping)
print(df)

Prints:

                       date status                                               rest
0 2021-04-14 00:40:00+02:00   INFO  [purge.PurgeManager run] PURGE: Atom purge all...
1 2021-04-14 01:40:00+02:00   INFO  [purge.PurgeManager run] PURGE: Atom purge all...
2 2021-04-14 02:40:00+02:00   INFO  [purge.PurgeManager run] PURGE: Atom purge all...


Source: stackoverflow