Skip to content
Advertisement

How to create Dataframe with the columns names being as a part of a csv file path?

I have a root folder

/home/project/data

With multiple folders in it with and ultimate paths to csv files:

/home/project/data/2020-12-05/John_Smith/data.csv
/home/project/data/2020-12-05/Robert_White/data.csv
/home/project/data/2020-12-06/John_Smith/data.csv
/home/project/data/2020-12-06/Sam_Walberg/data.csv
/home/project/data/2020-12-06/Garry_Oswald/data.csv
...

I was managed to create a Dataframe containing all csv files concatenated using the following code:

full_path = []
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        full_path.append(os.path.join(subdir, file))
dfs = [pd.read_csv(csv_path) for csv_path in full_path]
df = pd.concat(dfs)

Result:

df=
pr_id   quantity
0   27  4
1   89  1
2   33  2
3   8   3
4   16  1
...

But I am now struggling to add the respective date+name to the Dataframe, so it would look like this:

df=
pr_id quantity name date
0   27  4 John_Smith 2020-12-05
1   89  1 Robert_White 2020-12-05
2   33  2 John_Smith 2020-12-06
3   8   3 Sam_Walberg 2020-12-06
4   16  1 Garry_Oswald 2020-12-06
...

How can I do it?

Advertisement

Answer

With pathlib, you can go 1 & 2 directories up and get the name and date. Since this involves two things, an explicit for loop might be more readable than the list comprehension:

from pathlib import Path

# ...above are the same
dfs = []
for csv_path in full_path:
    # generate a `Path` object and get parents
    p = Path(csv_path)
    parents = p.parents

    # get the desired values from "parent" dirs
    name = parents[0].name
    date = parents[1].name

    # read in the CSV as is
    frame = pd.read_csv(csv_path)
    
    # assign the `name` and `date` columns
    frame["name"] = name
    frame["date"] = date

    # store in the list
    dfs.append(frame)

# lastly concating as you did
df = pd.concat(dfs)

Or equivalently, the list comprehension counterpart is:

dfs = [pd.read_csv(csv_path).assign(name=csv_path.parents[0].name,
                                    date=csv_path.parents[1].name)
       for csv_path in map(Path, full_path)]

df = pd.concat(dfs)

where we use assign to put new columns to each dataframe.

It depends on you to choose between explicit for loop or list comprehension.

Advertisement