I have a root folder
JavaScript
x
2
1
/home/project/data
2
With multiple folders in it with and ultimate paths to csv files:
JavaScript
1
7
1
/home/project/data/2020-12-05/John_Smith/data.csv
2
/home/project/data/2020-12-05/Robert_White/data.csv
3
/home/project/data/2020-12-06/John_Smith/data.csv
4
/home/project/data/2020-12-06/Sam_Walberg/data.csv
5
/home/project/data/2020-12-06/Garry_Oswald/data.csv
6
7
I was managed to create a Dataframe containing all csv files concatenated using the following code:
JavaScript
1
7
1
full_path = []
2
for subdir, dirs, files in os.walk(rootdir):
3
for file in files:
4
full_path.append(os.path.join(subdir, file))
5
dfs = [pd.read_csv(csv_path) for csv_path in full_path]
6
df = pd.concat(dfs)
7
Result:
JavaScript
1
9
1
df=
2
pr_id quantity
3
0 27 4
4
1 89 1
5
2 33 2
6
3 8 3
7
4 16 1
8
9
But I am now struggling to add the respective date+name to the Dataframe, so it would look like this:
JavaScript
1
9
1
df=
2
pr_id quantity name date
3
0 27 4 John_Smith 2020-12-05
4
1 89 1 Robert_White 2020-12-05
5
2 33 2 John_Smith 2020-12-06
6
3 8 3 Sam_Walberg 2020-12-06
7
4 16 1 Garry_Oswald 2020-12-06
8
9
How can I do it?
Advertisement
Answer
With pathlib
, you can go 1 & 2 directories up and get the name
and date
. Since this involves two things, an explicit for loop might be more readable than the list comprehension:
JavaScript
1
26
26
1
from pathlib import Path
2
3
# ...above are the same
4
dfs = []
5
for csv_path in full_path:
6
# generate a `Path` object and get parents
7
p = Path(csv_path)
8
parents = p.parents
9
10
# get the desired values from "parent" dirs
11
name = parents[0].name
12
date = parents[1].name
13
14
# read in the CSV as is
15
frame = pd.read_csv(csv_path)
16
17
# assign the `name` and `date` columns
18
frame["name"] = name
19
frame["date"] = date
20
21
# store in the list
22
dfs.append(frame)
23
24
# lastly concating as you did
25
df = pd.concat(dfs)
26
Or equivalently, the list comprehension counterpart is:
JavaScript
1
6
1
dfs = [pd.read_csv(csv_path).assign(name=csv_path.parents[0].name,
2
date=csv_path.parents[1].name)
3
for csv_path in map(Path, full_path)]
4
5
df = pd.concat(dfs)
6
where we use assign
to put new columns to each dataframe.
It depends on you to choose between explicit for loop or list comprehension.