I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column.
/ParentDirectory │ │ ├───SubFolder 1 │ test1.csv │ ├───SubFolder 2 │ test2.csv │ ├───SubFolder 3 │ test3.csv │ test4.csv │ ├───SubFolder 4 │ test5.csv
I can do something like this to concat all the CSVs into a single data frame
import pandas as pd import glob files = glob.glob('/ParentDirectory/**/*.csv', recursive=True) df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
But is there a way to also add the filename of each file as a column to the final data frame, or do I have to loop through each individual file first before concatenating the final data frame? Output should look like:
Col1 Col2 file_name 0 AAAA XYZ test1.csv 1 BBBB XYZ test1.csv 2 CCCC RST test1.csv 3 DDDD XYZ test2.csv 4 AAAA WXY test3.csv 5 CCCC RST test4.csv 6 DDDD XTZ test4.csv 7 AAAA TTT test4.csv 8 CCCC RRR test4.csv 9 AAAA QQQ test4.csv
Advertisement
Answer
you can assign the file_names on the fly:
from pathlib import Path df = pd.concat([pd.read_csv(fp).assign(file_name=Path(fp).name) for fp in files], ignore_index=True)
where pathlib.Path helps to extract the basename of the file from the path.