I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column. I can do something like this to concat all the CSVs into a single data frame But is there a way to also add the filename of each file as a column to the

Concatenating CSVs into dataframe with filename column

I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column.

/ParentDirectory
│  
│
├───SubFolder 1
│       test1.csv
│
├───SubFolder 2
│       test2.csv
│
├───SubFolder 3
│       test3.csv
│       test4.csv
│
├───SubFolder 4
│       test5.csv

I can do something like this to concat all the CSVs into a single data frame

import pandas as pd
import glob

files = glob.glob('/ParentDirectory/**/*.csv', recursive=True)
df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)

But is there a way to also add the filename of each file as a column to the final data frame, or do I have to loop through each individual file first before concatenating the final data frame? Output should look like:

   Col1  Col2    file_name
0  AAAA   XYZ    test1.csv
1  BBBB   XYZ    test1.csv
2  CCCC   RST    test1.csv
3  DDDD   XYZ    test2.csv
4  AAAA   WXY    test3.csv
5  CCCC   RST    test4.csv
6  DDDD   XTZ    test4.csv
7  AAAA   TTT    test4.csv
8  CCCC   RRR    test4.csv
9  AAAA   QQQ    test4.csv

Answer

you can assign the file_names on the fly:

from pathlib import Path

df = pd.concat([pd.read_csv(fp).assign(file_name=Path(fp).name)
                for fp in files], ignore_index=True)

where pathlib.Path helps to extract the basename of the file from the path.

Advertisement

Answer