This is my folder structure:
ORDERNO'S YEAR'S MONTH'S DATE'S CSVFILES 408------->2010---->01-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ----->02-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ... ------->201101-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ----->02-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ------->201201-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ----->02-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ... ------->2013--01-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ----->02-->21--->1.CSV --->2.CSV --->22--->1.CSV --->2.CSV ...
As their are many csv files inside a particular date. i want to combine all csv file inside each date folder with header of 1st file only into one single file and name is as orderno_year_month_date.csv. means every datefolder will have have only one csv named after their parent folders.So, i want clean command for this in linux which i can run outside of ORDERNO directory or execute same command using python script such that i don’t have to go again and again inside folder and manually execute the command.
**This same question is posted in askubuntu 3 months before but no answers there.
Expected structure should be like
ORDERNO'S YEAR'S MONTH'S DATE'S CSVFILES 408------->2010---->01-->21--->408_2010_01_21.CSV --->22--->408_2010_01_22.CSV ...
Previously i was going to each date folders of each month folder of each year folder of each order no. and use to run this command for single file for header.
awk '(NR == 1) || (FNR > 1)' *.csv > 4011_2020_07_16.csv (example)
Advertisement
Answer
I will be using this mock file structure (plotted using the tree
command, and saved under ~/test/
in my computer):
test └── 408 └── 2010 └── 01 ├── 21 │ ├── 1.csv │ └── 2.csv ├── 22 │ ├── 1.csv │ └── 2.csv └── 23 ├── 1.csv └── 2.csv
You can rename the files using Python, with the help of pathlib
and concatenate them using pandas
:
import pandas as pd from pathlib import Path def getfolders(files): return sorted(list(set([file.parent for file in files]))) def getpathproperty(folder, prop): properties = {"orderno": 3, "year": 2, "month": 1, "day": 0} for i in range(properties[prop]): folder = folder.parent return folder.stem path = Path("~/test").expanduser() allfiles = list(path.rglob("*.csv")) # Each file in allfiles is a Path object folders = getfolders(allfiles) for folder in folders: files = sorted(list(folder.glob("*.csv"))) df = pd.concat([pd.read_csv(file) for file in files]) # Get the values from the path to rename the files orderno = getpathproperty(folder, "orderno") year = getpathproperty(folder, "year") month = getpathproperty(folder, "month") day = getpathproperty(folder, "day") # Save the new CSV file df.to_csv(folder/f"{orderno}_{year}_{month}_{day}.csv", index=False) # Delete old files, commented for safety # for file in files: # file.unlink(missing_ok=True)
This yields:
test └── 408 └── 2010 └── 01 ├── 21 │ └── 408_2010_01_21.csv ├── 22 │ └── 408_2010_01_22.csv └── 23 └── 408_2010_01_23.csv