Linux command from python to execute inside subfolders

Tags: ,



This is my folder structure:

ORDERNO'S  YEAR'S  MONTH'S  DATE'S  CSVFILES
408------->2010---->01-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ----->02-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ...
    ------->201101-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ----->02-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
    ------->201201-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ----->02-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ...
    ------->2013--01-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ----->02-->21--->1.CSV
                           --->2.CSV
                      --->22--->1.CSV
                           --->2.CSV
               ...

As their are many csv files inside a particular date. i want to combine all csv file inside each date folder with header of 1st file only into one single file and name is as orderno_year_month_date.csv. means every datefolder will have have only one csv named after their parent folders.So, i want clean command for this in linux which i can run outside of ORDERNO directory or execute same command using python script such that i don’t have to go again and again inside folder and manually execute the command.

**This same question is posted in askubuntu 3 months before but no answers there.

Expected structure should be like

ORDERNO'S  YEAR'S  MONTH'S  DATE'S  CSVFILES
408------->2010---->01-->21--->408_2010_01_21.CSV
                      --->22--->408_2010_01_22.CSV
                      ...

Previously i was going to each date folders of each month folder of each year folder of each order no. and use to run this command for single file for header.

awk '(NR == 1) || (FNR > 1)' *.csv > 4011_2020_07_16.csv  (example)

Answer

I will be using this mock file structure (plotted using the tree command, and saved under ~/test/ in my computer):

test
└── 408
    └── 2010
        └── 01
            ├── 21
            │   ├── 1.csv
            │   └── 2.csv
            ├── 22
            │   ├── 1.csv
            │   └── 2.csv
            └── 23
                ├── 1.csv
                └── 2.csv

You can rename the files using Python, with the help of pathlib and concatenate them using pandas:

import pandas as pd

from pathlib import Path

def getfolders(files):
    return sorted(list(set([file.parent for file in files])))

def getpathproperty(folder, prop):
    properties = {"orderno": 3, "year": 2, "month": 1, "day": 0}
    for i in range(properties[prop]):
        folder = folder.parent
    return folder.stem

path = Path("~/test").expanduser()
allfiles = list(path.rglob("*.csv")) # Each file in allfiles is a Path object

folders = getfolders(allfiles)

for folder in folders:
    files = sorted(list(folder.glob("*.csv")))
    df = pd.concat([pd.read_csv(file) for file in files])

    # Get the values from the path to rename the files
    orderno = getpathproperty(folder, "orderno")
    year = getpathproperty(folder, "year")
    month = getpathproperty(folder, "month")
    day = getpathproperty(folder, "day")

    # Save the new CSV file
    df.to_csv(folder/f"{orderno}_{year}_{month}_{day}.csv", index=False)

    # Delete old files, commented for safety
    # for file in files:
        # file.unlink(missing_ok=True)

This yields:

test
└── 408
    └── 2010
        └── 01
            ├── 21
            │   └── 408_2010_01_21.csv
            ├── 22
            │   └── 408_2010_01_22.csv
            └── 23
                └── 408_2010_01_23.csv


Source: stackoverflow