I am merging 3700 csv files with a total of 10 million rows. The files do not have a sequential naming but the date in which they were created(Descending) is sequential. I use the following code to merge them but do not know how to add pick them in that sequence.
import pandas as pd import glob path = r'C:UsersUser' # path all_files = glob.glob(path + "/*.csv") li = [] for filename in all_files: df = pd.read_csv(filename, index_col=None, header=0) li.append(df) frame = pd.concat(li, axis=0, ignore_index=True)
The following are names of files arranged in date modified(oldest to newest) from the os
V100(1005).csv V100(1).csv V100(778).csv V100(2).csv
The file names are not sequential because while downloading other files were deleted in between.
Advertisement
Answer
If I understand the question correctly, you need this before the loop (based on an answer here), which sorts list of files by creation/modification date in Python 3:
import os all_files = sorted(all_files, key=os.path.getmtime) #all_files = sorted(all_files, key=os.path.getctime) #works too
Test on MacOs:
import glob all_files = glob.glob('V*.csv') print(all_files) #['V100-778.csv', 'V100-778-.csv', 'V100-1005-1.csv', 'V100-1.csv', 'V100-2.csv']
- Sorting the files based on modification time:
import os all_files = sorted(all_files, key=os.path.getmtime) print(all_files) #['V100-778-.csv', 'V100-1005-1.csv', 'V100-1.csv', 'V100-778.csv', 'V100-2.csv']