I need to create a dataframe one 100+ csv file. My issues is that I have more than 100 CSVs with more than 55000 rows in each (as primary keys).
Now the difference between the csv files is that is all columns (maybe around 1200 columns) were broken into separate files. In other words, I need to do a FULL JOIN these csv files based on the PRIMARY row.
I got to the below code so far but need help to compelete it.
import os import glob import pandas as pd os.chdir(r'C:Usersuser1GCP_All') extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv(r'C:Usersuser1GCP_Allcombined.csv', index=False, encoding='utf-8-sig')
similar example is this:
CSV_A: PK TOP_SALES TOP_PURCH 111111 A B 222222 C D 333333 E F CSV_B: PK AVG_SALE AVG_PURCH 111111 G H 222222 I J 333333 K L CSV_C: PK MIN_SALE MIN_PURCH 111111 M N 222222 O P 333333 Q R desired output: PK TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH 111111 A B G H M N 222222 O P I J O P 333333 Q R K L Q R
Thanks in advance! :)
Advertisement
Answer
Convert PK
column to index and then add axis=1
to concat
:
combined_csv = pd.concat([pd.read_csv(f, index_col=0) for f in all_filenames ], axis=1) print (combined_csv) TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH PK 111111 A B G H M N 222222 C D I J O P 333333 E F K L Q R