Creating a large dataframe out of 100 csv files (full join required)

I need to create a dataframe one 100+ csv file. My issues is that I have more than 100 CSVs with more than 55000 rows in each (as primary keys).

Now the difference between the csv files is that is all columns (maybe around 1200 columns) were broken into separate files. In other words, I need to do a FULL JOIN these csv files based on the PRIMARY row.

I got to the below code so far but need help to compelete it.

import os
import glob
import pandas as pd
os.chdir(r'C:Usersuser1GCP_All')

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

#export to csv
combined_csv.to_csv(r'C:Usersuser1GCP_Allcombined.csv', index=False, encoding='utf-8-sig')

JavaScript
​x
 
import os
import glob
import pandas as pd
os.chdir(r'C:Usersuser1GCP_All')
​
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
​
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
​
#export to csv
combined_csv.to_csv(r'C:Usersuser1GCP_Allcombined.csv', index=False, encoding='utf-8-sig')
​

similar example is this:

CSV_A:
PK        TOP_SALES  TOP_PURCH
111111    A          B
222222    C          D
333333    E          F

CSV_B:
PK        AVG_SALE   AVG_PURCH
111111    G          H
222222    I          J
333333    K          L

CSV_C:
PK        MIN_SALE   MIN_PURCH
111111    M          N
222222    O          P
333333    Q          R


desired output:
PK    TOP_SALES  TOP_PURCH  AVG_SALE   AVG_PURCH  MIN_SALE  MIN_PURCH
111111    A          B      G          H          M         N
222222    O          P      I          J          O         P
333333    Q          R      K          L          Q         R

JavaScript
 
CSV_A:
PK        TOP_SALES  TOP_PURCH
111111    A          B
222222    C          D
333333    E          F
​
CSV_B:
PK        AVG_SALE   AVG_PURCH
111111    G          H
222222    I          J
333333    K          L
​
CSV_C:
PK        MIN_SALE   MIN_PURCH
111111    M          N
222222    O          P
333333    Q          R
​
​
desired output:
PK    TOP_SALES  TOP_PURCH  AVG_SALE   AVG_PURCH  MIN_SALE  MIN_PURCH
111111    A          B      G          H          M         N
222222    O          P      I          J          O         P
333333    Q          R      K          L          Q         R
​

Thanks in advance! :)

Answer

Convert PK column to index and then add axis=1 to concat:

combined_csv = pd.concat([pd.read_csv(f, index_col=0) for f in all_filenames ], axis=1)

print (combined_csv)
       TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
PK                                                              
111111         A         B        G         H        M         N
222222         C         D        I         J        O         P
333333         E         F        K         L        Q         R

JavaScript
 
combined_csv = pd.concat([pd.read_csv(f, index_col=0) for f in all_filenames ], axis=1)
​
print (combined_csv)
       TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
PK                                                              
111111         A         B        G         H        M         N
222222         C         D        I         J        O         P
333333         E         F        K         L        Q         R
​

Advertisement

Answer