I need to create a dataframe one 100+ csv file. My issues is that I have more than 100 CSVs with more than 55000 rows in each (as primary keys).
Now the difference between the csv files is that is all columns (maybe around 1200 columns) were broken into separate files. In other words, I need to do a FULL JOIN these csv files based on the PRIMARY row.
I got to the below code so far but need help to compelete it.
JavaScript
x
14
14
1
import os
2
import glob
3
import pandas as pd
4
os.chdir(r'C:Usersuser1GCP_All')
5
6
extension = 'csv'
7
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
8
9
#combine all files in the list
10
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
11
12
#export to csv
13
combined_csv.to_csv(r'C:Usersuser1GCP_Allcombined.csv', index=False, encoding='utf-8-sig')
14
similar example is this:
JavaScript
1
25
25
1
CSV_A:
2
PK TOP_SALES TOP_PURCH
3
111111 A B
4
222222 C D
5
333333 E F
6
7
CSV_B:
8
PK AVG_SALE AVG_PURCH
9
111111 G H
10
222222 I J
11
333333 K L
12
13
CSV_C:
14
PK MIN_SALE MIN_PURCH
15
111111 M N
16
222222 O P
17
333333 Q R
18
19
20
desired output:
21
PK TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
22
111111 A B G H M N
23
222222 O P I J O P
24
333333 Q R K L Q R
25
Thanks in advance! :)
Advertisement
Answer
Convert PK
column to index and then add axis=1
to concat
:
JavaScript
1
9
1
combined_csv = pd.concat([pd.read_csv(f, index_col=0) for f in all_filenames ], axis=1)
2
3
print (combined_csv)
4
TOP_SALES TOP_PURCH AVG_SALE AVG_PURCH MIN_SALE MIN_PURCH
5
PK
6
111111 A B G H M N
7
222222 C D I J O P
8
333333 E F K L Q R
9