I’m using Python to merge 4 headerless CSV’s into one output file.
Each CSV has a unique number in the first column as shown in the 2 example CSV files below:
1.csv
1,Ringo,Beatles 2,John,Beatles 3,Mick,Rolling Stones 4,Keith,Rolling Stones 5,Rivers,Weezer
2.csv
1,TSLA,XNAS,1.0,USD 2,AAPL,XNAS,1.0,USD 3,SPY,ARCX,1.0,USD 4,BP LN,XLON,1.0,GBP 5,ESUD,XCME,1.0,USD
I have generated the output from these CSV’s using the following code.
import os import csv filenames = ['1.csv', '2.csv', '3.csv', '4.csv'] with open('output_file', 'w') as outfile: for fname in filenames: with open(fname) as infile: outfile.write(infile.read())
This works fine and outputs a file. The data ends up as follows
1,Ringo,Beatles 2,John,Beatles 3,Mick,Rolling Stones 4,Keith,Rolling Stones 5,Rivers,Weezer 1,TSLA,XNAS,1.0,USD 2,AAPL,XNAS,1.0,USD 3,SPY,ARCX,1.0,USD 4,BP LN,XLON,1.0,GBP 5,ESUD,XCME,1.0,USD1,5,-600,1043.22,-625932.00 3,5,200,304.89,60978.00 5,4,6,3015.25,904575.005,4,-1,2,3009.50 5,4,1,1,3011.75 4,3,1,1000,308.37 4,3,1,200,309.15 1,3,1,100,309.0125
Is there a way to use the first column number as a ‘unique’ number to link the data, such that it takes the three results that start with ‘1’, and adds them to the same row?
For example, these have the same ‘unique’ number ‘1’:
1,Ringo,Beatles 1,TSLA,XNAS,1.0,USD 1,3,1,100,309.0125
The resulting row would be:
(1) Ringo,Beatles,TSLA,XNAS,1.0,USD,3,1,100,309.0125
Advertisement
Answer
You could use dictionary to put all data as
{ 1: [1, "Ringo", "Beatles", 1, "TSLA", "XNAS", 1.0, "USD", 1, 3, 1, 100, 309.0125], 2: [2, ...], 3: [3, ...], ... }
and later write all in new file.
So first create empty dictionary. ie. new_rows = {}
Next get row from file, get ID and check if it exists in dictionary. If not exists then create it with list with has only ID new_rows[key] = [key]
Next you can add other values from row to this list new_rows[key] += values
Repeate it for all rows in all files.
And later you can use this dictionary to write all rows to new file.
I use io
only to simulate files in memory but you should use open()
text1 = '''1,Ringo,Beatles 2,John,Beatles 3,Mick,Rolling Stones 4,Keith,Rolling Stones 5,Rivers,Weezer''' text2 = '''1,TSLA,XNAS,1.0,USD 2,AAPL,XNAS,1.0,USD 3,SPY,ARCX,1.0,USD 4,BP LN,XLON,1.0,GBP 5,ESUD,XCME,1.0,USD''' import os import csv import io new_rows = {} # dict filenames = [text1, text2] #filenames = ['1.csv', '2.csv', '3.csv', '4.csv'] for fname in filenames: #with open(fname) as infile: with io.StringIO(fname) as infile: reader = csv.reader(infile) for row in reader: key = row[0] # ID values = row[1:] # rest # create key if not exists if key not in new_rows: new_rows[key] = [key] new_rows[key] += values # add two lists # OR #if key not in new_rows: # new_rows[key] = values # only for first file #else: # new_rows[key] += values # for other file - add two lists # --- write it --- with open('output_file', 'w') as outfile: writer = csv.writer(outfile) all_rows = new_rows.values() writer.writerows(all_rows) # `writerows` with `s` to write list with many rows.
BTW:
In older Python dict
didn’t have to keep order so it could write new rows in different order – and it would need to sort list of rows before saving or it would need to use collections.OrderedDict()