Skip to content
Advertisement

How to merge CSV files such that rows with a unique identifier are added into the same row of the output?

I’m using Python to merge 4 headerless CSV’s into one output file.

Each CSV has a unique number in the first column as shown in the 2 example CSV files below:

1.csv

1,Ringo,Beatles
2,John,Beatles
3,Mick,Rolling Stones
4,Keith,Rolling Stones
5,Rivers,Weezer

2.csv

1,TSLA,XNAS,1.0,USD
2,AAPL,XNAS,1.0,USD
3,SPY,ARCX,1.0,USD
4,BP LN,XLON,1.0,GBP
5,ESUD,XCME,1.0,USD

I have generated the output from these CSV’s using the following code.

import os
import csv

filenames = ['1.csv', '2.csv', '3.csv', '4.csv']
with open('output_file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

This works fine and outputs a file. The data ends up as follows

1,Ringo,Beatles
2,John,Beatles
3,Mick,Rolling Stones
4,Keith,Rolling Stones
5,Rivers,Weezer
1,TSLA,XNAS,1.0,USD
2,AAPL,XNAS,1.0,USD
3,SPY,ARCX,1.0,USD
4,BP LN,XLON,1.0,GBP
5,ESUD,XCME,1.0,USD1,5,-600,1043.22,-625932.00
3,5,200,304.89,60978.00
5,4,6,3015.25,904575.005,4,-1,2,3009.50
5,4,1,1,3011.75
4,3,1,1000,308.37
4,3,1,200,309.15
1,3,1,100,309.0125

Is there a way to use the first column number as a ‘unique’ number to link the data, such that it takes the three results that start with ‘1’, and adds them to the same row?

For example, these have the same ‘unique’ number ‘1’:

1,Ringo,Beatles
1,TSLA,XNAS,1.0,USD
1,3,1,100,309.0125

The resulting row would be:

(1) Ringo,Beatles,TSLA,XNAS,1.0,USD,3,1,100,309.0125

Advertisement

Answer

You could use dictionary to put all data as

{
1: [1, "Ringo", "Beatles", 1, "TSLA", "XNAS", 1.0, "USD", 1, 3, 1, 100, 309.0125], 
2: [2, ...],
3: [3, ...],
...
}

and later write all in new file.

So first create empty dictionary. ie. new_rows = {}

Next get row from file, get ID and check if it exists in dictionary. If not exists then create it with list with has only ID new_rows[key] = [key]

Next you can add other values from row to this list new_rows[key] += values

Repeate it for all rows in all files.

And later you can use this dictionary to write all rows to new file.


I use io only to simulate files in memory but you should use open()

text1 = '''1,Ringo,Beatles
2,John,Beatles
3,Mick,Rolling Stones
4,Keith,Rolling Stones
5,Rivers,Weezer'''

text2 = '''1,TSLA,XNAS,1.0,USD
2,AAPL,XNAS,1.0,USD
3,SPY,ARCX,1.0,USD
4,BP LN,XLON,1.0,GBP
5,ESUD,XCME,1.0,USD'''

import os
import csv
import io

new_rows = {} # dict

filenames = [text1, text2]
#filenames = ['1.csv', '2.csv', '3.csv', '4.csv']

for fname in filenames:
    #with open(fname) as infile:
    with io.StringIO(fname) as infile:

        reader = csv.reader(infile)
        for row in reader:

            key = row[0]      # ID
            values = row[1:]  # rest
            
            # create key if not exists
            if key not in new_rows:
                new_rows[key] = [key]
                
            new_rows[key] += values  # add two lists
            
            # OR

            #if key not in new_rows:
            #    new_rows[key] = values    # only for first file
            #else:
            #     new_rows[key] += values  # for other file - add two lists 

# --- write it  ---

with open('output_file', 'w') as outfile:
    writer = csv.writer(outfile)
    all_rows = new_rows.values()
    writer.writerows(all_rows)   # `writerows` with `s` to write list with many rows.

BTW:

In older Python dict didn’t have to keep order so it could write new rows in different order – and it would need to sort list of rows before saving or it would need to use collections.OrderedDict()

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement