I wanted to inquire on how I could merge 2 csv files so that I can generate queries. Note that I am not allowed to use the “pandas” library.
As an example I have these 2 csv:
data.csv:
cod_pers, cod_enti, fec_venc 2317422,208,04/12/2022 2392115,210,04/02/2022 2086638,211,31/03/2022 2086638,212,03/13/2022
enti.csv:
cod_enti,cod_market 208,40 209,50 210,16 211,40 212,50
And what I’m looking for is to be able to join them through cod_enti
and thus be able to evaluate the number of cod_pers
that they have as cod_mercado = 40
in the last 15 days.
For this I understand that I can generate the reading of the csv files as follows:
import csv import numpy as np from time import strftime from datetime import datetime, date, time, timedelta from dateutil.relativedelta import relativedelta #Read the CSV file str2date = lambda x: datetime.strptime(x, '%d/%m/%Y') data_datos = np.genfromtxt('datos.csv', delimiter=',', dtype=None, names=True, converters={'fec_venc':str2date}, encoding="UTF-8") data_enti = np.genfromtxt('enti.csv', delimiter=',', dtype=None, names=True, encoding="UTF-8")
And later to be able to search by days with a method similar to this:
#definition of days today = datetime.now() yesterday = today - timedelta(days=15) # Generate array of dates values_on_date =[] calc_date = data_datos['fec_venc'] for date_obt in calc_date: if (yesterday <= date_obt): values_on_date.append(date_obt) tot_doc_mor_15d = len(values_on_date) print(tot_doc_mor_15d)
But for this I must first join the 2 csv files to be able to generate the query.
I look forward to your comments and any kind of help, it is appreciated. Thank you!! :D
MODIFICATION:
I have added the following lines to my code:
from numpy.lib import recfunctions merged_array = recfunctions.join_by('cod_enti', data_datos, data_enti)
And it works correctly, but when I want to enter csv with more data, it gives me the following error:
TypeError: invalid type promotion with structured datatype(s).
Advertisement
Answer
You can do this in pretty straight-forward approach with just the CSV module.
I create a map of each row in data.csv to its code_enti
value. Then, for every row in enti.csv that has a matching code_enti, I update the row in the map:
import csv import pprint # Create a mapping of a data row to its cod_enti, e.g.: # {208: {cod_pers:2317422, cod_enti:208, fec_venc:04/12/2022}, ...} cod_enti_row_map = {} with open("data.csv", newline="") as f: reader = csv.DictReader(f, skipinitialspace=True) # because your header row has leading spaces for row in reader: cod_enti = row["cod_enti"] cod_enti_row_map[cod_enti] = row print(f"Map before join") pprint.pprint(cod_enti_row_map, width=100, sort_dicts=False) # Now, update each row in the map with cod_market for the key, cod_enti with open("enti.csv", newline="") as f: reader = csv.DictReader(f) for row in reader: cod_enti = row["cod_enti"] # skip cod_enti in enti.csv that is not in data.csv, like 209 if cod_enti not in cod_enti_row_map: continue cod_enti_row_map[cod_enti].update(row) print(f"Map after join") pprint.pprint(cod_enti_row_map, width=100, sort_dicts=False)
Here’s what I get when I run that:
Map before join {'208': {'cod_pers': '2317422', 'cod_enti': '208', 'fec_venc': '04/12/2022'}, '210': {'cod_pers': '2392115', 'cod_enti': '210', 'fec_venc': '04/02/2022'}, '211': {'cod_pers': '2086638', 'cod_enti': '211', 'fec_venc': '31/03/2022'}, '212': {'cod_pers': '2086638', 'cod_enti': '212', 'fec_venc': '03/13/2022'}} Map after join {'208': {'cod_pers': '2317422', 'cod_enti': '208', 'fec_venc': '04/12/2022', 'cod_market': '40'}, '210': {'cod_pers': '2392115', 'cod_enti': '210', 'fec_venc': '04/02/2022', 'cod_market': '16'}, '211': {'cod_pers': '2086638', 'cod_enti': '211', 'fec_venc': '31/03/2022', 'cod_market': '40'}, '212': {'cod_pers': '2086638', 'cod_enti': '212', 'fec_venc': '03/13/2022', 'cod_market': '50'}}
From there, you can extract the rows into a normal list and do all your filtering by key-value, or however else.