input dict
{'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']}
I have large csv files in the below format basename_AM1.csv I have large csv files in the below format
basename_AM1.csv
ID1 ID2 Score 0 AM1287 AM1286 97.55 1 AM1288 AM1286 78.91 2 AM1289 AM1286 95.38 3 AM1290 AM1286 94.83 4 AM1291 AM1286 82.91
Now I need to create a similarity dict like below for the given input_dict by searching/filter the csv files
{'AM1286': {'AM1286': 0, 'AM287': 97.55, 'AM288': 78.91}, 'AM1287': {'AM1286': 97.55, 'AM1287': 100.0, 'AM1288': 78.91}, 'AM1288': {'AM1286': 78.91, 'AM1287': 78.91, 'AM1288': 100.0}}
I have come up with the below logic but for an input_dict of 100 samples this takes too long, Can someone please suggest the optimized and fastest way to achieve this
for key,value in input_dict.items(): base_name_df = pd.read_csv('csv_file_path') base_name_df.columns = "ID1","ID2","Score" if os.path.exists('csv_file_path'): for id1 in range(len(value)): for id2 in range(len(value)): scan_df = base_name_df[(base_name_df['ID1'] == value[id1]) & (base_name_df['ID2'] == value[id2])] if not scan_df.empty: scan_df = scan_df.groupby(['LIMSID1','LIMSID2'], as_index=False)['Score'].max() final_dict[value[id1]][value[id2]] = scan_df.iloc[0]['Score']
Advertisement
Answer
IIUC, you can use:
input_dict = {'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']} import pandas as pd for fname, lst in input_dict.items(): df = pd.read_csv(fname, sep='s+', names=['ID1', 'ID2', 'score']) df2 = df.pivot('ID1', 'ID2', 'score').reindex(index=lst, columns=lst) df2 = df2.combine_first(df2.T).fillna(0) # print for example print(df2.to_dict())
If you want 100 on the diagonal:
import numpy as np a = df2.to_numpy() np.fill_diagonal(a, 100) df2 = pd.DataFrame(a, index=lst, columns=lst)
output:
{'AM1286': {'AM1286': 0.0, 'AM1287': 97.55, 'AM1288': 78.91}, 'AM1287': {'AM1286': 97.55, 'AM1287': 0.0, 'AM1288': 0.0}, 'AM1288': {'AM1286': 78.91, 'AM1287': 0.0, 'AM1288': 0.0}}