I am having trouble with a for loop inside a function. I am calculating cosine distances for a list of word vectors. with each vector, I am calculating the cosine distance and then appending it as a new column to the pandas dataframe. the problem is that there are several models, so i am comparing a word vector from model 1, with that word in every other model.
This means that some words are not present in all models. In this case, I use an exception for the KeyError and allow the loop to move on without throwing an error. If this happens, I also ask that a 0 value is added the pandas dataframe. This is causing duplicated indexes and am stuck with moving forward from here. The code is as follows:
from scipy.spatial.distance import cosine import pandas as pd def cosines(model1, model2, model3, model4, model5, model6, model7, words): df = pd.DataFrame() model = [model2, model3, model4, model5, model6, model7] for i in model: for j in words: try: cos = 1 - cosine(model1.wv[j], i.wv[j]) print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}') tempdf = pd.DataFrame([cos], columns=[f'{j}'], index=[f'{i.name}']) #print(tempdf) df = pd.concat([df, tempdf], axis=0) except KeyError: print(word not present at {i.name}') ke_tempdf = pd.DataFrame([0], columns=[f'{j}'], index=[f'{i.name}']) df = pd.concat([df, ke_tempdf], axis=0) pass return df
The function works, however, for each KeyError – instead of adding a 0 at one row, it creates a new duplicated one with the value 0. With two words this duplicated the dataframe, but the ultimate aim is to have a list of many words. The resulting dataframe is found below:
word1 word2 model1 0.000000 NaN model1 NaN 0.761573 model2 0.000000 NaN model2 NaN 0.000000 model3 0.000000 NaN model3 NaN 0.000000 model4 0.245140 NaN model4 NaN 0.680306 model5 0.090268 NaN model5 NaN 0.662234 model6 0.000000 NaN model6 NaN 0.709828
As you can see for every word that isn’t present, instead of adding a 0 to existing model row (NaN) it is adding a new row with the number 0. it should read: model1, 0, 0.76
etc, instead of the duplicated rows. any help is much appreciated, thank you!
Advertisement
Answer
I can’t quite test it without your model objects, but I think this would address your issue:
from scipy.spatial.distance import cosine import pandas as pd def cosines(model1, model2, model3, model4, model5, model6, model7, words): df = pd.DataFrame() model = [model2, model3, model4, model5, model6, model7] for i in model: cos_dict = {} for j in words: try: cos_dict[j] = 1 - cosine(model1.wv[j], i.wv[j]) print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}') except KeyError: print(f'word not present at {i.name}') cos_dict[j] = 0 tempdf = pd.DataFrame.from_dict(cos_dict, orient='columns') tempdf.index = [f'{i.name}'] df = pd.concat([df, tempdf]) return df
It collects the values for the words for each model in a dictionary in the inner loop, and only tacks them into the full dataframe once in the outer loop.