Skip to content
Advertisement

duplicated rows in pandas append inside for loop

I am having trouble with a for loop inside a function. I am calculating cosine distances for a list of word vectors. with each vector, I am calculating the cosine distance and then appending it as a new column to the pandas dataframe. the problem is that there are several models, so i am comparing a word vector from model 1, with that word in every other model.

This means that some words are not present in all models. In this case, I use an exception for the KeyError and allow the loop to move on without throwing an error. If this happens, I also ask that a 0 value is added the pandas dataframe. This is causing duplicated indexes and am stuck with moving forward from here. The code is as follows:

from scipy.spatial.distance import cosine
import pandas as pd

def cosines(model1, model2, model3, model4, model5, model6, model7, words):
    df = pd.DataFrame()

    model = [model2, model3, model4, model5, model6, model7]

    for i in model:
        for j in words:
            try:
                cos = 1 - cosine(model1.wv[j], i.wv[j])
                print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}')
                tempdf = pd.DataFrame([cos], columns=[f'{j}'], index=[f'{i.name}'])
                #print(tempdf)
                df = pd.concat([df, tempdf], axis=0)
            except KeyError:
                print(word not present at {i.name}')
                ke_tempdf = pd.DataFrame([0], columns=[f'{j}'], index=[f'{i.name}'])
                df = pd.concat([df, ke_tempdf], axis=0)
                pass
    return df

The function works, however, for each KeyError – instead of adding a 0 at one row, it creates a new duplicated one with the value 0. With two words this duplicated the dataframe, but the ultimate aim is to have a list of many words. The resulting dataframe is found below:

        word1       word2
model1  0.000000    NaN
model1  NaN         0.761573
model2  0.000000    NaN
model2  NaN         0.000000
model3  0.000000    NaN
model3  NaN         0.000000
model4  0.245140    NaN
model4  NaN         0.680306
model5  0.090268    NaN
model5  NaN         0.662234
model6  0.000000    NaN
model6  NaN         0.709828

As you can see for every word that isn’t present, instead of adding a 0 to existing model row (NaN) it is adding a new row with the number 0. it should read: model1, 0, 0.76 etc, instead of the duplicated rows. any help is much appreciated, thank you!

Advertisement

Answer

I can’t quite test it without your model objects, but I think this would address your issue:

from scipy.spatial.distance import cosine
import pandas as pd

def cosines(model1, model2, model3, model4, model5, model6, model7, words):
    df = pd.DataFrame()

    model = [model2, model3, model4, model5, model6, model7]

    for i in model:
        cos_dict = {}
        for j in words:
            try:
                cos_dict[j] = 1 - cosine(model1.wv[j], i.wv[j])
                print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}')
            except KeyError:
                print(f'word not present at {i.name}')
                cos_dict[j] = 0
                
        tempdf = pd.DataFrame.from_dict(cos_dict, orient='columns')
        tempdf.index = [f'{i.name}']
        
        df = pd.concat([df, tempdf])
            
    return df

It collects the values for the words for each model in a dictionary in the inner loop, and only tacks them into the full dataframe once in the outer loop.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement