I have a data frame of language similarity. Here is a small snippet that’s been edited for simplicity:
0 1 2 0 English Spanish 0.50 1 English Russian 0.15
I would like to create a correlation dataframe such as:
English Spanish Russian English 1 0.5 0.15 Spanish 0.5 1 - Russian 0.15 - 1
To create the first dataframe, I ran:
pairing_list = [["English","Spanish",0.5],["English","Russian",0.15]] df = pd.DataFrame(pairing_list)
I have tried:
df.corr()
Which returns:
2 2 1.0
I have looked at other similar questions but it seems that the data for use in .corr() is by itself (ie: my data here is already a correlation between the two columns, whereas the examples I have seen are not yet such related).
To clarify: the data presented is already the similarity between the two languages, and thus is not some value associated with one language alone; it is for the pair listed in the columns.
How could I use Python / Pandas to do this?
Advertisement
Answer
Use crosstab
to create the all language combinations and fill with the existing data:
lg = pd.concat([df[0], df[1]]).unique() # ['English', 'Spanish', 'Russian'] cx = pd.crosstab(lg, lg) cx.update(df.set_index([0, 1]).squeeze().unstack()) cx.update(df.set_index([0, 1]).squeeze().unstack().T)
>>> cx col_0 English Russian Spanish row_0 English 1.00 0.15 0.5 Russian 0.15 1.00 0.0 Spanish 0.50 0.00 1.0