Skip to content
Advertisement

How to Create a Correlation Dataframe from already related data

I have a data frame of language similarity. Here is a small snippet that’s been edited for simplicity:

    0       1       2
0   English Spanish 0.50
1   English Russian 0.15

I would like to create a correlation dataframe such as:

        English Spanish Russian
English 1       0.5     0.15
Spanish 0.5     1       -
Russian 0.15    -       1

To create the first dataframe, I ran:

pairing_list = [["English","Spanish",0.5],["English","Russian",0.15]]
df = pd.DataFrame(pairing_list)

I have tried:

df.corr()

Which returns:

        2
2       1.0

I have looked at other similar questions but it seems that the data for use in .corr() is by itself (ie: my data here is already a correlation between the two columns, whereas the examples I have seen are not yet such related).

To clarify: the data presented is already the similarity between the two languages, and thus is not some value associated with one language alone; it is for the pair listed in the columns.

How could I use Python / Pandas to do this?

Advertisement

Answer

Use crosstab to create the all language combinations and fill with the existing data:

lg = pd.concat([df[0], df[1]]).unique()  # ['English', 'Spanish', 'Russian']
cx = pd.crosstab(lg, lg)

cx.update(df.set_index([0, 1]).squeeze().unstack())
cx.update(df.set_index([0, 1]).squeeze().unstack().T)
>>> cx
col_0    English  Russian  Spanish
row_0
English     1.00     0.15      0.5
Russian     0.15     1.00      0.0
Spanish     0.50     0.00      1.0
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement