Skip to content
Advertisement

Cramer V correlation in python but instead of using frequency using weights?

So the dataset for Cramer V correlation has multiple categorical variables in columns, but there is also a column that is there telling us how often these values appear. Similar to table below:

Season     Age      Weather    Sales
Spring     New      Cold       100
Fall       Old      Warm       50 
Summer     New      Hot        200

I want to figure out how to calculate the Cramer V correlation between season/Age/Weather and the weight is sales? If doable how would one write something to calculate it? Or is there a different approach one can take to figure out correlation here? thanks!

Advertisement

Answer

As you probably know, Cramer’s V measures association between two nominal variables. So you can convert your current table into separate contingency tables for each pairwise combination of your variables and then compute pairwise statistics.

Code to create a table similar to yours:

from itertools import product
import numpy as np
import pandas as pd
import scipy.stats as stats

np.random.seed(42)

all_combs = product(
    ['Spring', 'Summer', 'Fall', 'Winter'],
    ['New', 'Old'],
    ['Cold', 'Warm', 'Hot']
)

df = pd.DataFrame(all_combs, columns=['Season', 'Age', 'Weather'])
df['Sales'] = np.random.randint(25, 200, len(df))
df.head()

#     Season    Age    Weather    Sales
# 0   Spring    New      Cold       127
# 1   Spring    New      Warm       117
# 2   Spring    New       Hot        39
# 3   Spring    Old      Cold       131
# 4   Spring    Old      Warm        96

Convert the table into a contingency table for measuring association between Season and Age and save it as 2-d array:

cont = df.pivot_table('Sales', 'Season', 'Age', 'sum')
cont
#    Age    New Old
# Season        
# Fall      459 277
# Spring    283 272
# Summer    372 377
# Winter    356 384

cont_arr = cont.values

Now, you can calculate the chi-squared statistic and from that compute Cramer’s V. The formula for Cramer’s V can be found here.

chi2 = stats.chi2_contingency(cont_arr, correction=False)[0]
sample_size = np.sum(cont_arr)
min_dim = min(cont_arr.shape) - 1

cramer_v = np.sqrt((chi2 / sample_size) / min_dim)

cramer_v
# 0.1157257...

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement