So the dataset for Cramer V correlation has multiple categorical variables in columns, but there is also a column that is there telling us how often these values appear. Similar to table below:
Season Age Weather Sales
Spring New Cold 100
Fall Old Warm 50
Summer New Hot 200
I want to figure out how to calculate the Cramer V correlation between season/Age/Weather and the weight is sales? If doable how would one write something to calculate it? Or is there a different approach one can take to figure out correlation here? thanks!
Advertisement
Answer
As you probably know, Cramer’s V measures association between two nominal variables. So you can convert your current table into separate contingency tables for each pairwise combination of your variables and then compute pairwise statistics.
Code to create a table similar to yours:
from itertools import product
import numpy as np
import pandas as pd
import scipy.stats as stats
np.random.seed(42)
all_combs = product(
['Spring', 'Summer', 'Fall', 'Winter'],
['New', 'Old'],
['Cold', 'Warm', 'Hot']
)
df = pd.DataFrame(all_combs, columns=['Season', 'Age', 'Weather'])
df['Sales'] = np.random.randint(25, 200, len(df))
df.head()
# Season Age Weather Sales
# 0 Spring New Cold 127
# 1 Spring New Warm 117
# 2 Spring New Hot 39
# 3 Spring Old Cold 131
# 4 Spring Old Warm 96
Convert the table into a contingency table for measuring association between Season
and Age
and save it as 2-d array:
cont = df.pivot_table('Sales', 'Season', 'Age', 'sum')
cont
# Age New Old
# Season
# Fall 459 277
# Spring 283 272
# Summer 372 377
# Winter 356 384
cont_arr = cont.values
Now, you can calculate the chi-squared statistic and from that compute Cramer’s V. The formula for Cramer’s V can be found here.
chi2 = stats.chi2_contingency(cont_arr, correction=False)[0]
sample_size = np.sum(cont_arr)
min_dim = min(cont_arr.shape) - 1
cramer_v = np.sqrt((chi2 / sample_size) / min_dim)
cramer_v
# 0.1157257...