So the dataset for Cramer V correlation has multiple categorical variables in columns, but there is also a column that is there telling us how often these values appear. Similar to table below:
Season Age Weather Sales Spring New Cold 100 Fall Old Warm 50 Summer New Hot 200
I want to figure out how to calculate the Cramer V correlation between season/Age/Weather and the weight is sales? If doable how would one write something to calculate it? Or is there a different approach one can take to figure out correlation here? thanks!
Advertisement
Answer
As you probably know, Cramer’s V measures association between two nominal variables. So you can convert your current table into separate contingency tables for each pairwise combination of your variables and then compute pairwise statistics.
Code to create a table similar to yours:
from itertools import product import numpy as np import pandas as pd import scipy.stats as stats np.random.seed(42) all_combs = product( ['Spring', 'Summer', 'Fall', 'Winter'], ['New', 'Old'], ['Cold', 'Warm', 'Hot'] ) df = pd.DataFrame(all_combs, columns=['Season', 'Age', 'Weather']) df['Sales'] = np.random.randint(25, 200, len(df)) df.head() # Season Age Weather Sales # 0 Spring New Cold 127 # 1 Spring New Warm 117 # 2 Spring New Hot 39 # 3 Spring Old Cold 131 # 4 Spring Old Warm 96
Convert the table into a contingency table for measuring association between Season
and Age
and save it as 2-d array:
cont = df.pivot_table('Sales', 'Season', 'Age', 'sum') cont # Age New Old # Season # Fall 459 277 # Spring 283 272 # Summer 372 377 # Winter 356 384 cont_arr = cont.values
Now, you can calculate the chi-squared statistic and from that compute Cramer’s V. The formula for Cramer’s V can be found here.
chi2 = stats.chi2_contingency(cont_arr, correction=False)[0] sample_size = np.sum(cont_arr) min_dim = min(cont_arr.shape) - 1 cramer_v = np.sqrt((chi2 / sample_size) / min_dim) cramer_v # 0.1157257...