I have the following data frame. I want to add in all score levels (high, mid, low), for each group (a, b, c, d), for all dates (there are two dates – 2020-06-01 and 2020-06-02)
x = pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'], 'group' : ['a','a','a','b','b','c','c','c','d'], 'score' : ['high','low','mid','low','high','high','high','mid','high'], 'count' : [12,13,2,19,22,3,4,49,12]})
I can add in the score categories for all subjects with the following, but i am having trouble adding date in as well
cats = ['high', 'mid','low'] x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score']) x_re.merge(x, how='left').fillna(0)
the expected output would be this : so there are 6 rows per subject, 3 rows for each date, and one row for each score category. The count is then filled in with np.nan (or zero is fine) where the data points are missing
pd.DataFrame(data={ 'date' : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'], 'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'], 'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'], 'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})
any advice would be great, thank you
Advertisement
Answer
Your solution is possible modify with add date
columns by unique values, this solution working if not unique triples date, group, score
in input data:
cats = ['high', 'mid','low'] x_re = pd.DataFrame(list(product(x['date'].unique(), x['group'].unique(), cats)),columns=['date','group', 'score']) x = x_re.merge(x, how='left').fillna(0)
Solution with reindex
by 3 level MultiIndex
is similar:
cats = ['high', 'mid','low'] x_re = pd.MultiIndex.from_product([x['date'].unique(), x['group'].unique(), cats],names=['date','group', 'score']) x = x.set_index(['date','group','score']).reindex(x_re).reset_index() print (x) date group score count 0 2020-06-01 a high 12.0 1 2020-06-01 a mid NaN 2 2020-06-01 a low 13.0 3 2020-06-01 b high NaN 4 2020-06-01 b mid NaN 5 2020-06-01 b low 19.0 6 2020-06-01 c high 3.0 7 2020-06-01 c mid NaN 8 2020-06-01 c low NaN 9 2020-06-01 d high NaN 10 2020-06-01 d mid NaN 11 2020-06-01 d low NaN 12 2020-06-02 a high NaN 13 2020-06-02 a mid 2.0 14 2020-06-02 a low NaN 15 2020-06-02 b high 22.0 16 2020-06-02 b mid NaN 17 2020-06-02 b low NaN 18 2020-06-02 c high 4.0 19 2020-06-02 c mid 49.0 20 2020-06-02 c low NaN 21 2020-06-02 d high 12.0 22 2020-06-02 d mid NaN 23 2020-06-02 d low NaN
With one call unstack
and one call stack
is possible use, but is necessary all unique values cats
have to exist in input data:
x = (x.set_index(['date', 'group', 'score']) .unstack(['group','score']) .stack([1, 2], dropna=False) .reset_index()) print (x) date group score count 0 2020-06-01 a high 12.0 1 2020-06-01 a low 13.0 2 2020-06-01 a mid NaN 3 2020-06-01 b high NaN 4 2020-06-01 b low 19.0 5 2020-06-01 b mid NaN 6 2020-06-01 c high 3.0 7 2020-06-01 c low NaN 8 2020-06-01 c mid NaN 9 2020-06-01 d high NaN 10 2020-06-01 d low NaN 11 2020-06-01 d mid NaN 12 2020-06-02 a high NaN 13 2020-06-02 a low NaN 14 2020-06-02 a mid 2.0 15 2020-06-02 b high 22.0 16 2020-06-02 b low NaN 17 2020-06-02 b mid NaN 18 2020-06-02 c high 4.0 19 2020-06-02 c low NaN 20 2020-06-02 c mid 49.0 21 2020-06-02 d high 12.0 22 2020-06-02 d low NaN 23 2020-06-02 d mid NaN