Skip to content
Advertisement

inserting missing categories and dates in pandas dataframe

I have the following data frame. I want to add in all score levels (high, mid, low), for each group (a, b, c, d), for all dates (there are two dates – 2020-06-01 and 2020-06-02)

x = pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],
                        'group' : ['a','a','a','b','b','c','c','c','d'],
                        'score' : ['high','low','mid','low','high','high','high','mid','high'],
                        'count' : [12,13,2,19,22,3,4,49,12]})

I can add in the score categories for all subjects with the following, but i am having trouble adding date in as well

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['group'].unique(), cats)),columns=['group', 'score'])
x_re.merge(x, how='left').fillna(0)

the expected output would be this : so there are 6 rows per subject, 3 rows for each date, and one row for each score category. The count is then filled in with np.nan (or zero is fine) where the data points are missing

pd.DataFrame(data={ 'date'  : ['2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02','2020-06-01','2020-06-01','2020-06-01','2020-06-02','2020-06-02','2020-06-02'],                        
                        'group' : ['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','c','d','d','d','d','d','d'],
                        'score' : ['high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid','high','low','mid'],
                        'count' : [12, 13, np.nan, np.nan, np.nan, 2, np.nan, 22, np.nan, 19, np.nan, np.nan, 3, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, 12, np.nan, 49]})

any advice would be great, thank you

Advertisement

Answer

Your solution is possible modify with add date columns by unique values, this solution working if not unique triples date, group, score in input data:

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['date'].unique(), 
                                 x['group'].unique(), 
                                 cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)

Solution with reindex by 3 level MultiIndex is similar:

cats = ['high', 'mid','low'] 
x_re = pd.MultiIndex.from_product([x['date'].unique(), 
                                   x['group'].unique(),
                                   cats],names=['date','group', 'score'])

x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   mid    NaN
2   2020-06-01     a   low   13.0
3   2020-06-01     b  high    NaN
4   2020-06-01     b   mid    NaN
5   2020-06-01     b   low   19.0
6   2020-06-01     c  high    3.0
7   2020-06-01     c   mid    NaN
8   2020-06-01     c   low    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   mid    NaN
11  2020-06-01     d   low    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   mid    2.0
14  2020-06-02     a   low    NaN
15  2020-06-02     b  high   22.0
16  2020-06-02     b   mid    NaN
17  2020-06-02     b   low    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   mid   49.0
20  2020-06-02     c   low    NaN
21  2020-06-02     d  high   12.0
22  2020-06-02     d   mid    NaN
23  2020-06-02     d   low    NaN

With one call unstack and one call stack is possible use, but is necessary all unique values cats have to exist in input data:

x = (x.set_index(['date', 'group', 'score'])
      .unstack(['group','score'])
      .stack([1, 2], dropna=False)
      .reset_index())
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   low   13.0
2   2020-06-01     a   mid    NaN
3   2020-06-01     b  high    NaN
4   2020-06-01     b   low   19.0
5   2020-06-01     b   mid    NaN
6   2020-06-01     c  high    3.0
7   2020-06-01     c   low    NaN
8   2020-06-01     c   mid    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   low    NaN
11  2020-06-01     d   mid    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   low    NaN
14  2020-06-02     a   mid    2.0
15  2020-06-02     b  high   22.0
16  2020-06-02     b   low    NaN
17  2020-06-02     b   mid    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   low    NaN
20  2020-06-02     c   mid   49.0
21  2020-06-02     d  high   12.0
22  2020-06-02     d   low    NaN
23  2020-06-02     d   mid    NaN
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement