Pandas DataFrame mean of data in columns occurring before certain date time

I have a dataframe with ID’s of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column ‘Date’ dictates which columns can be taken into account for the mean).

Example: for index 0 (ID: 12), the date states ‘2016-03-08’, then the mean should be taken from the columns ‘y_2014’ and ‘y_2015’, so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).

Desired output:

   y_2014  y_2015  y_2016  y_2017  y_2018        Date  ID    mean
0   100.0   122.0     324     632     NaN  2016-03-08  12   111.0
1   120.0   159.0      54     452   541.0  2015-04-09  96   120.0
2     NaN   164.0     687     165   245.0  2016-02-15  20   164.0
3   180.0   421.0     512     184   953.0  2018-05-01  73  324.25
4   110.0   654.0     913     173   103.0  2017-08-04  84   559.0
5   130.0     NaN     754     124   207.0  2016-07-03  26   130.0
6   170.0   256.0     843      97   806.0  2013-02-04  87     NaN
7   140.0   754.0      95     101   541.0  2016-06-08  64     447
8    80.0   985.0     184      84    90.0  2019-03-05  11   284.6
9    96.0    65.0     127     130   421.0  2014-05-14  34     NaN

JavaScript
​x
 
   y_2014  y_2015  y_2016  y_2017  y_2018        Date  ID    mean
 100.0   122.0     324     632     NaN  2016-03-08  12   111.0
 120.0   159.0      54     452   541.0  2015-04-09  96   120.0
   NaN   164.0     687     165   245.0  2016-02-15  20   164.0
 180.0   421.0     512     184   953.0  2018-05-01  73  324.25
 110.0   654.0     913     173   103.0  2017-08-04  84   559.0
 130.0     NaN     754     124   207.0  2016-07-03  26   130.0
 170.0   256.0     843      97   806.0  2013-02-04  87     NaN
 140.0   754.0      95     101   541.0  2016-06-08  64     447
  80.0   985.0     184      84    90.0  2019-03-05  11   284.6
  96.0    65.0     127     130   421.0  2014-05-14  34     NaN
​

The code below is what I tried.

Tried code:

import pandas as pd 
import numpy as np   

df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],   
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],   
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],   
               "y_2016": [324,54,687,512,913,754,843,95,184,127],     
               "y_2017": [632,452,165,184,173,124,97,101,84,130],   
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],    
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',                           '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})  

print(df)  

# the years from columns 
data = df.filter(like='y_') 
data_years = data.columns.str.extract('(d+)')[0].astype(int)  

# the years from Date 
years = pd.to_datetime(df.Date).dt.year.values

  df['mean'] = data.where(data_years<years[:,None]).mean(1) 
print(df)

-> ValueError: Lengths must match to compare

JavaScript
 
import pandas as pd•
import numpy as np•••
​
df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],•  
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],•  
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],•  
               "y_2016": [324,54,687,512,913,754,843,95,184,127],•    
               "y_2017": [632,452,165,184,173,124,97,101,84,130],•  
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],•   
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',•                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})••
​
print(df)••
​
# the years from columns•
data = df.filter(like='y_')•
data_years = data.columns.str.extract('(d+)')[0].astype(int)••
​
# the years from Date•
years = pd.to_datetime(df.Date).dt.year.values
​
••df['mean'] = data.where(data_years<years[:,None]).mean(1)•
print(df)
​
-> ValueError: Lengths must match to compare
​
​
​

Answer

Solved: one possible answer to my own question

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],                  
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],    
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],                  
               "y_2016": [324,54,687,512,913,754,843,95,184,127],   
               "y_2017": [632,452,165,184,173,124,97,101,84,130],                  
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],   
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',                
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']] 

#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']  
s = subset.columns[0:].values < df.Date.values[:,None] 
t = s.astype(float)
t[t == 0] = np.nan 

df['mean'] = (subset.iloc[:,0:]*t).mean(1)  
print(df)

#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)  
print(df)

JavaScript
 
import pandas as pd•
import numpy as np••
​
df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],•                 
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],•   
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],•                 
               "y_2016": [324,54,687,512,913,754,843,95,184,127],•  
               "y_2017": [632,452,165,184,173,124,97,101,84,130],•                 
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],•  
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',•               
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
​
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]•
​
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']••
s = subset.columns[0:].values < df.Date.values[:,None]•
t = s.astype(float)
t[t == 0] = np.nan•
​
df['mean'] = (subset.iloc[:,0:]*t).mean(1)••
print(df)
​
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)••
print(df)
​
​
​

Advertisement

Answer