I have a dataframe with ID’s of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column ‘Date’ dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states ‘2016-03-08’, then the mean should be taken from the columns ‘y_2014’ and ‘y_2015’, so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean 0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0 1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0 2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0 3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25 4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0 5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0 6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN 7 140.0 754.0 95 101 541.0 2016-06-08 64 447 8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6 9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
The code below is what I tried.
Tried code:
import pandas as pd import numpy as np df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34], "y_2014": [100,120,np.nan,180,110,130,170,140,80,96], "y_2015": [122,159,164,421,654,np.nan,256,754,985,65], "y_2016": [324,54,687,512,913,754,843,95,184,127], "y_2017": [632,452,165,184,173,124,97,101,84,130], "y_2018": [np.nan,541,245,953,103,207,806,541,90,421], "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04', '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']}) print(df) # the years from columns data = df.filter(like='y_') data_years = data.columns.str.extract('(d+)')[0].astype(int) # the years from Date years = pd.to_datetime(df.Date).dt.year.values df['mean'] = data.where(data_years<years[:,None]).mean(1) print(df) -> ValueError: Lengths must match to compare
Advertisement
Answer
Solved: one possible answer to my own question
import pandas as pd import numpy as np df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34], "y_2014": [100,120,np.nan,180,110,130,170,140,80,96], "y_2015": [122,159,164,421,654,np.nan,256,754,985,65], "y_2016": [324,54,687,512,913,754,843,95,184,127], "y_2017": [632,452,165,184,173,124,97,101,84,130], "y_2018": [np.nan,541,245,953,103,207,806,541,90,421], "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04', '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']}) #Subset from original df to calculate mean subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']] #an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01'] s = subset.columns[0:].values < df.Date.values[:,None] t = s.astype(float) t[t == 0] = np.nan df['mean'] = (subset.iloc[:,0:]*t).mean(1) print(df) #Additionally: (gives the sum of expenses before a certain date in the 'Date'-column df['sum'] = (subset.iloc[:,0:]*t).sum(1) print(df)