I have a dataframe with 50k+ rows. df.head(5)
is below:
start_date finish_date months_used 2841 2019-06-23 2019-07-17 2 2842 2019-06-16 2019-06-23 1 2843 2019-03-27 2019-07-17 5 2844 2019-05-29 2019-06-05 2 2845 2019-03-25 2019-07-17 5
I need to create one more column with a list of months spent between start and finish dates to use df.explode
relying on this column and get for every ID with months_used > 1 new row with date of every month the work was in progress.
My primitive way is:
for i in df2.index: if df2.loc[i, 'months_used'] > 1: x = pd.date_range(start=df2.loc[i, 'start_date'], end=df2.loc[i, 'finish_date'], freq='M') df2.loc[i, 'rep_list'] = str(x)
But it does not make any sense, because it gives me strings like this "DatetimeIndex(['2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30'], dtype='datetime64[ns]', freq='M')"
and explode doesnt work. If i remove str()
i get a ValueError: Must have equal len keys and value when setting with an iterable
. And it is very slow…
I even cant fill any column with list-type
because i get the same error (i suppose it tries to fill a column with values from the list instead of inserting the list itself to the dataframe…)
expected:
start_date finish_date months_used rep_list 2841 2019-06-23 2019-07-17 2 [2019-06, 2019-07] 2842 2019-06-16 2019-06-23 1 [2019-06] 2843 2019-03-27 2019-07-17 5 [2019-03, 2019-04, 2019-05, 2019-06, 2019-07] 2844 2019-05-29 2019-06-05 2 [2019-05, 2019-06] 2845 2019-03-25 2019-07-17 5 [2019-03, 2019-04, 2019-05, 2019-06, 2019-07]
Inside ‘expected’-‘rep_list’ there could be any dates from current month… e.g. [2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01] or others. Just need explode to work after that.
Advertisement
Answer
Try:
s=[[*pd.date_range(x,periods=y,freq='M').strftime('%Y-%m')] for x,y in zip(df['start_date'],df['months_used'])] #Finally: df['rep_list']=s
OR
s=df.agg(lambda x:[*pd.date_range(x['start_date'],periods=x['months_used'],freq='M').strftime('%Y-%m')],1).tolist() #Finally: df['rep_list']=s
Now if you print df
you will get your desired output