Skip to content
Advertisement

How to get a date_range and insert them as a ‘list’ to a new column in dataframe?

I have a dataframe with 50k+ rows. df.head(5) is below:

       start_date   finish_date months_used
2841    2019-06-23  2019-07-17       2
2842    2019-06-16  2019-06-23       1
2843    2019-03-27  2019-07-17       5
2844    2019-05-29  2019-06-05       2
2845    2019-03-25  2019-07-17       5

I need to create one more column with a list of months spent between start and finish dates to use df.explode relying on this column and get for every ID with months_used > 1 new row with date of every month the work was in progress.

My primitive way is:

for i in df2.index:
    if df2.loc[i, 'months_used'] > 1:
        x = pd.date_range(start=df2.loc[i, 'start_date'], end=df2.loc[i, 'finish_date'], freq='M')    
        df2.loc[i, 'rep_list'] = str(x)

But it does not make any sense, because it gives me strings like this "DatetimeIndex(['2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30'], dtype='datetime64[ns]', freq='M')" and explode doesnt work. If i remove str() i get a ValueError: Must have equal len keys and value when setting with an iterable. And it is very slow…

I even cant fill any column with list-type because i get the same error (i suppose it tries to fill a column with values from the list instead of inserting the list itself to the dataframe…)

expected:

        start_date  finish_date months_used rep_list
2841    2019-06-23  2019-07-17       2      [2019-06, 2019-07]
2842    2019-06-16  2019-06-23       1      [2019-06]
2843    2019-03-27  2019-07-17       5      [2019-03, 2019-04, 2019-05, 2019-06, 2019-07]
2844    2019-05-29  2019-06-05       2      [2019-05, 2019-06]
2845    2019-03-25  2019-07-17       5      [2019-03, 2019-04, 2019-05, 2019-06, 2019-07]

Inside ‘expected’-‘rep_list’ there could be any dates from current month… e.g. [2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01] or others. Just need explode to work after that.

Advertisement

Answer

Try:

s=[[*pd.date_range(x,periods=y,freq='M').strftime('%Y-%m')] for x,y in zip(df['start_date'],df['months_used'])]
#Finally:
df['rep_list']=s

OR

s=df.agg(lambda x:[*pd.date_range(x['start_date'],periods=x['months_used'],freq='M').strftime('%Y-%m')],1).tolist()
#Finally:
df['rep_list']=s

Now if you print df you will get your desired output

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement