I have a large time-series dataframe. The column has already been formatted as datetime. Such as
2017-10-06T00:00:00+00:00 2020-04-29 00:00:00+00:00
I want to plot the sample numbers for each season. Such as the following. The values are the count number of samples in that season.
1997 Winter 4 1997 Spring 8 1997 Summer 8 ... 2020 Winter 32
I do make a little search and realize I can create a dictionary to convert the months into seasons. However, The most tricky part since the ‘real wintertime’ contains two years’ data. For example, the 1997 winter actually should contain 1997 December 1998 January, and 1998 February.
Please note, I want the ‘1997 January, 1997 February’ to be excluded from 1997 winter since they are ‘1996 winter’.
I am wondering what is the most efficient way to do that? It does not have to be named such as ‘1997 winter’, any index should work for me as long as the counting numbers are successive from the beginning to the end.
Many thanks!
Advertisement
Answer
There is a fast way to solve it, but it’s not very orthodox… You create a column ‘season’, and with a np.where(), you assign the season. At the beginning, you say winter for the first 3 month, spring, for the 3 next, and so. And then, you apply a shift(-1) on the column to shift it by one row back. Then, you’ve got your seasons (just ffill the las nan). You can then solve your problem in a lazy way. If you’re not confortable with the code, tell me, I will edit it.
EDIT:
I assume that the dates are in the index. If not, you should apply a dt.month instead of .month. I decompose it to make it clear
_condtion_spring = (df.index.month>=4)&(df.index.month<=6) _condition_summer = (df.index.month>7)&(df.index.month<=9) _condition_automn = (df.index.month>=10)@(df.index.month<=12) df['Season'] = np.where(_condition_winter,'Winter',np.where(_condtion_spring,'Spring',np.where(_condition_summer,'Summer',np.where(_condition_automn,'Automn',np.nan)))) df['Season'] = df['Season'].shift(-1).fillna(method='ffill')
EDIT 2:
Here a complete example :
dates = pd.date_range("1983-09-01","1985-12-31",freq="1M") df = pd.DataFrame(np.random.randint(100, 200,size=28)/100,index =dates,columns=["Sample"]) df = df.sort_index() _condition_winter = (df.index.month>=1)&(df.index.month<=3) _condtion_spring = (df.index.month>=4)&(df.index.month<=6) _condition_summer = (df.index.month>=7)&(df.index.month<=9) _condition_automn = (df.index.month>=10)@(df.index.month<=12) df['Season'] = np.where(_condition_winter,'Winter',np.where(_condtion_spring,'Spring',np.where(_condition_summer,'Summer',np.where(_condition_automn,'Automn',np.nan)))) df['Season'] = df['Season']+'_'+df.index.strftime(date_format='%Y') df['Season'] = df['Season'].shift(-1).fillna(method='ffill') print('Sample for winter 1984 = ',df[df.Season=='Winter_1984'].Sample.sum())
EDIT 3:
Here the complete example if you have several rows for same month:
#### Build our df #### This is just to make it clear that we will have 2 rows of each month. It could be more or less. dates = pd.date_range("1983-09-01","1985-12-31",freq="1M") dates2 = pd.date_range("1983-09-01","1985-12-31",freq="1M") df1 = pd.DataFrame(np.random.randint(100, 200,size=28)/100,index =dates,columns=["Sample"]).append(pd.DataFrame(np.random.randint(100, 200,size=28)/100,index =dates2,columns=["Sample"])) df1 = df1.sort_index() #### Now, to keep it clear, even if we could do faster, let's do a dataframe with 1 row per month with total of sample each time df = pd.DataFrame() df = df1.groupby(df1.index).sum() #### Let's sort by date to be sure that it won't me messy #### If you've got a 'Date' column and not the index, apply a .sort_values('Date') instead of sort_index df = df.sort_index() #### If youve got a 'Date' column, it will be df.Date.dt.month istead of df.index.month _condition_winter = (df.index.month>=1)&(df.index.month<=3) _condtion_spring = (df.index.month>=4)&(df.index.month<=6) _condition_summer = (df.index.month>=7)&(df.index.month<=9) _condition_automn = (df.index.month>=10)@(df.index.month<=12) df['Season'] = np.where(_condition_winter,'Winter',np.where(_condtion_spring,'Spring',np.where(_condition_summer,'Summer',np.where(_condition_automn,'Automn',np.nan)))) df['Season'] = df['Season']+'_'+df.index.strftime(date_format='%Y') df['Season'] = df['Season'].shift(-1).fillna(method='ffill') print('Sample for winter 1984 = ',df[df.Season=='Winter_1984'].Sample.sum())