I have a df in pandas with daily data. I would like to return the last value of each month. I thought the simple solution would be to .resample("M").apply(lambda ser: ser.iloc[-1,])
however, it seems as if resample
actually computes the month end date rather than return the actual date that appears that month. Is this intended behavior? MWE:
import pandas as pd import numpy as np df = pd.Series(np.arange(100), index=pd.date_range(start="2000-01-02", periods=100)).to_frame() df.sort_index().resample("M").apply(lambda ser: ser.iloc[-1,]) # 0 #2000-01-31 29 #2000-02-29 58 #2000-03-31 89 #2000-04-30 99
While the last date appearing in df
is 2000-04-10
Advertisement
Answer
By using resample
with the offset M
, you are downsampling your sample to calendar month’s end (see the linked documentation on offsets), and then passing a function. So your indices will always be the last day of that month, and this is indeed the intended behaviour. The function you are applying (lambda ser: ser.iloc[-1,]
) is just saying: for the calendar date that ends on this day, what is the last value found in the original data.
For an example, you could also resample to month start, using the offset MS
instead of M
, and the result will be the same, except that the index will be the first day of the calendar month instead of the last day:
# Resample to month end, as you had originally: >>> df.sort_index().resample("M").apply(lambda ser: ser.iloc[-1,]) 0 2000-01-31 29 2000-02-29 58 2000-03-31 89 2000-04-30 99 # Resample to month start: same data, except index is month start instead of month end >>> df.sort_index().resample("MS").apply(lambda ser: ser.iloc[-1,]) 0 2000-01-01 29 2000-02-01 58 2000-03-01 89 2000-04-01 99
As pointed out by Wen, if you just want to show the actual last date found in your data, you’re better off with a groupby. Resample is useful if you want to up- or down- sample your data to a different time frequency, not so much to select real data from your original time frequency