I have a Series object that has:
date price dec 12 may 15 apr 13 ..
Problem statement: I want to make it appear by month and compute the mean price for each month and present it with a sorted manner by month.
Desired Output:
month mean_price Jan XXX Feb XXX Mar XXX
I thought of making a list and passing it in a sort function:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
but the sort_values doesn’t support that for series.
One big problem I have is that even though
df = df.sort_values(by='date',ascending=True,inplace=True)
works
to the initial df
but after I did a groupby
, it didn’t maintain the order coming out from the sorted df
.
To conclude, I needed from the initial data frame these two columns. Sorted the datetime column and through a groupby using the month (dt.strftime(‘%B’)) the sorting got messed up. Now I have to sort it by month name.
My code:
df # has 5 columns though I need the column 'date' and 'price' df.sort_values(by='date',inplace=True) #at this part it is sorted according to date, great total=(df.groupby(df['date'].dt.strftime('%B'))['price'].mean()) # Though now it is not as it was but instead the months appear alphabetically
Advertisement
Answer
Thanks @Brad Solomon for offering a faster way to capitalize string!
Note 1 @Brad Solomon’s answer using pd.categorical
should save your resources more than my answer. He showed how to assign order to your categorical data. You should not miss it :P
Alternatively, you can use.
df = pd.DataFrame([["dec", 12], ["jan", 40], ["mar", 11], ["aug", 21], ["aug", 11], ["jan", 11], ["jan", 1]], columns=["Month", "Price"]) # Preprocessing: capitalize `jan`, `dec` to `Jan` and `Dec` df["Month"] = df["Month"].str.capitalize() # Now the dataset should look like # Month Price # ----------- # Dec XX # Jan XX # Apr XX # make it a datetime so that we can sort it: # use %b because the data use the abbreviation of month df["Month"] = pd.to_datetime(df.Month, format='%b', errors='coerce').dt.month df = df.sort_values(by="Month") total = (df.groupby(df['Month'])['Price'].mean()) # total Month 1 17.333333 3 11.000000 8 16.000000 12 12.000000
Note 2
groupby
by default will sort group keys for you. Be aware to use the same key to sort and groupby in the df = df.sort_values(by=SAME_KEY)
and total = (df.groupby(df[SAME_KEY])['Price'].mean()).
Otherwise, one may gets unintended behavior. See Groupby preserve order among groups? In which way? for more information.
Note 3
A more computationally efficient way is first compute mean and then do sorting on months. In this way, you only need to sort on 12 items rather than the whole df
. It will reduce the computational cost if one don’t need df
to be sorted.
Note 4 For people already have month
as index, and wonder how to make it categorical, take a look at pandas.CategoricalIndex
@jezrael has a working example on making categorical index ordered in Pandas series sort by month index