Background
Pretty new to Python and dataframes. I’m on a Mac (Sierra) running Jupyter Notebook in Firefox (87.0). I’ve got a dataframe like this:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10], 'SubGroup':[1,5,6,5,8,6,8,6,6,5], 'Price':[7,1,0,10,2,3,0,0,10,0]}) A SubGroup Price 0 1 1 7 1 2 5 1 2 3 6 0 3 4 5 10 4 5 8 2 5 6 6 3 6 7 8 0 7 8 6 0 8 9 6 10 9 10 5 0
I want to add a Boolean column to this dataframe that checks whether a) the price in this row is zero and b) if it’s the first occurrence of a zero price for this subgroup (reading from top to bottom). If both (a) and (b) are true, then return true, otherwise false. So it should look like this:
A SubGroup Price Test 0 1 1 7 False 1 2 5 1 False 2 3 6 0 True 3 4 5 10 False 4 5 8 2 False 5 6 6 3 False 6 7 8 0 True 7 8 6 0 False 8 9 6 10 False 9 10 5 0 True
What I’ve Tried
The first condition (Price == 0) is easy. Checking whether it’s the first occurrence for the subgroup is where I could use some help. I have an Excel background, so I started by thinking about how to solve this using a MINIFS function. The idea was to find the minimum Price for the Subgroup, looking only at the rows above the current row. If that min was greater than zero, then I’d know this was the first zero occurrence. The closest I could find (from this post) was a line like…
df['subgroupGlobalMin'] = df.groupby('SubGroup')['Price'].transform('min')
…which works but takes a global minimum across all rows for the Subgroup, not just the ones above the current row. So I tried to specify the target range for my min using iloc, like this…
df['subgroupPreviousMin'] = df.iloc[:df.index].groupby('SubGroup')['Price'].transform('min')
…but this produces the error “cannot do positional indexing on RangeIndex with these indexers [RangeIndex(start=0, stop=10, step=1)] of type RangeIndex”. I couldn’t figure out how to dynamically specify my rows/indices.
So I changed strategies and instead tried to find the index of the first occurrence of the minimum value for a subgroup using idxmin (like this post):
df['minIndex'] = df.groupby(['SubGroup'])[['Price']].idxmin()
The plan was to check this against the current row index with df.index, but I get unexpected output here:
A SubGroup Price minIndex 0 1 1 7 NaN 1 2 5 1 0.0 2 3 6 0 NaN 3 4 5 10 NaN 4 5 8 2 NaN 5 6 6 3 9.0 6 7 8 0 2.0 7 8 6 0 NaN 8 9 6 10 6.0 9 10 5 0 NaN
I know what it’s doing here, but I don’t know why or how to fix it.
Questions
- Which strategy is best for what I’m trying to achieve – using a min function, checking the index with something like idxmin, or something else?
- How should I add a column to my dataframe that checks if the price is 0 for that row and if it’s the first occurrence of a zero for that subgroup?
Advertisement
Answer
Let us try your logic:
is_zero = df.Price.eq(0) is_first_zero = is_zero.groupby(df['SubGroup']).cumsum().eq(1) df['Test'] = is_zero & is_first_zero
Output:
A SubGroup Price Test 0 1 1 7 False 1 2 5 1 False 2 3 6 0 True 3 4 5 10 False 4 5 8 2 False 5 6 6 3 False 6 7 8 0 True 7 8 6 0 False 8 9 6 10 False 9 10 5 0 True