I have a pandas DataFrame that I want to summarize by group, using a custom function that resolves to a boolean value.
Consider the following data. df
describes 4 people, and for each person the fruits they like.
import numpy as np import pandas as pd df = pd.DataFrame({ "name": ["danny", "danny", "danny", "monica", "monica", "monica", "fred", "fred", "sam", "sam"], "fruit": ["apricot", "apple", "orange", "apricot", "banana", "watermelon", "apple", "apricot", "apricot", "peach"] }) print(df) ## name fruit ## 0 danny apricot ## 1 danny apple ## 2 danny orange ## 3 monica apricot ## 4 monica banana ## 5 monica watermelon ## 6 fred apple ## 7 fred apricot ## 8 sam apricot ## 9 sam peach
I want to summarize this table to find the people who like both apricot
and apple
. In other words, my desired output is the following table
# desired output ## name fruit ## 0 danny True ## 1 monica False ## 2 fred True ## 3 sam False
My attempt
I first defined a function that searches for string(s) existence in a target list:
def is_needle_in_haystack(needle, haystack): return all(x in haystack for x in needle)
Examples that is_needle_in_haystack()
works:
is_needle_in_haystack(["zebra", "lion"], ["whale", "lion", "dog"]) # False is_needle_in_haystack(["rabbit", "cat"], ["hamster", "cat", "monkey", "rabbit"]) # True
Now I used is_needle_in_haystack()
while grouping df
by name
:
target_fruits = ["apricot", "apple"] df.groupby(df["name"]).agg({"fruit": lambda x: is_needle_in_haystack(target_fruits, x)})
Then why do I get the following output, which clearly not as expected?
## fruit ## name ## danny False ## fred False ## monica False ## sam False
What have I done wrong in my code?
Advertisement
Answer
The problem is that haystack
is a Series, when called in .agg
, change to:
def is_needle_in_haystack(needle, haystack): return all(x in set(haystack) for x in needle) target_fruits = ["apricot", "apple"] res = df.groupby(df["name"]).agg({"fruit": lambda x: is_needle_in_haystack(target_fruits, x)}) print(res)
Output
fruit name danny True fred True monica False sam False
The in
operator for Series, returns False
, for example:
"hamster" in pd.Series(["hamster", "cat", "monkey", "rabbit"]) # False