Let’s say this the dataframe:
First Name Last Name Val1 George Clooney N George Clooney N George Clooney Y George Freeman N George Freeman N Claire Stark N Claire Stark Y
Then the goal is to produce this:
First Name Last Name Val1 Total George Clooney Y 3 George Freeman N 2 Claire Stark Y 2
The total Val1 is Y as long as one of the instances is Y.
My code looks like this:
grouped = df.groupby(by=['First Name', 'Last Name'])
def val_func(x):
if (x['Val1'] == 'Y').any():
return 'Y'
else:
return 'N'
cumulative = grouped.apply(val_func)
This works except that cumulative has dtype object and I can only access Val1, that is, I cannot access First Name or Last Name (Although when I run print(cumulative), it does print everything).
If I try:
df_cumulative = pd.DataFrame(cumulative)
then, I just get the column with Y or N, but not the names.
How to fix this? Moreover, can I return two arguments? one for Val1 and one for Total? or would I have to run another apply for Total and append the column to the dataframe?
Advertisement
Answer
Another way is to use groupby.agg where you use max to get “Y” if it exists (because Y>N) and count:
out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False)
.agg(Val1=('Val1', 'max'), Total=('Val1', 'count'))
Output:
First Name Last Name Val1 Total 0 George Clooney Y 3 1 George Freeman N 2 2 Claire Stark Y 2
You can pass in a lambda that selects based whatever criteria you want. For example, the following aggregates “Val1” based on whether the number of “Y”s are greater than the number of “N”s (if there are more “Y”s select “Y” else “N”):
out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False)
.agg(Val1=('Val1', lambda x: 'Y' if x.eq('Y').sum() > x.eq('N').sum() else 'N'),
Total=('Val1', 'count'))