Let’s say this the dataframe:
First Name Last Name Val1 George Clooney N George Clooney N George Clooney Y George Freeman N George Freeman N Claire Stark N Claire Stark Y
Then the goal is to produce this:
First Name Last Name Val1 Total George Clooney Y 3 George Freeman N 2 Claire Stark Y 2
The total Val1
is Y
as long as one of the instances is Y
.
My code looks like this:
grouped = df.groupby(by=['First Name', 'Last Name']) def val_func(x): if (x['Val1'] == 'Y').any(): return 'Y' else: return 'N' cumulative = grouped.apply(val_func)
This works except that cumulative
has dtype object
and I can only access Val1
, that is, I cannot access First Name
or Last Name
(Although when I run print(cumulative)
, it does print everything).
If I try:
df_cumulative = pd.DataFrame(cumulative)
then, I just get the column with Y
or N
, but not the names.
How to fix this? Moreover, can I return two arguments? one for Val1
and one for Total
? or would I have to run another apply for Total
and append the column to the dataframe?
Advertisement
Answer
Another way is to use groupby.agg
where you use max
to get “Y” if it exists (because Y>N
) and count
:
out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False) .agg(Val1=('Val1', 'max'), Total=('Val1', 'count'))
Output:
First Name Last Name Val1 Total 0 George Clooney Y 3 1 George Freeman N 2 2 Claire Stark Y 2
You can pass in a lambda that selects based whatever criteria you want. For example, the following aggregates “Val1” based on whether the number of “Y”s are greater than the number of “N”s (if there are more “Y”s select “Y” else “N”):
out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False) .agg(Val1=('Val1', lambda x: 'Y' if x.eq('Y').sum() > x.eq('N').sum() else 'N'), Total=('Val1', 'count'))