I have a data-frame with columns as ['ID','Title','Category','Company','Field']
and it has both blank values and at some places missing values are put as N/A. I have to pick the row which has maximum information available.
For example one case could be.
'ID' 'Title' 'Category' 'Company' 'Field' 0 ABD12567 Title1 Company1 1 ABD12567 Title1 N/A Field1 2 ABD12567 Title1 Category1 Company1 Field1 3 ABD12567 Title1 Company1 4 ABD12567 Title1 N/A Company1 Field1
In this case i want to select the row number 2 as it has maximum information available. I tried to add one length columns to the data-frame where i will add all the string lengths and will pick the one with maximum length with help of group-by. For Example catalog_11=catalog_1.groupby(['ID'], as_index = False).agg({'combined_len': max})
. And I will keep the index of that row in this case 2 and will filter out the row with index 2 from the old data-frame. But this is not working as i am loosing the index of the old data-frame after group-by. Not sure how to solve this. Please help me to pick the right row. Not Sure how to solve this with some other workaround.
If there is any other alternative than the combined length which i am using then please suggest.
Advertisement
Answer
You can use (~df.isin(["", "N/A"])).sum(1)
to get the number of valid values in a row by summing the boolean values on axis=1
. Combine it with groupby
and idxmax
:
Example data (added an extra ID to showcase the groupby
):
ID Title Category Company Field 0 ABD12567 Title1 Company1 1 ABD12567 Title1 N/A Field1 2 ABD12567 Title1 Category1 Company1 Field1 3 ABD12567 Title1 Company1 4 ABD12567 Title1 N/A Company1 Field1 5 ABD12568 Title1 N/A Company1 Field1
Code:
idx = (df.assign(max=(~df.isin(["", "N/A"])).sum(1)) # assign temp column .groupby("ID")["max"].idxmax()) #retrieve index of max value within each group print (df.loc[idx]) ID Title Category Company Field 2 ABD12567 Title1 Category1 Company1 Field1 5 ABD12568 Title1 N/A Company1 Field1