Skip to content
Advertisement

Selecting rows based on condition in python pandas

I have a data-frame with columns as ['ID','Title','Category','Company','Field'] and it has both blank values and at some places missing values are put as N/A. I have to pick the row which has maximum information available.

For example one case could be.

  'ID'           'Title'   'Category'      'Company'     'Field'
0 ABD12567       Title1                    Company1
1 ABD12567       Title1     N/A                          Field1
2 ABD12567       Title1     Category1      Company1      Field1
3 ABD12567       Title1                    Company1
4 ABD12567       Title1     N/A            Company1      Field1

In this case i want to select the row number 2 as it has maximum information available. I tried to add one length columns to the data-frame where i will add all the string lengths and will pick the one with maximum length with help of group-by. For Example catalog_11=catalog_1.groupby(['ID'], as_index = False).agg({'combined_len': max}) . And I will keep the index of that row in this case 2 and will filter out the row with index 2 from the old data-frame. But this is not working as i am loosing the index of the old data-frame after group-by. Not sure how to solve this. Please help me to pick the right row. Not Sure how to solve this with some other workaround.

If there is any other alternative than the combined length which i am using then please suggest.

Advertisement

Answer

You can use (~df.isin(["", "N/A"])).sum(1) to get the number of valid values in a row by summing the boolean values on axis=1. Combine it with groupby and idxmax:

Example data (added an extra ID to showcase the groupby):

  ID             Title      Category       Company       Field
0 ABD12567       Title1                    Company1   
1 ABD12567       Title1     N/A                          Field1
2 ABD12567       Title1     Category1      Company1      Field1
3 ABD12567       Title1                    Company1   
4 ABD12567       Title1     N/A            Company1      Field1
5 ABD12568       Title1     N/A            Company1      Field1

Code:

idx = (df.assign(max=(~df.isin(["", "N/A"])).sum(1)) # assign temp column
         .groupby("ID")["max"].idxmax()) #retrieve index of max value within each group

print (df.loc[idx])

         ID   Title   Category   Company   Field
2  ABD12567  Title1  Category1  Company1  Field1
5  ABD12568  Title1        N/A  Company1  Field1
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement