Skip to content
Advertisement

Obtaining n number highest values in dataframe also containing strings

I have a dataframe and I would like to find the n highest numbers in each column. There are a variety of methods to do this, but all seem to fail as a result of strings also being in the dataframe. I have tried a multitude of ways to get around this but I am always stumped by the presence of strings.

As some cells contain % a blanket omission of all string type columns wouldn’t work. However, ignoring cells containing A-Z would work.

Example dataframe:

import pandas as pd
test_data = {
            'Animal': ['Otter', 'Turtle', 'Chicken'],
            'Squeak Appeal': [12.8, 1.92, 11.4],
            'Richochet Chance': ['8%', '30%', '16%'],
            }            
test_df = pd.DataFrame(
                        test_data,
                        columns=['Animal', 'Squeak Appeal','Richochet Chance']
                    )

i). Attempt using apply:

test_df.apply(
            lambda x: pd.Series
            (x.str.strip('%').astype(float).nlargest(2).index)
            )

AttributeError: ('Can only use .str accessor with string values!', 'occurred at index Squeak Appeal')

ii). a). attempt using a for-loop:

headers = list(test_df.columns.values)
for header in headers:
    if not ['a-z'] in test_df[header]:
        max_value = (
                  test_df[header]
                  .str.strip('%')          # remove the ending %
                  .astype(float)           # convert to float
                  .nlargest(10).index       # nlargest and index
                  )

TypeError: unhashable type: 'list'

ii). b). I also tried excluding ‘e’ as an experiment to get past the if-statement:

#...
    if not 'e' in test_df[header]:
#...

AttributeError: Can only use .str accessor with string values!

iii). I attempted using numpy as I had see utilised it elsewhere but don’t really grasp the idea:

import numpy as np
N = 3
a = np.argsort(-test_df.values, axis=0)[-1:-1-N:-1]
b = pd.DataFrame(df.index[a], columns=df.columns)
print (b)

TypeError: bad operand type for unary -: 'str'

I could go on but I feel like it would be a waste of text space. Could anyone point me in the right direction?

Example Outcome:

print(richochet_chance_max)

    Animal  Squeak Appeal Richochet Chance
1   Turtle           1.92              30%
2  Chicken          11.40              16%


print(squeak_appeal_max)

    Animal  Squeak Appeal Richochet Chance
1    Otter          12.8               8%
2  Chicken          11.4              16%

Advertisement

Answer

You can convert the string column to float, then convert it back as str after obtaining the n largest values:

# Convert the string column to float
test_df['Richochet Chance'] = test_df['Richochet Chance'].str.strip('%').astype(float)
# Get nlargest as you want
test_df = test_df.nlargest(2, columns=['Squeak Appeal', 'Richochet Chance'])
# Convert the string column back to string
test_df['Richochet Chance'] = test_df['Richochet Chance'].map(lambda x: f'{x:.0f}%')

Output for nlargest = 2:

    Animal  Squeak Appeal Richochet Chance
0    Otter           12.8               8%
2  Chicken           11.4              16%
Advertisement