I have a dataframe and I would like to find the n highest numbers in each column. There are a variety of methods to do this, but all seem to fail as a result of strings also being in the dataframe. I have tried a multitude of ways to get around this but I am always stumped by the presence of strings.
As some cells contain %
a blanket omission of all string type columns wouldn’t work. However, ignoring cells containing A-Z would work.
Example dataframe:
import pandas as pd
test_data = {
'Animal': ['Otter', 'Turtle', 'Chicken'],
'Squeak Appeal': [12.8, 1.92, 11.4],
'Richochet Chance': ['8%', '30%', '16%'],
}
test_df = pd.DataFrame(
test_data,
columns=['Animal', 'Squeak Appeal','Richochet Chance']
)
i). Attempt using apply
:
test_df.apply(
lambda x: pd.Series
(x.str.strip('%').astype(float).nlargest(2).index)
)
AttributeError: ('Can only use .str accessor with string values!', 'occurred at index Squeak Appeal')
ii). a). attempt using a for-loop
:
headers = list(test_df.columns.values)
for header in headers:
if not ['a-z'] in test_df[header]:
max_value = (
test_df[header]
.str.strip('%') # remove the ending %
.astype(float) # convert to float
.nlargest(10).index # nlargest and index
)
TypeError: unhashable type: 'list'
ii). b). I also tried excluding ‘e’ as an experiment to get past the if-statement
:
#...
if not 'e' in test_df[header]:
#...
AttributeError: Can only use .str accessor with string values!
iii). I attempted using numpy
as I had see utilised it elsewhere but don’t really grasp the idea:
import numpy as np
N = 3
a = np.argsort(-test_df.values, axis=0)[-1:-1-N:-1]
b = pd.DataFrame(df.index[a], columns=df.columns)
print (b)
TypeError: bad operand type for unary -: 'str'
I could go on but I feel like it would be a waste of text space. Could anyone point me in the right direction?
Example Outcome:
print(richochet_chance_max)
Animal Squeak Appeal Richochet Chance
1 Turtle 1.92 30%
2 Chicken 11.40 16%
print(squeak_appeal_max)
Animal Squeak Appeal Richochet Chance
1 Otter 12.8 8%
2 Chicken 11.4 16%
Advertisement
Answer
You can convert the string column to float
, then convert it back as str
after obtaining the n largest values:
# Convert the string column to float
test_df['Richochet Chance'] = test_df['Richochet Chance'].str.strip('%').astype(float)
# Get nlargest as you want
test_df = test_df.nlargest(2, columns=['Squeak Appeal', 'Richochet Chance'])
# Convert the string column back to string
test_df['Richochet Chance'] = test_df['Richochet Chance'].map(lambda x: f'{x:.0f}%')
Output for nlargest = 2
:
Animal Squeak Appeal Richochet Chance
0 Otter 12.8 8%
2 Chicken 11.4 16%