I don’t know if it is right to say “standardize” categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
JavaScript
x
11
11
1
> df['gender']
2
3
gender
4
f
5
F
6
f
7
M
8
M
9
m
10
11
I tried this:
JavaScript
1
9
1
def padroniza_genero(x):
2
if(x == 'f' or x == 'F'):
3
replace(['f', 'F'], 0)
4
else:
5
replace(1)
6
7
df1['gender'] = df1['gender'].apply(padroniza_genero)
8
9
But I got an error:
JavaScript
1
3
1
NameError: name 'replace' is not defined
2
3
Any ideas? Thanks!
Advertisement
Answer
There is no replace
function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
JavaScript
1
2
1
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
2
Or use a comparison (not equal to f) and conversion from boolean to integer:
JavaScript
1
2
1
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
2
output:
JavaScript
1
8
1
gender gender_num
2
0 f 0
3
1 F 0
4
2 f 0
5
3 M 1
6
4 M 1
7
5 m 1
8
generalization
you can generalize to ant number of categories using pandas.factorize
. Advantage: you will get a real Categorical
type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True
:
JavaScript
1
6
1
s, key = pd.factorize(df['gender'].str.lower(), sort=True)
2
df['gender_num'] = s
3
4
key = dict(enumerate(key))
5
# {0: 'f', 1: 'm'}
6