I don’t know if it is right to say “standardize” categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
> df['gender'] gender f F f M M m
I tried this:
def padroniza_genero(x):
if(x == 'f' or x == 'F'):
replace(['f', 'F'], 0)
else:
replace(1)
df1['gender'] = df1['gender'].apply(padroniza_genero)
But I got an error:
NameError: name 'replace' is not defined
Any ideas? Thanks!
Advertisement
Answer
There is no replace function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
Or use a comparison (not equal to f) and conversion from boolean to integer:
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
output:
gender gender_num 0 f 0 1 F 0 2 f 0 3 M 1 4 M 1 5 m 1
generalization
you can generalize to ant number of categories using pandas.factorize. Advantage: you will get a real Categorical type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True:
s, key = pd.factorize(df['gender'].str.lower(), sort=True)
df['gender_num'] = s
key = dict(enumerate(key))
# {0: 'f', 1: 'm'}