I don’t know if it is right to say “standardize” categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
> df['gender'] gender f F f M M m
I tried this:
def padroniza_genero(x): if(x == 'f' or x == 'F'): replace(['f', 'F'], 0) else: replace(1) df1['gender'] = df1['gender'].apply(padroniza_genero)
But I got an error:
NameError: name 'replace' is not defined
Any ideas? Thanks!
Advertisement
Answer
There is no replace
function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
Or use a comparison (not equal to f) and conversion from boolean to integer:
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
output:
gender gender_num 0 f 0 1 F 0 2 f 0 3 M 1 4 M 1 5 m 1
generalization
you can generalize to ant number of categories using pandas.factorize
. Advantage: you will get a real Categorical
type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True
:
s, key = pd.factorize(df['gender'].str.lower(), sort=True) df['gender_num'] = s key = dict(enumerate(key)) # {0: 'f', 1: 'm'}