Skip to content
Advertisement

Creating a function to standardize categorical variables (python)

I don’t know if it is right to say “standardize” categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:

> df['gender']

gender
  f
  F
  f
  M
  M
  m

I tried this:

def padroniza_genero(x):
    if(x == 'f' or x == 'F'):
        replace(['f', 'F'], 0)
    else:
        replace(1)
        
df1['gender'] = df1['gender'].apply(padroniza_genero)

But I got an error:

NameError: name 'replace' is not defined

Any ideas? Thanks!

Advertisement

Answer

There is no replace function defined in your code.

Back to your goal, use a vector function.

Convert to lower and map f->0, m->1:

df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})

Or use a comparison (not equal to f) and conversion from boolean to integer:

df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)

output:

  gender  gender_num
0      f           0
1      F           0
2      f           0
3      M           1
4      M           1
5      m           1

generalization

you can generalize to ant number of categories using pandas.factorize. Advantage: you will get a real Categorical type.

NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True:

s, key = pd.factorize(df['gender'].str.lower(), sort=True)
df['gender_num'] = s

key = dict(enumerate(key))
# {0: 'f', 1: 'm'}
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement