Skip to content
Advertisement

Recode multiple values in several columns in Python [similar to R]

I am trying to translate my R script to python. I have a survey data with several date of birth and education level columns for each family member(from family member 1 to member 10): here a sample:

id_name  dob_1   dob_2     dob_3    education_1    education_2    education_3
12       1958     2001      2005         1            5              1
13       1990     1999      1932         2            1              3
14       1974     1965      1965         3            3              3
15       1963     1963      1990         4            3              1
16       2020     1995      1988         1            1              2

I had a function in R in order to check the logic and re code wrong education level in all columns.Like this

# R function

edu_recode <- function(dob, edu){
  case_when(
    dob >= 2003 & (edu == 1 | edu == 2  | edu == 3 | edu == 4) ~ 8,
    dob > 2000 &  (edu == 1 | edu == 2  | edu == 3 | edu == 4) ~ 1,
    dob >= 1996 &  (edu == 3 | edu == 4) ~ 2,
    dob >  1995 &  edu == 4 ~ 3,
    (dob >= 2001 & dob <= 2002) & edu == 8 ~ 1,
    TRUE ~ as.numeric(edu)
  )
} 

and apply it for all columns like this:

library(tidyverse)
 df %>% 
  mutate(education_1 = edu_recode(dob_1,education_1),
         education_2 = edu_recode(dob_2,education_2),
         education_3 = edu_recode(dob_3,education_3),
         education_4 = edu_recode(dob_4,education_4),
         education_5 = edu_recode(dob_5,education_5),
         education_6 = edu_recode(dob_6,education_6),
         education_7 = edu_recode(dob_7,education_7),
         education_8 = edu_recode(dob_8,education_8),
         education_9 = edu_recode(dob_9,education_9),
         education_10 = edu_recode(dob_10,education_10)
  )

is there a way to do similar process in Python instead of manually recoding each column?

Advertisement

Answer

You can write a function that combines pipe with np.select, as well as a dictionary (to abstract as much manual processing as possible):

def edu_recode(df, dob, edu):
    df = df.copy()
    cond1 = (df[dob] >= 2003) & (df[edu].isin([1, 4]))
    cond2 = (df[dob] > 2000) & (df[edu].isin([1, 4]))
    cond3 = (df[dob] > 1996) & (df[edu].isin([3, 4]))
    cond4 = (df[dob] > 1995) & (df[edu] == 4)
    cond5 = (df[dob].isin([2001, 2002])) & (df[edu] == 8)
    condlist = [cond1, cond2, cond3, cond4, cond5]
    choicelist = [8, 1, 2, 3, 1]
    return np.select(condlist, choicelist, pd.to_numeric(df[edu]))

# sticking to the sample data, you can extend this 
mapping = {f"education_{num}": df.pipe(edu_recode, f"dob_{num}", 
                                                   f"education_{num}")
           for num in range(1, 4)}

df.assign(**mapping)

        

     id_name    dob_1   dob_2   dob_3   education_1 education_2 education_3
    0   12      1958    2001    2005       1           5          8
    1   13      1990    1999    1932       2           1          3
    2   14      1974    1965    1965       3           3          3
    3   15      1963    1963    1990       4           3          1    
    4   16      2020    1995    1988       8           1          2
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement