I am trying to translate my R script to python. I have a survey data with several date of birth and education level columns for each family member(from family member 1 to member 10): here a sample:
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3 12 1958 2001 2005 1 5 1 13 1990 1999 1932 2 1 3 14 1974 1965 1965 3 3 3 15 1963 1963 1990 4 3 1 16 2020 1995 1988 1 1 2
I had a function in R in order to check the logic and re code wrong education level in all columns.Like this
# R function edu_recode <- function(dob, edu){ case_when( dob >= 2003 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 8, dob > 2000 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 1, dob >= 1996 & (edu == 3 | edu == 4) ~ 2, dob > 1995 & edu == 4 ~ 3, (dob >= 2001 & dob <= 2002) & edu == 8 ~ 1, TRUE ~ as.numeric(edu) ) }
and apply it for all columns like this:
library(tidyverse) df %>% mutate(education_1 = edu_recode(dob_1,education_1), education_2 = edu_recode(dob_2,education_2), education_3 = edu_recode(dob_3,education_3), education_4 = edu_recode(dob_4,education_4), education_5 = edu_recode(dob_5,education_5), education_6 = edu_recode(dob_6,education_6), education_7 = edu_recode(dob_7,education_7), education_8 = edu_recode(dob_8,education_8), education_9 = edu_recode(dob_9,education_9), education_10 = edu_recode(dob_10,education_10) )
is there a way to do similar process in Python instead of manually recoding each column?
Advertisement
Answer
You can write a function that combines pipe with np.select, as well as a dictionary (to abstract as much manual processing as possible):
def edu_recode(df, dob, edu): df = df.copy() cond1 = (df[dob] >= 2003) & (df[edu].isin([1, 4])) cond2 = (df[dob] > 2000) & (df[edu].isin([1, 4])) cond3 = (df[dob] > 1996) & (df[edu].isin([3, 4])) cond4 = (df[dob] > 1995) & (df[edu] == 4) cond5 = (df[dob].isin([2001, 2002])) & (df[edu] == 8) condlist = [cond1, cond2, cond3, cond4, cond5] choicelist = [8, 1, 2, 3, 1] return np.select(condlist, choicelist, pd.to_numeric(df[edu])) # sticking to the sample data, you can extend this mapping = {f"education_{num}": df.pipe(edu_recode, f"dob_{num}", f"education_{num}") for num in range(1, 4)} df.assign(**mapping) id_name dob_1 dob_2 dob_3 education_1 education_2 education_3 0 12 1958 2001 2005 1 5 8 1 13 1990 1999 1932 2 1 3 2 14 1974 1965 1965 3 3 3 3 15 1963 1963 1990 4 3 1 4 16 2020 1995 1988 8 1 2