I am trying to translate my R script to python. I have a survey data with several date of birth and education level columns for each family member(from family member 1 to member 10): here a sample:
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3 12 1958 2001 2005 1 5 1 13 1990 1999 1932 2 1 3 14 1974 1965 1965 3 3 3 15 1963 1963 1990 4 3 1 16 2020 1995 1988 1 1 2
I had a function in R in order to check the logic and re code wrong education level in all columns.Like this
# R function
edu_recode <- function(dob, edu){
case_when(
dob >= 2003 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 8,
dob > 2000 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 1,
dob >= 1996 & (edu == 3 | edu == 4) ~ 2,
dob > 1995 & edu == 4 ~ 3,
(dob >= 2001 & dob <= 2002) & edu == 8 ~ 1,
TRUE ~ as.numeric(edu)
)
}
and apply it for all columns like this:
library(tidyverse)
df %>%
mutate(education_1 = edu_recode(dob_1,education_1),
education_2 = edu_recode(dob_2,education_2),
education_3 = edu_recode(dob_3,education_3),
education_4 = edu_recode(dob_4,education_4),
education_5 = edu_recode(dob_5,education_5),
education_6 = edu_recode(dob_6,education_6),
education_7 = edu_recode(dob_7,education_7),
education_8 = edu_recode(dob_8,education_8),
education_9 = edu_recode(dob_9,education_9),
education_10 = edu_recode(dob_10,education_10)
)
is there a way to do similar process in Python instead of manually recoding each column?
Advertisement
Answer
You can write a function that combines pipe with np.select, as well as a dictionary (to abstract as much manual processing as possible):
def edu_recode(df, dob, edu):
df = df.copy()
cond1 = (df[dob] >= 2003) & (df[edu].isin([1, 4]))
cond2 = (df[dob] > 2000) & (df[edu].isin([1, 4]))
cond3 = (df[dob] > 1996) & (df[edu].isin([3, 4]))
cond4 = (df[dob] > 1995) & (df[edu] == 4)
cond5 = (df[dob].isin([2001, 2002])) & (df[edu] == 8)
condlist = [cond1, cond2, cond3, cond4, cond5]
choicelist = [8, 1, 2, 3, 1]
return np.select(condlist, choicelist, pd.to_numeric(df[edu]))
# sticking to the sample data, you can extend this
mapping = {f"education_{num}": df.pipe(edu_recode, f"dob_{num}",
f"education_{num}")
for num in range(1, 4)}
df.assign(**mapping)
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3
0 12 1958 2001 2005 1 5 8
1 13 1990 1999 1932 2 1 3
2 14 1974 1965 1965 3 3 3
3 15 1963 1963 1990 4 3 1
4 16 2020 1995 1988 8 1 2