I’m looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here’s an example:
artificial data
JavaScript
x
6
1
df = data.frame(
2
col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
3
col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
4
col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
5
)
6
The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:
- If the value ‘Yes’ is present in any of the columns, the value of the new_column will be ‘Yes’;
- If the value ‘Yes’ is not present in any of the columns, but the value ‘No’ is present, then the value of the new_column will be ‘No’;
- If the values ’Yes’ and ‘No’ are absent, then the value of new_columns will be ‘Unknown’.
I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.
Current solution:
JavaScript
1
12
12
1
library(dplyr)
2
df_1 <-
3
df %>%
4
mutate(
5
new_column = ifelse(
6
(col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
7
ifelse(
8
(col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
9
)
10
)
11
)
12
I’m looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.
After searching for StackOverflow, I couldn’t find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.
I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.
Advertisement
Answer
A solution using Python:
JavaScript
1
12
12
1
import pandas as pd
2
3
df = pd.DataFrame({
4
'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
5
'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
6
'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
7
})
8
9
df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]
10
11
print(df)
12
Output:
JavaScript
1
12
12
1
col_1 col_2 col_3 col_4
2
0 No Yes Unknown Yes
3
1 Yes Yes Yes Yes
4
2 Yes Unknown Yes Yes
5
3 Yes Yes Unknown Yes
6
4 Yes Unknown Unknown Yes
7
5 Yes No No Yes
8
6 No Unknown No No
9
7 No No Unknown No
10
8 No Unknown Unknown No
11
9 Unknown Unknown Unknown Unknown
12