I would like to create a function to go through each unique value in a column and check if another column contains a value and then create a column that shows the result.
For example: for each unique ID in df, check if the stats is A then create a result column:
df:
ID | Status |
---|---|
1 | A |
1 | B |
2 | B |
2 | C |
new_df:
ID | Status | Result A? |
---|---|---|
1 | A | YES |
2 | B | NO |
I started off with
def function(df): id = list(df['ID'].unique()) status = ['A'] for i in id:
If anyone can share some thoughts I would much appreciate.
Advertisement
Answer
I am not sure if I understand the rules correctly. Should I always take the first occurrence of the ID? Then the second row in your expected output is wrong.
You can use numpy.where
df = pd.DataFrame({'ID': {0: 1, 1: 1, 2: 2, 3: 2}, 'Status': {0: 'A', 1: 'B', 2: 'B', 3: 'C'}}) new_df = df.drop_duplicates(subset=["ID"]).copy() new_df["Result A?"] = np.where(new_df.Status == "A", "YES", "NO")
to get this:
ID Status Result A? 0 1 A YES 2 2 B NO
Edit: Your desired output is ambiguous, two things you can do:
df.groupby("ID")["Status"].apply({"A"}.issubset).replace({True: 'Yes', False: 'No'}).rename("Result A?")
gives you:
ID 1 Yes 2 No Name: Result A?, dtype: object
Or:
df["Result A?"] = np.where(df.groupby("ID")["Status"].apply({"A"}.issubset).loc[df.ID], "YES", "NO")
which gives you:
ID Status Result A? 0 1 A YES 1 1 B YES 2 2 B NO 3 2 C NO