I would like to create a function to go through each unique value in a column and check if another column contains a value and then create a column that shows the result.
For example: for each unique ID in df, check if the stats is A then create a result column:
df:
| ID | Status | 
|---|---|
| 1 | A | 
| 1 | B | 
| 2 | B | 
| 2 | C | 
new_df:
| ID | Status | Result A? | 
|---|---|---|
| 1 | A | YES | 
| 2 | B | NO | 
I started off with
def function(df): id = list(df['ID'].unique()) status = ['A'] for i in id:
If anyone can share some thoughts I would much appreciate.
Advertisement
Answer
I am not sure if I understand the rules correctly. Should I always take the first occurrence of the ID? Then the second row in your expected output is wrong.
You can use numpy.where
df = pd.DataFrame({'ID': {0: 1, 1: 1, 2: 2, 3: 2}, 'Status': {0: 'A', 1: 'B', 2: 'B', 3: 'C'}})
new_df = df.drop_duplicates(subset=["ID"]).copy()
new_df["Result A?"] = np.where(new_df.Status == "A", "YES", "NO")
to get this:
ID Status Result A? 0 1 A YES 2 2 B NO
Edit: Your desired output is ambiguous, two things you can do:
df.groupby("ID")["Status"].apply({"A"}.issubset).replace({True: 'Yes', False: 'No'}).rename("Result A?")
gives you:
ID 1 Yes 2 No Name: Result A?, dtype: object
Or:
df["Result A?"] = np.where(df.groupby("ID")["Status"].apply({"A"}.issubset).loc[df.ID], "YES", "NO")
which gives you:
ID Status Result A? 0 1 A YES 1 1 B YES 2 2 B NO 3 2 C NO