I want to create a third column in my pandas dataframe that is based on cols 1 and 2. They are always matching, but I want to make it so that the third column takes whichever value is available. If I just go off of id1, sometimes it is blank, so the third col will end up being blank as well. I want it so that it will take whichever one isn’t blank to create the college name.
Original:
id1 id2 0 ID01 ID01 1 ID03 2 ID07 3 ID08 ID08
Desired:
id1 id2 college_name 0 ID01 ID01 College1 1 ID03 College3 2 ID07 College7 3 ID08 ID08 College8
Also, one thing about this data frame is that I am pretty sure the first 2 columns either are an exact match or one of them is empty. I would like to double-check if there is an instance where id1 and id2 are completely different numbers in the same row. How should I do that?
Advertisement
Answer
Backfill values from id2
to id1
. Extract the numbers. Convert to int
then str
.
Given:
id1 id2 0 ID01 ID01 1 NaN ID03 2 ID07 NaN 3 ID08 ID08
Doing:
df['college_name'] = 'College' + (df.bfill(axis=1)['id1'] .str.extract('(d+)') .astype(int) .astype(str))
Output:
id1 id2 college_name 0 ID01 ID01 College1 1 NaN ID03 College3 2 ID07 NaN College7 3 ID08 ID08 College8
To check for rows where the ids are different:
Given:
id1 id2 0 ID01 ID01 1 NaN ID03 2 ID07 NaN 3 ID08 ID98
Doing:
print(df[df.id1.ne(df.id2) & df.id1.notna() & df.id2.notna()])
Output:
id1 id2 3 ID08 ID98