Skip to content
Advertisement

Compare Same and Different in Two Columns of Dataframe

I have a small dataframe, like this.

JavaScript

enter image description here

I am trying to find out if, and when a person switched to a different cell phone service.

I tried this logic; didn’t work.

JavaScript

Basically, given index row 0 and row 1, when the address_id was the same, the business was the same, but the business changed from verizon to comcast in index row 2. Also, given index row 3 and row 4, the address_id was the same, but the business changed from sprint to att in index row 4. I’d like to add a new column to the dataframe to flag these changes. How can I do that?

Advertisement

Answer

UPDATE: A comment by @rickhg12hs points out that my earlier answers (see below), while detecting when a person switches to a new cell phone service, do not handle the case of a person switching back to a previous service.

To handle this possibility, we must use something like the logic in another answer (@Pranav Hosangadi), though I would do it slightly differently:

JavaScript

Input:

JavaScript

Output:

JavaScript

Performance comparison: Here is test code for 600k rows and 5 columns, with results showing that PREVIOUS UPDATE takes about 0.1 seconds to identify 333815 rows for which new==True, while UPDATE takes about 35 seconds to find 335334 True rows, reflecting about 0.5% additional rows for which a person has switched cell phone services and then switched back.

JavaScript

Results:

JavaScript

PREVIOUS UPDATE: Here is an even simpler way than my original answer using join() (see below) to do what your question asks:

JavaScript

Explanation:

  • Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
  • Use Series.map() to transform the original dataframe’s address_id column into this first business value
  • Add column new which is True only if this new business differs from the original business column.

ORIGINAL SOLUTION:

Here is a simple way to do what you’ve asked using groupby() and join():

JavaScript

Output:

JavaScript

Explanation:

  • Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
  • Use join() to add a column business_first to df containing the corresponding first business for each address_id
  • Use assign() to add a column new containing a boolean indicating whether the row contains a new business with an existing address_id
  • Use drop() to eliminate the business_first column.
Advertisement