Compare Same and Different in Two Columns of Dataframe

Question

I have a small dataframe, like this. I am trying to find out if, and when a person switched to a different cell phone service. I tried this logic; didn't work. Basically, given index row 0 and row 1, when the address_id was the same, the business was the same, but the business changed from verizon to comcast in index

Accepted Answer

UPDATE: A comment by @rickhg12hs points out that my earlier answers (see below), while detecting when a person switches to a new cell phone service, do not handle the case of a person switching back to a previous service.To handle this possibility, we must use something like the logic in another answer (@Pranav Hosangadi), though I would do it slightly differently:df['new'] = ( df    .groupby('address_id', sort=False)    .apply(lambda x: x.business != x.business.shift().bfill())    .reset_index(0).business )Input:   address_id business0           1  verizon1           1  verizon2           1  comcast3           2   sprint4           2      att5           2   sprintOutput:   address_id business    new0           1  verizon  False1           1  verizon  False2           1  comcast   True3           2   sprint  False4           2      att   True5           2   sprint   TruePerformance comparison:Here is test code for 600k rows and 5 columns, with results showing that PREVIOUS UPDATE takes about 0.1 seconds to identify 333815 rows for which new==True, while UPDATE takes about 35 seconds to find 335334 True rows, reflecting about 0.5% additional rows for which a person has switched cell phone services and then switched back.rng = np.random.default_rng()details = {'address_id': rng.integers(1,100_000, size=600_000),'business': [['verizon','comcast','sprint','att'][i] for i in rng.integers(0,3, size=600_000)],'foo': 1,'bar': 2}df = pd.DataFrame(details) print('groupby() ...')start = datetime.now()x = ( df    .groupby('address_id', sort=False) )print(f'... complete after {datetime.now() - start} time elapsed.')print('apply() ...')start = datetime.now()x = ( x    .apply(lambda x: x.business != x.business.shift().bfill()) )print(f'... complete after {datetime.now() - start} time elapsed.')print('reset_index() ...')start = datetime.now()df['new'] = ( x    .reset_index(0).business )print(f'... complete after {datetime.now() - start} time elapsed.')print(df)print('rows with "new" == True', df.new.sum())df = pd.DataFrame(details)print('PREVIOUS UPDATE() ...')start = datetime.now()df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.businessprint(f'... complete after {datetime.now() - start} time elapsed.')print('rows with "new" == True', df.new.sum())Results:groupby() ...... complete after 0:00:00 time elapsed.apply() ...... complete after 0:00:33.541322 time elapsed.reset_index() ...... complete after 0:00:00.040942 time elapsed.        address_id business  foo  bar    new0            20223   sprint    1    2  False1            29297  comcast    1    2  False2            92489  comcast    1    2  False3            29297  verizon    1    2   True4            98901  comcast    1    2  False...            ...      ...  ...  ...    ...599995       29823  comcast    1    2   True599996       39328  comcast    1    2   True599997       27594  comcast    1    2  False599998       14903   sprint    1    2   True599999       87375  verizon    1    2   True[600000 rows x 5 columns]rows with "new" == True 335334PREVIOUS UPDATE() ...... complete after 0:00:00.097930 time elapsed.rows with "new" == True 333815PREVIOUS UPDATE: Here is an even simpler way than my original answer using join() (see below) to do what your question asks:df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.businessExplanation:Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_idUse Series.map() to transform the original dataframe&#8217;s address_id column into this first business valueAdd column new which is True only if this new business differs from the original business column.ORIGINAL SOLUTION:Here is a simple way to do what you&#8217;ve asked using groupby() and join():df = df.join(df.groupby('address_id').first(), on='address_id', rsuffix='_first')df = df.assign(new=df.business != df.business_first).drop(columns='business_first')fOutput:   address_id business    new0           1  verizon  False1           1  verizon  False2           1  comcast   True3           2   sprint  False4           2      att   TrueExplanation:Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_idUse join() to add a column business_first to df containing the corresponding first business for each address_idUse assign() to add a column new containing a boolean indicating whether the row contains a new business with an existing address_idUse drop() to eliminate the business_first column.

Advertisement

Answer