Skip to content
Advertisement

Verify if elements of pandas columns have been shuffled

I have the following df:

name   id   line_number    add_el    add_ver    del_el    del_ver                                                                  
name1  1    1              elem2     1.3        elem1     1.2
name1  1    2              elem3     1.4        elem3     1.1
name1  1    3              elem1     1.2        elem5     1.3
name2  2    1              elem10    2.0        elem11    2.1
name2  2    3              elem11    2.1        elem13    2.3

The above df represents the lines in a csv file where the del_el is an add_el on another line. I want to add a column action in which the value would be “replace” if for the same (name, id), the del_el is equal to the add_el column on another line_number.

Desired output

name   id   line_number    add_el    add_ver    del_el  del_ver  action                                                                
name1  1    1              elem2     1.3        elem1     1.2    none
name1  1    2              elem3     1.4        elem3     1.1    update
name1  1    3              elem1     1.2        elem5     1.3    replace
name2  2    1              elem10    2.0        elem11    2.1    none
name2  2    3              elem11    2.1        elem13    2.3    replace

Sample code to recreate the input df

df = pd.DataFrame({'name':['name1', 'name1', 'name1', 'name2', 'name2'], 
                   'id': [1, 1, 1, 2, 2], 
                   'line_number': [1, 2, 3, 1, 3], 
                   'add_el': ['elem2', 'elem3', 'elem1', 'elem10', 'elem11'], 
                   'add_ver': ['1.3', '1.4', '1.2', '2.0', '2.1'],
                   'del_el': ['elem1', 'elem3', 'elem5', 'elem11', 'elem13'],
                   'del_ver': ['1.2', '1.1', '1.3', '2.1', '2.3']})

In my current solution, I define the actions as follow: Tuple format: ((add_el_name, added_el_ver), (del_el_name, del_el_ver))

((NaN, NaN), (X1, V1)) - delete
((X1, V2), (X1, V1)) - update
((X1, V1), (X1, V2)) - downgrade
((X1, V1), (X2, V2)) - replace 

Code of my current solution:

def get_action_type(row):
    if str(row['add_el']) == 'nan' and str(row['del_el']) != 'nan':
        return 'delete'
    else:
        if (version.parse(str(row['add_ver'])) > version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'update'
        elif (version.parse(str(row['add_ver'])) < version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'downgrade'
        elif (version.parse(str(row['add_ver'])) == version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'none'
        else:
            if str(row['add_el']) != str(row['del_el']) and str(row['del_el']) != 'nan':
                return 'replace'
            else:
                return 'unknown'

My current solution only verifies if the name and version of the deleted element are different from the ones of the added element. I need the “replace” action to verify if the del_el, del_ver is added to another line_number of the same (name, id).

Advertisement

Answer

The solution I came up with consists in grouping the rows by name and id and aggregating the columns added and deleted into a list(removed version for simplicity purpose). More info here.

res = df.groupby(['name', 'id']).agg(tuple).applymap(list).reset_index()

I then create a column replaced with list comprehension that returns the set intersection between added and deleted elements. More info here.

res['replaced'] = [(set(a) & set(b)) if len((set(a) & set(b))) != 0 else 'NaN' for a, b in zip(res.added, res.deleted)]
res = res[['name', 'id', 'replaced']] #selecting necessary columns

I merge the result with the original dataframe so I have the set intersection in each row.

res_final = pd.merge(df, res, on=['name', 'id']) #merging with original df

I finally create a function that checks if the deleted element appears in the set intersection column replaced. If yes, then the label “replace” is added. Else, I just return the action that was previously there. To ensure that we are not looking at elements on the same row, I verify if the action isn’t none (based on the code in my question post).

def is_it_replaced(row):
    if str(row['deleted']) in str(row['replaced']) and str(row['action']) != 'none':
        return 'replace'
    else:
        return str(row['action'])

res_final['action_type'] = res_final.apply(lambda x: is_it_replaced(x), axis=1)
res_final = res_final.drop(columns=['action', 'replaced']) #final cleanup

Good: it works Bad: it’s slow, especially if you dataframe is big. It is preferable to avoid list comprehension.

Advertisement