I have the following df:
name id line_number add_el add_ver del_el del_ver name1 1 1 elem2 1.3 elem1 1.2 name1 1 2 elem3 1.4 elem3 1.1 name1 1 3 elem1 1.2 elem5 1.3 name2 2 1 elem10 2.0 elem11 2.1 name2 2 3 elem11 2.1 elem13 2.3
The above df represents the lines in a csv file where the del_el
is an add_el
on another line. I want to add a column action
in which the value would be “replace” if for the same (name, id), the del_el
is equal to the add_el
column on another line_number
.
Desired output
name id line_number add_el add_ver del_el del_ver action name1 1 1 elem2 1.3 elem1 1.2 none name1 1 2 elem3 1.4 elem3 1.1 update name1 1 3 elem1 1.2 elem5 1.3 replace name2 2 1 elem10 2.0 elem11 2.1 none name2 2 3 elem11 2.1 elem13 2.3 replace
Sample code to recreate the input df
df = pd.DataFrame({'name':['name1', 'name1', 'name1', 'name2', 'name2'], 'id': [1, 1, 1, 2, 2], 'line_number': [1, 2, 3, 1, 3], 'add_el': ['elem2', 'elem3', 'elem1', 'elem10', 'elem11'], 'add_ver': ['1.3', '1.4', '1.2', '2.0', '2.1'], 'del_el': ['elem1', 'elem3', 'elem5', 'elem11', 'elem13'], 'del_ver': ['1.2', '1.1', '1.3', '2.1', '2.3']})
In my current solution, I define the actions as follow: Tuple format: ((add_el_name, added_el_ver), (del_el_name, del_el_ver))
((NaN, NaN), (X1, V1)) - delete ((X1, V2), (X1, V1)) - update ((X1, V1), (X1, V2)) - downgrade ((X1, V1), (X2, V2)) - replace
Code of my current solution:
def get_action_type(row): if str(row['add_el']) == 'nan' and str(row['del_el']) != 'nan': return 'delete' else: if (version.parse(str(row['add_ver'])) > version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']): return 'update' elif (version.parse(str(row['add_ver'])) < version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']): return 'downgrade' elif (version.parse(str(row['add_ver'])) == version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']): return 'none' else: if str(row['add_el']) != str(row['del_el']) and str(row['del_el']) != 'nan': return 'replace' else: return 'unknown'
My current solution only verifies if the name and version of the deleted element are different from the ones of the added element. I need the “replace” action to verify if the del_el, del_ver
is added to another line_number
of the same (name, id).
Advertisement
Answer
The solution I came up with consists in grouping the rows by name
and id
and aggregating the columns added
and deleted
into a list(removed version for simplicity purpose). More info here.
res = df.groupby(['name', 'id']).agg(tuple).applymap(list).reset_index()
I then create a column replaced
with list comprehension that returns the set intersection between added and deleted elements. More info here.
res['replaced'] = [(set(a) & set(b)) if len((set(a) & set(b))) != 0 else 'NaN' for a, b in zip(res.added, res.deleted)] res = res[['name', 'id', 'replaced']] #selecting necessary columns
I merge the result with the original dataframe so I have the set intersection in each row.
res_final = pd.merge(df, res, on=['name', 'id']) #merging with original df
I finally create a function that checks if the deleted element appears in the set intersection column replaced
. If yes, then the label “replace” is added. Else, I just return the action that was previously there. To ensure that we are not looking at elements on the same row, I verify if the action isn’t none
(based on the code in my question post).
def is_it_replaced(row): if str(row['deleted']) in str(row['replaced']) and str(row['action']) != 'none': return 'replace' else: return str(row['action']) res_final['action_type'] = res_final.apply(lambda x: is_it_replaced(x), axis=1) res_final = res_final.drop(columns=['action', 'replaced']) #final cleanup
Good: it works Bad: it’s slow, especially if you dataframe is big. It is preferable to avoid list comprehension.