Speeding-up pandas column operation based on several rules

Question

I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame which has the following form: date ID1 ID2 201908 a X 201905 b Y 201811 a Y 201807 a Z You can assume that the date is sorted and that there are no duplicates in the subset [&#8216;ID1&#8217;, &#8216;ID2&#8217;]. Now, …

Accepted Answer

Okay, after googling and thinking about an approach I finally found one using the library networkx. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:import pandas as pdimport networkx as nxfrom itertools import chaindf_sub = pd.DataFrame(    data=dict(        date=[201906, 201903, 201811, 201802, 202003, 202001, 201907, 201904],        ID1=["a", "b", "a", "a", "c", "d", "c", "d"],        ID2=["X", "Y", "Y", "Z", "H", "H", "I", "J"]    ))        def _graph_decomposition(graph_as_df: pd.DataFrame) -> list:    # Initialize Graph (in my case, bipartit graph)    G = nx.Graph()    # Get connections    G.add_edges_from(graph_as_df.drop_duplicates().to_numpy().tolist())    # Create list containing connected components    connected_components = list(nx.connected_components(G))    return connected_componentsdef stabilized_ID(graph_as_df: pd.DataFrame) -> pd.DataFrame:    components: list= _graph_decomposition(graph_as_df)    # Chain components -> list of list to only one list    ID1_mapping = list(chain.from_iterable(components))    ID1_true = []    for component in components:        # Convert set to list        component = list(component)        # For my case, ID2 starts always with '0' and ID1 always with 'C'        # and max(['C', '0999999']) = 'C'        ID1_true += [max(component)] * len(component)    # Assert length are equal    assert len(ID1_true) == len(ID1_mapping)        # Define final mapping    mapping = pd.DataFrame(data={'ID1': ID1_mapping, 'ID1_true': ID1_true})    return mappingmapping = stabilized_ID(df_sub[['ID1', 'ID2']])pd.merge(df_sub, mapping, on=['ID1'], how='inner')This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:    date    ID1 ID2 ID1_true0   201906  a   X   b1   201811  a   Y   b2   201802  a   Z   b3   201903  b   Y   b4   202003  c   H   d5   201907  c   I   d6   202001  d   H   d7   201904  d   J   dSince I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New values are equal to one of the connected components from ID1, not to the most recent one. If needed, one could also map the most recent ID1 value as described in my question.

Speeding-up pandas column operation based on several rules

EDIT 1:

Advertisement

Answer

date	ID1	ID2	New_ID	New_ID_desired
201908	a	X	a	a
201905	b	Y	a	a
201811	a	Y	a	a
201807	a	Z	a	a
202003	c	H	d	c
202001	d	H	d	c
201907	c	I	c	c
201904	d	J	d	c