Skip to content
Advertisement

Speeding-up pandas column operation based on several rules

I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame

df_queried = df.query("ID1=='a' or ID2=='Y'")

which has the following form:

date ID1 ID2
201908 a X
201905 b Y
201811 a Y
201807 a Z

You can assume that the date is sorted and that there are no duplicates in the subset ['ID1', 'ID2'].

Now, the goal is to check whether there are ID2 duplicates that contain more than one ID1 value. If that’s the case, then assign the most recent ID1 value from that list to a new column for each ID1 in that list.

For the special query of my data frame:

Y is a duplicate of ID2 containing different values for ID1, namely ['a', 'b']. Now, we have to find the most recent value from the list and assign it to the new column for all ID1 values that are in the list.

Output:

date ID1 ID2 New_ID
201908 a X a
201905 b Y a
201811 a Y a
201807 a Z a

where New_ID equals the most recent value of ID1 and follows the following rules:

  1. Within each ID2 attribute New_ID must have the same and most recent value

Example: This obviously holds for ID2=X and ID2=Z. For ID2=Y there are two values for ID1, {a, b}. b must be overwritten with the most recent ID1 value of this segment.

  1. If there is more than one ID1 value within an ID2 value, then find all rows for which ID1 equals one of those values and assign the most recent one

Example: For ID2=Y, ID1 contains two values, a and b. Now, for each ID1==a or ID1==b, the new columns New_ID must equal the most recent value of ID1 independent of ID2.

I am able to achieve this:

date ID1 ID2 New_ID
201908 a X b
201905 b Y b
201811 a Y b
201807 a Z b

using the following loop:

df_queried['New_ID'] = df_queried['ID1']
for v2 in df_queried.ID2.unique():
    # Query data frame by ID2 value
    df_query1 = df_queried.query(f'ID2 == {v2!r}')
    # Get most recent value
    most_recent_val = df_query1.iloc[0, 1]
    # Define unique ID1 values within ID2 query
    unique_ID1_vals = df_query1.ID1.unique()
    # If several ID1 values were found, check if one val 
    # also occurs in different ID1 position
    if len(unique_ID1_vals) > 1:
        for v1 in unique_ID1_vals:
            # Get id1 query to check existence of multiple id2's
            df_queried.loc[df_queried['ID1'] == v1, 'New_ID'] = most_recent_val

Now, I can join the actual value a to the new column:

mapping = df_queried.drop_duplicates(subset=['New_ID'])[['ID1', 'New_ID']]
pd.merge(df_queried, mapping.rename(columns={'ID1': 'ID_temp'}), how='left')
    .drop(columns=['New_ID'])
    .rename(columns={'ID_temp': 'New_ID'})

which yields the desired result.

However, it takes way too long. I was thinking about a smarter approach. One that mainly relies on joins. But I was not able to find one.

Note: Obviously, I want to operate over the whole data frame not only on the queried one. Therefore, the code must be stable and applicable to the whole data frame. I think my code is, but I did not try it out on the whole data (after 6 hours I killed the kernel). I also tried to use numba, but failed to fully implement it.

I hope my problem got clear.

EDIT 1:

df_queried['New_ID'] = df_queried.groupby('ID2')['ID1'].transform('last')

This approach indeed works for this special case. However, if it is applied to a larger subset of my data, for instance:

date ID1 ID2 New_ID New_ID_desired
201908 a X a a
201905 b Y a a
201811 a Y a a
201807 a Z a a
202003 c H d c
202001 d H d c
201907 c I c c
201904 d J d c

the method does not hold anymore. It satisfies rule 1, but not rule 2.

However, when you use my approach, you get:

    date    ID1 ID2 New_ID
0   201906  a   X   a
1   201903  b   Y   a
2   201811  a   Y   a
3   201802  a   Z   a
4   202003  c   H   c
5   202001  d   H   c
6   201907  c   I   c
7   201904  d   J   c

Advertisement

Answer

Okay, after googling and thinking about an approach I finally found one using the library networkx. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:

import pandas as pd
import networkx as nx

from itertools import chain

df_sub = pd.DataFrame(
    data=dict(
        date=[201906, 201903, 201811, 201802, 202003, 202001, 201907, 201904],
        ID1=["a", "b", "a", "a", "c", "d", "c", "d"],
        ID2=["X", "Y", "Y", "Z", "H", "H", "I", "J"]
    )
)
        
def _graph_decomposition(graph_as_df: pd.DataFrame) -> list:
    # Initialize Graph (in my case, bipartit graph)
    G = nx.Graph()
    # Get connections
    G.add_edges_from(graph_as_df.drop_duplicates().to_numpy().tolist())
    # Create list containing connected components
    connected_components = list(nx.connected_components(G))
    return connected_components


def stabilized_ID(graph_as_df: pd.DataFrame) -> pd.DataFrame:
    components: list= _graph_decomposition(graph_as_df)
    # Chain components -> list of list to only one list
    ID1_mapping = list(chain.from_iterable(components))
    ID1_true = []
    for component in components:
        # Convert set to list
        component = list(component)
        # For my case, ID2 starts always with '0' and ID1 always with 'C'
        # and max(['C', '0999999']) = 'C'
        ID1_true += [max(component)] * len(component)
    # Assert length are equal
    assert len(ID1_true) == len(ID1_mapping)    
    # Define final mapping
    mapping = pd.DataFrame(data={'ID1': ID1_mapping, 'ID1_true': ID1_true})
    return mapping


mapping = stabilized_ID(df_sub[['ID1', 'ID2']])
pd.merge(df_sub, mapping, on=['ID1'], how='inner')

This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:

    date    ID1 ID2 ID1_true
0   201906  a   X   b
1   201811  a   Y   b
2   201802  a   Z   b
3   201903  b   Y   b
4   202003  c   H   d
5   201907  c   I   d
6   202001  d   H   d
7   201904  d   J   d

Since I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New values are equal to one of the connected components from ID1, not to the most recent one. If needed, one could also map the most recent ID1 value as described in my question.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement