I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame
df_queried = df.query("ID1=='a' or ID2=='Y'")
which has the following form:
date | ID1 | ID2 |
---|---|---|
201908 | a | X |
201905 | b | Y |
201811 | a | Y |
201807 | a | Z |
You can assume that the date is sorted and that there are no duplicates in the subset ['ID1', 'ID2']
.
Now, the goal is to check whether there are ID2
duplicates that contain more than one ID1
value. If that’s the case, then assign the most recent ID1
value from that list to a new column for each ID1
in that list.
For the special query of my data frame:
Y
is a duplicate of ID2
containing different values for ID1
, namely ['a', 'b']
. Now, we have to find the most recent value from the list and assign it to the new column for all ID1
values that are in the list.
Output:
date | ID1 | ID2 | New_ID |
---|---|---|---|
201908 | a | X | a |
201905 | b | Y | a |
201811 | a | Y | a |
201807 | a | Z | a |
where New_ID
equals the most recent value of ID1
and follows the following rules:
- Within each
ID2
attributeNew_ID
must have the same and most recent value
Example:
This obviously holds for ID2=X
and ID2=Z
. For ID2=Y
there are two values for ID1
, {a, b}
. b
must be overwritten with the most recent ID1 value of this segment.
- If there is more than one
ID1
value within anID2
value, then find all rows for whichID1
equals one of those values and assign the most recent one
Example: For ID2=Y
, ID1
contains two values, a
and b
. Now, for each ID1==a
or ID1==b
, the new columns New_ID
must equal the most recent value of ID1
independent of ID2
.
I am able to achieve this:
date | ID1 | ID2 | New_ID |
---|---|---|---|
201908 | a | X | b |
201905 | b | Y | b |
201811 | a | Y | b |
201807 | a | Z | b |
using the following loop:
df_queried['New_ID'] = df_queried['ID1'] for v2 in df_queried.ID2.unique(): # Query data frame by ID2 value df_query1 = df_queried.query(f'ID2 == {v2!r}') # Get most recent value most_recent_val = df_query1.iloc[0, 1] # Define unique ID1 values within ID2 query unique_ID1_vals = df_query1.ID1.unique() # If several ID1 values were found, check if one val # also occurs in different ID1 position if len(unique_ID1_vals) > 1: for v1 in unique_ID1_vals: # Get id1 query to check existence of multiple id2's df_queried.loc[df_queried['ID1'] == v1, 'New_ID'] = most_recent_val
Now, I can join the actual value a
to the new column:
mapping = df_queried.drop_duplicates(subset=['New_ID'])[['ID1', 'New_ID']] pd.merge(df_queried, mapping.rename(columns={'ID1': 'ID_temp'}), how='left') .drop(columns=['New_ID']) .rename(columns={'ID_temp': 'New_ID'})
which yields the desired result.
However, it takes way too long. I was thinking about a smarter approach. One that mainly relies on joins. But I was not able to find one.
Note: Obviously, I want to operate over the whole data frame not only on the queried one. Therefore, the code must be stable and applicable to the whole data frame. I think my code is, but I did not try it out on the whole data (after 6 hours I killed the kernel). I also tried to use numba
, but failed to fully implement it.
I hope my problem got clear.
EDIT 1:
df_queried['New_ID'] = df_queried.groupby('ID2')['ID1'].transform('last')
This approach indeed works for this special case. However, if it is applied to a larger subset of my data, for instance:
date | ID1 | ID2 | New_ID | New_ID_desired |
---|---|---|---|---|
201908 | a | X | a | a |
201905 | b | Y | a | a |
201811 | a | Y | a | a |
201807 | a | Z | a | a |
202003 | c | H | d | c |
202001 | d | H | d | c |
201907 | c | I | c | c |
201904 | d | J | d | c |
the method does not hold anymore. It satisfies rule 1, but not rule 2.
However, when you use my approach, you get:
date ID1 ID2 New_ID 0 201906 a X a 1 201903 b Y a 2 201811 a Y a 3 201802 a Z a 4 202003 c H c 5 202001 d H c 6 201907 c I c 7 201904 d J c
Advertisement
Answer
Okay, after googling and thinking about an approach I finally found one using the library networkx
. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:
import pandas as pd import networkx as nx from itertools import chain df_sub = pd.DataFrame( data=dict( date=[201906, 201903, 201811, 201802, 202003, 202001, 201907, 201904], ID1=["a", "b", "a", "a", "c", "d", "c", "d"], ID2=["X", "Y", "Y", "Z", "H", "H", "I", "J"] ) ) def _graph_decomposition(graph_as_df: pd.DataFrame) -> list: # Initialize Graph (in my case, bipartit graph) G = nx.Graph() # Get connections G.add_edges_from(graph_as_df.drop_duplicates().to_numpy().tolist()) # Create list containing connected components connected_components = list(nx.connected_components(G)) return connected_components def stabilized_ID(graph_as_df: pd.DataFrame) -> pd.DataFrame: components: list= _graph_decomposition(graph_as_df) # Chain components -> list of list to only one list ID1_mapping = list(chain.from_iterable(components)) ID1_true = [] for component in components: # Convert set to list component = list(component) # For my case, ID2 starts always with '0' and ID1 always with 'C' # and max(['C', '0999999']) = 'C' ID1_true += [max(component)] * len(component) # Assert length are equal assert len(ID1_true) == len(ID1_mapping) # Define final mapping mapping = pd.DataFrame(data={'ID1': ID1_mapping, 'ID1_true': ID1_true}) return mapping mapping = stabilized_ID(df_sub[['ID1', 'ID2']]) pd.merge(df_sub, mapping, on=['ID1'], how='inner')
This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:
date ID1 ID2 ID1_true 0 201906 a X b 1 201811 a Y b 2 201802 a Z b 3 201903 b Y b 4 202003 c H d 5 201907 c I d 6 202001 d H d 7 201904 d J d
Since I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New
values are equal to one of the connected components from ID1
, not to the most recent one. If needed, one could also map the most recent ID1
value as described in my question.