Skip to content
Advertisement

Speeding-up pandas column operation based on several rules

I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame

JavaScript

which has the following form:

date ID1 ID2
201908 a X
201905 b Y
201811 a Y
201807 a Z

You can assume that the date is sorted and that there are no duplicates in the subset ['ID1', 'ID2'].

Now, the goal is to check whether there are ID2 duplicates that contain more than one ID1 value. If that’s the case, then assign the most recent ID1 value from that list to a new column for each ID1 in that list.

For the special query of my data frame:

Y is a duplicate of ID2 containing different values for ID1, namely ['a', 'b']. Now, we have to find the most recent value from the list and assign it to the new column for all ID1 values that are in the list.

Output:

date ID1 ID2 New_ID
201908 a X a
201905 b Y a
201811 a Y a
201807 a Z a

where New_ID equals the most recent value of ID1 and follows the following rules:

  1. Within each ID2 attribute New_ID must have the same and most recent value

Example: This obviously holds for ID2=X and ID2=Z. For ID2=Y there are two values for ID1, {a, b}. b must be overwritten with the most recent ID1 value of this segment.

  1. If there is more than one ID1 value within an ID2 value, then find all rows for which ID1 equals one of those values and assign the most recent one

Example: For ID2=Y, ID1 contains two values, a and b. Now, for each ID1==a or ID1==b, the new columns New_ID must equal the most recent value of ID1 independent of ID2.

I am able to achieve this:

date ID1 ID2 New_ID
201908 a X b
201905 b Y b
201811 a Y b
201807 a Z b

using the following loop:

JavaScript

Now, I can join the actual value a to the new column:

JavaScript

which yields the desired result.

However, it takes way too long. I was thinking about a smarter approach. One that mainly relies on joins. But I was not able to find one.

Note: Obviously, I want to operate over the whole data frame not only on the queried one. Therefore, the code must be stable and applicable to the whole data frame. I think my code is, but I did not try it out on the whole data (after 6 hours I killed the kernel). I also tried to use numba, but failed to fully implement it.

I hope my problem got clear.

EDIT 1:

JavaScript

This approach indeed works for this special case. However, if it is applied to a larger subset of my data, for instance:

date ID1 ID2 New_ID New_ID_desired
201908 a X a a
201905 b Y a a
201811 a Y a a
201807 a Z a a
202003 c H d c
202001 d H d c
201907 c I c c
201904 d J d c

the method does not hold anymore. It satisfies rule 1, but not rule 2.

However, when you use my approach, you get:

JavaScript

Advertisement

Answer

Okay, after googling and thinking about an approach I finally found one using the library networkx. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:

JavaScript

This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:

JavaScript

Since I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New values are equal to one of the connected components from ID1, not to the most recent one. If needed, one could also map the most recent ID1 value as described in my question.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement