Skip to content
Advertisement

Fastest way to get all first-matched rows given a sequence of column values in Pandas

Say I have a Pandas dataframe with 10 rows and 2 columns.

JavaScript

Now that I am given a sequence of 'col1' values in a numpy array:

JavaScript

I want to find the rows that have the first occurence of 3, 1 and 2 in 'col1', and then get the corresponding 'col2' values in order. Right now I am using a list comprehension:

JavaScript

This works for small dataframes, but becomes the bottleneck of my code as I have a huge dataframe of more than 30,000 rows and is often given long sequences of column values (> 3,000). I would like to know if there is a more efficient way to do this?

Advertisement

Answer

Option 1

Perhaps faster than what I suggested earlier (below: option 2):

JavaScript

Option 2

  • First get the matches for col1 by using Series.isin and select from the df based on the mask.
  • Now, apply df.groupby and get the first non-null entry for each group.
  • Finally, apply df.reindex to sort the values.
JavaScript

If a value cannot be found, you’ll end up with a NaN. E.g.

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement