Say I have a Pandas dataframe with 10 rows and 2 columns.
JavaScript
x
17
17
1
import pandas as pd
2
df = pd.DataFrame({'col1': [1,2,4,3,1,3,1,5,1,4],
3
'col2': [.9,.7,.1,.3,.2,.4,.8,.2,.3,.5]})
4
df
5
6
col1 col2
7
0 1 0.9
8
1 2 0.7
9
2 4 0.1
10
3 3 0.3
11
4 1 0.2
12
5 3 0.4
13
6 1 0.8
14
7 5 0.2
15
8 1 0.3
16
9 4 0.5
17
Now that I am given a sequence of 'col1'
values in a numpy array:
JavaScript
1
3
1
import numpy as np
2
nums = np.array([3,1,2])
3
I want to find the rows that have the first occurence of 3, 1 and 2 in 'col1'
, and then get the corresponding 'col2'
values in order. Right now I am using a list comprehension:
JavaScript
1
5
1
res = np.array([df[df['col1']==n].reset_index(drop=True).at[0,'col2'] for n in nums])
2
res
3
4
[0.3 0.9 0.7]
5
This works for small dataframes, but becomes the bottleneck of my code as I have a huge dataframe of more than 30,000 rows and is often given long sequences of column values (> 3,000). I would like to know if there is a more efficient way to do this?
Advertisement
Answer
Option 1
Perhaps faster than what I suggested earlier (below: option 2):
JavaScript
1
8
1
df.groupby('col1').first().reindex(nums)
2
3
col2
4
col1
5
3 0.3
6
1 0.9
7
2 0.7
8
Option 2
- First get the matches for
col1
by usingSeries.isin
and select from thedf
based on the mask. - Now, apply
df.groupby
and get thefirst
non-null entry for each group. - Finally, apply
df.reindex
to sort the values.
JavaScript
1
8
1
df[df['col1'].isin(nums)].groupby('col1').first().reindex(nums)
2
3
col2
4
col1
5
3 0.3
6
1 0.9
7
2 0.7
8
If a value cannot be found, you’ll end up with a NaN
. E.g.
JavaScript
1
10
10
1
df.iloc[1,0] = 6 # there's no '2' in `col1` anymore
2
df[df['col1'].isin(nums)].groupby('col1').first().reindex(nums)
3
4
col2
5
col1
6
3 0.3
7
1 0.9
8
2 NaN
9
10