Skip to content
Advertisement

python – using an index in one series to find values in a separate dataframe with matching index

I have a for loop that is taking a subsample of my original dataset, doing a prediction from a previously fit model, and then i need to match the target value from the original dataframe to the prediction to calculate a different value.

20 lines from original subsample:

index       f0        f1         f2          product
89641   11.758713   -2.548885   5.007187    134.766305
30665   7.134050    -7.369558   3.990141    107.813044
71148   -13.860892  -2.727111   4.995418    137.945408
63263   -1.949113   6.340399    4.999270    134.766305
34301   2.741874    -5.114227   1.990971    57.085625
28150   -9.194978   -8.220917   4.000539    110.992147
37974   5.416532    -6.685454   3.997102    107.813044
63541   8.116958    -0.106199   1.992089    53.906522
69007   -0.886114   -8.732907   3.004329    84.038886
8808    -10.138814  -5.428649   3.996867    110.992147
77082   -7.427920   -9.558472   5.002233    137.945408
30523   0.780631    -1.872719   1.000312    30.132364
78523   3.096930    -6.854314   3.000831    84.038886
66519   4.459357    -6.787551   4.994414    134.766305
69231   10.113738   -10.433003  4.004866    107.813044
48418   -17.092959  -3.294716   1.999222    57.085625
59715   -0.970615   -1.741134   2.012687    57.085625
30159   -7.075355   -16.977595  4.997697    137.945408
34763   5.850225    -5.069475   2.994821    80.859783
99239   -8.493579   -8.126316   1.004643    30.132364

code:

r2_revenue = []
for i in range(1000):
    subsample = r2_test.sample(500,replace=True)
    features = subsample.drop(['product'],axis=1)
    predict = model2.predict(features)
    top_200 = pd.Series(predict).sort_values(ascending=False).iloc[:200]
    target = subsample['product'].isin(top_200)
    result = (revenue(target).sum())
    r2_revenue.append(result)

so, my “target” needs to find the index of each top_200 entry and then find the resulting entry in the ['product'] from the original subsample.

i am striking out on finding the way to take the index number from the series top_200 and find the corresponding product value from the original dataset.

i feel like i am missing something obvious, but searches like “matching an index from a series to a value in a dataframe” are turning up results for a single dataframe, not a series to a dataframe.

if i were looking up data, i’d use a .query() but i don’t know how to do that with an index to an index?

any input would be greatly appreciated!

:Edit to help clarify (hopefully):

so my series top_200 is predictions from the subsample dataframe. the index from the series should be the same as the index from the subsample dataframe. based on the index for a particular row, i want to look up a value in the product column of the subsample dataframe with the same index number.

so here is an example output for that series:

303    139.893243
203    138.886222
21     138.561583
296    138.535309
391    138.491757

the rows are 303,203,21,296 and 391. i now want to get the value in the column product from the subsample dataframe for the rows 303,203,21,296 and 391.

Advertisement

Answer

When you apply a condition to a Series the result is a boolean Series.

>>> s = pd.Series(range(10))
>>> s
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
>>> q = s % 2 == 0
>>> q
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

You can then use that boolean Series to filter the original.

>>> s[q]
0    0
2    2
4    4
6    6
8    8
dtype: int64

You can obtain the indices of the True values and use that to select from a like indexed Series

>>> q[q].index
Int64Index([0, 2, 4, 6, 8], dtype='int64')
>>> indices = q[q].index
>>> s[indices]
0    0
2    2
4    4
6    6
8    8
dtype: int64

>>> df = pd.DataFrame({'ex':range(10),'wye':list('zyxwvutsrq')}, index=list('abcdefghij'))
>>> df
   ex wye
a   0   z
b   1   y
c   2   x
d   3   w
e   4   v
f   5   u
g   6   t
h   7   s
i   8   r
j   9   q
>>> m = df.ex.isin([2,4,6,8])
>>> m
a    False
b    False
c     True
d    False
e     True
f    False
g     True
h    False
i     True
j    False
Name: ex, dtype: bool

>>> df.loc[m,'wye']
c    x
e    v
g    t
i    r
Name: wye, dtype: object

>>> m[m].index
Index(['c', 'e', 'g', 'i'], dtype='object')
>>> r = m[m].index
>>> df.loc[r,:]
   ex wye
c   2   x
e   4   v
g   6   t
i   8   r

the rows are 303,203,21,296 and 391. i now want to get the value in the column product from the subsample dataframe for the rows 303,203,21,296 and 391

In my example, the rows that meet the condition have the indices Index(['c', 'e', 'g', 'i'], dtype='object') and can be used to select the same rows of the 'wye' column.

>>> df.loc[r,'wye']
c    x
e    v
g    t
i    r
Name: wye, dtype: object

The indices were obtained by filtering the boolean Series for all the True values and accessing the index attribute of the result.

>>> m[m].index
Index(['c', 'e', 'g', 'i'], dtype='object')
>>> r = m[m].index

Pandas User Guide: Selection
Indexing and selecting data

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement