I have a for loop that is taking a subsample of my original dataset, doing a prediction from a previously fit model, and then i need to match the target
value from the original dataframe to the prediction to calculate a different value.
20 lines from original subsample:
index f0 f1 f2 product 89641 11.758713 -2.548885 5.007187 134.766305 30665 7.134050 -7.369558 3.990141 107.813044 71148 -13.860892 -2.727111 4.995418 137.945408 63263 -1.949113 6.340399 4.999270 134.766305 34301 2.741874 -5.114227 1.990971 57.085625 28150 -9.194978 -8.220917 4.000539 110.992147 37974 5.416532 -6.685454 3.997102 107.813044 63541 8.116958 -0.106199 1.992089 53.906522 69007 -0.886114 -8.732907 3.004329 84.038886 8808 -10.138814 -5.428649 3.996867 110.992147 77082 -7.427920 -9.558472 5.002233 137.945408 30523 0.780631 -1.872719 1.000312 30.132364 78523 3.096930 -6.854314 3.000831 84.038886 66519 4.459357 -6.787551 4.994414 134.766305 69231 10.113738 -10.433003 4.004866 107.813044 48418 -17.092959 -3.294716 1.999222 57.085625 59715 -0.970615 -1.741134 2.012687 57.085625 30159 -7.075355 -16.977595 4.997697 137.945408 34763 5.850225 -5.069475 2.994821 80.859783 99239 -8.493579 -8.126316 1.004643 30.132364
code:
r2_revenue = [] for i in range(1000): subsample = r2_test.sample(500,replace=True) features = subsample.drop(['product'],axis=1) predict = model2.predict(features) top_200 = pd.Series(predict).sort_values(ascending=False).iloc[:200] target = subsample['product'].isin(top_200) result = (revenue(target).sum()) r2_revenue.append(result)
so, my “target” needs to find the index of each top_200
entry and then find the resulting entry in the ['product']
from the original subsample.
i am striking out on finding the way to take the index number from the series top_200
and find the corresponding product
value from the original dataset.
i feel like i am missing something obvious, but searches like “matching an index from a series to a value in a dataframe” are turning up results for a single dataframe, not a series to a dataframe.
if i were looking up data, i’d use a .query() but i don’t know how to do that with an index to an index?
any input would be greatly appreciated!
:Edit to help clarify (hopefully):
so my series top_200 is predictions from the subsample dataframe. the index from the series should be the same as the index from the subsample dataframe. based on the index for a particular row, i want to look up a value in the product
column of the subsample dataframe with the same index number.
so here is an example output for that series:
303 139.893243 203 138.886222 21 138.561583 296 138.535309 391 138.491757
the rows are 303,203,21,296 and 391. i now want to get the value in the column product
from the subsample
dataframe for the rows 303,203,21,296 and 391.
Advertisement
Answer
When you apply a condition to a Series the result is a boolean Series.
>>> s = pd.Series(range(10)) >>> s 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int64 >>> q = s % 2 == 0 >>> q 0 True 1 False 2 True 3 False 4 True 5 False 6 True 7 False 8 True 9 False dtype: bool
You can then use that boolean Series to filter the original.
>>> s[q] 0 0 2 2 4 4 6 6 8 8 dtype: int64
You can obtain the indices of the True values and use that to select from a like indexed Series
>>> q[q].index Int64Index([0, 2, 4, 6, 8], dtype='int64') >>> indices = q[q].index >>> s[indices] 0 0 2 2 4 4 6 6 8 8 dtype: int64
>>> df = pd.DataFrame({'ex':range(10),'wye':list('zyxwvutsrq')}, index=list('abcdefghij')) >>> df ex wye a 0 z b 1 y c 2 x d 3 w e 4 v f 5 u g 6 t h 7 s i 8 r j 9 q >>> m = df.ex.isin([2,4,6,8]) >>> m a False b False c True d False e True f False g True h False i True j False Name: ex, dtype: bool >>> df.loc[m,'wye'] c x e v g t i r Name: wye, dtype: object >>> m[m].index Index(['c', 'e', 'g', 'i'], dtype='object') >>> r = m[m].index >>> df.loc[r,:] ex wye c 2 x e 4 v g 6 t i 8 r
the rows are 303,203,21,296 and 391. i now want to get the value in the column product from the subsample dataframe for the rows 303,203,21,296 and 391
In my example, the rows that meet the condition have the indices Index(['c', 'e', 'g', 'i'], dtype='object')
and can be used to select the same rows of the 'wye'
column.
>>> df.loc[r,'wye'] c x e v g t i r Name: wye, dtype: object
The indices were obtained by filtering the boolean Series for all the True values and accessing the index attribute of the result.
>>> m[m].index Index(['c', 'e', 'g', 'i'], dtype='object') >>> r = m[m].index
Pandas User Guide:
Selection
Indexing and selecting data