How can I select top k rows based on another dataframe in python?

Question

I have data as follows. Users are 1001 to 1004 (but actual data has one million users). Each user has corresponding probabilities for the variables AT1 to AT6. I would like to select the top 3 users for each choice based on the following data. In the output, top1 to top3 are the top 3 users based on probability for

Accepted Answer

I have no idea how this will scale to a million rows, but have a go with this dictionary comprehension:# Set up test df's and re-index.df_user = pd.DataFrame({    "user":[1001,1002,1003,1004],    "AT2" :[0.003, 0.1, 0.13, 0.23],    "AT3" :[0.03, 0.3, 0.22, 0.43],    "AT5" :[0.5, 0.1, 0.08, 0.04],    "AT6" :[0.453,0.2,0.2,0.14]})df_user.set_index("user", inplace=True)df_client = pd.DataFrame({    "client":[997, 223],    "choice_1":["AT2","AT6"],    "choice_2":["AT3", "AT5"]})# dictionary comprehensionpd.DataFrame({row["client"]:np.append(df_user[row["choice_1"]].nlargest(3).index.values,                                      df_user[row["choice_2"]].nlargest(3).index.values)              for (i, row) in df_client.iterrows()}).TOutput (you still have to rename the columns, obviously):Short explanation: run the following code to see that the iterables in df.iterrows() are tuples of (a) the index of the dataframe (b) the columns.for it in df_client.iterrows():    print(it)Once you&#8217;ve run that last snippet, it will contain the last row of df_client, so set row = it[1] to experiment with the various bits of information that you can extract from this. In particular, row["choice_1"] gives you something like "AT1", from which you can extract the corresponding column from df_user, upon which you can use the pandas nlargest function. The dictionary comprehension follows trivially once you&#8217;ve put together all the bits.

Advertisement

Answer