Skip to content
Advertisement

How can I select top k rows based on another dataframe in python?

I have data as follows. Users are 1001 to 1004 (but actual data has one million users). Each user has corresponding probabilities for the variables AT1 to AT6.

JavaScript

I would like to select the top 3 users for each choice based on the following data.

JavaScript

In the output, top1 to top3 are the top 3 users based on probability for choice_1 while top4 to top6 are for choice_2. client id is not computed but given. The topN are also not computed but given as top 3 for each choice. The output should look like this:

JavaScript

How can I construct the last dataframe in python?

Advertisement

Answer

I have no idea how this will scale to a million rows, but have a go with this dictionary comprehension:

JavaScript

Output (you still have to rename the columns, obviously):

enter image description here

Short explanation: run the following code to see that the iterables in df.iterrows() are tuples of (a) the index of the dataframe (b) the columns.

JavaScript

Once you’ve run that last snippet, it will contain the last row of df_client, so set row = it[1] to experiment with the various bits of information that you can extract from this. In particular, row["choice_1"] gives you something like "AT1", from which you can extract the corresponding column from df_user, upon which you can use the pandas nlargest function. The dictionary comprehension follows trivially once you’ve put together all the bits.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement