I have data as follows. Users are 1001 to 1004 (but actual data has one million users). Each user has corresponding probabilities for the variables AT1 to AT6.
user AT1 AT2 AT3 AT4 AT5 AT6 1001 0.004 0.003 0.03 0.01 0.5 0.453 1002 0.2 0.1 0.3 0.1 0.1 0.2 1003 0.07 0.13 0.22 0.3 0.08 0.2 1004 0.01 0.23 0.43 0.15 0.04 0.14
I would like to select the top 3 users for each choice based on the following data.
client choice_1 choice_2 997 AT2 AT3 223 AT6 AT5 444 AT1 AT4 121 AT1 AT5
In the output, top1 to top3 are the top 3 users based on probability for choice_1 while top4 to top6 are for choice_2. client id is not computed but given. The topN are also not computed but given as top 3 for each choice. The output should look like this:
client top1 top2 top3 top4 top5 top6 997 1004 1003 1002 1004 1002 1003 223 1001 1002 1003 1001 1002 1003 444 1002 1003 1004 1004 1003 1002 121 1002 1003 1004 1001 1002 1003
How can I construct the last dataframe in python?
Advertisement
Answer
I have no idea how this will scale to a million rows, but have a go with this dictionary comprehension:
# Set up test df's and re-index. df_user = pd.DataFrame({ "user":[1001,1002,1003,1004], "AT2" :[0.003, 0.1, 0.13, 0.23], "AT3" :[0.03, 0.3, 0.22, 0.43], "AT5" :[0.5, 0.1, 0.08, 0.04], "AT6" :[0.453,0.2,0.2,0.14] }) df_user.set_index("user", inplace=True) df_client = pd.DataFrame({ "client":[997, 223], "choice_1":["AT2","AT6"], "choice_2":["AT3", "AT5"] }) # dictionary comprehension pd.DataFrame({row["client"]:np.append(df_user[row["choice_1"]].nlargest(3).index.values, df_user[row["choice_2"]].nlargest(3).index.values) for (i, row) in df_client.iterrows()}).T
Output (you still have to rename the columns, obviously):
Short explanation: run the following code to see that the iterables in df.iterrows()
are tuples of (a) the index of the dataframe (b) the columns.
for it in df_client.iterrows(): print(it)
Once you’ve run that last snippet, it
will contain the last row of df_client
, so set row = it[1]
to experiment with the various bits of information that you can extract from this. In particular, row["choice_1"]
gives you something like "AT1"
, from which you can extract the corresponding column from df_user
, upon which you can use the pandas nlargest
function. The dictionary comprehension follows trivially once you’ve put together all the bits.