I have a multiindex pandas dataframe that looks like this (called p_z):
p_z entry subentry 0 0 0.338738 1 0.636035 2 -0.307365 3 -0.167779 4 0.243284 ... ... 26692 891 -0.459227 892 0.055993 893 -0.469857 894 0.192554 895 0.155738 [11742280 rows x 1 columns]
I want to be able to select certain rows based on another dataframe (or numpy array) which is multidimensional. It would look like this as a pandas dataframe (called tofpid):
tofpid entry subentry 0 0 0 1 2 2 4 3 5 4 7 ... ... 26692 193 649 194 670 195 690 196 725 197 737 [2006548 rows x 1 columns]
I also have it as an awkward array, where it’s a (26692, ) array (each of the entries has a non-standard number of subentries). This is a selection df/array that tells the p_z df which rows to keep. So in entry 0 of p_z, it should keep subentries 0, 2, 4, 5, 7, etc.
I can’t find a way to get this done in pandas. I’m new to pandas, and even newer to multiindex; but I feel there ought to be a way to do this. If it’s able to be broadcast even better as I’ll be doing this over ~1500 dataframes of similar size. If it helps, these dataframes are from a *.root file imported using uproot (if there’s another way to do this without pandas, I’ll take it; but I would love to use pandas to keep things organised).
Edit: Here’s a reproducible example (courtesy of Jim Pavinski’s answer; thanks!).
import awkward as ak import pandas as pd >>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284, 0.338738, 0.636035], [-0.459227, 0.055993, -0.469857, 0.192554, 0.155738, -0.459227]]) >>> p_z = ak.to_pandas(p_z) >>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]]) >>> tofpid = ak.to_pandas(tofpid)
Both of these dataframes are produced natively in uproot, but this will reproduce the same dataframes that uproot would (using the awkward library).
Advertisement
Answer
IIUC:
Input data:
>>> p_z p_z entry subentry 0 0 0.338738 1 0.636035 2 -0.307365 3 -0.167779 4 0.243284 >>> tofpid tofpid entry subentry 0 0 0 1 2 2 4 3 5 4 7
Create a new multiindex from the columns (entry, tofpid) of your second dataframe:
mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True) .reset_index())
Output result:
>>> p_z.loc[mi.intersection(p_z.index)] p_z entry 0 0 0.338738 2 -0.307365 4 0.243284