Use a multidimensional index on a MultiIndex pandas dataframe?

I have a multiindex pandas dataframe that looks like this (called p_z):

                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738

[11742280 rows x 1 columns]

JavaScript
​x
 
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738
​
[11742280 rows x 1 columns]
​

I want to be able to select certain rows based on another dataframe (or numpy array) which is multidimensional. It would look like this as a pandas dataframe (called tofpid):

                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737

[2006548 rows x 1 columns]

JavaScript
 
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737
​
[2006548 rows x 1 columns]
​

I also have it as an awkward array, where it’s a (26692, ) array (each of the entries has a non-standard number of subentries). This is a selection df/array that tells the p_z df which rows to keep. So in entry 0 of p_z, it should keep subentries 0, 2, 4, 5, 7, etc.

I can’t find a way to get this done in pandas. I’m new to pandas, and even newer to multiindex; but I feel there ought to be a way to do this. If it’s able to be broadcast even better as I’ll be doing this over ~1500 dataframes of similar size. If it helps, these dataframes are from a *.root file imported using uproot (if there’s another way to do this without pandas, I’ll take it; but I would love to use pandas to keep things organised).

Edit: Here’s a reproducible example (courtesy of Jim Pavinski’s answer; thanks!).

import awkward as ak
import pandas as pd

>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)

JavaScript
 
import awkward as ak
import pandas as pd
​
>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)
​

Both of these dataframes are produced natively in uproot, but this will reproduce the same dataframes that uproot would (using the awkward library).

Answer

IIUC:

Input data:

>>> p_z
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284

>>> tofpid
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7

JavaScript
 
>>> p_z
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
​
>>> tofpid
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
​

Create a new multiindex from the columns (entry, tofpid) of your second dataframe:

mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True)
                                    .reset_index())

JavaScript
 
mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True)
                                    .reset_index())
​

Output result:

>>> p_z.loc[mi.intersection(p_z.index)]
              p_z
entry
0     0  0.338738
      2 -0.307365
      4  0.243284

JavaScript
 
>>> p_z.loc[mi.intersection(p_z.index)]
              p_z
entry
0     0  0.338738
      2 -0.307365
      4  0.243284
​

Advertisement

Answer