Skip to content
Advertisement

Boolean Masking on a Pandas Dataframe where columns may not exist

I have a dataframe called compare that looks like this:

Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc
John 0 -500 0 50 50 0 0 3 3 0 0 0 1803 1803 0 0 30 30 0 0
Cheldone -500 0 0 50 50 0 0 1.25 1.25 0 0 0 1565 1565 0 0 30 30 0 0
Dieu -300 -300 0 0 0 0 0 3 3 0 0 0 1372 1372 0 0 18 18 0 0

Here is the dataframe in a form that can be copied and pasted:

,Resident,1xdisc,1xdisc_doc,conpark,parking,parking_doc,conmil,conmil_doc,pest,pest_doc,pet,pet1x,pet_doc,rent,rent_doc,stlc,storage,trash,trash_doc,water,water_doc 0,Acacia,0,0,0,0,0,0,-500,3.0,3.0,0,0,70,2067,2067,0,0,15,15,0,0 1,ashley,0,0,0,0,0,0,0,3.0,3.0,0,0,0,2067,2067,0,0,15,15,0,0 2,Sheila,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1574,1574,0,0,0,0,0,0 3,Brionne,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1787,1787,0,0,0,0,0,0 4,Danielle,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1422,0,0,0,0,0,0,0 5,Nmesomachi,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1675,1675,0,0,0,0,0,0 6,Doaa,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1306,1306,0,0,0,0,0,0 7,Reynaldo,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1685,1685,0,0,0,0,0,0 8,Shajuan,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1768,0,0,0,0,0,0,0 9,Dalia,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1596,1596,0,0,0,0,0,0

I want to create another dataframe using boolean masking that only contains rows where there are mismatches various sets of columns. For example, where parking doesn’t match parking_doc or conmil doesn’t match conmil_doc.

Here is the code I am using currently:

nonmatch = compare[((compare['1xdisc']!=compare['1xdisc_doc']) & 
                 (compare['conpark']!=compare['1xdisc']))
               | (compare['rent']!=compare['rent_doc']) 
               | (compare['parking']!=compare['parking_doc']) 
               | (compare['trash']!=compare['trash_doc'])
               | (compare['pest']!=compare['pest_doc'])
               | (compare['stlc']!=compare['stlc_doc'])
               | (compare['pet']!=compare['pet_doc'])
               |(compare['conmil']!=compare['conmil_doc'])
              ]

The problem I’m having is that some columns may not always exist, for example stlc_doc or pet_doc. How do I select rows with mismatches, but only check for mismatches for particular columns if the columns exist?

Advertisement

Answer

If the column names doesn’t always exist, you can either add the columns that doesn’t exist which I don’t think will be a good idea since you will have to replicate the corresponding columns which will eventually increase the size of the dataframe.

So, another approach might be to filter the column names themselves and take only the column pairs that exists:

Given DataFrame:

>>> df.head(3)
  Resident  1xdisc  1xdisc_doc  conpark  parking  parking_doc  conmil  conmil_doc  pest  pest_doc  pet  pet1x  pet_doc  rent  rent_doc  stlc  storage  trash  trash_doc  water  water_doc 
0   Acacia       0           0        0        0            0       0        -500   3.0       3.0    0      0       70  2067      2067     0        0     15         15      0           0
1   ashley       0           0        0        0            0       0           0   3.0       3.0    0      0        0  2067      2067     0        0     15         15      0           0
2   Sheila       0           0        0        0            0       0           0   0.0       0.0    0      0        0  1574      1574     0        0      0          0      0           0

Take out the columns pairs:

>>> maskingCols = [(col[:-4], col) for col in df if col[:-4] in df and col.endswith('_doc')]
maskingCols
[('1xdisc', '1xdisc_doc'), ('parking', 'parking_doc'), ('conmil', 'conmil_doc'), ('pest', 'pest_doc'), ('pet', 'pet_doc'), ('rent', 'rent_doc'), ('trash', 'trash_doc')]

Now that you have the column pairs, you can create the expression required to mask the dataframe.

>>> "|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)

"(df['1xdisc'] != df['1xdisc_doc'])|(df['parking'] != df['parking_doc'])|(df['conmil'] != df['conmil_doc'])|(df['pest'] != df['pest_doc'])|(df['pet'] != df['pet_doc'])|(df['rent'] != df['rent_doc'])|(df['trash'] != df['trash_doc'])"

You can simply pass this expression string to eval function to evaluate it.

>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols))

You can add other criteria other than this masking:

>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8     True
9    False
dtype: bool

You can use it to get your desired dataframe:

>>> df[eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))]

OUTPUT:

 Resident  1xdisc  1xdisc_doc  conpark  parking  parking_doc  conmil  conmil_doc  pest  pest_doc  pet  pet1x  pet_doc  rent  rent_doc  stlc  storage  trash  trash_doc  water  water_doc 
0    Acacia       0           0        0        0            0       0        -500   3.0       3.0    0      0       70  2067      2067     0        0     15         15      0           0
4  Danielle       0           0        0        0            0       0           0   0.0       0.0    0      0        0  1422         0     0        0      0          0      0           0
8   Shajuan       0           0        0        0            0       0           0   0.0       0.0    0      0        0  1768         0     0        0      0          0      0           0

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement