I have a dataframe called compare
that looks like this:
Resident | 1xdisc | 1xdisc_doc | conpark | parking | parking_doc | conmil | conmil_doc | pest | pest_doc | pet | pet1x | pet_doc | rent | rent_doc | stlc | storage | trash | trash_doc | water | water_doc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
John | 0 | -500 | 0 | 50 | 50 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 1803 | 1803 | 0 | 0 | 30 | 30 | 0 | 0 |
Cheldone | -500 | 0 | 0 | 50 | 50 | 0 | 0 | 1.25 | 1.25 | 0 | 0 | 0 | 1565 | 1565 | 0 | 0 | 30 | 30 | 0 | 0 |
Dieu | -300 | -300 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 1372 | 1372 | 0 | 0 | 18 | 18 | 0 | 0 |
Here is the dataframe in a form that can be copied and pasted:
,Resident,1xdisc,1xdisc_doc,conpark,parking,parking_doc,conmil,conmil_doc,pest,pest_doc,pet,pet1x,pet_doc,rent,rent_doc,stlc,storage,trash,trash_doc,water,water_doc 0,Acacia,0,0,0,0,0,0,-500,3.0,3.0,0,0,70,2067,2067,0,0,15,15,0,0 1,ashley,0,0,0,0,0,0,0,3.0,3.0,0,0,0,2067,2067,0,0,15,15,0,0 2,Sheila,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1574,1574,0,0,0,0,0,0 3,Brionne,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1787,1787,0,0,0,0,0,0 4,Danielle,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1422,0,0,0,0,0,0,0 5,Nmesomachi,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1675,1675,0,0,0,0,0,0 6,Doaa,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1306,1306,0,0,0,0,0,0 7,Reynaldo,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1685,1685,0,0,0,0,0,0 8,Shajuan,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1768,0,0,0,0,0,0,0 9,Dalia,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1596,1596,0,0,0,0,0,0
I want to create another dataframe using boolean masking that only contains rows where there are mismatches various sets of columns. For example, where parking
doesn’t match parking_doc
or conmil
doesn’t match conmil_doc
.
Here is the code I am using currently:
nonmatch = compare[((compare['1xdisc']!=compare['1xdisc_doc']) & (compare['conpark']!=compare['1xdisc'])) | (compare['rent']!=compare['rent_doc']) | (compare['parking']!=compare['parking_doc']) | (compare['trash']!=compare['trash_doc']) | (compare['pest']!=compare['pest_doc']) | (compare['stlc']!=compare['stlc_doc']) | (compare['pet']!=compare['pet_doc']) |(compare['conmil']!=compare['conmil_doc']) ]
The problem I’m having is that some columns may not always exist, for example stlc_doc
or pet_doc
. How do I select rows with mismatches, but only check for mismatches for particular columns if the columns exist?
Advertisement
Answer
If the column names doesn’t always exist, you can either add the columns that doesn’t exist which I don’t think will be a good idea since you will have to replicate the corresponding columns which will eventually increase the size of the dataframe.
So, another approach might be to filter the column names themselves and take only the column pairs that exists:
Given DataFrame:
>>> df.head(3) Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc 0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0 1 ashley 0 0 0 0 0 0 0 3.0 3.0 0 0 0 2067 2067 0 0 15 15 0 0 2 Sheila 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1574 1574 0 0 0 0 0 0
Take out the columns pairs:
>>> maskingCols = [(col[:-4], col) for col in df if col[:-4] in df and col.endswith('_doc')] maskingCols [('1xdisc', '1xdisc_doc'), ('parking', 'parking_doc'), ('conmil', 'conmil_doc'), ('pest', 'pest_doc'), ('pet', 'pet_doc'), ('rent', 'rent_doc'), ('trash', 'trash_doc')]
Now that you have the column pairs, you can create the expression required to mask the dataframe.
>>> "|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols) "(df['1xdisc'] != df['1xdisc_doc'])|(df['parking'] != df['parking_doc'])|(df['conmil'] != df['conmil_doc'])|(df['pest'] != df['pest_doc'])|(df['pet'] != df['pet_doc'])|(df['rent'] != df['rent_doc'])|(df['trash'] != df['trash_doc'])"
You can simply pass this expression string to eval
function to evaluate it.
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols))
You can add other criteria other than this masking:
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc'])) 0 True 1 False 2 False 3 False 4 True 5 False 6 False 7 False 8 True 9 False dtype: bool
You can use it to get your desired dataframe:
>>> df[eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))]
OUTPUT:
Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc 0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0 4 Danielle 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1422 0 0 0 0 0 0 0 8 Shajuan 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1768 0 0 0 0 0 0 0