Using Python how can you use a group-by to filter this dataset
Start
First Last Location ID1 ID2 First3 Last3 John Smith Toronto JohnToronto SmithToronto Joh Smi Joh Smith Toronto JohToronto SmithToronto Joh Smi Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee Stac Lee Markham StacMarkham LeeMarkham Sta Lee Stac Wong Aurora StacAurora LeeAurora Sta Won Stac Lee Newmarket StacNewmarket LeeNewmarket Sta Lee Steve Smith Toronto SteveToronto SmithToronto Ste Smi John Jones Toronto JohnToronto JonesToronto Joh Jon
How can I make it so that where either the two conditions are accepted, filtering everything else that doesn’t meet these two criteria
- ID1 – Matches another ID1 and the Last3 are the same
- ID2 – Matches another ID2 and the First 3 are the same
End
First Last Location ID1 ID2 First3 Last3 John Smith Toronto JohnToronto SmithToronto Joh Smi Joh Smith Toronto JohToronto SmithToronto Joh Smi Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee Stac Lee Markham StacMarkham LeeMarkham Sta Lee
Advertisement
Answer
Based on comment for clarification of the problem statement –
trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively
Try this approach –
#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False) c1_idx = c1[c1].droplevel(0).index #group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False) c2_idx = c2[c2].droplevel(0).index #take a union of the 2 indexes and then .. #filter dataframe for the indexes that meet the 2 independent conditions output = df.iloc[c1_idx.union(c2_idx)] print(output)
First Last Location ID1 ID2 First3 Last3 0 John Smith Toronto JohnToronto SmithToronto Joh Smi 1 Joh Smith Toronto JohToronto SmithToronto Joh Smi 2 Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax 3 Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax 4 Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee 5 Stac Lee Markham StacMarkham LeeMarkham Sta Lee
EDIT: Modifying the above answer provided by @SomeDude, you can run this as 2 independent conditions without a groupby and take an OR between them as well –
m1 = df.duplicated(subset=['ID1','Last3'],keep=False) m2 = df.duplicated(subset=['ID2','First3'],keep=False) df[m1 | m2]