PATIENT_ID | PATHOLOGIES |
---|---|
12 | null |
12 | null |
3 | patho1 |
3 | null |
5 | patho2 |
2 | patho1 |
12 | null |
If you can see, patient ID 12 is always null but others can be null or has pathologies if the same ID is always null, I want to delete it with the related rows in all columns note: I have 2 million ID, so I want a code to search for the ID’s (Python, CSV)
Advertisement
Answer
To remove all patients with only "null"
variables you can use this example:
import csv from itertools import groupby with open("input.csv", "r") as f_in: reader = csv.reader(f_in) next(reader) # skip header out = [] for id_, g in groupby(sorted(reader), lambda k: k[0]): g = list(g) if all(pathology == "null" for _, pathology in g): continue out.extend(g) with open("output.csv", "w") as f_out: writer = csv.writer(f_out) writer.writerow(["PATIENT_ID", "PATHOLOGIES"]) writer.writerows(out)
This creates output.csv
:
PATIENT_ID | PATHOLOGIES |
---|---|
2 | patho1 |
3 | null |
3 | patho1 |
5 | patho2 |