Data loss while extracting the rows from large csv file

Question

This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code. File 1 example: EMP_name EMP_Code EMP_dept b f367 abc a c264 xyz c d264 abc master_file example: I want to extract similar rows from master_file using all the EMP_Code

Accepted Answer

I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I&#8217;m guessing your problem might have to do with the real-life scale of master_file as you&#8217;ve alluded to.The problem might be that despite using csv.DictReader to stream information in, you&#8217;re still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.If that&#8217;s true, then use csv.DictWriter to stream out.  The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can&#8217;t be known till we&#8217;ve read the first row, so we&#8217;ll set up the writer in the first iteration of the read loop.(I&#8217;ve removed the with open(... contexts because I think they add too much indentation)df = pd.read_csv(r"file1.csv")list_codes = list(df.EMP_Code)f_in = open(r"master_file.csv", newline="")reader = csv.DictReader(f_in)f_out = open(r"output.csv", "w", newline="")init_writer = Truefor row in reader:    if init_writer:        writer = csv.DictWriter(f_out, fieldnames=row)        writer.writeheader()        init_writer = False    if row["EMP_Code"] in list_codes:        writer.writerow(row)f_out.close()f_in.close()EMP_nameEMP_ageEMP_ServiceEMP_CodeEMP_depta306c264xyzb293f367abcd4510c264abcAnd if you&#8217;d like to get rid of Pandas altogether:list_codes = set()with open(r"file1.csv", newline="") as f:    reader = csv.DictReader(f)    for row in reader:        list_codes.add(row["EMP_Code"])

Advertisement

Answer