This is in continuation from my previous question. I have 2 files, file1.csv
and a large csv called master_file.csv
. They have several columns and have a common column name called EMP_Code
.
File 1 example:
EMP_name | EMP_Code | EMP_dept |
---|---|---|
b | f367 | abc |
a | c264 | xyz |
c | d264 | abc |
master_file
example:
EMP_name EMP_age EMP_Service EMP_Code EMP_dept a 30 6 c264 xyz b 29 3 f367 abc r 27 1 g364 lmn d 45 10 c264 abc t 50 25 t453 lmn
I want to extract similar rows from master_file
using all the EMP_Code
values in file1
. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file
in chunks and extract the complete rows for each of the EMP_Code
present in file1
and save it into new file Employee_full_data
.
import csv import pandas as pd df = pd.read_csv(r"master_file.csv") li = [c264,f367] full_data = df[df.EMP_Code.isin(li)] full_data.to_csv(r"Employee_full_data.csv", index=False)
I also tried the following code. I receive an empty file whenever I use EMP_Code
column and works fine when I use columns like Emp_name
or EMP_dept
. I want to extract the data using EMP_Code
.
import csv import pandas as pd df = pd.read_csv(r"file1.csv") list_codes = list(df.EMP_Code) selected_rows = [] with open(r"master_file.csv") as csv_file: reader = csv.DictReader(csv_file) for row in reader: if row['EMP_Code'] in list_codes: selected_rows.append(row)` article_usage = pd.DataFrame.from_records(selected_rows) article_usage.to_csv(r"Employee_full_data.csv", index=False)
Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated
Advertisement
Answer
I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I’m guessing your problem might have to do with the real-life scale of master_file as you’ve alluded to.
The problem might be that despite using csv.DictReader to stream information in, you’re still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.
If that’s true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can’t be known till we’ve read the first row, so we’ll set up the writer in the first iteration of the read loop.
(I’ve removed the with open(...
contexts because I think they add too much indentation)
df = pd.read_csv(r"file1.csv") list_codes = list(df.EMP_Code) f_in = open(r"master_file.csv", newline="") reader = csv.DictReader(f_in) f_out = open(r"output.csv", "w", newline="") init_writer = True for row in reader: if init_writer: writer = csv.DictWriter(f_out, fieldnames=row) writer.writeheader() init_writer = False if row["EMP_Code"] in list_codes: writer.writerow(row) f_out.close() f_in.close()
EMP_name | EMP_age | EMP_Service | EMP_Code | EMP_dept |
---|---|---|---|---|
a | 30 | 6 | c264 | xyz |
b | 29 | 3 | f367 | abc |
d | 45 | 10 | c264 | abc |
And if you’d like to get rid of Pandas altogether:
list_codes = set() with open(r"file1.csv", newline="") as f: reader = csv.DictReader(f) for row in reader: list_codes.add(row["EMP_Code"])