This questions was a little harder for me to phrase so I request to help edit the question which would make more sense (if necessary).
Problem Statement: I want all the rows which have a specific column value in common, saved to same file.
Example Code I want to do something like this. Say, I have a dataframe:
d = {'col1': [1, 2, 6, 3, 4], 'col2': [3, 4, 2, 5, 6], 'col3':['a', 'b', 'c', 'a', 'b'], 'col4':['2', '3', '2', '2', '2']}
df = pd.DataFrame(data=d)
I want to create csv files such that:
- all rows where col3isa, they all get saved ina.csvall rows
- where col3isb, they all get saved inb.csvall rows where
- col3is- c, they all get saved in- c.csv
Hypothesized Solution: The only way I can think of creating the CSV files is iterating through the dataframe per row and checking if the column (e.g. col3 val) has a csv created already, if not — create and add the rows or else append to exists csv file.
Issue:
Above sample code is just a representation. I have a very large dataframe. If it helps, I know the unique value in the column in question (like, col3 is example) as a list somewhere. However, on of the most popular answer on  how to iterate over a dataframe? : How to iterate over rows in a DataFrame in Pandas says (in the second answer there) that DON'T. I might have to use it as a last resort if there is no other way but if there is one, can someone help me get a better solution to this problem?
Advertisement
Answer
If your file (here all.csv) is large and you want to process csv in chunks, you can try this strategy: open a file when the first occurrence is met and save the handle into a dict. Next when you meet the same occurrence, load the handle and use it to write the data and so on.
import pandas as pd
import pathlib
DIRPATH = "/tmp/csv_folder"
# create folder if it doesn't exist
dirpath = pathlib.Path(DIRPATH)
dirpath.mkdir(parents=True, exist_ok=True)
# chunksize=2 for demo purpose only...
reader = pd.read_csv("all.csv", chunksize=2)
streams = {}
for df in reader:
    for grp, dfg in df.groupby("col3"):
        try:
            buffer = streams[grp]
            dfg.to_csv(buffer, index=False, header=False)
        except KeyError:
            # grp is met for the first time
            buffer = open(dirpath / f"{grp}.csv", "w")
            streams[grp] = buffer
            dfg.to_csv(buffer, index=False)
for fp in streams.values():
    fp.close()
$ cat /tmp/csv_folder/a.csv col1,col2,col3,col4 1,3,a,2 3,5,a,2 $ cat /tmp/csv_folder/b.csv col1,col2,col3,col4 2,4,b,3 4,6,b,2 $ cat /tmp/csv_folder/c.csv col1,col2,col3,col4 6,2,c,2
 
						