Skip to content
Advertisement

How to create files from a groupby object, based on the length of the dataframe

I have a dataframe (df) that looks like this (highly simplified):

ID  A   B   C   VALUE
1   10  462 2241    217
2   11  498 6953    217
3   67  120 6926    654
4   68  898 7153    654
5   87  557 4996    654
6   88  227 6475    911
7   47  875 5097    911
8   48  143 8953    111
9   65  157 4470    111
10  66  525 9328    111

The 'VALUE' column contains a variable number of rows with identical values. I am trying to output a series of csv files that contain all of the rows that contain a 'VALUE' length == 2, ==3 etc. For example:

to_csv('/Path/to/VALUE_len_2.csv')              
ID  A   B   C      VALUE
1   10  462 2241    217
2   11  498 6953    217
6   88  227 6475    911
7   47  875 5097    911
to_csv('/Path/to/VALUE_len_3.csv')              
ID  A   B   C      VALUE
3   67  120 6926    654
4   68  898 7153    654
5   87  557 4996    654
to_csv('/Path/to/VALUE_len_4.csv')              
ID  A   B   C      VALUE
7   47  875 5097    111
8   48  143 8953    111
9   65  157 4470    111
10  66  525 9328    111

I can get the desired output of one length value at a time, e.g., using:

df = pd.concat(v for _, v in df.groupby("VALUE") if len(v) == 2)
df.to_csv("/Path/to/VALUE_len_2.csv")

However, I have dozens of values to test. I would like to put this in a for loop on the order of:

mylist = [2,3,4,5,6,7,8,9] or len([2,3,4,5,6,7,8,9])
grouped = df.groupby(['VALUE'])
output = '/Path/to/VALUE_len_{}.csv'

for loop here:
    if item in my list found in grouped:
        output rows to csv
    else:
        pass
  • I’ve tried various constructions to iterate the groupby object using the items in the list, and I haven’t been able to make anything work.
  • It might be an issue with trying to use a groupby object this way, but it is more than likely it is my inability to get the syntax right to complete the iteration.

Advertisement

Answer

  • It doesn’t make sense to use a predetermined list to create the filenames.
  • df_len will be used to generate a filename using an f-string.
  • Path.exists() is used to determine if the file exists or not
import pandas as pd
from pathlib import Path

# test data
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'A': [10, 11, 67, 68, 87, 88, 47, 48, 65, 66], 'B': [462, 498, 120, 898, 557, 227, 875, 143, 157, 525], 'C': [2241, 6953, 6926, 7153, 4996, 6475, 5097, 8953, 4470, 9328], 'VALUE': [217, 217, 654, 654, 654, 911, 911, 111, 111, 111]}
df = pd.DataFrame(data)

# groupby value
for group, data in df.groupby('VALUE'):
    
    # get the length of the dataframe
    df_len = len(data)
    
    # create a filename with df_len
    file = Path(f'/path/to/VALUE_len_{df_len}.csv')
    
    # if the file exists, append without the header
    if file.exists():
        data.to_csv(file, index=False, mode='a', header=False)
        
    # create a new file
    else:
        data.to_csv(file, index=False) 

  • If you must only create a file for dataframes of a specific length
desired_length = [2, 3, 4, 5, 6, 7, 8, 9]

# groupby value
for group, data in df.groupby('VALUE'):
    
    # get the length of the dataframe
    df_len = len(data)
    
    # create a filename with df_len
    file = Path(f'/path/to/VALUE_len_{df_len}.csv')
    
    # check if the length of the dataframe is in the desired length
    if df_len in desired_length:
        
        # if the file exists, append without the header
        if file.exists():
            data.to_csv(file, index=False, mode='a', header=False)

        # create a new file
        else:
            data.to_csv(file, index=False)
Advertisement