I have a dataframe (df) that looks like this (highly simplified):
ID A B C VALUE 1 10 462 2241 217 2 11 498 6953 217 3 67 120 6926 654 4 68 898 7153 654 5 87 557 4996 654 6 88 227 6475 911 7 47 875 5097 911 8 48 143 8953 111 9 65 157 4470 111 10 66 525 9328 111
The 'VALUE' column contains a variable number of rows with identical values. I am trying to output a series of csv files that contain all of the rows that contain a 'VALUE' length == 2, ==3 etc. For example:
to_csv('/Path/to/VALUE_len_2.csv')
ID A B C VALUE
1 10 462 2241 217
2 11 498 6953 217
6 88 227 6475 911
7 47 875 5097 911
to_csv('/Path/to/VALUE_len_3.csv')
ID A B C VALUE
3 67 120 6926 654
4 68 898 7153 654
5 87 557 4996 654
to_csv('/Path/to/VALUE_len_4.csv')
ID A B C VALUE
7 47 875 5097 111
8 48 143 8953 111
9 65 157 4470 111
10 66 525 9328 111
I can get the desired output of one length value at a time, e.g., using:
df = pd.concat(v for _, v in df.groupby("VALUE") if len(v) == 2)
df.to_csv("/Path/to/VALUE_len_2.csv")
However, I have dozens of values to test. I would like to put this in a for loop on the order of:
mylist = [2,3,4,5,6,7,8,9] or len([2,3,4,5,6,7,8,9])
grouped = df.groupby(['VALUE'])
output = '/Path/to/VALUE_len_{}.csv'
for loop here:
if item in my list found in grouped:
output rows to csv
else:
pass
- I’ve tried various constructions to iterate the groupby object using the items in the list, and I haven’t been able to make anything work.
- It might be an issue with trying to use a groupby object this way, but it is more than likely it is my inability to get the syntax right to complete the iteration.
Advertisement
Answer
- It doesn’t make sense to use a predetermined list to create the filenames.
df_lenwill be used to generate a filename using anf-string.Path.exists()is used to determine if the file exists or not
import pandas as pd
from pathlib import Path
# test data
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'A': [10, 11, 67, 68, 87, 88, 47, 48, 65, 66], 'B': [462, 498, 120, 898, 557, 227, 875, 143, 157, 525], 'C': [2241, 6953, 6926, 7153, 4996, 6475, 5097, 8953, 4470, 9328], 'VALUE': [217, 217, 654, 654, 654, 911, 911, 111, 111, 111]}
df = pd.DataFrame(data)
# groupby value
for group, data in df.groupby('VALUE'):
# get the length of the dataframe
df_len = len(data)
# create a filename with df_len
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
# if the file exists, append without the header
if file.exists():
data.to_csv(file, index=False, mode='a', header=False)
# create a new file
else:
data.to_csv(file, index=False)
- If you must only create a file for dataframes of a specific length
desired_length = [2, 3, 4, 5, 6, 7, 8, 9]
# groupby value
for group, data in df.groupby('VALUE'):
# get the length of the dataframe
df_len = len(data)
# create a filename with df_len
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
# check if the length of the dataframe is in the desired length
if df_len in desired_length:
# if the file exists, append without the header
if file.exists():
data.to_csv(file, index=False, mode='a', header=False)
# create a new file
else:
data.to_csv(file, index=False)