I have a dataframe (df) that looks like this (highly simplified):
JavaScript
x
12
12
1
ID A B C VALUE
2
1 10 462 2241 217
3
2 11 498 6953 217
4
3 67 120 6926 654
5
4 68 898 7153 654
6
5 87 557 4996 654
7
6 88 227 6475 911
8
7 47 875 5097 911
9
8 48 143 8953 111
10
9 65 157 4470 111
11
10 66 525 9328 111
12
The 'VALUE'
column contains a variable number of rows with identical values. I am trying to output a series of csv files that contain all of the rows that contain a 'VALUE'
length == 2, ==3 etc. For example:
JavaScript
1
18
18
1
to_csv('/Path/to/VALUE_len_2.csv')
2
ID A B C VALUE
3
1 10 462 2241 217
4
2 11 498 6953 217
5
6 88 227 6475 911
6
7 47 875 5097 911
7
to_csv('/Path/to/VALUE_len_3.csv')
8
ID A B C VALUE
9
3 67 120 6926 654
10
4 68 898 7153 654
11
5 87 557 4996 654
12
to_csv('/Path/to/VALUE_len_4.csv')
13
ID A B C VALUE
14
7 47 875 5097 111
15
8 48 143 8953 111
16
9 65 157 4470 111
17
10 66 525 9328 111
18
I can get the desired output of one length value at a time, e.g., using:
JavaScript
1
3
1
df = pd.concat(v for _, v in df.groupby("VALUE") if len(v) == 2)
2
df.to_csv("/Path/to/VALUE_len_2.csv")
3
However, I have dozens of values to test. I would like to put this in a for loop on the order of:
JavaScript
1
10
10
1
mylist = [2,3,4,5,6,7,8,9] or len([2,3,4,5,6,7,8,9])
2
grouped = df.groupby(['VALUE'])
3
output = '/Path/to/VALUE_len_{}.csv'
4
5
for loop here:
6
if item in my list found in grouped:
7
output rows to csv
8
else:
9
pass
10
- I’ve tried various constructions to iterate the groupby object using the items in the list, and I haven’t been able to make anything work.
- It might be an issue with trying to use a groupby object this way, but it is more than likely it is my inability to get the syntax right to complete the iteration.
Advertisement
Answer
- It doesn’t make sense to use a predetermined list to create the filenames.
df_len
will be used to generate a filename using anf-string
.Path.exists()
is used to determine if the file exists or not
JavaScript
1
24
24
1
import pandas as pd
2
from pathlib import Path
3
4
# test data
5
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'A': [10, 11, 67, 68, 87, 88, 47, 48, 65, 66], 'B': [462, 498, 120, 898, 557, 227, 875, 143, 157, 525], 'C': [2241, 6953, 6926, 7153, 4996, 6475, 5097, 8953, 4470, 9328], 'VALUE': [217, 217, 654, 654, 654, 911, 911, 111, 111, 111]}
6
df = pd.DataFrame(data)
7
8
# groupby value
9
for group, data in df.groupby('VALUE'):
10
11
# get the length of the dataframe
12
df_len = len(data)
13
14
# create a filename with df_len
15
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
16
17
# if the file exists, append without the header
18
if file.exists():
19
data.to_csv(file, index=False, mode='a', header=False)
20
21
# create a new file
22
else:
23
data.to_csv(file, index=False)
24
- If you must only create a file for dataframes of a specific length
JavaScript
1
22
22
1
desired_length = [2, 3, 4, 5, 6, 7, 8, 9]
2
3
# groupby value
4
for group, data in df.groupby('VALUE'):
5
6
# get the length of the dataframe
7
df_len = len(data)
8
9
# create a filename with df_len
10
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
11
12
# check if the length of the dataframe is in the desired length
13
if df_len in desired_length:
14
15
# if the file exists, append without the header
16
if file.exists():
17
data.to_csv(file, index=False, mode='a', header=False)
18
19
# create a new file
20
else:
21
data.to_csv(file, index=False)
22