Skip to content
Advertisement

Balance dataset using pandas

This is for a machine learning program.

I am working with a dataset that has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908 rows with label 0 and 89,117 rows with label 1.

There are 41,791 more rows with label 0 than label 1. I want to randomly drop the extra rows with label 1. After that, I want to decrease the sample size from 178,234 to just 50,000, with 25,000 ids for each label.

Another approach might be to randomly drop 105,908 rows with label 1 and 64,117 with label 0.

How can I do this using pandas?

I have already looked at using .groupby and then using .sample, but that drops an equal amount of rows in both labels, while I only want to drop rows in one label.

Sample of the csv:

id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0

Advertisement

Answer

Personally, I would break it up into the following steps:

Since you have more 0s than 1s, we’re first going to ensure that we even out the number of each. Here, I’m using the sample data you pasted in as df

  • Count the number of 1s (since this is our smaller value)
ones_subset = df.loc[df["label"] == 1, :]
number_of_1s = len(ones_subset)

print(number_of_1s)
3
  • Sample only the zeros to match the number of number_of_1s
zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(number_of_1s)

print(sampled_zeros)
  • Stick these 2 chunks (all of the 1s from our ones_subset and our matched sampled_zeros together to make one clean dataframe that has an equal number of 1 and 0 labels
clean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)

print(clean_df)
                                         id  label
0  c18f2d887b7ae4f6742ee445113fa1aef383ed77      1
1  a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da      1
2  7f6ccae485af121e0b6ee733022e226ee6b0c65f      1
3  559e55a64c9ba828f700e948f6886f4cea919261      0
4  f38a6374c348f90b587e046aac6079959adf3835      0
5  068aba587a4950175d04c680d38943fd488d6a9d      0

Now that we have a cleaned up dataset, we can proceed with the last step:

  • Use the groupby(...).sample(...) approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)
downsampled_df = clean_df.groupby("label").sample(2)

print(downsampled_df)
                                         id  label
4  f38a6374c348f90b587e046aac6079959adf3835      0
5  068aba587a4950175d04c680d38943fd488d6a9d      0
1  a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da      1
0  c18f2d887b7ae4f6742ee445113fa1aef383ed77      1
Advertisement