Balance dataset using pandas

Question

This is for a machine learning program. I am working with a dataset that has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908

Accepted Answer

Personally, I would break it up into the following steps:Since you have more 0s than 1s, we&#8217;re first going to ensure that we even out the number of each. Here, I&#8217;m using the sample data you pasted in as dfCount the number of 1s (since this is our smaller value)ones_subset = df.loc[df["label"] == 1, :]number_of_1s = len(ones_subset)print(number_of_1s)3Sample only the zeros to match the number of number_of_1szeros_subset = df.loc[df["label"] == 0, :]sampled_zeros = zeros_subset.sample(number_of_1s)print(sampled_zeros)Stick these 2 chunks (all of the 1s from our ones_subset and our matched sampled_zeros together to make one clean dataframe that has an equal number of 1 and 0 labelsclean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)print(clean_df)                                         id  label0  c18f2d887b7ae4f6742ee445113fa1aef383ed77      11  a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da      12  7f6ccae485af121e0b6ee733022e226ee6b0c65f      13  559e55a64c9ba828f700e948f6886f4cea919261      04  f38a6374c348f90b587e046aac6079959adf3835      05  068aba587a4950175d04c680d38943fd488d6a9d      0Now that we have a cleaned up dataset, we can proceed with the last step:Use the groupby(...).sample(...) approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)downsampled_df = clean_df.groupby("label").sample(2)print(downsampled_df)                                         id  label4  f38a6374c348f90b587e046aac6079959adf3835      05  068aba587a4950175d04c680d38943fd488d6a9d      01  a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da      10  c18f2d887b7ae4f6742ee445113fa1aef383ed77      1

Advertisement

Answer