This is for a machine learning program.
I am working with a dataset that has a csv which contains an id, for a .tif
image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908 rows with label 0 and 89,117 rows with label 1.
There are 41,791 more rows with label 0 than label 1. I want to randomly drop the extra rows with label 1. After that, I want to decrease the sample size from 178,234 to just 50,000, with 25,000 ids for each label.
Another approach might be to randomly drop 105,908 rows with label 1 and 64,117 with label 0.
How can I do this using pandas?
I have already looked at using .groupby
and then using .sample
, but that drops an equal amount of rows in both labels, while I only want to drop rows in one label.
Sample of the csv:
id,label f38a6374c348f90b587e046aac6079959adf3835,0 c18f2d887b7ae4f6742ee445113fa1aef383ed77,1 755db6279dae599ebb4d39a9123cce439965282d,0 bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0 068aba587a4950175d04c680d38943fd488d6a9d,0 acfe80838488fae3c89bd21ade75be5c34e66be7,0 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1 7f6ccae485af121e0b6ee733022e226ee6b0c65f,1 559e55a64c9ba828f700e948f6886f4cea919261,0 8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
Advertisement
Answer
Personally, I would break it up into the following steps:
Since you have more 0s than 1s, we’re first going to ensure that we even out the number of each. Here, I’m using the sample data you pasted in as df
- Count the number of 1s (since this is our smaller value)
ones_subset = df.loc[df["label"] == 1, :] number_of_1s = len(ones_subset) print(number_of_1s) 3
- Sample only the zeros to match the number of
number_of_1s
zeros_subset = df.loc[df["label"] == 0, :] sampled_zeros = zeros_subset.sample(number_of_1s) print(sampled_zeros)
- Stick these 2 chunks (all of the 1s from our
ones_subset
and our matchedsampled_zeros
together to make one clean dataframe that has an equal number of 1 and 0 labels
clean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True) print(clean_df) id label 0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1 1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1 2 7f6ccae485af121e0b6ee733022e226ee6b0c65f 1 3 559e55a64c9ba828f700e948f6886f4cea919261 0 4 f38a6374c348f90b587e046aac6079959adf3835 0 5 068aba587a4950175d04c680d38943fd488d6a9d 0
Now that we have a cleaned up dataset, we can proceed with the last step:
- Use the
groupby(...).sample(...)
approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)
downsampled_df = clean_df.groupby("label").sample(2) print(downsampled_df) id label 4 f38a6374c348f90b587e046aac6079959adf3835 0 5 068aba587a4950175d04c680d38943fd488d6a9d 0 1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1 0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1