Trying to use pandas to oversample my ragged data (data with different lengths).
Given the following data samples:
import pandas as pd x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]}) y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
Data (groups are separated with ---
for convince):
id f1 0 1 11 1 1 12 2 1 13 ----------- 3 2 22 4 2 22 ----------- 5 3 33 6 3 34 7 3 35 8 3 36 ----------- 9 4 44 ----------- 10 5 55 ----------- 11 6 66 12 6 66
Targets:
id target 0 1 1 1 2 0 2 3 1 3 4 0 4 5 0 5 6 0
I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.
I’m looking for a way to oversample the data so the results would be:
id f1 0 1 11 1 1 12 2 1 13 ----------- 3 2 22 4 2 22 ----------- 5 3 33 6 3 34 7 3 35 8 3 36 ----------- 9 4 44 ----------- 10 5 55 ----------- 11 6 66 12 6 66 ----------------- 13 7 11 14 7 12 Replica of id 1 15 7 13 ----------------- 16 8 33 17 8 34 Replica of id 3 18 8 35 19 8 36
And the targets would be balanced:
id target 0 1 1 1 2 0 2 3 1 3 4 0 4 5 0 5 6 0 6 7 1 8 8 1
With exactly 4 positive and 4 negative samples.
Advertisement
Answer
You can use:
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6], 'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]}) #more general sample y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
#repeat values 1 or 0 for balance target s = y['target'].value_counts() s1 = s.rsub(s.max()) new = s1.index.repeat(s1).tolist() #create helper df and add to y y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1), 'target':new}) y2 = y.append(y1, ignore_index=True) print (y2) #filter by first value of new add = y[y['target'].eq(new[0])] #repeat values by np.tile or is possible change to np.repeat #add helper column by y1.id and merge to x add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']] .head(len(new)) .assign(new = y1['id'].tolist()) .merge(x, on='id', how='left') .drop('id', axis=1) .rename(columns={'new':'id'})) #add to x x2 = x.append(add, ignore_index=True) print (x2)
Solution above working only for non balanced data, if possible sometimes balanced:
#balanced sample y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]}) #repeat values 1 or 0 for balance target s = y['target'].value_counts() s1 = s.rsub(s.max()) new = s1.index.repeat(s1).tolist()
if len(new) > 0: #create helper df and add to y y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1), 'target':new}) y2 = y.append(y1, ignore_index=True) print (y2) #filter by first value of new add = y[y['target'].eq(new[0])] #repeat values by np.tile or is possible change to np.repeat #add helper column by y1.id and merge to x add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']] .head(len(new)) .assign(new = y1['id'].tolist()) .merge(x, on='id', how='left') .drop('id', axis=1) .rename(columns={'new':'id'})) #add to x x2 = x.append(add, ignore_index=True) print (x2) else: print ('y is already balanced')