Pandas oversampling ragged sequential data

Question

Trying to use pandas to oversample my ragged data (data with different lengths). Given the following data samples: Data (groups are separated with --- for convince): Targets: I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3. I'm looking for a way to oversample

Accepted Answer

You can use:x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],                  'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})#more general sampley = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})#repeat values 1 or 0 for balance targets = y['target'].value_counts()s1 = s.rsub(s.max())new = s1.index.repeat(s1).tolist()#create helper df and add to yy1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),                    'target':new})y2 = y.append(y1, ignore_index=True)print (y2)#filter by first value of newadd = y[y['target'].eq(new[0])]#repeat values by np.tile or is possible change to np.repeat#add helper column by y1.id and merge to xadd = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]          .head(len(new))          .assign(new = y1['id'].tolist())          .merge(x, on='id', how='left')          .drop('id', axis=1)          .rename(columns={'new':'id'}))#add to xx2 = x.append(add, ignore_index=True)print (x2)Solution above working only for non balanced data, if possible sometimes balanced:#balanced sampley = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})#repeat values 1 or 0 for balance targets = y['target'].value_counts()s1 = s.rsub(s.max())new = s1.index.repeat(s1).tolist()if len(new) > 0:    #create helper df and add to y    y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),                       'target':new})    y2 = y.append(y1, ignore_index=True)    print (y2)            #filter by first value of new    add = y[y['target'].eq(new[0])]        #repeat values by np.tile or is possible change to np.repeat    #add helper column by y1.id and merge to x    add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]              .head(len(new))              .assign(new = y1['id'].tolist())              .merge(x, on='id', how='left')              .drop('id', axis=1)              .rename(columns={'new':'id'}))        #add to x    x2 = x.append(add, ignore_index=True)    print (x2)    else:    print ('y is already balanced')

Advertisement

Answer