Trying to use pandas to oversample my ragged data (data with different lengths).
Given the following data samples:
JavaScript
x
5
1
import pandas as pd
2
3
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
4
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
5
Data (groups are separated with ---
for convince):
JavaScript
1
20
20
1
id f1
2
0 1 11
3
1 1 12
4
2 1 13
5
-----------
6
3 2 22
7
4 2 22
8
-----------
9
5 3 33
10
6 3 34
11
7 3 35
12
8 3 36
13
-----------
14
9 4 44
15
-----------
16
10 5 55
17
-----------
18
11 6 66
19
12 6 66
20
Targets:
JavaScript
1
8
1
id target
2
0 1 1
3
1 2 0
4
2 3 1
5
3 4 0
6
4 5 0
7
5 6 0
8
I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.
I’m looking for a way to oversample the data so the results would be:
JavaScript
1
29
29
1
id f1
2
0 1 11
3
1 1 12
4
2 1 13
5
-----------
6
3 2 22
7
4 2 22
8
-----------
9
5 3 33
10
6 3 34
11
7 3 35
12
8 3 36
13
-----------
14
9 4 44
15
-----------
16
10 5 55
17
-----------
18
11 6 66
19
12 6 66
20
-----------------
21
13 7 11
22
14 7 12 Replica of id 1
23
15 7 13
24
-----------------
25
16 8 33
26
17 8 34 Replica of id 3
27
18 8 35
28
19 8 36
29
And the targets would be balanced:
JavaScript
1
10
10
1
id target
2
0 1 1
3
1 2 0
4
2 3 1
5
3 4 0
6
4 5 0
7
5 6 0
8
6 7 1
9
8 8 1
10
With exactly 4 positive and 4 negative samples.
Advertisement
Answer
You can use:
JavaScript
1
6
1
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
2
'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
3
4
#more general sample
5
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
6
JavaScript
1
28
28
1
#repeat values 1 or 0 for balance target
2
s = y['target'].value_counts()
3
s1 = s.rsub(s.max())
4
new = s1.index.repeat(s1).tolist()
5
6
#create helper df and add to y
7
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
8
'target':new})
9
y2 = y.append(y1, ignore_index=True)
10
print (y2)
11
12
13
#filter by first value of new
14
add = y[y['target'].eq(new[0])]
15
16
#repeat values by np.tile or is possible change to np.repeat
17
#add helper column by y1.id and merge to x
18
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
19
.head(len(new))
20
.assign(new = y1['id'].tolist())
21
.merge(x, on='id', how='left')
22
.drop('id', axis=1)
23
.rename(columns={'new':'id'}))
24
25
#add to x
26
x2 = x.append(add, ignore_index=True)
27
print (x2)
28
Solution above working only for non balanced data, if possible sometimes balanced:
JavaScript
1
8
1
#balanced sample
2
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})
3
4
#repeat values 1 or 0 for balance target
5
s = y['target'].value_counts()
6
s1 = s.rsub(s.max())
7
new = s1.index.repeat(s1).tolist()
8
JavaScript
1
28
28
1
if len(new) > 0:
2
3
#create helper df and add to y
4
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
5
'target':new})
6
y2 = y.append(y1, ignore_index=True)
7
print (y2)
8
9
10
#filter by first value of new
11
add = y[y['target'].eq(new[0])]
12
13
#repeat values by np.tile or is possible change to np.repeat
14
#add helper column by y1.id and merge to x
15
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
16
.head(len(new))
17
.assign(new = y1['id'].tolist())
18
.merge(x, on='id', how='left')
19
.drop('id', axis=1)
20
.rename(columns={'new':'id'}))
21
22
#add to x
23
x2 = x.append(add, ignore_index=True)
24
print (x2)
25
26
else:
27
print ('y is already balanced')
28