I want to shuffle this dataset to have a random set. It has 1.6 million rows but the first are 0 and the last 4, so I need pick samples randomly to have more than one class. The actual code prints only class 0 (meaning in just 1 class). I took advice from this platform but doesn’t work.
JavaScript
x
21
21
1
fid = open("sentiment_train.csv", "r")
2
3
li = fid.readlines(16000000)
4
5
6
random.shuffle(li)
7
8
fid2 = open("shuffled_train.csv", "w")
9
10
fid2.writelines(li)
11
12
fid2.close()
13
14
fid.close()
15
16
sentiment_onefourty_train = pd.read_csv('shuffled_train.csv', header= 0, delimiter=",", usecols=[0,5], nrows=100000)
17
18
sentiment_onefourty_train.columns=['target', 'text']
19
20
print(sentiment_onefourty_train['target'].value_counts())
21
Advertisement
Answer
Because you read in your data using Pandas, you can also do the randomisation in a different way using pd.sample
:
JavaScript
1
4
1
df = pd.read_csv('sentiment_train.csv', header= 0, delimiter=",", usecols=[0,5])
2
df.columns=['target', 'text']
3
df1 = df.sample(n=100000)
4
If this fails, it might be good to check out the amount of unique values and how frequent they appear. If the first 1,599,999 are 0 and the last is only 4, then the chances are that you won’t get any 4.