I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as train_test_split
does in sklearn
. However when I try to import it into pandas using read_csv
I get errors from many of the lines because of erroneous data in there and it would be way too much work to try and fix the broken lines. If I try and set the error_bad_lines=false
then it will skip some lines in one of the files and possibly not the other which would ruin the alignment. If I split it manually using unix split
it works fine for my needs though so I’m not concerned with cleaning it, but the data that is returned is not random.
How should I go about splitting this dataset into train/validate/test sets?
I’m using python but I can also use linux commands if that would be easier.
Advertisement
Answer
I found that I can use the shuf
command on the file with the random-source
parameter, like this shuf tgt-full.txt -o tgt-fullshuf.txt --random-source=tgt-full.txt
.