I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as
train_test_split does in
sklearn. However when I try to import it into pandas using
read_csv I get errors from many of the lines because of erroneous data in there and it would be way too much work to try and fix the broken lines. If I try and set the
error_bad_lines=false then it will skip some lines in one of the files and possibly not the other which would ruin the alignment. If I split it manually using unix
split it works fine for my needs though so I’m not concerned with cleaning it, but the data that is returned is not random.
How should I go about splitting this dataset into train/validate/test sets?
I’m using python but I can also use linux commands if that would be easier.
I found that I can use the
shuf command on the file with the
random-source parameter, like this
shuf tgt-full.txt -o tgt-fullshuf.txt --random-source=tgt-full.txt.