How to split parallel corpora while keeping alignm…

I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as train_test_split does in sklearn. However when I try to import it into pandas using read_csv I get errors from many of the lines because of erroneous data in there and it would be way too much work to try and fix the broken lines. If I try and set the error_bad_lines=false then it will skip some lines in one of the files and possibly not the other which would ruin the alignment. If I split it manually using unix split it works fine for my needs though so I’m not concerned with cleaning it, but the data that is returned is not random.
How should I go about splitting this dataset into train/validate/test sets?
I’m using python but I can also use linux commands if that would be easier.

Answer

I found that I can use the shuf command on the file with the random-source parameter, like this shuf tgt-full.txt -o tgt-fullshuf.txt --random-source=tgt-full.txt.

How to split parallel corpora while keeping alignment?

Advertisement

Answer