The code tries to split the text data based on a separator but I keep getting an error
Traceback (most recent call last): File "split.py", line 7, in <module> en_text = split_text[1].lstrip() IndexError: list index out of range
And the output of the two files has to be the same number of lines but I got 94132 en_out.txt
and 94304 mn_out.txt
for two of the files which im not sure what’s going on.
The code I used is
with open('mn_en_sentences_split.txtaa') as inputFile: inFile = inputFile.readlines() for i in inFile: split_text = i.split("+++++SEP+++++") mn_text = split_text[0].rstrip() en_text = split_text[1].lstrip() with open("mn_out.txt", "a") as mn_out: mn_out.write(mn_text + "n") with open("en_out.txt", "a") as en_out: en_out.write(en_text)
The input file for this code can be found here at https://drive.google.com/file/d/1GNo1XJxRFxjey5VDsHjLvj9upXJOqd3e/view
Advertisement
Answer
The reason of the IndexError
is that split_text
only has 1 element when the line does not have the separator.
You have to deal with this case. Drop that line or choose a different processing.
Another case if the line has multiple separators. Marat had a nice solution for that case (see edit)
A few other refactor tips:
It is not needed to read the whole file before processing.
To get faster processing do not open and close files hundreds of times.
Use a debugger to inspect the results of the split if they contain the new line character.
If you don’t need any white space at the ends of the string you can strip()
them of all white space of only the new line char with `strip(‘n’)
And later add the new line for both written lines to keep them similar.
with open('mn_en_sentences_split.txtaa') as inputFile: with open("mn_out.txt", "w") as mn_out: with open("en_out.txt", "w") as en_out: for i in inputFile: split_text = map(lambda x:x.strip('n'), i.split("+++++SEP+++++")) if len(split_text) < 2: continue # drop line if no separator mn_out.write(split_text[0].rstrip() + "n") en_out.write(split_text[1].lstrip() + "n")
Edit
Marat made a few suggestions to refactor and fail safe the execution in case the separator is not found. The 3 with
statements can be joined together with (syntax sugar) to reduce the indentation (not supported in all versions of Python 3.x).
I really like the variable unpacking of the split result. If it fails you get a ValueError
exception.
I have chosen to skip the lines that do not have a separator. If you want to do something with these lines you have to put the write()
calls outside/below the try/except
and in the exception handler set mn
and en
to some value.
I like to keep the normal flow of code inside the try.
What and how you want to strip from the strings is all up to you depending on what you want and what the input might contain.
with open('mn_en_sentences_split.txtaa') as inputFile, open("mn_out.txt", "w") as mn_out, open("en_out.txt", "w") as en_out: for line in inputFile: try: mn, en = line.strip('n').split("+++++SEP+++++", 1) mn_out.write(mn.rstrip() + "n") en_out.write(en.lstrip() + "n") except ValueError: pass