Skip to content
Advertisement

IndexError in Python while splitting an input file based on a pattern

The code tries to split the text data based on a separator but I keep getting an error

Traceback (most recent call last):
  File "split.py", line 7, in <module>
    en_text = split_text[1].lstrip()
IndexError: list index out of range

And the output of the two files has to be the same number of lines but I got 94132 en_out.txt and 94304 mn_out.txt for two of the files which im not sure what’s going on.

The code I used is

with open('mn_en_sentences_split.txtaa') as inputFile:
    inFile = inputFile.readlines()

for i in inFile:
    split_text = i.split("+++++SEP+++++")
    mn_text = split_text[0].rstrip()
    en_text = split_text[1].lstrip()
    with open("mn_out.txt", "a") as mn_out:
        mn_out.write(mn_text + "n")
    
    with open("en_out.txt", "a") as en_out:
        en_out.write(en_text)

The input file for this code can be found here at https://drive.google.com/file/d/1GNo1XJxRFxjey5VDsHjLvj9upXJOqd3e/view

Advertisement

Answer

The reason of the IndexError is that split_text only has 1 element when the line does not have the separator.

You have to deal with this case. Drop that line or choose a different processing.

Another case if the line has multiple separators. Marat had a nice solution for that case (see edit)

A few other refactor tips:

It is not needed to read the whole file before processing.

To get faster processing do not open and close files hundreds of times.

Use a debugger to inspect the results of the split if they contain the new line character.

If you don’t need any white space at the ends of the string you can strip() them of all white space of only the new line char with `strip(‘n’)

And later add the new line for both written lines to keep them similar.

with open('mn_en_sentences_split.txtaa') as inputFile:
    with open("mn_out.txt", "w") as mn_out:
        with open("en_out.txt", "w") as en_out:
            for i in inputFile:
                split_text = map(lambda x:x.strip('n'), i.split("+++++SEP+++++"))
                if len(split_text) < 2: continue  # drop line if no separator
                mn_out.write(split_text[0].rstrip() + "n")
                en_out.write(split_text[1].lstrip() + "n")


Edit

Marat made a few suggestions to refactor and fail safe the execution in case the separator is not found. The 3 with statements can be joined together with (syntax sugar) to reduce the indentation (not supported in all versions of Python 3.x).

I really like the variable unpacking of the split result. If it fails you get a ValueError exception.

I have chosen to skip the lines that do not have a separator. If you want to do something with these lines you have to put the write() calls outside/below the try/except and in the exception handler set mn and en to some value.

I like to keep the normal flow of code inside the try.

What and how you want to strip from the strings is all up to you depending on what you want and what the input might contain.

with open('mn_en_sentences_split.txtaa') as inputFile, 
     open("mn_out.txt", "w") as mn_out, 
     open("en_out.txt", "w") as en_out:
    for line in inputFile:
        try:
            mn, en = line.strip('n').split("+++++SEP+++++", 1)
            mn_out.write(mn.rstrip() + "n")
            en_out.write(en.lstrip() + "n")
        except ValueError:
           pass

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement