Skip to content
Advertisement

Using Python maintain clean a text file on a specific format consistent across each line

I have got a unique use case. I have got a txt file in the following format where every line starts with “APA” and ends with “||” (varies in length and content, does not matter)

JavaScript

In some lines however, due to unknown reasons, some of these lines are split like so:

JavaScript

Technically this line should have been like:

JavaScript

The file is of really huge size (16MB), and this is the logic I have come up with:

Read each line into a list of strings and apply the following algorithm:

JavaScript

Mostly, it works. However, I could see some of the results were omitted or not in the order of insert. Please feel free to provide your own logic as well.

In summary, I need a list of strings with the correct format mentioned above.

Thanks

Advertisement

Answer

It appears that you don’t only have || at the end of the line, but also in between columns. If this is a typo, the normal .split() function would be sufficient, however if it’s not, you can use re.split() to only split on || at the end of lines. Afterwards, remove the line breaks from the resulting list elements and finally join everything back together:

JavaScript

Output:

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement