I have got a unique use case. I have got a txt file in the following format where every line starts with “APA” and ends with “||” (varies in length and content, does not matter)
APA lEDGER|5023|124223|STAFF NAME|XYZ|123|| APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw|| APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq|prime||
In some lines however, due to unknown reasons, some of these lines are split like so:
APA lEDGER|5023|hello| 40937 / 903.01 for period: 2021|8|332.48||
Technically this line should have been like:
APA lEDGER|5023|hello|40937 / 903.01 for period: 2021|8|332.48||
The file is of really huge size (16MB), and this is the logic I have come up with:
Read each line into a list of strings and apply the following algorithm:
pattern = re.compile("s*[^APA]") patternOK = re.compile("s*[APA]") final_list = [] --list to store the cleaned strings ptr=0 for elem in string_list: if(elem.startswith ("APA") and elem.endswith("||")): final_list.append(elem) --add each string with the proper format to the the final list --maintain a pointer, point it to the current string and one to the previous string, if the current string does not start with an APA then append all the strings in a while loop and then append it back to the previous proper string if(ptr<len(string_list)): if (ptr - 1 >= 0): prev_el = str(string_list[ptr-1]) curr_el = str(string_list[ptr]) if(pattern.match(curr_el.strip())): while (pattern.match(string_list[ptr])): prev_el = prev_el + '|'+string_list[ptr] ptr = ptr + 1 if(ptr+1 > len(string_list)): break final_list.append(prev_el) --append the cleaned string to the final list ptr = timer+1
Mostly, it works. However, I could see some of the results were omitted or not in the order of insert. Please feel free to provide your own logic as well.
In summary, I need a list of strings with the correct format mentioned above.
Thanks
Advertisement
Answer
It appears that you don’t only have ||
at the end of the line, but also in between columns. If this is a typo, the normal .split()
function would be sufficient, however if it’s not, you can use re.split()
to only split on ||
at the end of lines. Afterwards, remove the line breaks from the resulting list elements and finally join everything back together:
import re data = """APA lEDGER|5023|124223|STAFF NAME|XYZ|123|| APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw|| APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq||prime|| APA lEDGER|5023|hello| 40937 / 903.01 for period: 2021|8|332.48|| """ r = re.compile(r'||$', re.M) splits = re.split(r, data) clean_lines = (line.replace('n', ' ').strip() for line in splits) clean_file = '||n'.join(clean_lines) print(clean_file.splitlines())
Output:
['APA lEDGER|5023|124223|STAFF NAME|XYZ|123||', 'APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw||', 'APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq||prime||', 'APA lEDGER|5023|hello| 40937 / 903.01 for period: 2021|8|332.48||']