We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.
For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.
1|B|3|D|5 A|1|2|34|5 D|This is a |text|3|5|7 B|4|5|5|6
Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.
1|B|3|D|5 A|1|2|34|5 D|This is a text|3|5|7 B|4|5|5|6
Please note that the file size is 100 MB. Any help is appreciated.
Advertisement
Answer
Source: my_file.txt
1|B|3|D|5 A|1|2|34|5 D|This is a |text|3|5|7 B|4|5|5|6 E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
Code
# If using Python3.10, this can be Parenthesized context managers # https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst: for line in file_src.readlines(): # Split the line by the character '|' line_list = line.split('|') if len(line_list) <= 5: # If the number of columns doesn't exceed, just write the original line as is. file_dst.write(line) else: # If the number of columns exceeds, count the number of columns that should be merged. to_merge_columns_count = (len(line_list) - 5) + 1 # Merge the columns from index 1 to index x which includes all the columns to be merged. merged_column = "".join(line_list[1:1+to_merge_columns_count]) # Replace all the items from index 1 to index x with the single merged column line_list[1:1+to_merge_columns_count] = [merged_column] # Write the updated line. file_dst.write("|".join(line_list))
Result: my_file_parsed.txt
1|B|3|D|5 A|1|2|34|5 D|This is a text|3|5|7 B|4|5|5|6 E|1 9 2 8 Not a text!!!|3|7|4