Skip to content
Advertisement

Reading files faster in python

I’m writting a script to read a TXT file where each line is a Log entry and I need to separate this log in different files (for all Hor, Sia, Lmu). I’m reading each line and dividing in new files with no problem when using my test file (80kb), but when I try to apply to the actual file (177MB – around 500k lines) it takes too long. Took more than an hour and it was still at 80K lines read.

The lines are like this:

Crm|Hor|SiebelSeed

Crm|Sia|SiebelSeed

Crm|Lmu|LMU|

Is there anyway I can make it run faster?

My code

with open(path, "r", encoding="UTF-16") as file:
    for i, line in enumerate(file): 
    
            if i > 2: # lines 1-2 are headers
                component = re.match(r"Crm|([A-Za-z0-9_]+)|]", line).group(1)
                
                if component not in comp_list:
                    comp_list.append(component)
                    
                    with open(f'HHR_Splitter/output/{component}.txt','w+', encoding="UTF-16") as new_file:
                        new_file.write('{}'.format(line))
                        
                        
                if component in comp_list:
                    
                    with open(f'HHR_Splitter/output/{component}.txt','a+', encoding="UTF-16") as existing_file: 
                        existing_file.write('{}'.format(line))

                else:
                    break

Advertisement

Answer

The first thing that I spot is that you are opening the output files for each line. You could open them once and them process all the lines. The same is valid for the regex: you could compute it once before the for loop with re.compile()

Here is an example:

def process_log(input_file, output_files):
    prog = re.compile(r"Crm|([A-Za-z0-9_]+)|]")
    for i, line in enumerate(file):
        if i > 2:
           component = prog.match(line).group(1)
           output_files[component].write('{}'.format(line))

def open_outputs_files():
     output_files = {}
     components = ["Crm", "Hor", "Sia", "Lmu", "SiebelSeed"]
     for component in components:
         with open(f'HHR_Splitter/output/{component}.txt','w+', encoding="UTF-16") as new_file:
             output_files[component] = new_file
     return output_files

with open(path, "r", encoding="UTF-16") as input_file:
    output_files = open_outputs_files()
    process_log(input_file, output_files)
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement