Skip to content
Advertisement

Lower execution time for apache log parser in Python

I have an school assignment where I were tasked with writing a apache log parser in Python. This parser will extract all the IP addresses and all the HTTP Methods using Regex and store these in a nested dictionary. The code can be seen below:

JavaScript

This code works (it gives me the expected data for the log files we were given). However, when extracting data from large log files (in my case, ~500 MB) the program is VERY slow (it takes ~30 min for the script to finish). According to my teacher, a good script should be able to process the large file in under 3 minutes (wth?). My question is: Is there anything I can do to speed up my script? I have done some things, like replacing the lists with sets which have better lookup times.

Advertisement

Answer

I found my answer. Use “re.findall()” instead of storing returned regex data in array as such:

JavaScript

instead of

JavaScript

I also read the entire file in one go:

JavaScript

This implementation processed the file in under 1 minute!

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement