Lower execution time for apache log parser in Python

I have an school assignment where I were tasked with writing a apache log parser in Python. This parser will extract all the IP addresses and all the HTTP Methods using Regex and store these in a nested dictionary. The code can be seen below:

def aggregatelog(filename):
    keyvaluepairscounter = {"IP":{}, "HTTP":{}}
    with open(filename, "r") as file:
        for line in file:
                result = search(r'(d{1,3}.d{1,3}.d{1,3}.d{1,3})s*.*"(b[A-Z]+b)', line).groups() #Combines the regexes: IP (d{1,3}.d{1,3}.d{1,3}.d{1,3}) and HTTP Method ("(b[A-Z]+b))
                if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
                    keyvaluepairscounter["IP"][result[0]] += 1
                else:
                    keyvaluepairscounter["IP"][result[0]] = 1
                
                if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
                    keyvaluepairscounter["HTTP"][result[1]] += 1
                else:
                    keyvaluepairscounter["HTTP"][result[1]] = 1

    return keyvaluepairscounter

JavaScript
​x
 
def aggregatelog(filename):
    keyvaluepairscounter = {"IP":{}, "HTTP":{}}
    with open(filename, "r") as file:
        for line in file:
                result = search(r'(d{1,3}.d{1,3}.d{1,3}.d{1,3})s*.*"(b[A-Z]+b)', line).groups() #Combines the regexes: IP (d{1,3}.d{1,3}.d{1,3}.d{1,3}) and HTTP Method ("(b[A-Z]+b))
                if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
                    keyvaluepairscounter["IP"][result[0]] += 1
                else:
                    keyvaluepairscounter["IP"][result[0]] = 1
                
                if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
                    keyvaluepairscounter["HTTP"][result[1]] += 1
                else:
                    keyvaluepairscounter["HTTP"][result[1]] = 1
​
    return keyvaluepairscounter 
​

This code works (it gives me the expected data for the log files we were given). However, when extracting data from large log files (in my case, ~500 MB) the program is VERY slow (it takes ~30 min for the script to finish). According to my teacher, a good script should be able to process the large file in under 3 minutes (wth?). My question is: Is there anything I can do to speed up my script? I have done some things, like replacing the lists with sets which have better lookup times.

Answer

I found my answer. Use “re.findall()” instead of storing returned regex data in array as such:

for data in re.findall(pattern, text):
    do things

JavaScript
 
for data in re.findall(pattern, text):
    do things
​

instead of

array = re.findall(pattern, text)
for data in array:
    do things

JavaScript
 
array = re.findall(pattern, text)
for data in array:
    do things
​

I also read the entire file in one go:

file = open("file", "r")
text = file.read()

JavaScript
 
file = open("file", "r")
text = file.read()
​

This implementation processed the file in under 1 minute!

Advertisement

Answer