Skip to content
Advertisement

Fastest way to transform a continuous string of hex into base 10 in Python

I have a 100M 3600 character length strings of hexadecimal digits that I want to split into blocks of three and then convert into base 10. Strictly speaking, I want to transform these into signed 4 byte numbers.

DEADBEEF2022 -> (Split) DEA, DBE, EF2, 022 -> (Convert) 3562, 3518, 3862, 034
-> (Signed) 534, 578,  234, 034.

As I have 100M of these strings to process, my main aim is code efficiency/speed.

For splitting the strings I am using regex:

def chunkstring(string):
    return re.findall('.{3}', string)

For converting from hex I am using Pythons built-in int converter:

def to_hex(a):
    number =  int(a, 16)
    if number > 2048:
        return number-2048
    else:
        return number

For the overall code I am combining things with a list comprehension:

def overall(data):
    return np.array([to_hex(i) for i in chunkstring(data)])
    

I have run cProfile (over a smaller sample of 3000), and it seems the total time for splitting and converting are approximately equal, although converting happens far more (1200 times per string), whereas splitting only occurs once.

        3624005 function calls in 1.500 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3000    0.461    0.000    0.461    0.000 {method 'findall' of 're.Pattern' objects}
  3600000    0.436    0.000    0.436    0.000 1197454951.py:1(to_hex)
     3000    0.350    0.000    0.787    0.000 2461773553.py:4(<listcomp>)
     3000    0.193    0.000    0.193    0.000 {built-in method numpy.array}
     3000    0.002    0.000    0.464    0.000 re.py:233(findall)
     3000    0.001    0.000    0.002    0.000 re.py:289(_compile)
     3000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     3000    0.002    0.000    0.466    0.000 1363934133.py:5(chunkstring)
     3000    0.012    0.000    1.457    0.000 2461773553.py:1(row_to_number)

Are there any ways I can improve the speed of this code?

Advertisement

Answer

Are there any ways I can improve the speed of this code?

You might try using functools.lru_cache decorator which will save and then use values for already seen input rather than computing it again following way

import functools
@functools.lru_cache(4096)
def to_hex(a):
    number =  int(a, 16)
    if number > 2048:
        return number-2048
    else:
        return number

where 4096 is number of all possible inputs. Note that it will increase memory usage.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement