I have a 100M 3600 character length strings of hexadecimal digits that I want to split into blocks of three and then convert into base 10. Strictly speaking, I want to transform these into signed 4 byte numbers.
DEADBEEF2022 -> (Split) DEA, DBE, EF2, 022 -> (Convert) 3562, 3518, 3862, 034 -> (Signed) 534, 578, 234, 034.
As I have 100M of these strings to process, my main aim is code efficiency/speed.
For splitting the strings I am using regex:
def chunkstring(string): return re.findall('.{3}', string)
For converting from hex I am using Pythons built-in int converter:
def to_hex(a): number = int(a, 16) if number > 2048: return number-2048 else: return number
For the overall code I am combining things with a list comprehension:
def overall(data): return np.array([to_hex(i) for i in chunkstring(data)])
I have run cProfile (over a smaller sample of 3000), and it seems the total time for splitting and converting are approximately equal, although converting happens far more (1200 times per string), whereas splitting only occurs once.
3624005 function calls in 1.500 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 3000 0.461 0.000 0.461 0.000 {method 'findall' of 're.Pattern' objects} 3600000 0.436 0.000 0.436 0.000 1197454951.py:1(to_hex) 3000 0.350 0.000 0.787 0.000 2461773553.py:4(<listcomp>) 3000 0.193 0.000 0.193 0.000 {built-in method numpy.array} 3000 0.002 0.000 0.464 0.000 re.py:233(findall) 3000 0.001 0.000 0.002 0.000 re.py:289(_compile) 3000 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance} 3000 0.002 0.000 0.466 0.000 1363934133.py:5(chunkstring) 3000 0.012 0.000 1.457 0.000 2461773553.py:1(row_to_number)
Are there any ways I can improve the speed of this code?
Advertisement
Answer
Are there any ways I can improve the speed of this code?
You might try using functools.lru_cache
decorator which will save and then use values for already seen input rather than computing it again following way
import functools @functools.lru_cache(4096) def to_hex(a): number = int(a, 16) if number > 2048: return number-2048 else: return number
where 4096 is number of all possible inputs. Note that it will increase memory usage.