Skip to content
Advertisement

Faster bit-level data packing

An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes

JavaScript

become

JavaScript

Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.

Combining every two bytes from a Pillow “L” (monochrome 8 bit) Image:

JavaScript

This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):

JavaScript

Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:

JavaScript

takes about twice as long (0.9 s vs 0.4 s).

The struct module does not support nibbles (half-bytes).

bitstring does allow packing nibbles:

JavaScript

But there does not seem to be a method to unpack this into a list of integers quickly — this takes 30 seconds!:

JavaScript

Does Python 3 have the means to do this efficiently or do I need an alternative? External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.

Advertisement

Answer

Down to 130 ms from 200 ms by just wrapping the loop in a function

JavaScript

Down to 35 ms by Cythonizing the same code

JavaScript

Down to 16 ms with type

JavaScript

Not much of a difference with a “simplified” loop

JavaScript

Maybe a tiny bit faster even (15 ms)

JavaScript

Here’s with timeit

JavaScript

This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).

Or with parallelism on the GPU or the multi-core Raspberry Pis.


A crude comparison with the same loop in C (ideone):

JavaScript

It’s apparently 100 times faster:

JavaScript

Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):

JavaScript

results in

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement