An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes
0a 0b 0c 0d (only lower nibble has data)
become
ab cd
Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.
Combining every two bytes from a Pillow “L” (monochrome 8 bit) Image:
imd = im.tobytes() nibbles = [int(p / 16) for p in imd] packed = [] msn = None for n in nibbles: nib = n & 0x0F if msn is not None: b = msn << 4 | nib packed.append(b) msn = None else: msn = nib
This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):
packed = [] for b in range(0, 256*64, 2): packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:
mv = surface.get_data() w = surface.get_width() h = surface.get_height() f = surface.get_format() s = surface.get_stride() print(len(mv), w, h, f, s) # convert xRGB o = [] msn = None for p in range(0, len(mv), 4): nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F if msn is not None: b = msn << 4 | nib o.append(b) msn = None else: msn = nib
takes about twice as long (0.9 s vs 0.4 s).
The struct
module does not support nibbles (half-bytes).
bitstring
does allow packing nibbles:
>>> a = bitstring.BitStream() >>> a.insert('0xf') >>> a.insert('0x1') >>> a BitStream('0xf1') >>> a.insert(5) >>> a BitStream('0b1111000100000') >>> a.insert('0x2') >>> a BitStream('0b11110001000000010') >>>
But there does not seem to be a method to unpack this into a list of integers quickly — this takes 30 seconds!:
a = bitstring.BitStream() for p in imd: a.append( bitstring.Bits(uint=p//16, length=4) ) packed=[] a.pos=0 for p in range(256*64//2): packed.append( a.read(8).uint )
Does Python 3 have the means to do this efficiently or do I need an alternative?
External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.
Advertisement
Answer
Down to 130 ms from 200 ms by just wrapping the loop in a function
def packer0(imd): """same loop in a def""" packed = [] for b in range(0, 256*64, 2): packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) ) return packed
Down to 35 ms by Cythonizing the same code
def packer1(imd): """Cythonize python nibble packing loop""" packed = [] for b in range(0, 256*64, 2): packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) ) return packed
Down to 16 ms with type
def packer2(imd): """Cythonize python nibble packing loop, typed""" packed = [] cdef unsigned int b for b in range(0, 256*64, 2): packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) ) return packed
Not much of a difference with a “simplified” loop
def packer3(imd): """Cythonize python nibble packing loop, typed""" packed = [] cdef unsigned int i for i in range(256*64/2): packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) ) return packed
Maybe a tiny bit faster even (15 ms)
def packer4(it): """Cythonize python nibble packing loop, typed""" cdef unsigned int n = len(it)//2 cdef unsigned int i return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
Here’s with timeit
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100) 1.31725951000044 >>> exit() pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)' 100 loops, best of 3: 9.04 msec per loop
This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).
Or with parallelism on the GPU or the multi-core Raspberry Pis.
A crude comparison with the same loop in C (ideone):
#include <stdio.h> #include <stdint.h> #define SIZE (256*64) int main(void) { uint8_t in[SIZE] = {0}; uint8_t out[SIZE/2] = {0}; uint8_t t; for(t=0; t<100; t++){ uint16_t i; for(i=0; i<SIZE/2; i++){ out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16; } } return 0; }
It’s apparently 100 times faster:
pi@raspberry:~ $ gcc p.c pi@raspberry:~ $ time ./a.out real 0m0.085s user 0m0.060s sys 0m0.010s
Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):
def packs(bytes it): """Cythonize python nibble packing loop, typed""" cdef unsigned int n = len(it)//2 cdef unsigned int i return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
results in
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)' 100 loops, best of 3: 12.7 msec per loop python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)' 100 loops, best of 3: 12 msec per loop python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)' 100 loops, best of 3: 11 msec per loop python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)' 100 loops, best of 3: 13.9 msec per loop