An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes become Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data,

Faster bit-level data packing

An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes

0a 0b 0c 0d (only lower nibble has data)

JavaScript
​x
 
0a 0b 0c 0d (only lower nibble has data)
​

become

ab cd

JavaScript
 
ab cd
​

Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.

Combining every two bytes from a Pillow “L” (monochrome 8 bit) Image:

imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
    nib = n & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        packed.append(b)
        msn = None
    else:
        msn = nib

JavaScript
 
imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
    nib = n & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        packed.append(b)
        msn = None
    else:
        msn = nib
​

This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):

packed = []
for b in range(0, 256*64, 2):
    packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )

JavaScript
 
packed = []
for b in range(0, 256*64, 2):
    packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
​

Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:

mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)

# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
    nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        o.append(b)
        msn = None
    else:
        msn = nib

JavaScript
 
mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)
​
# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
    nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        o.append(b)
        msn = None
    else:
        msn = nib
​

takes about twice as long (0.9 s vs 0.4 s).

The struct module does not support nibbles (half-bytes).

bitstring does allow packing nibbles:

>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>

JavaScript
 
>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>
​

But there does not seem to be a method to unpack this into a list of integers quickly — this takes 30 seconds!:

a = bitstring.BitStream()
for p in imd:
    a.append( bitstring.Bits(uint=p//16, length=4) )

packed=[]
a.pos=0
for p in range(256*64//2):
    packed.append( a.read(8).uint )

JavaScript
 
a = bitstring.BitStream()
for p in imd:
    a.append( bitstring.Bits(uint=p//16, length=4) )
​
packed=[]
a.pos=0
for p in range(256*64//2):
    packed.append( a.read(8).uint )
​

Does Python 3 have the means to do this efficiently or do I need an alternative? External packer wrapped with ctypes? The same, but simpler, with Cython ~~(I have not yet looked into these)~~? Looks very good, see my answer.

Answer

Down to 130 ms from 200 ms by just wrapping the loop in a function

def packer0(imd):
    """same loop in a def"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

JavaScript
 
def packer0(imd):
    """same loop in a def"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed
​

Down to 35 ms by Cythonizing the same code

def packer1(imd):
    """Cythonize python nibble packing loop"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

JavaScript
 
def packer1(imd):
    """Cythonize python nibble packing loop"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed
​

Down to 16 ms with type

def packer2(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int b
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

JavaScript
 
def packer2(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int b
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed
​

Not much of a difference with a “simplified” loop

def packer3(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int i
    for i in range(256*64/2):
        packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
    return packed

JavaScript
 
def packer3(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int i
    for i in range(256*64/2):
        packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
    return packed
​

Maybe a tiny bit faster even (15 ms)

def packer4(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]

JavaScript
 
def packer4(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
​

Here’s with timeit

>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop

JavaScript
 
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop
​

This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).

Or with parallelism on the GPU or the multi-core Raspberry Pis.

A crude comparison with the same loop in C (ideone):

#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
  uint8_t in[SIZE] = {0};
  uint8_t out[SIZE/2] = {0};
  uint8_t t;
  for(t=0; t<100; t++){
    uint16_t i;
    for(i=0; i<SIZE/2; i++){
        out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
    }
  }
  return 0;
}

JavaScript
 
#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
  uint8_t in[SIZE] = {0};
  uint8_t out[SIZE/2] = {0};
  uint8_t t;
  for(t=0; t<100; t++){
    uint16_t i;
    for(i=0; i<SIZE/2; i++){
        out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
    }
  }
  return 0;
}
​

It’s apparently 100 times faster:

pi@raspberry:~ $ gcc p.c
pi@raspberry:~ $ time ./a.out

real    0m0.085s
user    0m0.060s
sys     0m0.010s

JavaScript
 
pi@raspberry:~ $ gcc p.c
pi@raspberry:~ $ time ./a.out
​
real    0m0.085s
user    0m0.060s
sys     0m0.010s
​

Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):

def packs(bytes it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]

JavaScript
 
def packs(bytes it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
​

results in

python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop

JavaScript
 
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop
​

Advertisement

Answer