I’m trying to debug my tflite
model, that uses custom ops. I’ve found the correspondence between op names (in *.pb
) and op ids (in *.tflite
), and I’m doing a layer-per-layer comparison (to make sure the outputs difference are always in range 1e-4
(since it blows up at the end, I want to find the exact place where my custom layer fails) as follows:
Method 1: I use get_tensor
to get the output as follows:
from tensorflow.contrib.lite.python import interpreter # load the model model = interpreter.Interpreter(model_path='model.tflite') model.allocate_tensors() # get tensors for i in tensor_ids: tensor_output[i] = model.get_tensor(i)
It show totally inadequate random values (comparing to the outputs of the TensorFlow model).
Method 2: convert the *.pb
only up to a certain layer, then repeat, basically:
Create a
*.pb
so that it contains the network only frominput
up tolayer_1
.Convert to
tflite
(so the output is nowlayer_1
) and check the outputs of TF-Lite with TensorFlow.Repeat steps 1-2 for
layer_2
,layer_3
, …outputs
.
This method requires much more work and executions, but it correctly shows that for built-in operations the outputs of tflite
and pb
models were identical, and only starts to differ in my custom ops (while in Method 1 the outputs diverges right away from first layers).
Question: Why the behaviour of
get_tensor
is so strange? Maybe it is because I am usingtensorflow 1.9
(when TF-Lite was still not released and available only in developer preview)?
PS: I am aware about the release of TF-Lite, but I’ve manually compiled TensorFlow 1.9 for my project and now it is hard to change the versioning.
Advertisement
Answer
I had the same problem few month ago. The thing is, TF-Lite is completely different from TensorFlow – it uses static memory and execution plans, memory mapping files for faster loading, and it is supposed to optimize everything possible in the network’s forward propagation pipeline.
I’m not a developer of TF-Lite, but I suppose it keeps its memory footprint extremely low by re-using the memory areas that were used for previously computed ops. Let’s see the idea on following illustration:
Step 1: first, we’re feeding the inputs to a symbolic tensor I
(in parentheses). Let’s say the value of it is stored in a buffer called buffer_1
.
op1 op2 op3 (I) ----> A ----> B ----> O _________________________________ ^^^ ^^^^^^^^^^^^ ^^^ input intermediate output tensor tensors tensor
Step 2: Now, we need to compute op1
on symbolic tensor I
to attain the symbolic tensor A
. We compute on buffer_1
and store the value of symbolic tensor A
in a buffer called buffer_2
.
[op1] op2 op3 (I) ----> (A) ----> B ----> O
Step 3: Now, we’re computing op2
on symbolic tensor A
to attain the symbolic tensor B
. We compute on buffer_2
and store the value of symbolic tensor B
in a buffer called buffer_3
…
op1 [op2] op3 I ----> (A) ----> (B) ----> O
But wait! Why waste our memory to store in buffer_3
if we now have buffer_1
that is unused, and the value of which is now useless for getting the output O
? So, instead of storing in buffer_3
, we will actually store results of this operation in buffer_1
!
That’s the basic idea of efficient memory re-usage, which I think is implemented in TF-Lite, given its built-in static graph analyzer in toco
and other stuffs. And that’s why you can’t simply apply get_tensor
on non-output tensors.
An easier way to debug?
You’ve mentioned that you’re writing a custom op, so I suppose you’ve built tflite
with bazel
, right? Then you can actually inject some logging code to Interpreter::Invoke()
in the file tensorflow/lite/interpreter.cc
. An ugly hack, but it works.
PS: I would be glad if any TensorFlow Lite developers come across and give a comment on this :)