I’m trying to debug my tflite model, that uses custom ops. I’ve found the correspondence between op names (in *.pb) and op ids (in *.tflite), and I’m doing a layer-per-layer comparison (to make sure the outputs difference are always in range 1e-4 (since it blows up at the end, I want to find the exact place where my custom layer fails) as follows:
Method 1: I use get_tensor to get the output as follows:
from tensorflow.contrib.lite.python import interpreter
# load the model
model = interpreter.Interpreter(model_path='model.tflite')
model.allocate_tensors()
# get tensors
for i in tensor_ids:
tensor_output[i] = model.get_tensor(i)
It show totally inadequate random values (comparing to the outputs of the TensorFlow model).
Method 2: convert the *.pb only up to a certain layer, then repeat, basically:
Create a
*.pbso that it contains the network only frominputup tolayer_1.Convert to
tflite(so the output is nowlayer_1) and check the outputs of TF-Lite with TensorFlow.Repeat steps 1-2 for
layer_2,layer_3, …outputs.
This method requires much more work and executions, but it correctly shows that for built-in operations the outputs of tflite and pb models were identical, and only starts to differ in my custom ops (while in Method 1 the outputs diverges right away from first layers).
Question: Why the behaviour of
get_tensoris so strange? Maybe it is because I am usingtensorflow 1.9(when TF-Lite was still not released and available only in developer preview)?
PS: I am aware about the release of TF-Lite, but I’ve manually compiled TensorFlow 1.9 for my project and now it is hard to change the versioning.
Advertisement
Answer
I had the same problem few month ago. The thing is, TF-Lite is completely different from TensorFlow – it uses static memory and execution plans, memory mapping files for faster loading, and it is supposed to optimize everything possible in the network’s forward propagation pipeline.
I’m not a developer of TF-Lite, but I suppose it keeps its memory footprint extremely low by re-using the memory areas that were used for previously computed ops. Let’s see the idea on following illustration:
Step 1: first, we’re feeding the inputs to a symbolic tensor I (in parentheses). Let’s say the value of it is stored in a buffer called buffer_1.
op1 op2 op3 (I) ----> A ----> B ----> O _________________________________ ^^^ ^^^^^^^^^^^^ ^^^ input intermediate output tensor tensors tensor
Step 2: Now, we need to compute op1 on symbolic tensor I to attain the symbolic tensor A. We compute on buffer_1 and store the value of symbolic tensor A in a buffer called buffer_2.
[op1] op2 op3 (I) ----> (A) ----> B ----> O
Step 3: Now, we’re computing op2 on symbolic tensor A to attain the symbolic tensor B. We compute on buffer_2 and store the value of symbolic tensor B in a buffer called buffer_3…
op1 [op2] op3 I ----> (A) ----> (B) ----> O
But wait! Why waste our memory to store in buffer_3 if we now have buffer_1 that is unused, and the value of which is now useless for getting the output O? So, instead of storing in buffer_3, we will actually store results of this operation in buffer_1!
That’s the basic idea of efficient memory re-usage, which I think is implemented in TF-Lite, given its built-in static graph analyzer in toco and other stuffs. And that’s why you can’t simply apply get_tensor on non-output tensors.
An easier way to debug?
You’ve mentioned that you’re writing a custom op, so I suppose you’ve built tflite with bazel, right? Then you can actually inject some logging code to Interpreter::Invoke() in the file tensorflow/lite/interpreter.cc. An ugly hack, but it works.
PS: I would be glad if any TensorFlow Lite developers come across and give a comment on this :)