Skip to content
Advertisement

Compute gradients across two models

Let’s assume that we are building a basic CNN that recognizes pictures of cats and dogs (binary classifier).

An example of such CNN can be as follows:

JavaScript

Let’s also assume that we want to have the model split into two parts, or two models, called model_0 and model_1.

model_0 will handle the input, and model_1 will take model_0 output and take it as an input.

For example, the previous model will become:

JavaScript

How do I train the two models as if they were one single model? I have tried to manually set the gradients, but I don’t understand how to pass the gradients from model_1 to model_0:

JavaScript

This method will of course not work, as I am basically just training two models separately and binding them up, which is not what I want to achieve.

This is a Google Colab notebook for a simpler version of this problem, using only two fully connected layers and two activation functions: https://colab.research.google.com/drive/14Px1rJtiupnB6NwtvbgeVYw56N1xM6JU#scrollTo=PeqtJJWS3wyG

Please note that I am aware of Sequential([model_0, model_1]), but this is not what I want to achieve. I want to do the backpropagation step manually.

Also, I would like to keep using two separate tapes. The trick here is to use grads_1 to calculate grads_0.

Any clues?

Advertisement

Answer

After asking for help and understanding better the dynamics of automatic differentiation (or autodiff), I have managed to get a working, simple example of what I wanted to achieve. Even though this approach does not fully resolve the problem, it puts us in a step forward in understanding how to approach the problem at hand.

Reference model

I have simplified the model into something much smaller:

JavaScript

Which we split into three parts, layer_0, layer_1, layer_2. The vanilla approach is just putting everything together and calculate the gradients one by one (or in a single step):

JavaScript

And the different gradients can be calculated just with simple calls to tape.gradient:

JavaScript

We will use these values later to compare the correctness of our approach.

Split model with manual autodiff

For splitting, we mean that every layer_x needs to have its own GradientTape, responsible for generating its own gradient:

JavaScript

Now, using simply tape_n.gradient for every step will not work. We are basically losing a lot of information that we can not recover afterwards.

Instead, we have to use tape.jacobian and tape.batch_jacobian, except for , as we only have one value as a source.

JavaScript

We will use a couple of utility functions to adjust the result to what we want:

JavaScript

Finally, we can apply the chain rule:

JavaScript

And the output will be:

JavaScript

as all the values are close to each other. Mind that using the approach with Jacobians, we will have some degree of error/approximation, around 10^-7, but I think this is good enough.

Gotchas

For extremely or toy models, this is perfect and works well. However, in a real scenario, you would have big images with tons of dimensions. This is not ideal when dealing with Jacobians, which can quickly reach very high dimensionality. But this is an issue all of its own.

You can read more about the topic on the following resources:

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement