Let’s assume that we are building a basic CNN that recognizes pictures of cats and dogs (binary classifier).
An example of such CNN can be as follows:
model = Sequential([ Conv2D(32, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2), Conv2D(32, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2) Conv2D(64, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2), Flatten(), Dense(64), Activation('relu'), Dropout(0.5), Dense(1), Activation('sigmoid') ])
Let’s also assume that we want to have the model split into two parts, or two models, called model_0
and model_1
.
model_0
will handle the input, and model_1
will take model_0
output and take it as an input.
For example, the previous model will become:
model_0 = Sequential([ Conv2D(32, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2), Conv2D(32, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2) Conv2D(64, (3,3), input_shape=...), Activation('relu'), MaxPooling2D(pool_size=(2,2) ]) model_1 = Sequential([ Flatten(), Dense(64), Activation('relu'), Dropout(0.5), Dense(1), Activation('sigmoid') ])
How do I train the two models as if they were one single model? I have tried to manually set the gradients, but I don’t understand how to pass the gradients from model_1
to model_0
:
for epoch in range(epochs): for step, (x_batch, y_batch) in enumerate(train_generator): # model 0 with tf.GradientTape() as tape_0: y_pred_0 = model_0(x_batch, training=True) # model 1 with tf.GradientTape() as tape_1: y_pred_1 = model_1(y_pred_0, training=True) loss_value = loss_fn(y_batch_tensor, y_pred_1) grads_1 = tape_1.gradient(y_pred_1, model_1.trainable_weights) grads_0 = tape_0.gradient(y_pred_0, model_0.trainable_weights) optimizer.apply_gradients(zip(grads_1, model_1.trainable_weights)) optimizer.apply_gradients(zip(grads_0, model_0.trainable_weights))
This method will of course not work, as I am basically just training two models separately and binding them up, which is not what I want to achieve.
This is a Google Colab notebook for a simpler version of this problem, using only two fully connected layers and two activation functions: https://colab.research.google.com/drive/14Px1rJtiupnB6NwtvbgeVYw56N1xM6JU#scrollTo=PeqtJJWS3wyG
Please note that I am aware of Sequential([model_0, model_1])
, but this is not what I want to achieve. I want to do the backpropagation step manually.
Also, I would like to keep using two separate tapes. The trick here is to use grads_1
to calculate grads_0
.
Any clues?
Advertisement
Answer
After asking for help and understanding better the dynamics of automatic differentiation (or autodiff), I have managed to get a working, simple example of what I wanted to achieve. Even though this approach does not fully resolve the problem, it puts us in a step forward in understanding how to approach the problem at hand.
Reference model
I have simplified the model into something much smaller:
import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Activation, Dense, Layer, Flatten, Conv2D import numpy as np tf.random.set_seed(0) # 3 batches, 10x10 images, 1 channel x = tf.random.uniform((3, 10, 10, 1)) y = tf.cast(tf.random.uniform((3, 1)) > 0.5, tf.float32) layer_0 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")]) layer_1 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")]) layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")]) loss_fn = tf.keras.losses.MeanSquaredError()
Which we split into three parts, layer_0, layer_1, layer_2
. The vanilla approach is just putting everything together and calculate the gradients one by one (or in a single step):
with tf.GradientTape(persistent=True) as tape: out_layer_0 = layer_0(x) out_layer_1 = layer_1(out_layer_0) out_layer_2 = layer_2(out_layer_1) loss = loss_fn(y, out_layer_2)
And the different gradients can be calculated just with simple calls to tape.gradient
:
ref_conv_dLoss_dWeights2 = tape.gradient(loss, layer_2.trainable_weights) ref_conv_dLoss_dWeights1 = tape.gradient(loss, layer_1.trainable_weights) ref_conv_dLoss_dWeights0 = tape.gradient(loss, layer_0.trainable_weights) ref_conv_dLoss_dY = tape.gradient(loss, out_layer_2) ref_conv_dLoss_dOut1 = tape.gradient(loss, out_layer_1) ref_conv_dOut2_dOut1 = tape.gradient(out_layer_2, out_layer_1) ref_conv_dLoss_dOut0 = tape.gradient(loss, out_layer_0) ref_conv_dOut1_dOut0 = tape.gradient(out_layer_1, out_layer_0) ref_conv_dOut0_dWeights0 = tape.gradient(out_layer_0, layer_0.trainable_weights) ref_conv_dOut1_dWeights1 = tape.gradient(out_layer_1, layer_1.trainable_weights) ref_conv_dOut2_dWeights2 = tape.gradient(out_layer_2, layer_2.trainable_weights)
We will use these values later to compare the correctness of our approach.
Split model with manual autodiff
For splitting, we mean that every layer_x
needs to have its own GradientTape
, responsible for generating its own gradient:
with tf.GradientTape(persistent=True) as tape_0: out_layer_0 = model.layers[0](x) with tf.GradientTape(persistent=True) as tape_1: tape_1.watch(out_layer_0) out_layer_1 = model.layers[1](out_layer_0) with tf.GradientTape(persistent=True) as tape_2: tape_2.watch(out_layer_1) out_flatten = model.layers[2](out_layer_1) out_layer_2 = model.layers[3](out_flatten) loss = loss_fn(y, out_layer_2)
Now, using simply tape_n.gradient
for every step will not work. We are basically losing a lot of information that we can not recover afterwards.
Instead, we have to use tape.jacobian
and tape.batch_jacobian
, except for , as we only have one value as a source.
dOut0_dWeights0 = tape_0.jacobian(out_layer_0, model.layers[0].trainable_weights) dOut1_dOut0 = tape_1.batch_jacobian(out_layer_1, out_layer_0) dOut1_dWeights1 = tape_1.jacobian(out_layer_1, model.layers[1].trainable_weights) dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1) dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights) dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY
We will use a couple of utility functions to adjust the result to what we want:
def add_missing_axes(source_tensor, target_tensor): len_missing_axes = len(target_tensor.shape) - len(source_tensor.shape) # note: the number of tf.newaxis is determined by the number of axis missing to reach # the same dimension of the target tensor assert len_missing_axes >= 0 # convenience renaming source_tensor_extended = source_tensor # add every missing axis for _ in range(len_missing_axes): source_tensor_extended = source_tensor_extended[..., tf.newaxis] return source_tensor_extended def upstream_gradient_loss_weights(dOutUpstream_dWeightsLocal, dLoss_dOutUpstream): dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dWeightsLocal) # reduce over the first axes len_reduce = range(len(dLoss_dOutUpstream.shape)) return tf.reduce_sum(dOutUpstream_dWeightsLocal * dLoss_dOutUpstream_extended, axis=len_reduce) def upstream_gradient_loss_out(dOutUpstream_dOutLocal, dLoss_dOutUpstream): dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dOutLocal) len_reduce = range(len(dLoss_dOutUpstream.shape))[1:] return tf.reduce_sum(dOutUpstream_dOutLocal * dLoss_dOutUpstream_extended, axis=len_reduce)
Finally, we can apply the chain rule:
dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1) dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights) dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dY dLoss_dWeights2 = upstream_gradient_loss_weights(dOut2_dWeights2[0], dLoss_dOut2) dLoss_dBias2 = upstream_gradient_loss_weights(dOut2_dWeights2[1], dLoss_dOut2) dLoss_dOut1 = upstream_gradient_loss_out(dOut2_dOut1, dLoss_dOut2) dLoss_dWeights1 = upstream_gradient_loss_weights(dOut1_dWeights1[0], dLoss_dOut1) dLoss_dBias1 = upstream_gradient_loss_weights(dOut1_dWeights1[1], dLoss_dOut1) dLoss_dOut0 = upstream_gradient_loss_out(dOut1_dOut0, dLoss_dOut1) dLoss_dWeights0 = upstream_gradient_loss_weights(dOut0_dWeights0[0], dLoss_dOut0) dLoss_dBias0 = upstream_gradient_loss_weights(dOut0_dWeights0[1], dLoss_dOut0) print("dLoss_dWeights2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[0], dLoss_dWeights2).numpy()) print("dLoss_dBias2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[1], dLoss_dBias2).numpy()) print("dLoss_dWeights1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[0], dLoss_dWeights1).numpy()) print("dLoss_dBias1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[1], dLoss_dBias1).numpy()) print("dLoss_dWeights0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[0], dLoss_dWeights0).numpy()) print("dLoss_dBias0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[1], dLoss_dBias0).numpy())
And the output will be:
dLoss_dWeights2 valid: True dLoss_dBias2 valid: True dLoss_dWeights1 valid: True dLoss_dBias1 valid: True dLoss_dWeights0 valid: True dLoss_dBias0 valid: True
as all the values are close to each other. Mind that using the approach with Jacobians, we will have some degree of error/approximation, around 10^-7
, but I think this is good enough.
Gotchas
For extremely or toy models, this is perfect and works well. However, in a real scenario, you would have big images with tons of dimensions. This is not ideal when dealing with Jacobians, which can quickly reach very high dimensionality. But this is an issue all of its own.
You can read more about the topic on the following resources: