Compute gradients across two models

Question

Let's assume that we are building a basic CNN that recognizes pictures of cats and dogs (binary classifier). An example of such CNN can be as follows: Let's also assume that we want to have the model split into two parts, or two models, called model_0 and model_1. model_0 will handle the input, and model_1 will take model_0 output and

Accepted Answer

After asking for help and understanding better the dynamics of automatic differentiation (or autodiff), I have managed to get a working, simple example of what I wanted to achieve. Even though this approach does not fully resolve the problem, it puts us in a step forward in understanding how to approach the problem at hand.Reference modelI have simplified the model into something much smaller:import tensorflow as tffrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Activation, Dense, Layer, Flatten, Conv2Dimport numpy as nptf.random.set_seed(0)# 3 batches, 10x10 images, 1 channelx = tf.random.uniform((3, 10, 10, 1))y = tf.cast(tf.random.uniform((3, 1)) > 0.5, tf.float32)layer_0 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])layer_1 = Sequential([Conv2D(filters=6, kernel_size=2, activation="relu")])layer_2 = Sequential([Flatten(), Dense(1), Activation("sigmoid")])loss_fn = tf.keras.losses.MeanSquaredError()Which we split into three parts, layer_0, layer_1, layer_2. The vanilla approach is just putting everything together and calculate the gradients one by one (or in a single step):with tf.GradientTape(persistent=True) as tape:    out_layer_0 = layer_0(x)    out_layer_1 = layer_1(out_layer_0)    out_layer_2 = layer_2(out_layer_1)    loss = loss_fn(y, out_layer_2)And the different gradients can be calculated just with simple calls to tape.gradient:ref_conv_dLoss_dWeights2 = tape.gradient(loss, layer_2.trainable_weights)ref_conv_dLoss_dWeights1 = tape.gradient(loss, layer_1.trainable_weights)ref_conv_dLoss_dWeights0 = tape.gradient(loss, layer_0.trainable_weights)ref_conv_dLoss_dY = tape.gradient(loss, out_layer_2)ref_conv_dLoss_dOut1 = tape.gradient(loss, out_layer_1)ref_conv_dOut2_dOut1 = tape.gradient(out_layer_2, out_layer_1)ref_conv_dLoss_dOut0 = tape.gradient(loss, out_layer_0)ref_conv_dOut1_dOut0 = tape.gradient(out_layer_1, out_layer_0)ref_conv_dOut0_dWeights0 = tape.gradient(out_layer_0, layer_0.trainable_weights)ref_conv_dOut1_dWeights1 = tape.gradient(out_layer_1, layer_1.trainable_weights)ref_conv_dOut2_dWeights2 = tape.gradient(out_layer_2, layer_2.trainable_weights)We will use these values later to compare the correctness of our approach.Split model with manual autodiffFor splitting, we mean that every layer_x needs to have its own GradientTape, responsible for generating its own gradient:with tf.GradientTape(persistent=True) as tape_0:    out_layer_0 = model.layers[0](x)with tf.GradientTape(persistent=True) as tape_1:    tape_1.watch(out_layer_0)    out_layer_1 = model.layers[1](out_layer_0)with tf.GradientTape(persistent=True) as tape_2:    tape_2.watch(out_layer_1)    out_flatten = model.layers[2](out_layer_1)    out_layer_2 = model.layers[3](out_flatten)    loss = loss_fn(y, out_layer_2)Now, using simply tape_n.gradient for every step will not work. We are basically losing a lot of information that we can not recover afterwards.Instead, we have to use tape.jacobian and tape.batch_jacobian, except for , as we only have one value as a source.dOut0_dWeights0 = tape_0.jacobian(out_layer_0, model.layers[0].trainable_weights)dOut1_dOut0 = tape_1.batch_jacobian(out_layer_1, out_layer_0)dOut1_dWeights1 = tape_1.jacobian(out_layer_1, model.layers[1].trainable_weights)dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dYWe will use a couple of utility functions to adjust the result to what we want:def add_missing_axes(source_tensor, target_tensor):    len_missing_axes = len(target_tensor.shape) - len(source_tensor.shape)    # note: the number of tf.newaxis is determined by the number of axis missing to reach    # the same dimension of the target tensor    assert len_missing_axes >= 0    # convenience renaming    source_tensor_extended = source_tensor    # add every missing axis    for _ in range(len_missing_axes):        source_tensor_extended = source_tensor_extended[..., tf.newaxis]    return source_tensor_extendeddef upstream_gradient_loss_weights(dOutUpstream_dWeightsLocal, dLoss_dOutUpstream):    dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dWeightsLocal)    # reduce over the first axes    len_reduce = range(len(dLoss_dOutUpstream.shape))    return tf.reduce_sum(dOutUpstream_dWeightsLocal * dLoss_dOutUpstream_extended, axis=len_reduce)def upstream_gradient_loss_out(dOutUpstream_dOutLocal, dLoss_dOutUpstream):    dLoss_dOutUpstream_extended = add_missing_axes(dLoss_dOutUpstream, dOutUpstream_dOutLocal)    len_reduce = range(len(dLoss_dOutUpstream.shape))[1:]    return tf.reduce_sum(dOutUpstream_dOutLocal * dLoss_dOutUpstream_extended, axis=len_reduce)Finally, we can apply the chain rule:dOut2_dOut1 = tape_2.batch_jacobian(out_layer_2, out_layer_1)dOut2_dWeights2 = tape_2.jacobian(out_layer_2, model.layers[3].trainable_weights)dLoss_dOut2 = tape_2.gradient(loss, out_layer_2) # or dL/dYdLoss_dWeights2 = upstream_gradient_loss_weights(dOut2_dWeights2[0], dLoss_dOut2)dLoss_dBias2 = upstream_gradient_loss_weights(dOut2_dWeights2[1], dLoss_dOut2)dLoss_dOut1 = upstream_gradient_loss_out(dOut2_dOut1, dLoss_dOut2)dLoss_dWeights1 = upstream_gradient_loss_weights(dOut1_dWeights1[0], dLoss_dOut1)dLoss_dBias1 = upstream_gradient_loss_weights(dOut1_dWeights1[1], dLoss_dOut1)dLoss_dOut0 = upstream_gradient_loss_out(dOut1_dOut0, dLoss_dOut1)dLoss_dWeights0 = upstream_gradient_loss_weights(dOut0_dWeights0[0], dLoss_dOut0)dLoss_dBias0 = upstream_gradient_loss_weights(dOut0_dWeights0[1], dLoss_dOut0)print("dLoss_dWeights2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[0], dLoss_dWeights2).numpy())print("dLoss_dBias2 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights2[1], dLoss_dBias2).numpy())print("dLoss_dWeights1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[0], dLoss_dWeights1).numpy())print("dLoss_dBias1 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights1[1], dLoss_dBias1).numpy())print("dLoss_dWeights0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[0], dLoss_dWeights0).numpy())print("dLoss_dBias0 valid:", tf.experimental.numpy.allclose(ref_conv_dLoss_dWeights0[1], dLoss_dBias0).numpy())And the output will be:dLoss_dWeights2 valid: TruedLoss_dBias2 valid: TruedLoss_dWeights1 valid: TruedLoss_dBias1 valid: TruedLoss_dWeights0 valid: TruedLoss_dBias0 valid: Trueas all the values are close to each other. Mind that using the approach with Jacobians, we will have some degree of error/approximation, around 10^-7, but I think this is good enough.GotchasFor extremely or toy models, this is perfect and works well. However, in a real scenario, you would have big images with tons of dimensions. This is not ideal when dealing with Jacobians, which can quickly reach very high dimensionality. But this is an issue all of its own.You can read more about the topic on the following resources:(EN) https://mblondel.org/teaching/autodiff-2020.pdf(EN) https://www.sscardapane.it/assets/files/nnds2021/Lecture_3_fully_connected.pdf(ITA) https://iaml.it/blog/differenziazione-automatica-parte-1

Advertisement

Answer

Reference model

Split model with manual autodiff

Gotchas