Gradient Accumulation with Custom model.fit in TF.Keras?

Question

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -) I'm trying to train a tf.keras model with Gradient Accumulation (GA). But I don't want to use it in the custom training loop (like) but customize the .fit() method by overriding the train_step.Is it possible? How to accomplish this? The reason is

Accepted Answer

Yes it is possible to customize the .fit() method by overriding the train_step without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation:import tensorflow as tf class CustomTrainStep(tf.keras.Model):    def __init__(self, n_gradients, *args, **kwargs):        super().__init__(*args, **kwargs)        self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)        self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)        self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]    def train_step(self, data):        self.n_acum_step.assign_add(1)        x, y = data        # Gradient Tape        with tf.GradientTape() as tape:            y_pred = self(x, training=True)            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)        # Calculate batch gradients        gradients = tape.gradient(loss, self.trainable_variables)        # Accumulate batch gradients        for i in range(len(self.gradient_accumulation)):            self.gradient_accumulation[i].assign_add(gradients[i])         # If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing        tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)        # update metrics        self.compiled_metrics.update_state(y, y_pred)        return {m.name: m.result() for m in self.metrics}    def apply_accu_gradients(self):        # apply accumulated gradients        self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))        # reset        self.n_acum_step.assign(0)        for i in range(len(self.gradient_accumulation)):            self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))# Model input = tf.keras.Input(shape=(28, 28))base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps) custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])# bind allcustom_model.compile(    loss = tf.keras.losses.CategoricalCrossentropy(),    metrics = ['accuracy'],    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )# data (x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()x_train = tf.divide(x_train, 255)y_train = tf.one_hot(y_train , depth=10) # customized fit custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)Outputs:Epoch 1/310000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584Epoch 2/310000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600Epoch 3/310000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748Pros:Gradient accumulation is a mechanism to split the batch of samples —used for training a neural network — into several mini-batches ofsamples that will be run sequentiallyBecause GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, i.e using less memory to training the model like it using large batch size.Example: If you run a gradient accumulation with steps of 5 and batchsize of 4 images, it serves almost the same purpose of running with abatch size of 20 images.We could also parallel the training when using GA, i.e aggregate gradients from multiple machines.Things to consider:This technique is working so well so it is widely used, there few things to consider before using it that I don&#8217;t think it should be called cons, after all, all GA does is turning 4 + 4 to 2 + 2 + 2 + 2.If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine&#8217;s memory already can handle.Reference:What is Gradient Accumulation in Deep Learning?

Gradient Accumulation with Custom model.fit in TF.Keras?

Update

Update 2

Advertisement

Answer

Pros:

Things to consider: