In PyTorch, how do I update a neural network via the average gradient from a list of losses?

Question

I have a toy reinforcement learning project based on the REINFORCE algorithm (here&#8217;s PyTorch&#8217;s implementation) that I would like to add batch updates to. In RL, the &#8220;target&#8221; can only be created after a &#8220;prediction&#8221; has been made, so standard batching techniques do not apply…

Accepted Answer

Gradient is a linear operation so gradient of the average is the same as the average of the gradient.Take some example dataimport torcha = torch.randn(1, 4, requires_grad=True);b = torch.randn(5, 4);You could store all the losses and compute the mean as you are doing,a.grad = Nonex = (a * b).mean(axis=1)x.mean().backward() # gradient of the meanprint(a.grad)Or every iteration to compute the back propagation to get the contribution of that loss to the gradient.a.grad = Nonefor bi in b:    (a * bi / len(b)).mean().backward()print(a.grad)PerformanceI don&#8217;t know the internal details of the pytorch backward implementation, but I can tell that(1) the graph is destroyed by default after the backward pass ratain_graph=True or create_graph=True to backward().(2) The gradient is not kept except for leaf tensors, unless you specify retain_grad;(3) if you evaluate a model twice using different inputs, you can perform the backward pass to individual variables, this means that they have separate graphs. This can be verified with the following code.a.grad = None# compute all the variables in advancer = [ (a * b / len(b)).mean() for bi in b ]for ri in r:    # This depends on the graph of r[i] but the graph or r[i-1]    # was already destroyed, it means that r[i] graph is independent    # of r[i-1] graph, hence they require separate memory.    ri.backward()  # this will remove the graph of riprint(a.grad)So if you update the gradient after each episode it will accumulate the gradient of the leaf nodes, that&#8217;s all the information you need for the next optimization step, so you can discard that loss freeing up resources for further computations. I would expect a memory usage reduction, potentially even a faster execution if the memory allocation can efficiently use the just deallocated pages for the next allocation.

Advertisement

Answer

Performance