If I have a model: And then I'm defining my inputs, optimizer (with lr=0.1), scheduler (with base_lr=1e-3), and training: The optimizer seems to take the learning rate of the scheduler Does the learning rate scheduler overwrite the optimizer? How does it connect to it? Trying to understand the relation between them (i.e how they interact, etc.) Answer TL;DR: The LR

What is the relation between a learning rate scheduler and an optimizer?

If I have a model:

import torch
import torch.nn as nn
import torch.optim as optim

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(2, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 4) 

        def forward(self, x):
            x=self.fc1(x)
            x=self.fc2(x)
            x=self.out(x)
            return x

nx = net_x()

JavaScript
​x
 
import torch
import torch.nn as nn
import torch.optim as optim
​
class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(2, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 4) 
​
        def forward(self, x):
            x=self.fc1(x)
            x=self.fc2(x)
            x=self.out(x)
            return x
​
nx = net_x()
​

And then I’m defining my inputs, optimizer (with lr=0.1), scheduler (with base_lr=1e-3), and training:

r = torch.tensor([1.0,2.0])
optimizer = optim.Adam(nx.parameters(), lr = 0.1)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=1e-3, max_lr=0.1, step_size_up=1, mode="triangular2", cycle_momentum=False)

path = 'opt.pt'
for epoch in range(10):
    optimizer.zero_grad()
    net_predictions = nx(r)
    loss = torch.sum(torch.randint(0,10,(4,)) - net_predictions)
    loss.backward()
    optimizer.step()
    scheduler.step()
    print('loss:' , loss)
    
    #save state dict
    torch.save({    'epoch': epoch,
                    'net_x_state_dict': nx.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'scheduler': scheduler.state_dict(),    
                    }, path)
#loading state dict
checkpoint = torch.load(path)        
nx.load_state_dict(checkpoint['net_x_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler'])

JavaScript
 
r = torch.tensor([1.0,2.0])
optimizer = optim.Adam(nx.parameters(), lr = 0.1)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=1e-3, max_lr=0.1, step_size_up=1, mode="triangular2", cycle_momentum=False)
​
path = 'opt.pt'
for epoch in range(10):
    optimizer.zero_grad()
    net_predictions = nx(r)
    loss = torch.sum(torch.randint(0,10,(4,)) - net_predictions)
    loss.backward()
    optimizer.step()
    scheduler.step()
    print('loss:' , loss)
    
    #save state dict
    torch.save({    'epoch': epoch,
                    'net_x_state_dict': nx.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'scheduler': scheduler.state_dict(),    
                    }, path)
#loading state dict
checkpoint = torch.load(path)        
nx.load_state_dict(checkpoint['net_x_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler'])
​

The optimizer seems to take the learning rate of the scheduler

for g in optimizer.param_groups:
    print(g)
>>>
{'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'initial_lr': 0.001, 'params': [Parameter containing:

JavaScript
 
for g in optimizer.param_groups:
    print(g)
>>>
{'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'initial_lr': 0.001, 'params': [Parameter containing:
​

Does the learning rate scheduler overwrite the optimizer? How does it connect to it? Trying to understand the relation between them (i.e how they interact, etc.)

Answer

TL;DR: The LR scheduler contains the optimizer as a member and alters its parameters learning rates explicitly.

As mentioned in PyTorch Official Documentations, the learning rate scheduler receives the optimizer as a parameter in its constructor, and thus has access to its parameters.

The common use is to update the LR after every epoch:

scheduler = ... # initialize some LR scheduler
for epoch in range(100):
    train(...) # here optimizer.step() is called numerous times.
    validate(...)
    scheduler.step()

JavaScript
 
scheduler = ... # initialize some LR scheduler
for epoch in range(100):
    train(...) # here optimizer.step() is called numerous times.
    validate(...)
    scheduler.step()
​

All optimizers inherit from a common parent class torch.nn.Optimizer and are updated using the step method implemented for each of them.

Similarly, all LR schedulers (besides ReduceLROnPlateau) inherit from a common parent class named _LRScheduler. Observing its source code uncovers that in the step method the class indeed changes the LR of the parameters of the optimizer:

...
for i, data in enumerate(zip(self.optimizer.param_groups, values)):
            param_group, lr = data
            param_group['lr'] = lr
...

JavaScript
 
...
for i, data in enumerate(zip(self.optimizer.param_groups, values)):
            param_group, lr = data
            param_group['lr'] = lr
...
​

Advertisement

Answer