I am really confused about the shuffle order of DataLoader in pytorch. Supposed I have a dataset:
datasets = [0,1,2,3,4]
In scenario I, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets,generator=G) dataloader = DataLoader(dataset=datasets,sampler=ran_sampler)
the shuffling result is 0,4,2,3,1.
In scenario II, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
the shuffling result is 1,3,4,0,2.
In scenario III, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets, generator=G) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
the shuffling result is 4,1,3,0,2.
Can someone explain what is going on here?
Advertisement
Answer
Based on your code, I did a little modification (on scenario II) and inspection:
datasets = [0,1,2,3,4]
torch.manual_seed(1)
G = torch.Generator()
G = G.manual_seed(1)
ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)
torch.manual_seed(1)
G = torch.Generator()
G.manual_seed(1)
# this is different from OP's scenario II because in that case the ran_sampler is not initialized with the right generator.
dataloader = DataLoader(dataset=datasets, shuffle=True, generator=G)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)
torch.manual_seed(1)
G = torch.Generator()
G.manual_seed(1)
ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)
The outputs are:
False [0, 4, 2, 3, 1] True [4, 1, 3, 0, 2] True [4, 1, 3, 0, 2]
The reason why the above three seemingly equivalent setups lead to different outcomes is that there are two different generators actually being used inside the DataLoader, one of which is None, in the first scenario.
To make it clear, let’s analyze the source. It seems that the generator not only decides the random number generation of the _index_sampler inside DataLoader but also affects the initialization of _BaseDataLoaderIter. See the source code
if sampler is None: # give default samplers
if self._dataset_kind == _DatasetKind.Iterable:
# See NOTE [ Custom Samplers and IterableDataset ]
sampler = _InfiniteConstantSampler()
else: # map-style
if shuffle:
sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type]
else:
sampler = SequentialSampler(dataset) # type: ignore[arg-type]
and
self.sampler = sampler
self.batch_sampler = batch_sampler
self.generator = generator
and
def _get_iterator(self) -> '_BaseDataLoaderIter':
if self.num_workers == 0:
return _SingleProcessDataLoaderIter(self)
else:
self.check_worker_number_rationality()
return _MultiProcessingDataLoaderIter(self)
and
class _BaseDataLoaderIter(object):
def __init__(self, loader: DataLoader) -> None:
...
self._index_sampler = loader._index_sampler
- Scenario II & Scenario III
Both setups are equivalent. We pass a generator to DataLoader and do not specify the sampler. DataLoader automatically creates a RandomSampler object with the generator and assign the same generator to self.generator.
- Scenario I
We pass a sampler to DataLoader with the right generator but do not explicitly specify the keyword argument generator in DataLoader.__init__(...). DataLoader initializes the sampler with the given sampler but uses the default generator None for self.generator and the _BaseDataLoaderIter object returned by self._get_iterator().