I am really confused about the shuffle order of DataLoader in pytorch. Supposed I have a dataset:
datasets = [0,1,2,3,4]
In scenario I, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets,generator=G) dataloader = DataLoader(dataset=datasets,sampler=ran_sampler)
the shuffling result is 0,4,2,3,1
.
In scenario II, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
the shuffling result is 1,3,4,0,2
.
In scenario III, the code is:
torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets, generator=G) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
the shuffling result is 4,1,3,0,2
.
Can someone explain what is going on here?
Advertisement
Answer
Based on your code, I did a little modification (on scenario II) and inspection:
datasets = [0,1,2,3,4] torch.manual_seed(1) G = torch.Generator() G = G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets, generator=G) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler) print(id(dataloader.generator)==id(dataloader.sampler.generator)) xs = [] for x in dataloader: xs.append(x.item()) print(xs) torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) # this is different from OP's scenario II because in that case the ran_sampler is not initialized with the right generator. dataloader = DataLoader(dataset=datasets, shuffle=True, generator=G) print(id(dataloader.generator)==id(dataloader.sampler.generator)) xs = [] for x in dataloader: xs.append(x.item()) print(xs) torch.manual_seed(1) G = torch.Generator() G.manual_seed(1) ran_sampler = RandomSampler(data_source=datasets, generator=G) dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G) print(id(dataloader.generator)==id(dataloader.sampler.generator)) xs = [] for x in dataloader: xs.append(x.item()) print(xs)
The outputs are:
False [0, 4, 2, 3, 1] True [4, 1, 3, 0, 2] True [4, 1, 3, 0, 2]
The reason why the above three seemingly equivalent setups lead to different outcomes is that there are two different generators actually being used inside the DataLoader
, one of which is None
, in the first scenario.
To make it clear, let’s analyze the source. It seems that the generator
not only decides the random number generation of the _index_sampler
inside DataLoader
but also affects the initialization of _BaseDataLoaderIter
. See the source code
if sampler is None: # give default samplers if self._dataset_kind == _DatasetKind.Iterable: # See NOTE [ Custom Samplers and IterableDataset ] sampler = _InfiniteConstantSampler() else: # map-style if shuffle: sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] else: sampler = SequentialSampler(dataset) # type: ignore[arg-type]
and
self.sampler = sampler self.batch_sampler = batch_sampler self.generator = generator
and
def _get_iterator(self) -> '_BaseDataLoaderIter': if self.num_workers == 0: return _SingleProcessDataLoaderIter(self) else: self.check_worker_number_rationality() return _MultiProcessingDataLoaderIter(self)
and
class _BaseDataLoaderIter(object): def __init__(self, loader: DataLoader) -> None: ... self._index_sampler = loader._index_sampler
- Scenario II & Scenario III
Both setups are equivalent. We pass a generator to DataLoader
and do not specify the sampler
. DataLoader
automatically creates a RandomSampler
object with the generator
and assign the same generator to self.generator
.
- Scenario I
We pass a sampler to DataLoader
with the right generator but do not explicitly specify the keyword argument generator
in DataLoader.__init__(...)
. DataLoader
initializes the sampler with the given sampler but uses the default generator None
for self.generator
and the _BaseDataLoaderIter
object returned by self._get_iterator()
.