The shuffling order of DataLoader in pytorch

Question

I am really confused about the shuffle order of DataLoader in pytorch. Supposed I have a dataset: In scenario I, the code is: the shuffling result is 0,4,2,3,1. In scenario II, the code is: the shuffling result is 1,3,4,0,2. In scenario III, the code is: the shuffling result is 4,1,3,0,2. Can someone explain what is going on here? Answer Based

Accepted Answer

Based on your code, I did a little modification (on scenario II) and inspection:datasets = [0,1,2,3,4]torch.manual_seed(1)G = torch.Generator()G = G.manual_seed(1)ran_sampler = RandomSampler(data_source=datasets, generator=G)dataloader = DataLoader(dataset=datasets, sampler=ran_sampler)print(id(dataloader.generator)==id(dataloader.sampler.generator))xs = []for x in dataloader:    xs.append(x.item())print(xs)torch.manual_seed(1)G = torch.Generator()G.manual_seed(1)# this is different from OP's scenario II because in that case the ran_sampler is not initialized with the right generator.dataloader = DataLoader(dataset=datasets, shuffle=True, generator=G)print(id(dataloader.generator)==id(dataloader.sampler.generator))xs = []for x in dataloader:    xs.append(x.item())print(xs)torch.manual_seed(1)G = torch.Generator()G.manual_seed(1)ran_sampler = RandomSampler(data_source=datasets, generator=G)dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)print(id(dataloader.generator)==id(dataloader.sampler.generator))xs = []for x in dataloader:    xs.append(x.item())print(xs)The outputs are:False[0, 4, 2, 3, 1]True[4, 1, 3, 0, 2]True[4, 1, 3, 0, 2]The reason why the above three seemingly equivalent setups lead to different outcomes is that there are two different generators actually being used inside the DataLoader, one of which is None, in the first scenario.To make it clear, let&#8217;s analyze the source. It seems that the generator not only decides the random number generation of the _index_sampler inside DataLoader but also affects the initialization of _BaseDataLoaderIter. See the source code        if sampler is None:  # give default samplers            if self._dataset_kind == _DatasetKind.Iterable:                # See NOTE [ Custom Samplers and IterableDataset ]                sampler = _InfiniteConstantSampler()            else:  # map-style                if shuffle:                    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]                else:                    sampler = SequentialSampler(dataset)  # type: ignore[arg-type]and        self.sampler = sampler        self.batch_sampler = batch_sampler        self.generator = generatorand    def _get_iterator(self) -> '_BaseDataLoaderIter':        if self.num_workers == 0:            return _SingleProcessDataLoaderIter(self)        else:            self.check_worker_number_rationality()            return _MultiProcessingDataLoaderIter(self)andclass _BaseDataLoaderIter(object):    def __init__(self, loader: DataLoader) -> None:        ...        self._index_sampler = loader._index_samplerScenario II & Scenario IIIBoth setups are equivalent. We pass a generator to DataLoader and do not specify the sampler. DataLoader automatically creates a RandomSampler object with the generator and assign the same generator to self.generator.Scenario IWe pass a sampler to DataLoader with the right generator but do not explicitly specify the keyword argument generator in DataLoader.__init__(...). DataLoader initializes the sampler with the given sampler but uses the default generator None for self.generator and the _BaseDataLoaderIter object returned by self._get_iterator().

Advertisement

Answer