Should I separate my data into different batches and then perform tsne on each batch?

Question

I have a very huge dataset and required to reduce the embedding of 768 dimension to 128dimension with TSNE. Since I have more than 1million rows, it takes more than weeks to complete dimension reduction on whole dataset, so I thought maybe I can separate the dataset into different parts and then perform each …

Accepted Answer

It probably depends what you&#8217;re trying to do, but I suspect the answer is that it is the wrong thing to do.Between different batches it would difficult to guarantee that the reduced dimension representations would be comparable, since they would have been optimised independently, not using the same data. So you could end up with data looking similar in the low-D representation, when they aren&#8217;t similar in the original representation.It seems like PCA might be more suited to you, since it&#8217;s very fast. Or UMAP, since it is also fast, but additionally has some ways to work with batched data etc.

Advertisement

Answer