Word2Vec + LSTM Good Training and Validation but Poor on Test

Question

currently I&#8217;am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is w…

Accepted Answer

Without reviewing everything, a few high-order things that may be limiting your results:The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn&#8217;t necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you&#8217;ll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn&#8217;t mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set&#8217;s vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you&#8217;re taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you&#8217;re trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that&#8217;s much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn&#8217;t already provide.) So: consider shrinking your model, and getting much more training data &#8211; potentially even from other domains than your main classification interest, as long as the use-of-language is similar.

Advertisement

Answer