How to train Naive Bayes Classifier for n-gram (movie_reviews)

Question

Below is the code of training Naive Bayes Classifier on movie_reviews dataset for unigram model. I want to train and analyze its performance by considering bigram, trigram model. How can we do it. Answer Simply change your featurizer BTW, your code will be a lot faster if you change your featurizer to do use a set for your stopword list

Accepted Answer

Simply change your featurizerfrom nltk import ngramsdef create_ngram_features(words, n=2): ngram_vocab = ngrams(words, n) my_dict = dict([(ng, True) for ng in ngram_vocab]) return my_dictBTW, your code will be a lot faster if you change your featurizer to do use a set for your stopword list and initialize it only once.stoplist = set(stopwords.words("english"))def create_word_features(words): useful_words = [word for word in words if word not in stoplist] my_dict = dict([(word, True) for word in useful_words]) return my_dictSomeone should really tell the NLTK people to convert the stopwords list into a set type since it’s “technically” a unique list (i.e. a set).>>> from nltk.corpus import stopwords>>> type(stopwords.words('english'))>>> type(set(stopwords.words('english')))For the fun of benchmarkingimport nltk.classify.utilfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviewsfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk import ngramsdef create_ngram_features(words, n=2): ngram_vocab = ngrams(words, n) my_dict = dict([(ng, True) for ng in ngram_vocab]) return my_dictfor n in [1,2,3,4,5]: pos_data = [] for fileid in movie_reviews.fileids('pos'): words = movie_reviews.words(fileid) pos_data.append((create_ngram_features(words, n), "positive")) neg_data = [] for fileid in movie_reviews.fileids('neg'): words = movie_reviews.words(fileid) neg_data.append((create_ngram_features(words, n), "negative")) train_set = pos_data[:800] + neg_data[:800] test_set = pos_data[800:] + neg_data[800:] classifier = NaiveBayesClassifier.train(train_set) accuracy = nltk.classify.util.accuracy(classifier, test_set) print(str(n)+'-gram accuracy:', accuracy)[out]:1-gram accuracy: 0.7352-gram accuracy: 0.76253-gram accuracy: 0.82754-gram accuracy: 0.81255-gram accuracy: 0.74Your original code returns an accuracy of 0.725.Use more orders of ngramsimport nltk.classify.utilfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviewsfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk import everygramsdef create_ngram_features(words, n=2): ngram_vocab = everygrams(words, 1, n) my_dict = dict([(ng, True) for ng in ngram_vocab]) return my_dictfor n in range(1,6): pos_data = [] for fileid in movie_reviews.fileids('pos'): words = movie_reviews.words(fileid) pos_data.append((create_ngram_features(words, n), "positive")) neg_data = [] for fileid in movie_reviews.fileids('neg'): words = movie_reviews.words(fileid) neg_data.append((create_ngram_features(words, n), "negative")) train_set = pos_data[:800] + neg_data[:800] test_set = pos_data[800:] + neg_data[800:] classifier = NaiveBayesClassifier.train(train_set) accuracy = nltk.classify.util.accuracy(classifier, test_set) print('1-gram to', str(n)+'-gram accuracy:', accuracy)[out]:1-gram to 1-gram accuracy: 0.7351-gram to 2-gram accuracy: 0.76251-gram to 3-gram accuracy: 0.78751-gram to 4-gram accuracy: 0.81-gram to 5-gram accuracy: 0.82

Advertisement

Answer

Simply change your featurizer

For the fun of benchmarking

Use more orders of ngrams