I am new to text analysis and am trying to create a bag of words model(using sklearn’s CountVectorizer method). I have a data frame with a column of text with words like ‘acid’, ‘acidic’, ‘acidity’, ‘wood’, ‘woodsy’, ‘woody’.
I think that ‘acid’ and ‘wood’ should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.
Stemming produces ‘acid’,’wood’,’woodi’,woodsi’
and lemmatizing produces a worse output of ‘acid’ ‘acidic’ ‘acidity’ ‘wood’ ‘woodsy’ ‘woody’. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a')
(I believe that most of the words should be adjectives) however, it does not make a difference in the output.
What can I do to improve the output?
My full code is below;
!pip install nltk nltk.download('omw-1.4') import nltk import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from nltk.stem import WordNetLemmatizer
Data Frame:
df = pd.DataFrame() df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']
CountVectorizer with Stemmer:
analyzer = CountVectorizer().build_analyzer() stemmer = nltk.stem.SnowballStemmer('english') lemmatizer = WordNetLemmatizer() def stemmed_words(doc): return (stemmer.stem(w) for w in analyzer(doc)) vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words) X = vectorizer.fit_transform(df['text']) df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names()) df_bow_sklearn.head()
CountVectorizer with Lemmatizer:
analyzer = CountVectorizer().build_analyzer() stemmer = nltk.stem.SnowballStemmer('english') lemmatizer = WordNetLemmatizer() def lemed_words(doc): return(lemmatizer.lemmatize(w) for w in analyzer(doc)) vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words) X = vectorizer.fit_transform(df['text'],'a') df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names()) df_bow_sklearn.head()
Advertisement
Answer
Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.
Try different ones like… Stemmers:
- Porter ( -> from nltk.stem import PorterStemmer)
- Lancaster (-> from nltk.stem import LancasterStemmer)
Lemmatizers:
- spacy ( -> import spacy)
- IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
- HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)
Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.