Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?

Question

I have a data set that contains 3 columns for 310 data. The columns are all text. One column is text input by a user into an inquiry form and the second column are the labels (one of six labels) that say which inquiry category the input falls into. I am doing the following preprocessing to my data before I

Accepted Answer

The problem is that TfidfVectorizer() cannot be applied on three columns at a time. According to the documentation:  fit_transform(self, raw_documents, y=None)         Learn vocabulary and idf, return term-document matrix.       This is equivalent to fit followed by transform, but more efficiently  implemented.        Parameters:       raw_documents : iterable  an iterable which yields either str, unicode or file objects        Returns:     X : sparse matrix, [n_samples, n_features]  Tf-idf-weighted document-term matrix.Hence, when apply on single column of text data only. In your code, it had just iterated through the column names and create a transform for it. An example to understand, what is happening:import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerdata = pd.DataFrame({'col1':['this is first sentence','this one is the second sentence'],                    'col2':['this is first sentence','this one is the second sentence'],                    'col3':['this is first sentence','this one is the second sentence'] })vec = TfidfVectorizer()vec.fit_transform(data).todense()# # matrix([[1., 0., 0.],#         [0., 1., 0.],#         [0., 0., 1.]])vec.get_feature_names()# ['col1', 'col2', 'col3']Now, the solution is that you have to join all the three column into one single column or apply vectorizer separately on each column and then append them at the end.Approach 1data.loc[:,'full_text'] = data.apply(lambda x: ' '.join(x), axis=1)vec = TfidfVectorizer()X = vec.fit_transform(data['full_text']).todense()print(X.shape)# (2, 7)print(vec.get_feature_names())# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']Approach 2from scipy.sparse import hstackimport numpy as npvec={}X = []for col in data[['col1','col2','col3']]:    vec[col]= TfidfVectorizer()    X = np.append(X,                   vec[col].fit_transform(data[col]))stacked_X = hstack(X).todense()stacked_X.shape# (2, 21)for col, v in vec.items():    print(col)    print(v.get_feature_names())# col1# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']# col2# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']# col3# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']

Advertisement

Answer

Approach 1

Approach 2