Creating a new column for predicted cluster: SettingWithCopyWarning

This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The error that I get is:

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

JavaScript
​x
 
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
​

There are a few questions on this error, but probably I am doing something wrong, as I have not fixed the issue yet and I am still getting the same error as above. The dataset is the following:

    Date    Link    Value   
0   03/15/2020  https://www.bbc.com 1
1   03/15/2020  https://www.netflix.com 4   
2   03/15/2020  https://www.google.com 10
...

JavaScript
 
    Date    Link    Value   
0   03/15/2020  https://www.bbc.com 1
1   03/15/2020  https://www.netflix.com 4   
2   03/15/2020  https://www.google.com 10
...
​

I have split the dataset into train and test sets as follows:

import sklearn
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string as st 

train_data=df.Link.tolist()
df_train=pd.DataFrame(train_data, columns = ['Review'])
X = df_train

X_train, X_test = train_test_split(
        X, test_size=0.4).copy()
X_test, X_val = train_test_split(
        X_test, test_size=0.5).copy()
print(X_train.isna().sum())
print(X_test.isna().sum())

stop_words = stopwords.words('english')

def preprocessor(t):
    t = re.sub(r"[^a-zA-Z]", " ", t())
    words = word_tokenize(t)
    w_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return w_lemm


vect =TfidfVectorizer(tokenizer= preprocessor)
vectorized_text=vect.fit_transform(X_train['Review'])
kmeans =KMeans(n_clusters=3).fit(vectorized_text)

JavaScript
 
import sklearn
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string as st 
​
train_data=df.Link.tolist()
df_train=pd.DataFrame(train_data, columns = ['Review'])
X = df_train
​
X_train, X_test = train_test_split(
        X, test_size=0.4).copy()
X_test, X_val = train_test_split(
        X_test, test_size=0.5).copy()
print(X_train.isna().sum())
print(X_test.isna().sum())
​
stop_words = stopwords.words('english')
​
def preprocessor(t):
    t = re.sub(r"[^a-zA-Z]", " ", t())
    words = word_tokenize(t)
    w_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return w_lemm
​
​
vect =TfidfVectorizer(tokenizer= preprocessor)
vectorized_text=vect.fit_transform(X_train['Review'])
kmeans =KMeans(n_clusters=3).fit(vectorized_text)
​

The lines of code that cause the error are:

cl=kmeans.predict(vectorized_text)
X_train['Cluster']=pd.Series(cl, index=X_train.index)

JavaScript
 
cl=kmeans.predict(vectorized_text)
X_train['Cluster']=pd.Series(cl, index=X_train.index)
​

I think these two questions should have been able to help me with code:

How to add k-means predicted clusters in a column to a dataframe in Python

How to deal with SettingWithCopyWarning in Pandas?

but something is still continuing to be wrong within my code.

Could you please have a look at it and help me to fix this issue before closing this question as duplicate?

Answer

IMHO, train_test_split gives you a tuple, and when you do copy(), that copy() is a tuple‘s operation, not pandas’. This triggers pandas’ infamous copy warning.

So you only create a shallow copy of the tuple, not the elements. In other words

X_train, X_test = train_test_split(X, test_size=0.4).copy()

JavaScript
 
X_train, X_test = train_test_split(X, test_size=0.4).copy()
​

is equivalent to:

train_test = train_test_split(X, test_size=0.4)
train_test_copy = train_test.copy()
X_train, X_test = train_test_copy[0], train_test_copy[1]

JavaScript
 
train_test = train_test_split(X, test_size=0.4)
train_test_copy = train_test.copy()
X_train, X_test = train_test_copy[0], train_test_copy[1]
​

Since pandas dataframes are pointers, X_train and X_test may or may not point to the same data as X does. If you want to copy the dataframes, you should explicitly force copy() on each dataframe:

X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

JavaScript
 
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()
​

X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]

JavaScript
 
X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]
​

Then each X_train and X_test is a new dataframe pointing to new memory data.

Update: Tested this code without any warnings:

X = pd.DataFrame(np.random.rand(100,3))
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

X_train['abcd'] = 1

JavaScript
 
X = pd.DataFrame(np.random.rand(100,3))
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()
​
X_train['abcd'] = 1
​

Advertisement

Answer