Creating a new column for predicted cluster: SettingWithCopyWarning

Question

This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The

Accepted Answer

IMHO, train_test_split gives you a tuple, and when you do copy(), that copy() is a tuple&#8216;s operation, not pandas&#8217;. This triggers pandas&#8217; infamous copy warning.So you only create a shallow copy of the tuple, not the elements. In other wordsX_train, X_test = train_test_split(X, test_size=0.4).copy()is equivalent to:train_test = train_test_split(X, test_size=0.4)train_test_copy = train_test.copy()X_train, X_test = train_test_copy[0], train_test_copy[1]Since pandas dataframes are pointers, X_train and X_test may or may not point to the same data as X does. If you want to copy the dataframes, you should explicitly force copy() on each dataframe:X_train, X_test = train_test_split(X, test_size=0.4)X_train, X_test = X_train.copy(), X_test.copy()or X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]Then each X_train and X_test is a new dataframe pointing to new memory data.Update: Tested this code without any warnings:X = pd.DataFrame(np.random.rand(100,3))X_train, X_test = train_test_split(X, test_size=0.4)X_train, X_test = X_train.copy(), X_test.copy()X_train['abcd'] = 1

Advertisement

Answer