I did some computations in an IPython Notebook and ended up with a dataframe df
which isn’t saved anywhere yet. In the same IPython Notebook, I want to work with this dataframe using sklearn.
df is a dataframe with 4 columns: id (string), value(int), rated(bool), score(float). I am trying to determine what influences the score the most just like in this example. There they load a standard dataset, but instead I want to use my own dataframe in the notebook.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from matplotlib import pyplot as plt plt.rcParams.update({'figure.figsize': (12.0, 8.0)}) plt.rcParams.update({'font.size': 14}) dataset = df X = pd.DataFrame(dataset.data, columns=dataset.feature_names) y = dataset.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
But I get the AttributeError that the 'DataFrame' object has no attribute 'data'
Advertisement
Answer
Ok, so some clarifications first: in your example, it is unclear what the load_boston() function does. they just import it. whatever that function returns has an attribute called “data”.
They use this line:
X = pd.DataFrame(boston.data, columns=boston.feature_names)
to create a dataframe. Your situation is different because you have a dataframe already and dataframes don’t have an attribute “.data”. Hence, the error you’re getting: “DataFrame’ object has no attribute ‘data’.
What you need is simply
X = df y = df['score'] # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
or if you need only some of the columns from you dataframe:
# set data list_of_columns = ['id','value'] X = df[list_of_columns] # set target target_column = 'score' y = df[target_column] # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)