I have converted my dataset to dataframe. I was wondering how to use it in scikit kmeans or if any other kmeans package available.
JavaScript
x
13
13
1
import csv
2
import codecs
3
import pandas as pd
4
import sklearn
5
from sklearn import cross_validation
6
from sklearn.cross_validation import train_test_split
7
sample_df = pd.read_csv('sample.csv',sep='t',keep_default_na=False, na_values=[""])
8
print sample_df['Polarity']
9
print sample_df['Gravity']
10
print sample_df['Sense']
11
print sample_df[['Polarity','Gravity']]
12
sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_ state=None, copy_x=True, n_jobs=1)
13
Advertisement
Answer
sklearn
is fully compatible with pandas
DataFrames. Therefore, it’s as simple as:
JavaScript
1
6
1
sample_df_train, sample_df_test = sklearn.cross_validation.train_test_split(sample_df, train_size=0.6)
2
3
cluster = sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
4
cluster.fit(sample_df_train)
5
result = cluster.predict(sample_df_test)
6
That 0.6
means you use 60% of your data for training, 40% for testing.
More info here:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html