Skip to content
Advertisement

I get the same output for a classifier algorithm with sklearn and pandas

Problem

I get the same output everytime regardless of the input.

Context

I have a .csv with IDs that represent a team of 5 persons (previously formed teams) like this:

0, 1, 2, 3, 4
5, 6, 7, 3, 8
2, 5, 6, 7, 3
9, 1, 2, 6, 4
9, 0, 1, 2, 4
...

My goal with the following code is to be able to input 4 IDs and get a prediction of what the 5th member should be.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd 

file = 'People.csv'

# Read dataset without a header row:
dataset = pd.read_csv(file, header=None)

# return the first 5 rows: 
dataset.head() 

# Convert input to a two-dimensional array:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Split dataset into random train and test subsets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Standardize - removes mean and scales to unit variance:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) 

# Use the KNN classifier to fit data:
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train) 

# Uses the classifier to predict the value of the fifth column:
y_pred = classifier.predict([[5, 6, 7, 3]])

# Print the predicted value:
print(y_pred)

Advertisement

Answer

Mainstream statistical machine learning assumes that it’s possible to predict an attribute of an object based on other observed attributes.

In the problem presented here: there are no attributes. Each row represents a previously observed team, and each column represents an identifier attribute of a team member. In other words: it is not clear how we would build a model.


There’s an alternate way to frame this problem though: “Which people prefer to work together?” or “What frequent patterns exist in this data?” or “How do we expect each person to rate one another?

Apriori” is an algorithm that helps estimate which objects (team members) frequently appear together, and mlxtend provides an implementation:

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd

data = [
    [0, 1, 2, 3, 4],
    [5, 6, 7, 3, 8],
    [2, 5, 6, 7, 3],
    [9, 1, 2, 6, 4],
    [9, 0, 1, 2, 4],
]


te = TransactionEncoder()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)

print(apriori(df, min_support=0.5))

The output includes itemsets and their support (basically a measure of how frequently they were observed together).

   support   itemsets
0      0.6        (1)
1      0.8        (2)
2      0.6        (3)
3      0.6        (4)
4      0.6        (6)
5      0.6     (1, 2)
6      0.6     (1, 4)
7      0.6     (2, 4)
8      0.6  (1, 2, 4)

For example: this tells us that user 2 has previously appeared in 80% of the teams, and this tells us that users 1, 2, and 4 worked together 60% of the time.

If we were trying to form groups in the future: we might sample from users who worked with one another previously, and randomly add or remove people until everyone was on a team.

Advertisement