Problem
I get the same output everytime regardless of the input.
Context
I have a .csv with IDs that represent a team of 5 persons (previously formed teams) like this:
0, 1, 2, 3, 4 5, 6, 7, 3, 8 2, 5, 6, 7, 3 9, 1, 2, 6, 4 9, 0, 1, 2, 4 ...
My goal with the following code is to be able to input 4 IDs and get a prediction of what the 5th member should be.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier import pandas as pd file = 'People.csv' # Read dataset without a header row: dataset = pd.read_csv(file, header=None) # return the first 5 rows: dataset.head() # Convert input to a two-dimensional array: X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values # Split dataset into random train and test subsets: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) # Standardize - removes mean and scales to unit variance: scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # Use the KNN classifier to fit data: classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) # Uses the classifier to predict the value of the fifth column: y_pred = classifier.predict([[5, 6, 7, 3]]) # Print the predicted value: print(y_pred)
Advertisement
Answer
Mainstream statistical machine learning assumes that it’s possible to predict an attribute of an object based on other observed attributes.
In the problem presented here: there are no attributes. Each row represents a previously observed team, and each column represents an identifier attribute of a team member. In other words: it is not clear how we would build a model.
There’s an alternate way to frame this problem though: “Which people prefer to work together?” or “What frequent patterns exist in this data?” or “How do we expect each person to rate one another?“
“Apriori” is an algorithm that helps estimate which objects (team members) frequently appear together, and mlxtend
provides an implementation:
from mlxtend.preprocessing import TransactionEncoder from mlxtend.frequent_patterns import apriori import pandas as pd data = [ [0, 1, 2, 3, 4], [5, 6, 7, 3, 8], [2, 5, 6, 7, 3], [9, 1, 2, 6, 4], [9, 0, 1, 2, 4], ] te = TransactionEncoder() te_ary = te.fit(data).transform(data) df = pd.DataFrame(te_ary, columns=te.columns_) print(apriori(df, min_support=0.5))
The output includes itemsets
and their support
(basically a measure of how frequently they were observed together).
support itemsets 0 0.6 (1) 1 0.8 (2) 2 0.6 (3) 3 0.6 (4) 4 0.6 (6) 5 0.6 (1, 2) 6 0.6 (1, 4) 7 0.6 (2, 4) 8 0.6 (1, 2, 4)
For example: this tells us that user 2
has previously appeared in 80% of the teams, and this tells us that users 1
, 2
, and 4
worked together 60% of the time.
If we were trying to form groups in the future: we might sample from users who worked with one another previously, and randomly add or remove people until everyone was on a team.