How to build a full trainset when loading data from predefined folds in Surprise?

Question

I am using Surprise to evaluate various recommender system algorithms. I would like to calculate predictions and prediction coverage on all possible user and item permutations. My data is loaded in from predefined splits. My strategy to calculate prediction coverage is to build a full trainset and fit get lists of all users and items iterate through the list and

Accepted Answer

TLDR; The model_selection documentation in Surprise indicates a &#8220;refit&#8221; method, that will fit data on a full trainset, however it explicitly doesn&#8217;t work with predefined folds.Another major issue: oyyablokov&#8217;s comment on this issue suggests you cannot fit a model with data that has NaNs. So even if you have a full trainset, how does one create a full prediction matrix to calculate things like prediction coverage, which requires all users and item combinations with or without ratings?My workaround was to create 3 Surprise datasets.The dataset from predefined folds to compute best_paramsThe full dataset of ratings (combining all folds outside of Surprise)The full prediction matrix dataset including all possible combinations of users and items (with or without ratings).After you find your best paramaters with grid search cross validation, you can find your predictions and coverage with something like this:import pandas as pdfrom surprise import Dataset, Readerdef get_pred_coverage(data_matrix, algo_constructor, best_params, verbose=False):    """    Calculates coverage    inputs:        data_matrix: Numpy Matrix with 0, 1, 2 columns as user, service, rating        algo_constructor: the Surprise algorithm constructor to pass the best params into        best_params: Surprise gs.best_params to pass into algo.    returns: coverage & full predictions    """    reader=Reader(rating_scale=(1,5))    full_predictions = [] #list to store prediction results        df = pd.DataFrame(data_matrix)    if verbose: print(df.info())    df_no_nan = df.dropna(subset=[2])    if verbose: print(df_no_nan.head())    no_nan_dataset = Dataset.load_from_df(df_no_nan[[0,1,2]], reader)    full_dataset = Dataset.load_from_df(df[[0, 1, 2]], reader)    #Predict on full dataset    # Use the weights that yields the best rmse:    algo = algo_constructor(**best_params) # Pass the dictionary as double star keyword arguments to the algorithm constructor    #Create a no-nan trainset to fit on     no_nan_trainset = no_nan_dataset.build_full_trainset()    algo.fit(no_nan_trainset)    if verbose: print('Number of trainset users: ', no_nan_trainset.n_users, 'n')    if verbose: print('Number of trainset items: ', no_nan_trainset.n_items, 'n')    pred_set = full_dataset.build_full_trainset()    if verbose: print('Number of users: ', pred_set.n_users, 'n')    if verbose: print('Number of items: ', pred_set.n_items, 'n')        #get all item ids    pred_set_iids = list(pred_set.all_items())    # print(f'pred_set iids are {pred_set_iids}')    iid_converter = lambda x: pred_set.to_raw_iid(x)    pred_set_raw_iids = list(map(iid_converter, pred_set_iids))        #get all user ids    pred_set_uids = list(pred_set.all_users())    uid_converter = lambda x: pred_set.to_raw_uid(x)    pred_set_raw_uids = list(map(uid_converter, pred_set_uids))    # print(f'pred_set uids are {pred_set_uids}')    for user in pred_set_raw_uids:        for item in pred_set_raw_iids:            r_ui = float(df[2].loc[(df[0] == user) & (df[1]== item)])  #find the rating, by user and value            # print(f"r_ui is type {type(r_ui)} and value {r_ui}")                        prediction = algo.predict(uid = user, iid = item, r_ui=r_ui)            # print(prediction)            full_predictions.append(prediction)    #access a tuple     #5th element, dicitonary item "was_impossible"    impossible_count = 0    for prediction in full_predictions:        impossible_count += prediction[4]['was_impossible']    if verbose: print(f"for algo {algo}, impossible_count is {impossible_count} ")    prediction_coverage = (pred_set.n_users*pred_set.n_items - impossible_count)/(pred_set.n_users*pred_set.n_items)    print(f"prediction_coverage is {prediction_coverage}")    return prediction_coverage, full_predictions

Advertisement

Answer