Okay, I’m just going to say starting out that I’m entirely new to SciKit-Learn and data science. But here is the issue and my current research on the problem. Code at the bottom.
Summary
I’m trying to do type recognition (like digits, for example) with a BernoulliRBM and I’m trying to find the correct parameters with GridSearchCV. However I don’t see anything going on. With a lot of examples using verbosity settings I see output and progress, but with mine it just says,
Fitting 3 folds for each of 15 candidates, totalling 45 fits
Then it sits there and does nothing….forever (or 8 hours, the longest I’ve waited with high verbosity settings).
I have a pretty large data set (1000 2D arrays each of size 428 by 428), so this might be the problem but I’ve also set the verbosity to 10 so I feel like I should be seeing some kind of output or progress. Also, in terms of my “target”, it is just either a 0 or a 1, either it is the object I’m looking for (1), or it isn’t (0).
Previous Research
- I looked into sklearn.preprocessing to see if that was necessary, it doesn’t seem to be the issue (but again, I’m entirely new to this).
- I increased verbosity
- I switched from using a 3D list of data to using a list of scipy csr matrices.
- I waited 8 hours with high verbosity settings, I still don’t see anything happening.
- I switched from not using a pipeline, to using a pipeline
I tampered with various parameters of gridsearchcv and tried creating fake (smaller) data sets to practice on.
def network_trainer(self, data, files): train_x, test_x, train_y, test_y = train_test_split(data, files, test_size=0.2, random_state=0) parameters = {'learning_rate':np.arange(.25, .75, .1), 'n_iter':[5, 10, 20]} model = BernoulliRBM(random_state=0, verbose=True) model.cv = 2 model.n_components = 2 logistic = linear_model.LogisticRegression() pipeline = Pipeline(steps=[('model', model), ('clf', logistic)]) gscv = grid_search.GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10) gscv.fit(train_x, train_y) print gscv.best_params_
I’d really appreciate a nudge in the right direction here. Thanks for considering my issue.
Advertisement
Answer
Okay, so just to summarize everything I’ve figured out about it over the past few days.
- On Windows 8.1 don’t set n_jobs to anything other than 1 if you still want it to be verbose.
- In my case, even though I only have n_jobs = 1, all of my processor cores were still involved in the calculations, so either this is a bug or should be better documented.
- I made the horrible mistake of using a list of csr matrices, so basically, read the documentation and then read it again before you ask questions.
Again I’d like to thank @Barmaley.exe for the initial tip.