Can i have too many features in a logistic regression?

Tags: , , , ,



I’m building a model to predict pedestrian casualties on the streets of New York, from a data set of 1.7 million records. I decided to build dummy features out of the ON STREET NAME column, to see what predictive power that might provide. With that, I have approximately 7500 features.

I tried running that, and I immediately get an alert that the Jupyter kernel died. Tried it again, same thing happened. Considering how long the model takes to fit, and how hot the computer runs, when I try to fit on 100 features, I can only assume that LogisticRegression() is not meant to handle such a feature set.

Two questions:

  1. Is that the case, is logistic regression meant to handle smaller feature sets?
  2. Is there some way to mitigate this, and apply a logistic regression model on such a feature set?

Answer

You should at least provide a log, or an example we can reproduce, so other people can determine the problem.
Side note 7500 features and 1.7 million rows assuming that’s a float for every element you got about 48 GB of data there, ram probably will be a major issue.

  1. Logistic regression is a very simple model and while it can handle the amount, it is not meant for complex data it’s performance is underwhelming. Your problem with crashing here is probably that in order to train, the least squares method is used which require all the data to be in ram
  2. For large datasets the gradient descent variation should be used which will allow you to train on the data and apply the logistic regression. with so many data you could use more complex models to get better results.

Finaly feature reduction methods like PCA or some feature selection method would probably help enough so you won’t need to change the model



Source: stackoverflow