Can i have too many features in a logistic regression?

Question

I&#8217;m building a model to predict pedestrian casualties on the streets of New York, from a data set of 1.7 million records. I decided to build dummy features out of the ON STREET NAME column, to see what predictive power that might provide. With that, I have approximately 7500 features. I tried running th…

Accepted Answer

You should at least provide a log, or an example we can reproduce, so other people can determine the problem.Side note 7500 features and 1.7 million rows assuming that&#8217;s a float for every element you got about 48 GB of data there, ram probably will be a major issue.Logistic regression is a very simple model and while it can handle the amount, it is not meant for complex data it&#8217;s performance is underwhelming. Your problem with crashing here is probably that in order to train, the least squares method is used which require all the data to be in ramFor large datasets the gradient descent variation should be used which will allow you to train on the data and apply the logistic regression. with so many data you could use more complex models to get better results.Finaly feature reduction methods like PCA or some feature selection method would probably help enough so you won&#8217;t need to change the model

Advertisement

Answer