Skip to content
Advertisement

Why does sklearn MinMaxScaler() return an out-of-range value instead of an error?

When I use sklearn MinMaxScaler(), I noticed some interesting behavior which shown in the following code.

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler(feature_range=(0, 1))
>>> scaler.fit(data)
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> test_data = [[-22, 20], [20.5, 26], [30, 40], [19, 13]]
>>> scaler.transform(test_data)
array([[-10.5   ,   1.125 ],
       [ 10.75  ,   1.5   ],
       [ 15.5   ,   2.375 ],
       [ 10.    ,   0.6875]])

I noticed that when I transform the test_data with fitted MinMaxScaler(), it returns values beyond the defined range (0 – 1).

Now, I intentionally make the test_data to be outside the value range of “data”, to test the output of MinMaxScaler().

I thought that when the “test_data” has a value which is beyond the value range in the variable “data”, it should return some error. But then, this is not the case, and I got an output value beyond the defined range.

My question is, why does the function exhibit this behavior (i.e. return an output value beyond the defined range, when the test_data value is beyond the value range in the data in which MinMaxScaler is being fitted), instead of returning an error?

Advertisement

Answer

MinMaxScaler throwing an error (and thus terminating program execution) in cases where the resulting (transformed) data are outside the feature_range provided during fitting would arguably be a bad & weird design choice.

Consider a scenario of a real-world pipeline processing some hundreds of thousands incoming data samples on a periodic basis with such a scaler being a part of it. Imagine that the scaler does indeed throw an error and stops if any transformed feature falls outside the range [0, 1]. Now consider a case where, in a batch of, say, 500K data samples, there are just a couple of which the features, after transformation, come up indeed out of the [0, 1] range. So, the whole pipeline just breaks up…

Who might be happy in such a scenario? (tentative answer: nobody).

Could the responsible data scientist or ML engineer possibly claim “but why, this is the correct thing to do, since there are obviously bad data“? No, not by a long shot…


The notion of concept drift, i.e. the unforeseeable changes in the underlying distribution of streaming data over time, is a huge ML sub-topic of great practical interest and an area of intense research. The idea here (i.e. behind such functions not throwing errors in these cases) is that, if the modeler has reasons to believe that something like that might happen in practice (it almost always does), hence rendering their ML results largely useless, it is their own responsibility to deal with it explicitly in their deployed systems. Leaving such a serious job on the shoulders of a (humble…) scaling function would be largely inappropriate, and, at the end of the day, a mistake.

Generalizing the discussion a bit: MimMaxScaler is just a helper function; the underlying assumption of using it (as the whole of scikit-learn and similar libraries, in fact) is that we know what we are doing, and we are not just mindless dummies randomly turning knobs and pressing buttons until our models seem to “work”. Should Keras warn us when we try something really meaningless, like requesting the classification accuracy in a regression problem? Well, it does not – a minimum of knowledge is certainly assumed to exist when using it, and we should not really expect the frameworks themselves to protect us from such mistakes in our own modeling.

Similarly here, it is our job to be aware of the possibility of getting out-of-range values for transformed new data, and to handle the situation accordingly; it is not the job of MinMaxScaler (or any other similar transformer) to halt the process on that behalf.


Returning to your own toy example, or to my own hypothetical one: it is always possible to integrate additional logic after the transformation of new data, so that such cases are handled accordingly; even just checking which (and how many) samples are problematic is arguably infinitely easier after such a transformation than before (thus providing a very first, crude alert of possible concept drift). By not throwing an error (and thus halting the whole process), scikit-learn gives to you, the modeler, all the options to proceed as you see fit, provided again that you know your stuff. Just throwing an error and refusing to continue would not be productive here, and the design choice of the scikit-learn developers seems highly justified.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement