When testing a PMML object previously converted from a Pickle file (dumped from a sklearn fitted object), I am unable to reproduce the same results as with the pickle model. In the sklearn we see I obtain [0 1 0] as classes for the input given in X. However in PMML I would approaximate the probabilities to [1 1 1]. Is there anything I am doing wrong? This model behaviour is not what I would expect when converting the pickle file.
Advertisement
Answer
PMML models identify input columns by name, not by position. Therefore, one should not use “anonymized” data stores (such as numpy.ndarray
) because, by definition, it will be impossible to identity input columns correctly.
Here, the order of input columns is problematic, because the RF model is trained using a small set of randomly generated data (a 3 x 4 numpy.array
). It is likely that the RF model uses only one or two input columns, and ignores the rest (say, “x1” and “x3” are significant, and “x2” and “x4” are not). During conversion the SkLearn2PMML package removes all redundant features. In the current example, the RF model would expect a two-column input data store, where the first column corresponds to “x1” and the second column to “x3”; you’re passing a four-column input data store instead, where the second column is occupied by “x2”.
TLDR: When working with PMML models, do the following:
- Don’t use anonymized data stores! Train your model using a
pandas.DataFrame
(instead ofnumpy.ndarray
), and also make predictions using the same data store class. This way the column mapping will always come out correct, even if the SkLearn2PMML package decided to eliminate some redundant columns. - Use the JPMML-Evaluator-Python package instead of PyPMML. Specifically, stay away from PyPMML’s
predict(X: numpy.ndarray)
method!