Skip to content
Advertisement

Why is my validation accuracy so much lower when I switch from doing all in-memory learning to a dada generator?

I have a data set that contains 2 columns:

1.) A string column consisting of 21 different letters. 2.) A classification column: Each of these strings is associated with a number from 1-7.

Using the following code, I first perform integer encoding.

JavaScript

Using this code, I am performing integer and then one-hot encoding all in memory.

JavaScript

Then I train my learner like so.

JavaScript

This gives me validation accuracies that are decent for a first stab at about 86%.

Even the first epoch looks like this:

JavaScript

Note the validation accuracy of 77% on first round.

But because my dataset is relatively big, I end up consuming about 50+Gb. This is so because I am loading the entire dataset into memory and convert the entire dataset and data transformations in memory.

To do my learning in a more memory efficient way, I am introducing a data generator like so:

JavaScript

The code was adapted from here: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

Learning is then triggered like so:

JavaScript

The problem here is that my validation accuracies never exceed 15% using the data generator.

JavaScript

Note the validation accuracy of only 9%.

My question is why that is occurring? One thing I cannot explain is this:

When I do all in memory learning, I set the batch size to 32 or 64, but the number of steps remains roughly 413k (the total number of training samples). But when I use the data generators, I get much smaller numbers generally 413k samples/batch size. Is this telling me that I am not really using the batch size parameter in the in-memory learning case? Explanations appreciated.

Advertisement

Answer

A series of stupid errors cuased this discrepancy and they are all located in this one line here:

JavaScript

Error 1: I should pass in the dataframe I want to process which allows me to feed in training and validation error. The way I did this before…even if I thought I passed in validation data, I would still use training data.

Error 2: I was double integer encoding my data (duh!)

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement