Keras – Hyper Tuning the initial state of the model

Question

I've written an LSTM model that predicts the sequential data. I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of

Accepted Answer

&#8230; which led me to think that the initial state of the model is probably crucial in order to get the best performance.&#8230;.. As suggested in this video, weight initialization can provide some performance boost. I&#8217;ve googled around and found layer weight initializers, but I&#8217;m not sure what ranges should I tune.Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don&#8217;t need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the docDraws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal,  LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.# in case if you want to try various initializer - # use VarianceScaling by passing proper parameter. # ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)# bur recommended to stick with glorot_uniform (default)initializer = tf.keras.initializers.VarianceScaling(scale=1.,                                                     mode='fan_avg', seed=101,                                                    distribution='uniform')print(initializer(shape=(2, 2)))initializer = tf.keras.initializers.GlorotUniform(seed=101)print(initializer(shape=(2, 2)))tf.Tensor([[-1.0027379  1.0746485] [-1.2234    -1.1489409]], shape=(2, 2), dtype=float32)tf.Tensor([[-1.0027379  1.0746485] [-1.2234    -1.1489409]], shape=(2, 2), dtype=float32)In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:def conv_kernel_initializer(shape, dtype=None):  kernel_height, kernel_width, _, out_filters = shape  fan_out = int(kernel_height * kernel_width * out_filters)  return tf.random.normal(      shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)def dense_kernel_initializer(shape, dtype=None):  init_range = 1.0 / np.sqrt(shape[1])  return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let&#8217;s say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.UpdateFind/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.Here are one scenario, let&#8217;s say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens&#8217;t mean picking seed 42 is better; but it simply means your model is unstable and most like won&#8217;t do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it&#8217;s possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:Is random seed a hyper-parameter to tune in training deep neural network?How to choose the random seed?

Advertisement

Answer