How to initialize your bias.
February 24, 2023 · 9 mins · 1638 words
tldr
Initializing correctly the bias of the last layer of your network can speed up the training process. In this post, I show first how to derive analytically the best values for the biases, and then I run an experiment to show the impact of using the correct bias.
In particular, the best biases are
- Classification problem with
classes with frequencies , such that , using softmax activation and categorical cross entropy loss
- Regression problem using
penalization and linear activation
- Regression problem using
penalization and linear activation
Motivation
These last weeks at work I’ve tuned a neural network that is used to predict arrival times. Basically, the network receives a representation of Stuart’s platform state (where are the drivers, where are the packages, etc.) and outputs the estimated time of arrival of some drivers. We decided to use a deep learning approach to avoid doing boring and unmaintainable feature engineering, but the problem then was to choose the model architecture. If we were solving an image classification problem it would have been trivial to design the architecture, in fact, we wouldn’t need to design anything, just take ResNet50 and fine-tune it. However, our problem is not standard in the deep learning world, so we couldn’t rely on pre-trained models or copy the architecture of previously successful models. We ended up defining an architecture based on convolutions, self-attention, and some dense layers here and there. The results were pretty good -it beat the previous model by +30%- and the model was deployed and everyone was happy.
However, not everything is always that easy, and at some point, we noticed that our model was overfitting. This wasn’t surprising since the model architecture and training process was never tuned. We just took our initial idea, run some experiments, changed some hyper-params by hand and called it a day. But now that the model is deployed and the stakeholders are happy we are working on tuning the model and making it more competitive. To do so I started with the great post by the great Karpathy here. It’s not the first time I read it, but this time one of the points called specially my attention.
verify loss @ init. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure
-log(1/n_classes)
on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.
What does Karpathy mean by verifying that your loss starts at the correct value? How can we achieve the -log(1/n_classes)
loss on a softmax? Which are the respective initializations for L2 regressions, Hubber, etc? In this post, I’ll show how to initialize the network to fulfil these requirements and their implications.
Problem statement
We want to solve the problem of
Which is the best initialization scheme for our network layers?
This is a broad question and has been addressed in a lot of works, such as Glorot and He (add references). In these works, the authors initialize the weights of the layers by sampling from a distribution with some optimized parameters. For instance, Glorot proposes to sample from
Which is the best initialization scheme for the last layer of our network?
Solution
In this section, I will answer the above question for several deep learning architectures.
Classification
Let’s start with a classification problem. We can define a neural network of depth
where
Cool, we have our first result, let’s see now how can we use this to optimize the initial values of
where
where we have used that at initialization
Nice, now we know which value to expect for the loss for a correctly initialized last layer, but now we need to know how to set
Now, using that the last layer is
which has the solution
Therefore, setting
Regression
In the last section, I’ve shown how to derive the optimal biases at initialization for a classification problem. In this section, I’ll show how to do the same for a regression problem. The main differences between these approaches are (1) the loss we are using, (2) the last layer activation, and (3) the dimension of the output. In regression, the output is usually 1-dim, ie: we’re just predicting one value, so
and
Using the same rationale as before, we want to minimize these losses at initialization. It’s known that without any further information, the value that minimizes MSE is
and
The expected loss at initialization for the MSE is then the variance since
and for the MAE
which I don’t know if it has a specific name.
In the original post, Karpathy says that you can also find the optimal values for the Hubber loss, however, unlike with MAE and MSE there’s no closed form for the value that minimizes the Hubber loss (explaination here). However, we could obtain the value that minimizes the Hubber loss for our dataset numerically and then use it as the bias of our layer.
Results
In the previous sections, I explained how to determine the best initial bias through mathematical analysis. However, in the real world, things are not always precise, and data can show that our assumptions were incorrect. In this section, I will conduct some experiments to see the impact of initializing biases correctly.
To conduct these experiments, I will use the CIFAR-10 dataset. I have made the problem unbalanced by sampling each class. Then, I created two CNN networks: one with the optimal bias strategy defined above and another with the standard initialization. You can find the code used to generate the models and datasets in this notebook.
The results are summarized in the following plot. We can see that the network with the optimized initial bias learns faster than the one with the normal network. This effect disappears if we train the network for a sufficient number of epochs. However, training large models is often costly. Therefore, if we can save time and money by setting the correct bias, it is worthwhile.