from fastai.vision.all import *
Maximizing the likelihood of the labels of the data
Suppose we have \(N\) training examples and we have a multi-class problem such that each training example belongs to one and only one out of \(K\) possible classes. Let \(C(i) \in \{1,\ldots, K\}\) be the correct class for the \(i\)-th training example and \(o^{[C(i)]}_{i}\) is the probability assigned by a classifier to the correct class for the \(i\)-th training example. We want this classifier to maximize: \[\prod_{i=1}^{N} o^{[C(i)]}_{i}\]
If the classifier assigns a probability of \(1\) to the correct class for \(N-1\) training examples and a probability of \(0\) for the \(N\)-th example then the entire product shown above becomes zero. So to maximize this product of probabilities, the classifier has to assign a high probability to the correct class for each and every training example.
Now, maximizing the product is equivalent to maximizing \[ln(\prod_{i=1}^{N}o^{[C(i)]}_{i}) = \sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\]
This is the same as minimizing the sum of the negative log likelihoods \[-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\]
The above can now serve as a loss function for an optimization routine.
Recall that Cross Entropy = \(-\sum_{k=1}^{K}y^{[k]}ln(o^{[k]})\) where \(y\) is the reference distribution over \(K\) classes while our predictions over the \(K\) classes is given by \(o\). Observe that this summation will collapse to being a single term when, in the reference distribution \(y\), only one of the classes has a probability of \(1\).
Thus \(-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\) can be interpreted as the sum of cross entropy losses across all examples.
Pretend the following are the activations of each class of a multiclass classification problem. So we have 6 examples and in each row we have the activation for each class the example could belong to.
= torch.randn((6,2))*2
activations activations
tensor([[-1.6453, 1.8893],
[ 1.9800, 1.7681],
[ 2.8183, 4.6643],
[-0.3635, -0.0614],
[ 0.4064, -0.4668],
[-3.3801, 3.2484]])
Suppose the correct class of each example is as follows
= tensor([0,1,0,1,1,0])
targets targets
tensor([0, 1, 0, 1, 1, 0])
Take the softmax of the activations
= torch.softmax(activations, dim=1)
sm_acts sm_acts
tensor([[0.0283, 0.9717],
[0.5528, 0.4472],
[0.1363, 0.8637],
[0.4250, 0.5750],
[0.7054, 0.2946],
[0.0013, 0.9987]])
Extract the probabilities predicted for the correct class.
= range(6)
idx list(idx)
[0, 1, 2, 3, 4, 5]
= sm_acts[idx, targets]
p_correct_class p_correct_class
tensor([0.0283, 0.4472, 0.1363, 0.5750, 0.2946, 0.0013])
Take the log of the softmax activations
torch.log(sm_acts)
tensor([[-3.5634e+00, -2.8753e-02],
[-5.9281e-01, -8.0469e-01],
[-1.9925e+00, -1.4659e-01],
[-8.5559e-01, -5.5344e-01],
[-3.4895e-01, -1.2222e+00],
[-6.6298e+00, -1.3213e-03]])
Computing the softmax of the activations and then taking the log is equivalent to applying PyTorch’s log_softmax function directly to the original activations. We want to do the latter because it will faster and more accurate.
=1) torch.log_softmax(activations, dim
tensor([[-3.5634e+00, -2.8753e-02],
[-5.9281e-01, -8.0469e-01],
[-1.9925e+00, -1.4659e-01],
[-8.5559e-01, -5.5344e-01],
[-3.4895e-01, -1.2222e+00],
[-6.6298e+00, -1.3213e-03]])
Let’s compute the mean of cross entropy losses across the training examples:
-1*torch.log(p_correct_class), (-1*torch.log(p_correct_class)).mean()
(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))
We can just use Pytorch to compute this directly
='none')(activations, targets), nn.CrossEntropyLoss()(activations, targets) nn.CrossEntropyLoss(reduction
(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))
or by using:
='none'), F.cross_entropy(activations, targets) F.cross_entropy(activations, targets, reduction
(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))
Gradient of Cross Entropy
We follow the exposition in [1].
Let \(z^{[1]},\ldots, z^{[K]}\) denote the activations corresponding to the \(K\) classes. The softmax activation for each class is given by:
\[o^{[j]} = \frac{e^{z^{[j]}}}{\sum_{l=1}^{K} e^{z^{[l]}}}\]
The cross-entropy loss across the \(K\) classes is given by:
\[E=-\sum_{l=1}^{K}y^{[l]}ln(o^{[l]})\]
Partial derivative of \(o^{[j]}\) with respect to \(z^{[i]}\)
\[ \frac{\partial}{\partial z^{[i]}} o^{[j]} = \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[j]}}}{\sum_l e^{z^{[l]}}} = e^{z^{[j]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1} \\ \qquad = -e^{z^{[j]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}} = -o^{[j]} \cdot o^{[i]} \]
Partial derivative of \(o^{[i]}\) with respect to \(z^{[i]}\)
\[ \frac{\partial}{\partial z^{[i]}} o^{[i]} = \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[i]}}}{\sum_l e^{z^{[l]}}} = \frac{e^{z^{[i]}}}{\sum_{l} e^{z^{[l]}}} + e^{z^{[i]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1}\\ \quad \qquad \qquad \qquad = o^{[i]}-e^{z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}} = o^{[i]} - o^{[i]} \cdot o^{[i]} = o^{[i]} \cdot (1 - o^{[i]}) \]
Let’s compute the gradient of the cross-entropy loss with respect to the activation of the \(i\)-the class:
Per the Sylvain says section (page 203 Chapter 5) of [2], ” The gradient is proportional to the difference between the prediction and the target.… Because the gradient is linear we won’t see sudden jumps or exponential increases in gradients, which should lead to smoother training of models.”