Maximizing the likelihood of the labels of the data

Suppose we have \(N\) training examples and we have a multi-class problem such that each training example belongs to one and only one out of \(K\) possible classes. Let \(C(i) \in \{1,\ldots, K\}\) be the correct class for the \(i\)-th training example and \(o^{[C(i)]}_{i}\) is the probability assigned by a classifier to the correct class for the \(i\)-th training example. We want this classifier to maximize: \[\prod_{i=1}^{N} o^{[C(i)]}_{i}\]

If the classifier assigns a probability of \(1\) to the correct class for \(N-1\) training examples and a probability of \(0\) for the \(N\)-th example then the entire product shown above becomes zero. So to maximize this product of probabilities, the classifier has to assign a high probability to the correct class for each and every training example.

Now, maximizing the product is equivalent to maximizing \[ln(\prod_{i=1}^{N}o^{[C(i)]}_{i}) = \sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\]

This is the same as minimizing the sum of the negative log likelihoods \[-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\]

The above can now serve as a loss function for an optimization routine.

Recall that Cross Entropy = \(-\sum_{k=1}^{K}y^{[k]}ln(o^{[k]})\) where \(y\) is the reference distribution over \(K\) classes while our predictions over the \(K\) classes is given by \(o\). Observe that this summation will collapse to being a single term when, in the reference distribution \(y\), only one of the classes has a probability of \(1\).

Thus \(-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})\) can be interpreted as the sum of cross entropy losses across all examples.

from fastai.vision.all import *

Pretend the following are the activations of each class of a multiclass classification problem. So we have 6 examples and in each row we have the activation for each class the example could belong to.

activations = torch.randn((6,2))*2
activations

tensor([[-1.6453,  1.8893],
        [ 1.9800,  1.7681],
        [ 2.8183,  4.6643],
        [-0.3635, -0.0614],
        [ 0.4064, -0.4668],
        [-3.3801,  3.2484]])

Suppose the correct class of each example is as follows

targets = tensor([0,1,0,1,1,0])
targets

tensor([0, 1, 0, 1, 1, 0])

Take the softmax of the activations

sm_acts = torch.softmax(activations, dim=1)
sm_acts

tensor([[0.0283, 0.9717],
        [0.5528, 0.4472],
        [0.1363, 0.8637],
        [0.4250, 0.5750],
        [0.7054, 0.2946],
        [0.0013, 0.9987]])

Extract the probabilities predicted for the correct class.

idx = range(6)
list(idx)

[0, 1, 2, 3, 4, 5]

p_correct_class = sm_acts[idx, targets]
p_correct_class

tensor([0.0283, 0.4472, 0.1363, 0.5750, 0.2946, 0.0013])

Take the log of the softmax activations

torch.log(sm_acts)

tensor([[-3.5634e+00, -2.8753e-02],
        [-5.9281e-01, -8.0469e-01],
        [-1.9925e+00, -1.4659e-01],
        [-8.5559e-01, -5.5344e-01],
        [-3.4895e-01, -1.2222e+00],
        [-6.6298e+00, -1.3213e-03]])

Computing the softmax of the activations and then taking the log is equivalent to applying PyTorch’s log_softmax function directly to the original activations. We want to do the latter because it will faster and more accurate.

torch.log_softmax(activations, dim=1)

tensor([[-3.5634e+00, -2.8753e-02],
        [-5.9281e-01, -8.0469e-01],
        [-1.9925e+00, -1.4659e-01],
        [-8.5559e-01, -5.5344e-01],
        [-3.4895e-01, -1.2222e+00],
        [-6.6298e+00, -1.3213e-03]])

Let’s compute the mean of cross entropy losses across the training examples:

-1*torch.log(p_correct_class), (-1*torch.log(p_correct_class)).mean()

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

We can just use Pytorch to compute this directly

nn.CrossEntropyLoss(reduction='none')(activations, targets), nn.CrossEntropyLoss()(activations, targets)

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

or by using:

F.cross_entropy(activations, targets, reduction='none'), F.cross_entropy(activations, targets)

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

Gradient of Cross Entropy

We follow the exposition in [1].

Let \(z^{[1]},\ldots, z^{[K]}\) denote the activations corresponding to the \(K\) classes. The softmax activation for each class is given by:

\[o^{[j]} = \frac{e^{z^{[j]}}}{\sum_{l=1}^{K} e^{z^{[l]}}}\]

The cross-entropy loss across the \(K\) classes is given by:

\[E=-\sum_{l=1}^{K}y^{[l]}ln(o^{[l]})\]

Partial derivative of \(o^{[j]}\) with respect to \(z^{[i]}\)

\[ \frac{\partial}{\partial z^{[i]}} o^{[j]} = \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[j]}}}{\sum_l e^{z^{[l]}}} = e^{z^{[j]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1} \\ \qquad = -e^{z^{[j]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}} = -o^{[j]} \cdot o^{[i]} \]

Partial derivative of \(o^{[i]}\) with respect to \(z^{[i]}\)

\[ \frac{\partial}{\partial z^{[i]}} o^{[i]} = \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[i]}}}{\sum_l e^{z^{[l]}}} = \frac{e^{z^{[i]}}}{\sum_{l} e^{z^{[l]}}} + e^{z^{[i]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1}\\ \quad \qquad \qquad \qquad = o^{[i]}-e^{z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}} = o^{[i]} - o^{[i]} \cdot o^{[i]} = o^{[i]} \cdot (1 - o^{[i]}) \]

Let’s compute the gradient of the cross-entropy loss with respect to the activation of the \(i\)-the class:

Per the Sylvain says section (page 203 Chapter 5) of [2], ” The gradient is proportional to the difference between the prediction and the target.… Because the gradient is linear we won’t see sudden jumps or exponential increases in gradients, which should lead to smoother training of models.”

References

[1]

M. Thill, “<A href="https://markusthill.github.io/stats/ml/gradient-softmax-function-with-cross-entropy/">gradient of the softmax function with cross-entropy loss</a>.” https://markusthill.github.io/stats/ml/gradient-softmax-function-with-cross-entropy/, 2017.

[2]

J. Howard and S. Gugger, Deep learning for coders with fastai and PyTorch: AI applications without a PhD, 1st ed. O’Reilly, 2020.