Cost Functions
Various cost functions for various scenarios
Let aL=σ(zL) be the output of the neural network and y the label (expected output). Different cost functions come up for different goals, and each has a "canonical" activation associated with it that optimises calculations.
Regression (continuous from −∞ to ∞)
MSE
Linear - σ(z)=z
Binary Classification
Binary Cross-Entropy
Sigmoid
Multi-Class Classification
Categorical Cross-Entropy
Softmax
MSE (Mean Squared Error)
Also known as the half-squared error, MSE is simply half the Euclidean distance between aL and y:
The purpose of the half become clear when we note that
So, in terms of vectors,
This makes the calculation of δL quite easy:
Note that with linear activation σ(z)=z, we would get δL=aL−y. This is desirable!
Binary Cross-Entropy Loss
Given a label y, which is either 0 or 1 (hence binary), we have
Then we have
Beautifully, if σ(z) is the sigmoid function then we have σ′(z)=σ(z)(1−σ(z)), so
as we'd like!
Categorical Cross-Entropy Loss + Softmax
CE loss itself can look like a weird choice:
However, we can make it highly efficient by using softmax on the final neurons rather than sigmoid again. Softmax is simply
When combined with softmax, we get a very nice derivation for δ:
Proof
This proof is a bit more involved. Using multivariate chain rule:
We can see, from the definition of C, that
It is a bit harder to calculate ∂zi∂ak. If k=i, then by the quotient rule we get
If k=i, it's a bit different:
Therefore
Last updated