Cost Functions

Various cost functions for various scenarios

Let aL=σ(zL)a^L = \sigma(z^L) be the output of the neural network and yy the label (expected output). Different cost functions come up for different goals, and each has a "canonical" activation associated with it that optimises calculations.

Goal
Cost Function
Activation

Regression (continuous from -\infin to \infin)

MSE

Linear - σ(z)=z\sigma(z) = z

Binary Classification

Binary Cross-Entropy

Sigmoid

Multi-Class Classification

Categorical Cross-Entropy

Softmax

MSE (Mean Squared Error)

Also known as the half-squared error, MSE is simply half the Euclidean distance between aLa^L and yy:

C(aL,y)=12aLy2=12i=1n(aiLyi)2C(a^L,y) =\frac{1}{2} \lVert a^L-y \rVert^2 = \frac{1}{2}\sum_{i=1}^n (a^L_i - y_i)^2

The purpose of the half become clear when we note that

CaiL=aiLyi\frac{\partial C}{\partial a^L_i} = a^L_i - y_i

So, in terms of vectors,

CaL=aLy\frac{\partial C}{\partial a^L} = a^L - y

This makes the calculation of δL\delta^L quite easy:

δL=(aLy)σ(zL)\delta^L = (a^L - y) \odot \sigma^\prime(z^L)

Note that with linear activation σ(z)=z\sigma(z) = z, we would get δL=aLy\delta^L = a^L - y. This is desirable!

Binary Cross-Entropy Loss

Given a label yy, which is either 00 or 11 (hence binary), we have

C(aL,y)=[yln(aL)+(1y)ln(1aL)]C(a^L, y) = -[y\ln (a^L) + (1-y)\ln (1- a^L)]

Then we have

CaL=yaL+1y1aL=aya(1a)\frac{\partial C}{\partial a^L} = -\frac{y}{a^L} + \frac{1-y}{1-a^L} = \frac{a-y}{a(1-a)}

Beautifully, if σ(z)\sigma(z) is the sigmoid function then we have σ(z)=σ(z)(1σ(z))\sigma^\prime(z) = \sigma(z)(1-\sigma(z)), so

δL=aya(1a)a(1a)=ay\delta^L = \frac{a-y}{a(1-a)} \cdot a(1-a) = a-y

as we'd like!

Categorical Cross-Entropy Loss + Softmax

CE loss itself can look like a weird choice:

C(aL,y)=i=1nyiln(aiL)C(a^L,y) = -\sum_{i=1}^n y_i\ln(a^L_i)

However, we can make it highly efficient by using softmax on the final neurons rather than sigmoid again. Softmax is simply

ai=ezijezja_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

When combined with softmax, we get a very nice derivation for δ\delta:

δL=aLy\delta^L = a^L - y

Proof

This proof is a bit more involved. Using multivariate chain rule:

δiL=Czi=kCakLakLzi\delta^L_i = \frac{\partial C}{\partial z_i} = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z_i}

We can see, from the definition of CC, that

CaiL=yiaiL\frac{\partial C}{\partial a^L_i} = -\frac{y_i}{a^L_i}

It is a bit harder to calculate akzi\frac{\partial a_k}{\partial z_i}. If k=ik = i, then by the quotient rule we get

aiLzi=zi[ezijezj]=ezijezjeziezi(jezj)2=ezijezj(ezijezj)2=aiL(aiL)2=aiL(1aiL)\begin{align*} \frac{\partial a^L_i}{\partial z_i} &= \frac{\partial}{\partial z_i}\left[\frac{e^{z_i}}{\sum_j e^{z_j}}\right] \\ &= \frac{e^{z_i}\sum_j e^{z_j} - e^{z_i}e^{z_i}}{\left(\sum_j e^{z_j}\right)^2} \\ &= \frac{e^{z_i}}{\sum_j e^{z_j}} - \left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right)^2 \\ &= a^L_i - (a^L_i)^2 \\ &= a^L_i(1- a^L_i) \end{align*}

If kik \neq i, it's a bit different:

akLzi=zi[ezkjezj]=0jezjeziezk(jezj)2=ezijezjezkjezj=aiLakL\begin{align*} \frac{\partial a^L_k}{\partial z_i} &= \frac{\partial}{\partial z_i}\left[\frac{e^{z_k}}{\sum_j e^{z_j}}\right] \\ &= \frac{0 \cdot \sum_j e^{z_j} - e^{z_i}e^{z_k}}{\left(\sum_j e^{z_j}\right)^2} \\ &= -\frac{e^{z_i}}{\sum_j e^{z_j}} \cdot \frac{e^{z_k}}{\sum_j e^{z_j}} \\ &= -a^L_ia^L_k \end{align*}

Therefore

δiL=CaiLaiLzi+kiCakLakLzi=yiaiLaiL(1aiL)+kiykakLakLaiL=yi(1aiL)+kiykaiL=yi+aiLyi+kiykaiL=yi+aiLkyk=aiLyi\begin{align*} \delta^L_i &= \frac{\partial C}{\partial a^L_i} \frac{\partial a^L_i}{\partial z_i} + \sum_{k\neq i} \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z_i} \\ &= -\frac{y_i}{a^L_i} \cdot a^L_i(1- a^L_i) + \sum_{k \neq i} -\frac{y_k}{a^L_k} \cdot -a^L_ka^L_i \\ &= -y_i(1-a^L_i) + \sum_{k \neq i} y_ka^L_i \\ &= -y_i + a^L_iy_i + \sum_{k \neq i} y_ka^L_i \\ &= -y_i + a^L_i\sum_{k} y_k \\ &= a^L_i - y_i \end{align*}

Last updated