# Cost Functions

Let $$a^L = \sigma(z^L)$$ be the output of the neural network and $$y$$ the label (expected output). Different cost functions come up for different goals, and each has a "canonical" activation associated with it that optimises calculations.

<table><thead><tr><th width="332">Goal</th><th width="225">Cost Function</th><th>Activation</th></tr></thead><tbody><tr><td>Regression (continuous from <span class="math">-\infin</span> to <span class="math">\infin</span>)</td><td>MSE</td><td>Linear - <span class="math">\sigma(z) = z</span></td></tr><tr><td>Binary Classification</td><td>Binary Cross-Entropy</td><td>Sigmoid</td></tr><tr><td>Multi-Class Classification</td><td>Categorical Cross-Entropy</td><td>Softmax</td></tr></tbody></table>

## MSE (Mean Squared Error)

Also known as the **half-squared error**, MSE is simply half the **Euclidean distance** between $$a^L$$ and $$y$$:

$$
C(a^L,y) =\frac{1}{2} \lVert a^L-y \rVert^2 = \frac{1}{2}\sum\_{i=1}^n (a^L\_i - y\_i)^2
$$

The purpose of the half become clear when we note that

$$
\frac{\partial C}{\partial a^L\_i} = a^L\_i - y\_i
$$

So, in terms of vectors,

$$
\frac{\partial C}{\partial a^L} = a^L - y
$$

This makes the calculation of $$\delta^L$$ quite easy:

$$
\delta^L = (a^L - y) \odot \sigma^\prime(z^L)
$$

Note that with **linear activation** $$\sigma(z) = z$$, we would get $$\delta^L = a^L - y$$. This is desirable!

## Binary Cross-Entropy Loss

Given a label $$y$$, which is either $$0$$ or $$1$$ (hence binary), we have

$$
C(a^L, y) = -\[y\ln (a^L) + (1-y)\ln (1- a^L)]
$$

Then we have

$$
\frac{\partial C}{\partial a^L} = -\frac{y}{a^L} + \frac{1-y}{1-a^L} = \frac{a-y}{a(1-a)}
$$

Beautifully, if $$\sigma(z)$$ is the sigmoid function then we have $$\sigma^\prime(z) = \sigma(z)(1-\sigma(z))$$, so

$$
\delta^L = \frac{a-y}{a(1-a)} \cdot a(1-a) = a-y
$$

as we'd like!

## Categorical Cross-Entropy Loss + Softmax

CE loss itself can look like a weird choice:

$$
C(a^L,y) = -\sum\_{i=1}^n y\_i\ln(a^L\_i)
$$

However, we can make it highly efficient by using **softmax** on the final neurons rather than **sigmoid** again. Softmax is simply

$$
a\_i = \frac{e^{z\_i}}{\sum\_j e^{z\_j}}
$$

When combined with **softmax**, we get a very nice derivation for $$\delta$$:

$$
\delta^L = a^L - y
$$

### Proof

This proof is a bit more involved. Using multivariate chain rule:

$$
\delta^L\_i = \frac{\partial C}{\partial z\_i} = \sum\_k \frac{\partial C}{\partial a^L\_k} \frac{\partial a^L\_k}{\partial z\_i}
$$

We can see, from the definition of $$C$$, that

$$
\frac{\partial C}{\partial a^L\_i} = -\frac{y\_i}{a^L\_i}
$$

It is a bit harder to calculate $$\frac{\partial a\_k}{\partial z\_i}$$. If $$k = i$$, then by the quotient rule we get

$$
\begin{align\*}
\frac{\partial a^L\_i}{\partial z\_i} &= \frac{\partial}{\partial z\_i}\left\[\frac{e^{z\_i}}{\sum\_j e^{z\_j}}\right] \\
&= \frac{e^{z\_i}\sum\_j e^{z\_j} - e^{z\_i}e^{z\_i}}{\left(\sum\_j e^{z\_j}\right)^2} \\
&= \frac{e^{z\_i}}{\sum\_j e^{z\_j}} - \left(\frac{e^{z\_i}}{\sum\_j e^{z\_j}}\right)^2 \\
&= a^L\_i - (a^L\_i)^2 \\
&= a^L\_i(1- a^L\_i)
\end{align\*}
$$

If $$k \neq i$$, it's a bit different:

$$
\begin{align\*}
\frac{\partial a^L\_k}{\partial z\_i} &= \frac{\partial}{\partial z\_i}\left\[\frac{e^{z\_k}}{\sum\_j e^{z\_j}}\right] \\
&= \frac{0 \cdot \sum\_j e^{z\_j} - e^{z\_i}e^{z\_k}}{\left(\sum\_j e^{z\_j}\right)^2} \\
&= -\frac{e^{z\_i}}{\sum\_j e^{z\_j}} \cdot \frac{e^{z\_k}}{\sum\_j e^{z\_j}} \\
&= -a^L\_ia^L\_k
\end{align\*}
$$

Therefore

$$
\begin{align\*}
\delta^L\_i &=  \frac{\partial C}{\partial a^L\_i} \frac{\partial a^L\_i}{\partial z\_i} + \sum\_{k\neq i} \frac{\partial C}{\partial a^L\_k} \frac{\partial a^L\_k}{\partial z\_i} \\
&= -\frac{y\_i}{a^L\_i} \cdot a^L\_i(1- a^L\_i) + \sum\_{k \neq i} -\frac{y\_k}{a^L\_k} \cdot -a^L\_ka^L\_i \\
&= -y\_i(1-a^L\_i) + \sum\_{k \neq i} y\_ka^L\_i \\
&= -y\_i + a^L\_iy\_i + \sum\_{k \neq i} y\_ka^L\_i \\
&= -y\_i + a^L\_i\sum\_{k} y\_k \\
&= a^L\_i - y\_i
\end{align\*}
$$


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ir0nstone.gitbook.io/ai-ml/machine-learning/neural-networks/cost-functions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
