> For the complete documentation index, see [llms.txt](https://ir0nstone.gitbook.io/ai-ml/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ir0nstone.gitbook.io/ai-ml/machine-learning/neural-networks/cost-functions.md).

# Cost Functions

Let $$a^L = \sigma(z^L)$$ be the output of the neural network and $$y$$ the label (expected output). Different cost functions come up for different goals, and each has a "canonical" activation associated with it that optimises calculations.

<table><thead><tr><th width="332">Goal</th><th width="225">Cost Function</th><th>Activation</th></tr></thead><tbody><tr><td>Regression (continuous from <span class="math">-\infin</span> to <span class="math">\infin</span>)</td><td>MSE</td><td>Linear - <span class="math">\sigma(z) = z</span></td></tr><tr><td>Binary Classification</td><td>Binary Cross-Entropy</td><td>Sigmoid</td></tr><tr><td>Multi-Class Classification</td><td>Categorical Cross-Entropy</td><td>Softmax</td></tr></tbody></table>

## MSE (Mean Squared Error)

Also known as the **half-squared error**, MSE is simply half the **Euclidean distance** between $$a^L$$ and $$y$$:

$$
C(a^L,y) =\frac{1}{2} \lVert a^L-y \rVert^2 = \frac{1}{2}\sum\_{i=1}^n (a^L\_i - y\_i)^2
$$

The purpose of the half become clear when we note that

$$
\frac{\partial C}{\partial a^L\_i} = a^L\_i - y\_i
$$

So, in terms of vectors,

$$
\frac{\partial C}{\partial a^L} = a^L - y
$$

This makes the calculation of $$\delta^L$$ quite easy:

$$
\delta^L = (a^L - y) \odot \sigma^\prime(z^L)
$$

Note that with **linear activation** $$\sigma(z) = z$$, we would get $$\delta^L = a^L - y$$. This is desirable!

## Binary Cross-Entropy Loss

Given a label $$y$$, which is either $$0$$ or $$1$$ (hence binary), we have

$$
C(a^L, y) = -\[y\ln (a^L) + (1-y)\ln (1- a^L)]
$$

Then we have

$$
\frac{\partial C}{\partial a^L} = -\frac{y}{a^L} + \frac{1-y}{1-a^L} = \frac{a-y}{a(1-a)}
$$

Beautifully, if $$\sigma(z)$$ is the sigmoid function then we have $$\sigma^\prime(z) = \sigma(z)(1-\sigma(z))$$, so

$$
\delta^L = \frac{a-y}{a(1-a)} \cdot a(1-a) = a-y
$$

as we'd like!

## Categorical Cross-Entropy Loss + Softmax

CE loss itself can look like a weird choice:

$$
C(a^L,y) = -\sum\_{i=1}^n y\_i\ln(a^L\_i)
$$

However, we can make it highly efficient by using **softmax** on the final neurons rather than **sigmoid** again. Softmax is simply

$$
a\_i = \frac{e^{z\_i}}{\sum\_j e^{z\_j}}
$$

When combined with **softmax**, we get a very nice derivation for $$\delta$$:

$$
\delta^L = a^L - y
$$

### Proof

This proof is a bit more involved. Using multivariate chain rule:

$$
\delta^L\_i = \frac{\partial C}{\partial z\_i} = \sum\_k \frac{\partial C}{\partial a^L\_k} \frac{\partial a^L\_k}{\partial z\_i}
$$

We can see, from the definition of $$C$$, that

$$
\frac{\partial C}{\partial a^L\_i} = -\frac{y\_i}{a^L\_i}
$$

It is a bit harder to calculate $$\frac{\partial a\_k}{\partial z\_i}$$. If $$k = i$$, then by the quotient rule we get

$$
\begin{align\*}
\frac{\partial a^L\_i}{\partial z\_i} &= \frac{\partial}{\partial z\_i}\left\[\frac{e^{z\_i}}{\sum\_j e^{z\_j}}\right] \\
&= \frac{e^{z\_i}\sum\_j e^{z\_j} - e^{z\_i}e^{z\_i}}{\left(\sum\_j e^{z\_j}\right)^2} \\
&= \frac{e^{z\_i}}{\sum\_j e^{z\_j}} - \left(\frac{e^{z\_i}}{\sum\_j e^{z\_j}}\right)^2 \\
&= a^L\_i - (a^L\_i)^2 \\
&= a^L\_i(1- a^L\_i)
\end{align\*}
$$

If $$k \neq i$$, it's a bit different:

$$
\begin{align\*}
\frac{\partial a^L\_k}{\partial z\_i} &= \frac{\partial}{\partial z\_i}\left\[\frac{e^{z\_k}}{\sum\_j e^{z\_j}}\right] \\
&= \frac{0 \cdot \sum\_j e^{z\_j} - e^{z\_i}e^{z\_k}}{\left(\sum\_j e^{z\_j}\right)^2} \\
&= -\frac{e^{z\_i}}{\sum\_j e^{z\_j}} \cdot \frac{e^{z\_k}}{\sum\_j e^{z\_j}} \\
&= -a^L\_ia^L\_k
\end{align\*}
$$

Therefore

$$
\begin{align\*}
\delta^L\_i &=  \frac{\partial C}{\partial a^L\_i} \frac{\partial a^L\_i}{\partial z\_i} + \sum\_{k\neq i} \frac{\partial C}{\partial a^L\_k} \frac{\partial a^L\_k}{\partial z\_i} \\
&= -\frac{y\_i}{a^L\_i} \cdot a^L\_i(1- a^L\_i) + \sum\_{k \neq i} -\frac{y\_k}{a^L\_k} \cdot -a^L\_ka^L\_i \\
&= -y\_i(1-a^L\_i) + \sum\_{k \neq i} y\_ka^L\_i \\
&= -y\_i + a^L\_iy\_i + \sum\_{k \neq i} y\_ka^L\_i \\
&= -y\_i + a^L\_i\sum\_{k} y\_k \\
&= a^L\_i - y\_i
\end{align\*}
$$


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ir0nstone.gitbook.io/ai-ml/machine-learning/neural-networks/cost-functions.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Goal	Cost Function	Activation
Regression (continuous from -\infin to \infin)	MSE	Linear - \sigma(z) = z
Binary Classification	Binary Cross-Entropy	Sigmoid
Multi-Class Classification	Categorical Cross-Entropy	Softmax