In the linear regression tasks, the normal equation is widely used to find optimal parameters. However, Pattern Recognition and Machine Learning (RPML), one of the most popular machine learning textbooks, does not explain details of the derivation process. So, this article demonstrates how to derive the equation.

### Linear regression model

We define linear regression model as:

for a input vector $\textbf{x}$, base function $\boldsymbol{\phi}$ and output $y$.

The main task is to find an optimal parameter $\textbf{w}$ from $N$ learning data sets, $(\textbf{x}_1, t_1), (\textbf{x}_2, t_2), \ldots, (\textbf{x}_N, t_N)$. As a result of such learning step, we can predict output for any input $\textbf{x}$.

### Least squares method

How can we estimate an optimal parameter $\textbf{w}$? The answer is quite simple: minimization of the total prediction error. When we already have parameters, the total prediction error for the $N$ learning data may be computed by $\sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\boldsymbol{\phi}(\textbf{x}_n))$. Is it correct?

Unfortunately, this formula has two problems. First, if learning data such that $t_n-\textbf{w}^{\mathrm{T}}\boldsymbol{\phi}(\textbf{x}_n)< 0$ exists, above formula does not represent "total error". Second, since the formula is linear for $\textbf{w}$, we cannot minimize it. Thus, squared error function $E(\textbf{w})$ is considered as:

$E(\textbf{w})$ is a quadratic function, and it will be concave up. So, we can minimize it by finding $\textbf{w}$ which satisfies $\frac{\partial E}{\partial \textbf{w}} = 0$.

Note that, in the PRML, squared error function is represented as $E(\textbf{w}) = \frac{1}{2} \sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\boldsymbol{\phi}(\textbf{x}_n))^2$ with mysterious $\frac{1}{2}$, but it just deletes $2$ in $\frac{\partial E}{\partial \textbf{w}}$. Hence, the coefficient is not so important to understand the normal equation.

### Normal equation

For the reasons that I mentioned above, we want to obtain $\frac{\partial E}{\partial \textbf{w}}$. For better understanding, I will first check the result of vector derivation for a small example. When we have just one learning data, and input vector has two dimensions, the squared error function is:

Also,

For instance,

As a consequence,

By extending this simple example to arbitrary $N$ and dimensions,

with $\boldsymbol{\phi}(\textbf{x}_n)=\boldsymbol{\phi}_n$. Importantly, since $\textbf{w}^{\mathrm{T}}\boldsymbol{\phi}_n$ is scalar, exchangeable parts exist as: $\textbf{w}^{\mathrm{T}}\boldsymbol{\phi}_n = \boldsymbol{\phi}_n^{\mathrm{T}}\textbf{w}$, and $(\boldsymbol{\phi}_n^{\mathrm{T}}\textbf{w})\cdot\boldsymbol{\phi}_n = \boldsymbol{\phi}_n \cdot (\boldsymbol{\phi}_n^{\mathrm{T}}\textbf{w}) = (\boldsymbol{\phi}_n\boldsymbol{\phi}_n^{\mathrm{T}})\cdot\textbf{w}$.

Next, we solve the equation for $\textbf{w}$ as:

Additionally, the PRML introduces **design matrix** by:

for $M$, dimensions of input vector. It can be simply written as $\boldsymbol{\Phi} = \left[\boldsymbol{\phi_1} \ \boldsymbol{\phi_2} \ldots \boldsymbol{\phi_N}\right]$. Therefore, we can easily confirm $\sum_{n=1}^{N} \boldsymbol{\phi}_n \boldsymbol{\phi}_n^{\mathrm{T}} = \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}$, and $\sum_{n=1}^{N} t_n\boldsymbol{\phi}_n = \boldsymbol{\Phi}^{\mathrm{T}} \textbf{t}$.

Finally, we get the normal equation with the design matrix:

Now, we can find an optimal parameter for any learning data $(\textbf{x}_1, t_1), (\textbf{x}_2, t_2), \dots, (\textbf{x}_N, t_N)$ by computing the equation.

### Conclusion

- The PRML book does not explain details of derivation process of the normal equation.
- I derive the normal equation step-by-step from the definition of linear regression models.
- Since vector derivation is not easy to follow, checking the result with simple examples is good idea.

Tweet