In the linear regression tasks, the normal equation is widely used to find optimal parameters. However, Pattern Recognition and Machine Learning (RPML), one of the most popular machine learning textbooks, does not explain details of the derivation process. So, this article demonstrates how to derive the equation.

### Linear regression model

We define linear regression model as:

for a input vector $\textbf{x}$, base function $\phi$ and output $y$.

The main task is to find an optimal parameter $\textbf{w}$ from $N$ learning data sets, $(\textbf{x}_1, t_1), (\textbf{x}_2, t_2), \ldots, (\textbf{x}_N, t_N)$. As a result of such learning step, we can predict output for any input $\textbf{x}$.

### Least squares method

How can we estimate an optimal parameter $\textbf{w}$? The answer is quite simple: minimization of the total prediction error. When we already have parameters, the total prediction error for the $N$ learning data may be computed by $\sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n))$. Is it correct?

Unfortunately, this formula has two problems. First, if learning data such that $t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n)< 0$ exists, above formula does not represent "total error". Second, since the formula is linear for $\textbf{w}$, we cannot minimize it. Thus, squared error function $E(\textbf{w})$ is considered as:

$E(\textbf{w})$ is a quadratic function, and it will be concave up. So, we can minimize it by finding $\textbf{w}$ which satisfies $\frac{\partial E}{\partial \textbf{w}} = 0$.

Note that, in the PRML, squared error function is represented as $E(\textbf{w}) = \frac{1}{2} \sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n))^2$ with mysterious $\frac{1}{2}$, but it just deletes $2$ in $\frac{\partial E}{\partial \textbf{w}}$. Hence, the coefficient is not so important to understand the normal equation.

### Normal equation

For the reasons that I mentioned above, we want to obtain $\frac{\partial E}{\partial \textbf{w}}$. For better understanding, I will first check the result of vector derivation for a small example. When we have just one learning data, and input vector has two dimensions, the squared error function is:

Also,

For instance,

As a consequence,

By extending this simple example to arbitrary $N$ and dimensions,

with $\phi(\textbf{x}_n)=\phi_n$. Importantly, since $\textbf{w}^{\mathrm{T}}\phi_n$ is scalar, exchangeable parts exist as: $\textbf{w}^{\mathrm{T}}\phi_n = \phi_n^{\mathrm{T}}\textbf{w}$, and $(\phi_n^{\mathrm{T}}\textbf{w})\cdot\phi_n = \phi_n \cdot (\phi_n^{\mathrm{T}}\textbf{w}) = (\phi_n\phi_n^{\mathrm{T}})\cdot\textbf{w}$.

Next, we solve the equation for $\textbf{w}$ as:

Additionally, the PRML introduces **design matrix** by:

for $M$, dimensions of input vector. It can be simply written as $\phi = \left[\phi_1 \ \phi_2 \ldots \phi_N \right]$. Therefore, we can easily confirm $\sum_{n=1}^{N} \phi_n \phi_n^{\mathrm{T}} = \phi^{\mathrm{T}} \phi$, and $\sum_{n=1}^{N} t_n\phi_n = \phi^{\mathrm{T}} \textbf{t}$.

Finally, we get the normal equation with the design matrix:

Now, we can find an optimal parameter for any learning data $(\mathbf{x}_1, t_1), (\mathbf{x}_2, t_2), \cdots, (\mathbf{x}_N, t_N)$ by computing the equation.

### Conclusion

- The PRML book does not explain details of derivation process of the normal equation.
- I derive the normal equation step-by-step from the definition of linear regression models.
- Since vector derivation is not easy to follow, checking the result with simple examples is good idea.

#### Share

#### Support (Thank you!)

Note that, as an Amazon Associate, I earn from qualifying purchases on amazon.ca.

#### See also

- 2022-09-14
- Seeing Past and Present in Coursera "Machine Learning on Google Cloud" Specialization
- 2020-08-29
- What I Think About When I Talk About ML Product
- 2017-05-28
- Hugo meets kramdown + KaTeX #gohugo

Last updated: 2022-09-02

#### Author: Takuya Kitazawa

**Takuya Kitazawa** is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. While my area of specialty is in data & AI ethics and machine learning productization, I have worked full-stack throughout the career e.g., as a frontend/backend engineer, OSS developer, technical evangelist, solution architect, data scientist, and product manager. You can find what I am doing lately at my "**now**" page, and your inquiry is always welcome at **[email protected]**, including comments on my blog posts.

*Opinions are my own and do not represent the views of organizations I am/was belonging to.*

#### Popular articles