In the linear regression tasks, the normal equation is widely used to find optimal parameters. However, Pattern Recognition and Machine Learning (RPML), one of the most popular machine learning textbooks, does not explain details of the derivation process. So, this article demonstrates how to derive the equation.

### Linear regression model

We define linear regression model as:

for a input vector $\textbf{x}$, base function $\phi$ and output $y$.

The main task is to find an optimal parameter $\textbf{w}$ from $N$ learning data sets, $(\textbf{x}_1, t_1), (\textbf{x}_2, t_2), \ldots, (\textbf{x}_N, t_N)$. As a result of such learning step, we can predict output for any input $\textbf{x}$.

### Least squares method

How can we estimate an optimal parameter $\textbf{w}$? The answer is quite simple: minimization of the total prediction error. When we already have parameters, the total prediction error for the $N$ learning data may be computed by $\sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n))$. Is it correct?

Unfortunately, this formula has two problems. First, if learning data such that $t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n)< 0$ exists, above formula does not represent "total error". Second, since the formula is linear for $\textbf{w}$, we cannot minimize it. Thus, squared error function $E(\textbf{w})$ is considered as:

$E(\textbf{w})$ is a quadratic function, and it will be concave up. So, we can minimize it by finding $\textbf{w}$ which satisfies $\frac{\partial E}{\partial \textbf{w}} = 0$.

Note that, in the PRML, squared error function is represented as $E(\textbf{w}) = \frac{1}{2} \sum_{n=1}^{N} (t_n-\textbf{w}^{\mathrm{T}}\phi(\textbf{x}_n))^2$ with mysterious $\frac{1}{2}$, but it just deletes $2$ in $\frac{\partial E}{\partial \textbf{w}}$. Hence, the coefficient is not so important to understand the normal equation.

### Normal equation

For the reasons that I mentioned above, we want to obtain $\frac{\partial E}{\partial \textbf{w}}$. For better understanding, I will first check the result of vector derivation for a small example. When we have just one learning data, and input vector has two dimensions, the squared error function is:

Also,

For instance,

As a consequence,

By extending this simple example to arbitrary $N$ and dimensions,

with $\phi(\textbf{x}_n)=\phi_n$. Importantly, since $\textbf{w}^{\mathrm{T}}\phi_n$ is scalar, exchangeable parts exist as: $\textbf{w}^{\mathrm{T}}\phi_n = \phi_n^{\mathrm{T}}\textbf{w}$, and $(\phi_n^{\mathrm{T}}\textbf{w})\cdot\phi_n = \phi_n \cdot (\phi_n^{\mathrm{T}}\textbf{w}) = (\phi_n\phi_n^{\mathrm{T}})\cdot\textbf{w}$.

Next, we solve the equation for $\textbf{w}$ as:

Additionally, the PRML introduces **design matrix** by:

for $M$, dimensions of input vector. It can be simply written as $\phi = \left[\phi_1 \ \phi_2 \ldots \phi_N \right]$. Therefore, we can easily confirm $\sum_{n=1}^{N} \phi_n \phi_n^{\mathrm{T}} = \phi^{\mathrm{T}} \phi$, and $\sum_{n=1}^{N} t_n\phi_n = \phi^{\mathrm{T}} \textbf{t}$.

Finally, we get the normal equation with the design matrix:

Now, we can find an optimal parameter for any learning data $(\mathbf{x}_1, t_1), (\mathbf{x}_2, t_2), \cdots, (\mathbf{x}_N, t_N)$ by computing the equation.

### Conclusion

- The PRML book does not explain details of derivation process of the normal equation.
- I derive the normal equation step-by-step from the definition of linear regression models.
- Since vector derivation is not easy to follow, checking the result with simple examples is good idea.

#### Categories

#### See also

- 2023-04-07
- Three Perspectives on Large Language Models and Their Emerging Applications
- 2022-09-14
- Seeing Past and Present in Coursera "Machine Learning on Google Cloud" Specialization
- 2017-05-28
- Hugo meets kramdown + KaTeX #gohugo

Last updated: 2022-09-02

#### Author: Takuya Kitazawa

**Takuya Kitazawa** is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. With a decade of experience at start-up companies and Big Tech ranging from full-stack/machine-learning engineering to data science to product management, I am currently working at the intersection of technological and social aspects of data-driven applications.

#### Disclaimer

- Opinions are my own and do not represent the views of organizations I am/was belonging to.
- I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors, outdated information, or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
- That said, if there is any issue with the content, please contact me so I can take the necessary action.