Rationale Behind MSE for optimisation

Look at Central Limit Theorem
When collecting data, the data would inevitably be affected by noise. Looking deeper, noise is made up by the summation of random and independent processes (eg. thermal process, electro-magnetic interferences etc). Thus based on the Central Limit theorem, Noise should follow a Normal distribution.

ϵ \sim N (0, σ^{2})

y_{real} = y_{ideal} + ϵ_{noise}

Typically, the mean ( $μ$ ) of noise is assumed to be 0. Since each individual processes are random in nature, they have an equal likelihood of having a positive and negative magnitude. Thus, the noise should have no biases in the positive or negative direction.

The probability of observing an error of value $ϵ$ is given as

P (y_{ideal} - y_{real}) = P (ϵ) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{ϵ}{σ})}^{2}}

The above shows the normal distribution $P (ϵ)$ denoted in red , where $w^{T} x$ is $y_{ideal}$ . It basically states the greater the $ϵ$ , the lower the probability of its occurrence and vice versa.
Suppose $y_{ideal}$ is a linear equation

y_{ideal} = w x

Optimising the weights

For independent events, the probability of multiple events occurring is just the product of all its probabilities.

Thus the probability that we observe the following data above is

P (Data) = P (ϵ_{1}) \cdot P (ϵ_{2}) \dots P (ϵ_{n})

We can rewrite it as

\begin{aligned} P (Data) = \prod_{i = 1}^{N} P (ϵ_{i}) & = \prod_{i = 1}^{N} P (y_{ideal} - y_{real}) \\ = \prod_{i = 1}^{N} P (w x_{i} - y_{i}) \end{aligned}

Since the smaller the $ϵ$ , the higher the probability of its occurrence $P (ϵ)$ . If we maximise $P (Data)$ , that would minimise $ϵ$ .

Thus $w_{best}$ can be achieved when we maximise $P (Data)$ .

w_{best} = \arg max \prod_{i = 1}^{N} P (w x_{i} - y_{i})

$\arg max$ is an operation that finds the argument (or input) that yields the maximum value of a function. Thus we aim to find the input weights ( $w$ ) or coefficients of $y_{ideal}$ that maximises $P (Data)$ .

The argument that maximises $\prod_{i = 1}^{N} P (w x_{i} - y_{i})$ is the same as the argument that maximises of $\ln \prod_{i = 1}^{N} P (w x_{i} - y_{i})$ as, if $a < b$ then $\ln a < \ln b$ .

\begin{aligned} w_{best} & = \arg max \ln \prod_{i = 1}^{N} P (w x_{i} - y_{i}) \\ = \arg max \sum_{i = 1}^{N} \ln P (w x_{i} - y_{i}) \\ = \arg max \sum \ln \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{w x_{i} - y_{i}}{σ})}^{2}} \\ = \arg max \sum \ln e^{- {(w x_{i} - y_{i})}^{2}} \\ = \arg max \sum - (w x_{i} - y_{i})^{2} \end{aligned}

Hopefully you understand that taking out the constants $\frac{1}{σ \sqrt{2 π}}$ and $\frac{1}{2 σ^{2}}$ will still provide an argument that maximises $P (Data)$ .

Thus

\begin{aligned} w_{best} & = \arg max \sum - (y_{ideal} - y_{real})^{2} \\ = \arg min \sum (y_{ideal} - y_{real})^{2} \end{aligned}

This is the exactly the Mean squared error formula used in textbooks. And we have proved that finding the argument $w$ , that minimises $\sum (w x_{i} - y_{i})^{2}$ is the best at minimising error $ϵ$ .

From: https://www.youtube.com/watch?v=q7seckj1hwM&list=LL&index=63&t=428s