Rationale Behind MSE for optimisation

Look at Central Limit Theorem
When collecting data, the data would inevitably be affected by noise. Looking deeper, noise is made up by the summation of random and independent processes (eg. thermal process, electro-magnetic interferences etc). Thus based on the Central Limit theorem, Noise should follow a Normal distribution.

ϵN(0,σ2)yreal=yideal+ϵnoise

Typically, the mean (μ) of noise is assumed to be 0. Since each individual processes are random in nature, they have an equal likelihood of having a positive and negative magnitude. Thus, the noise should have no biases in the positive or negative direction.

Pasted image 20250608131245.png|centre|200
The probability of observing an error of value ϵ is given as

P(yidealyreal)=P(ϵ)=1σ2πe12(ϵσ)2

Pasted image 20250608132219.png|centre|600

The above shows the normal distribution P(ϵ) denoted in red , where wTx is yideal. It basically states the greater the ϵ, the lower the probability of its occurrence and vice versa.
Suppose yideal is a linear equation

yideal=wx

Optimising the weights

For independent events, the probability of multiple events occurring is just the product of all its probabilities.
Pasted image 20250608134650.png|centre|500

Thus the probability that we observe the following data above is

P(Data)=P(ϵ1)P(ϵ2)P(ϵn)

We can rewrite it as

P(Data)=i=1NP(ϵi)=i=1NP(yidealyreal)=i=1NP(wxiyi)

Since the smaller the ϵ, the higher the probability of its occurrence P(ϵ). If we maximise P(Data), that would minimise ϵ.

Thus wbest can be achieved when we maximise P(Data).

wbest=argmaxi=1NP(wxiyi)

argmax is an operation that finds the argument (or input) that yields the maximum value of a function. Thus we aim to find the input weights (w) or coefficients of yideal that maximises P(Data).

The argument that maximises i=1NP(wxiyi) is the same as the argument that maximises of lni=1NP(wxiyi) as, if a<b then lna<lnb .

wbest=argmaxlni=1NP(wxiyi)=argmaxi=1NlnP(wxiyi)=argmaxln1σ2πe12(wxiyiσ)2=argmaxlne(wxiyi)2=argmax(wxiyi)2

Hopefully you understand that taking out the constants 1σ2π and 12σ2 will still provide an argument that maximises P(Data).

Thus

wbest=argmax(yidealyreal)2=argmin(yidealyreal)2

This is the exactly the Mean squared error formula used in textbooks. And we have proved that finding the argument w, that minimises (wxiyi)2 is the best at minimising error ϵ.

From: https://www.youtube.com/watch?v=q7seckj1hwM&list=LL&index=63&t=428s