Statistics

"Statistics is vectors " ~ Joshua + a lot of statisticians

Introduction to the Geometry of Statistics

Why is population variance defined as

Var(X)=i=1n(xiμ)2n

But sample variance is defined as

Var(X)=i=1n(xix¯)2n1

The key difference lies in the definition of μ and x¯.

Population Mean

E(X)=μ

μ is assumed to be known in the population context. #Expectation

Sample mean

x¯=1Ni=1Nxi

where x¯ is a point estimate of the unknown μ based on a random sample.

Residuals

Residuals have an interesting property, it is expressed as

xix¯

Now the sum of residuals is always 0.
Since sample mean x¯=1ni=1nxi, thus nx¯=i=1nxi. Hence

i=1n(xix¯)=i=1nxinx¯=0

Suppose we have two sets of data, lets express them in vectors or residuals #Residuals

X^=(x1x¯x2x¯xnx¯)Y^=(y1y¯y2y¯yny¯)

Sample mean residuals

A degree of freedom is an independent value in your dataset that can vary freely when estimating a statistic.

Data X can be split into two parts, residuals (X^) and sample mean.

X=(x1x2xn)=(x1x¯+x¯x2x¯+x¯xnx¯+x¯)=(x1x¯x2x¯xnx¯)Residuals (X^)+x¯(111)Sample mean

let the residuals be vector X^, we know that the sum of the terms in X^ is 0. Statistics#Residuals

i=1nxix¯=0

Thus this limits what the residual could be. Suppose n=2 (2 data points).

X^=(x1x)

We find that all possible values of X^ lies on a line y=1x. Thus its degree of freedom is 1 (as it only moves across a 1d line)

Suppose n=3.

X^=(xyz)x+y+z=0

The vector X^ must lies on a 2d plane that satisfies x+y+z=0. Thus the degree of freedom of X^ is 2. Extrapolating this, X^ has n1 degree of freedom.

Sample Variance

As such, when calculating the variance of the residuals, since the degree of freedom of X^ is n1, we divide by n1 instead to find the average deviation or residual.

Var(X)=1n1i=1n(xix¯)2

Recall Linear Algebra#Dot Product

We can rewrite it as

Var(X)=1n1(X^X^)=1n1||X^||2

Sample Covariance

Cov(X,Y)=1n1i=1n(xix¯)(yiy¯)

We can rewrite it as

Cov(X,Y)=1n1(X^Y^)

Population Mean residuals

Population mean is known before calculating the value, it is thus independent of xi.

i=1nxiμ0

Thus, the residuals xiμ or vector X^ can exist anywhere in the n dimensional space, having n degrees of freedom as its sum is not confined to 0.

Population Variance

Var(X)=i=1n(xiμ)2n=1n(X^X^)=1n||X^||2

We can also rewrite it as

Var(X)=i=1n(xiμ)2n=i=1nxi22xiE(X)+E(X)2n=E(X2)2 E(X)2+E(X)2=E(X2)E(X)2

Population Covariance

Cov(X,Y)=1ni=1n(xiμx)(yiμy)

Why the covariance of two independent variables is 0

Cov(X,Y)=i=1n(xiμx)(yiμy)n

Recall the two independent vectors residual X^ and Y^. We can see that based on dot product rule.

Cov(X,Y)=1n(X^Y^)

Thus, if X^ and Y^ are independent, they will not be similar (does not point in a similar direction), thus their dot product would be 0

Standard Deviation

Var(X)=σ2

r Correlation Coefficient (A more elegant view of)

The r formula (Pearson correlation coefficient) is given as such

r=(xx¯)(yy¯)(xx¯)2(yy¯)2

Look at this monstrosity. But in a sense this is a rather simple formula.
Let's look at the formula again,

r=(xx¯)(yy¯)(xx¯)2(yy¯)2

Notice how we can rewrite is as

r=X^Y^||X^||||Y^||=cosθ

The dot product determine the similarity of two vectors. If the vectors have a linear relationship, Y^=aX^, then X^ and Y^ would point at the same direction, then it will output ±1. But if they are dissimilar (perpendicular) it will output 0. This thus explains why r tells us the correlation between two variables as it is simply cosθ.

Regression

Check this out: Regression using Linear Algebra (for univariate inputs)

The coefficients of the best-fit for univariate functions can be expressed as, where X~ is a Vandermonde Matrix. #Vandermonde_Matrix

A=(X~TX~)1X~TY

Interpolation

The coefficients of a uni-variate function can be simply expressed as such as X^ is a square matrix and has an inverse. (Only when d=n+1) where d is the degree of the interpolant, and n is the number of data the interpolant is supposed to pass through.

A=X^1Y

Why use Mean Squared Error in Regression?

Check this out: Rationale Behind MSE for optimisation

wbest=argmax(yidealyreal)2=argmin(yidealyreal)2

Where yideal=wx, and w is the argument, argmin outputs the argument that minimises the square of the residual. The above proves that the best function is obtained when we minimise the sum of the squared difference.

References

  1. https://www.youtube.com/watch?v=VDlnuO96p58
  2. https://medium.com/@andrew.chamberlain/a-more-elegant-view-of-r-squared-a0a14c177dc3
  3. https://www.youtube.com/watch?v=q7seckj1hwM&list=LL&index=63&t=428s