Statistics

"Statistics is vectors " ~ Joshua + a lot of statisticians

Introduction to the Geometry of Statistics

Why is population variance defined as

V a r (X) = \sum_{i = 1}^{n} \frac{(x_{i} - μ)^{2}}{n}

But sample variance is defined as

V a r (X) = \sum_{i = 1}^{n} \frac{(x_{i} - \bar{x})^{2}}{n - 1}

The key difference lies in the definition of $μ$ and $\bar{x}$ .

Population Mean

$μ$ is the population mean — a fixed but possibly unknown constant
If $X$ is a random variable with a probability distribution, then

E (X) = μ

$μ$ is assumed to be known in the population context. #Expectation

Sample mean

In practice, however, we often don’t know $μ$ , especially in real-world data. So, we approximate it using the sample mean

\bar{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

where $\bar{x}$ is a point estimate of the unknown $μ$ based on a random sample.
Prove for Sample Mean formula

Residuals

Residuals have an interesting property, it is expressed as

x_{i} - \bar{x}

Now the sum of residuals is always $0$ .
Since sample mean $\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ , thus $n \bar{x} = \sum_{i = 1}^{n} x_{i}$ . Hence

\sum_{i = 1}^{n} (x_{i} - \bar{x}) = \sum_{i = 1}^{n} x_{i} - n \bar{x} = 0

Suppose we have two sets of data, lets express them in vectors or residuals #Residuals

\hat{X} = (\begin{matrix} x_{1} - \bar{x} \\ x_{2} - \bar{x} \\ ⋮ \\ x_{n} - \bar{x} \end{matrix}) \hat{Y} = (\begin{matrix} y_{1} - \bar{y} \\ y_{2} - \bar{y} \\ ⋮ \\ y_{n} - \bar{y} \end{matrix})

Sample mean residuals

A degree of freedom is an independent value in your dataset that can vary freely when estimating a statistic.

Data $X$ can be split into two parts, residuals $(\hat{X})$ and sample mean.

X = (\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{n} \end{matrix}) = (\begin{matrix} x_{1} - \bar{x} + \bar{x} \\ x_{2} - \bar{x} + \bar{x} \\ ⋮ \\ x_{n} - \bar{x} + \bar{x} \end{matrix}) = \overset{Residuals (\hat{X})}{\overset{⏞}{(\begin{matrix} x_{1} - \bar{x} \\ x_{2} - \bar{x} \\ ⋮ \\ x_{n} - \bar{x} \end{matrix})}} + \overset{Sample mean}{\overset{⏞}{\bar{x} (\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix})}}

let the residuals be vector $\hat{X}$ , we know that the sum of the terms in $\hat{X}$ is $0$ . Statistics#Residuals

\sum_{i = 1}^{n} x_{i} - \bar{x} = 0

Thus this limits what the residual could be. Suppose $n = 2$ (2 data points).

\hat{X} = (\begin{matrix} x \\ 1 - x \end{matrix})

We find that all possible values of $\hat{X}$ lies on a line $y = 1 - x$ . Thus its degree of freedom is 1 (as it only moves across a 1d line)

Suppose $n = 3$ .

\hat{X} = (\begin{matrix} x \\ y \\ z \end{matrix}) x + y + z = 0

The vector $\hat{X}$ must lies on a 2d plane that satisfies $x + y + z = 0$ . Thus the degree of freedom of $\hat{X}$ is 2. Extrapolating this, $\hat{X}$ has $n - 1$ degree of freedom.

Sample Variance

As such, when calculating the variance of the residuals, since the degree of freedom of $\hat{X}$ is $n - 1$ , we divide by $n - 1$ instead to find the average deviation or residual.

V a r (X) = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}

Recall Linear Algebra#Dot Product

We can rewrite it as

V a r (X) = \frac{1}{n - 1} (\hat{X} \cdot \hat{X}) = \frac{1}{n - 1} | | \hat{X} | |^{2}

Sample Covariance

C o v (X, Y) = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})

We can rewrite it as

C o v (X, Y) = \frac{1}{n - 1} (\hat{X} \cdot \hat{Y})

Population Mean residuals

Population mean is known before calculating the value, it is thus independent of $x_{i}$ .

\sum_{i = 1}^{n} x_{i} - μ \neq 0

Thus, the residuals $x_{i} - μ$ or vector $\hat{X}$ can exist anywhere in the $n$ dimensional space, having $n$ degrees of freedom as its sum is not confined to $0$ .

Population Variance

V a r (X) = \sum_{i = 1}^{n} \frac{(x_{i} - μ)^{2}}{n} = \frac{1}{n} (\hat{X} \cdot \hat{X}) = \frac{1}{n} | | \hat{X} | |^{2}

We can also rewrite it as

\begin{aligned} V a r (X) & = \sum_{i = 1}^{n} \frac{(x_{i} - μ)^{2}}{n} \\ = \sum_{i = 1}^{n} \frac{x_{i}^{2} - 2 x_{i} E (X) + E (X)^{2}}{n} \\ = E (X^{2}) - 2 E (X)^{2} + E (X)^{2} \\ = E (X^{2}) - E (X)^{2} \end{aligned}

Population Covariance

C o v (X, Y) = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ_{x}) (y_{i} - μ_{y})

Why the covariance of two independent variables is 0

C o v (X, Y) = \frac{\sum_{i = 1}^{n} (x_{i} - μ_{x}) (y_{i} - μ_{y})}{n}

Recall the two independent vectors residual $\hat{X}$ and $\hat{Y}$ . We can see that based on dot product rule.

C o v (X, Y) = \frac{1}{n} (\hat{X} \cdot \hat{Y})

Thus, if $\hat{X}$ and $\hat{Y}$ are independent, they will not be similar (does not point in a similar direction), thus their dot product would be $0$

Standard Deviation

V a r (X) = σ^{2}

$r$ Correlation Coefficient (A more elegant view of)

The $r$ formula (Pearson correlation coefficient) is given as such

r = \frac{\sum (x - \bar{x}) (y - \bar{y})}{\sqrt{\sum (x - \bar{x})^{2} \cdot \sum (y - \bar{y})^{2}}}

Look at this monstrosity. But in a sense this is a rather simple formula.
Let's look at the formula again,

r = \frac{\sum (x - \bar{x}) (y - \bar{y})}{\sqrt{\sum (x - \bar{x})^{2} \cdot \sum (y - \bar{y})^{2}}}

Notice how we can rewrite is as

r = \frac{\hat{X} \cdot \hat{Y}}{| | \hat{X} | | \cdot | | \hat{Y} | |} = \cos θ

The dot product determine the similarity of two vectors. If the vectors have a linear relationship, $\hat{Y} = a \hat{X}$ , then $\hat{X}$ and $\hat{Y}$ would point at the same direction, then it will output $\pm 1$ . But if they are dissimilar (perpendicular) it will output $0$ . This thus explains why $r$ tells us the correlation between two variables as it is simply $\cos θ$ .

Regression

Check this out: Regression using Linear Algebra (for univariate inputs)

The coefficients of the best-fit for univariate functions can be expressed as, where $\tilde{X}$ is a Vandermonde Matrix. #Vandermonde_Matrix

A = ({\tilde{X}}^{T} \tilde{X})^{- 1} {\tilde{X}}^{T} Y

Interpolation

The coefficients of a uni-variate function can be simply expressed as such as $\hat{X}$ is a square matrix and has an inverse. (Only when $d = n + 1$ ) where $d$ is the degree of the interpolant, and $n$ is the number of data the interpolant is supposed to pass through.

A = {\hat{X}}^{- 1} Y

Why use Mean Squared Error in Regression?

Check this out: Rationale Behind MSE for optimisation

\begin{aligned} w_{best} & = \arg max \sum - (y_{ideal} - y_{real})^{2} \\ = \arg min \sum (y_{ideal} - y_{real})^{2} \end{aligned}

Where $y_{ideal} = w x$ , and $w$ is the argument, $\arg min$ outputs the argument that minimises the square of the residual. The above proves that the best function is obtained when we minimise the sum of the squared difference.