Home
Blog

Starting point of linear regression and why we prefer matrix algebra

Ironically, in many cases, especially in linear regression, calculations become simpler when approached through multi-dimensional matrix algebra rather than one-dimensional methods. Let's explore the foundational concepts of linear regression to understand this phenomenon.

Published

13 January 2024

Tag

Firstly the setup. Let us say we are given a data matrix $X\in\mathbb{R}^{n\times p}$ and a label vector $Y\in\mathbb{R}^n$ . We have $Y=X\beta+\epsilon$ where $\beta$ and $\epsilon$ are unknowns. Further, let’s say that each component in $\epsilon$ follows normal distribution $N(0,\sigma^2)$ i.i.d. This will be our “model assumption.” (it is our choice - we can relax or strengthen as we need).

Now, our objective is to estimate the parameter $\beta$ based on the given data. One reasonable way to achieve this is by choosing $\beta$ in such a way that the sum of squares, defined as $RSS = \lVert Y-X\beta\rVert^2$ , is minimized.

In the one-dimensional case, the calculation can be described as follows: The RSS is expressed as $\sum_{i=1}^{i=n}(Y_i-\beta_0-X_i\beta_1)^2$ and note that this is convex function. Obtain $\hat{\beta_0}$ and $\hat{\beta_1}$ by taking derivative of $RSS$ .

We arrive at $\hat{\beta}_1=\frac{\sum_i(X_i-\bar X)\cdot (Y_i-\bar Y)}{\sum_i(X_i-\bar X)^2}$ and $\hat{\beta}_0=\bar Y-\hat{\beta}_1\bar X$ . Additionally, we can also define $\hat{\epsilon}_i = Y_i-(\hat{\beta}_0+\hat{\beta}_1 X_i)$ and put the estimate variance as $\hat{\sigma}^2 = \frac 1{n-2}\sum_i \hat{\epsilon}_i^2$ , where ‘2’ in the denominator accounts for the degrees of freedom. This is already cumbersome; for example, in conducting hypothesis testing involving $\hat\beta$ and $\hat y$ , it may be necessary to calculate the mean and variance of $\hat\beta_0$ , $\hat\beta_1$ , and $\hat\epsilon$ . While the probability is embedded in numerator, allowing some of these calculations feasible, the complexity escalates when attempting to confirm the unbiasedness of $\hat\epsilon$ .

In multivariate scenario, the process becomes more streamlined, as we’ll see. Employing the same trick of differentiation, we can deduce that $\hat \beta = (X^TX)^{-1}X^T Y\sim N(\beta,\sigma^2(X^TX)^{-1})$ . Accordingly we can define the estimated variance $\hat\epsilon^2 = \lVert Y-X\hat\beta\rVert ^2/(n-p)$ .

To understand why $\hat\epsilon^2$ is unbiased, we note that

$(n-p)\hat{\epsilon}^2=\lVert Y-X\hat\beta\rVert^2=\lVert Y-X(X^TX)^{-1}X^T Y\rVert^2$ and $X(X^TX)^{-1}X^T$ is an orthogonal projection matrix on $span(X)$ , because it is idempotent and symmetric. Thus, $I-X(X^TX)^{-1}X^T$ is an orthogonal projection matrix on $span(X^\perp)$ . This means we can reduce the above formula to

$(n-p)\hat{\epsilon}^2=\lVert UU^T Y\rVert=\lVert U^T Y\rVert^2$

for some projection matrix $U$ .

Now consider the following sequence of equalities.

$\begin{aligned} E(\lVert U^T Y\rVert^2) &= E(Y^T U U^T Y) \\ &= E(\epsilon^T UU^T \epsilon) \\ &= Tr [E((U^T\epsilon)^T (U^T\epsilon))] \\ &= E(Tr[((U^T\epsilon)(U^T\epsilon)^T)]) \\ &= E(Tr[(\sigma^2 I)]) = (n-p)\sigma^2 \\ \end{aligned}$

These steps effectively establish our claim.

To understand why $\hat\beta\perp\hat\epsilon$ , we can rely on the fact that zero covariance implies the independence in multivariate normal distribution. This essentially emerges from the unique structure of multivariate normal distribution function, specifically when it has diagonal covariance matrix elements, allowing it to separate into distinct components (i.e. it factors out). Why such a ‘miracle’ happens specifically for normal distribution isn’t immediately clear to me.

The reasoning unfolds as follows: to demonstrate the independence of $\hat\beta$ and $\hat\epsilon$ , it’s sufficient to show that $U^T Y$ and $\hat\beta=(X^TX)^{-1}X^T Y$ are independent. This is because $\hat\epsilon^2$ is simply a square of $U^T Y$ . From the earlier discussion, we are only required to show that the transformation matrix of Y, $U^T$ and $(X^T X)^{-1}X^T$ , are orthogonal, which holds because $U^TX=0$ by definition.

With these insights, we can further delve into the statistical analysis of $\hat\beta$ and $\hat y$ .

What simplifies the multivariate calculation easier in our case? Firstly the usage of trace was pivotal, particularly when directly computing the expected value of $XX^T$ is impractical (by using trace, we could swap and used $X^TX=I$ essentially). Secondly, the unique attribute of multivariate normal distributions where zero covariance equates to independence plays a critical role.

This is not to suggest that complex calculations are inferior to more succinct ones. In mathematics, the aim is often to glean as much intuitive understanding as possible from various approaches. For example, in the 1D case, $\hat{\beta}_1=\frac{\sum_i(X_i-\bar X)\cdot (Y_i-\bar Y)}{\sum_i(X_i-\bar X)^2}$ can be interpreted as the sample correlation between X and Y, scaled by the ratio of their standard deviations, reflecting the relative strength of their relationship.

We’ve derived some fundamental formulas for multiple regression, setting the groundwork for our analysis. This illustrates how matrix algebra can streamline the computational process in regression analysis.