Starting point of linear regression and why we prefer matrix algebra
Ironically, in many cases, especially in linear regression, calculations become simpler when approached through multi-dimensional matrix algebra rather than one-dimensional methods. Let's explore the foundational concepts of linear regression to understand this phenomenon.
Firstly the setup. Let us say we are given a data matrix and a label vector . We have where and are unknowns. Further, let’s say that each component in follows normal distribution i.i.d. This will be our “model assumption.” (it is our choice - we can relax or strengthen as we need).
Now, our objective is to estimate the parameter based on the given data. One reasonable way to achieve this is by choosing in such a way that the sum of squares, defined as , is minimized.
In the one-dimensional case, the calculation can be described as follows: The RSS is expressed as and note that this is convex function. Obtain and by taking derivative of .
We arrive at and . Additionally, we can also define and put the estimate variance as , where ‘2’ in the denominator accounts for the degrees of freedom. This is already cumbersome; for example, in conducting hypothesis testing involving and , it may be necessary to calculate the mean and variance of , , and . While the probability is embedded in numerator, allowing some of these calculations feasible, the complexity escalates when attempting to confirm the unbiasedness of .
In multivariate scenario, the process becomes more streamlined, as we’ll see. Employing the same trick of differentiation, we can deduce that . Accordingly we can define the estimated variance .
To understand why is unbiased, we note that
and is an orthogonal projection matrix on , because it is idempotent and symmetric. Thus, is an orthogonal projection matrix on . This means we can reduce the above formula to
for some projection matrix .
Now consider the following sequence of equalities.
These steps effectively establish our claim.
To understand why , we can rely on the fact that zero covariance implies the independence in multivariate normal distribution. This essentially emerges from the unique structure of multivariate normal distribution function, specifically when it has diagonal covariance matrix elements, allowing it to separate into distinct components (i.e. it factors out). Why such a ‘miracle’ happens specifically for normal distribution isn’t immediately clear to me.
The reasoning unfolds as follows: to demonstrate the independence of and , it’s sufficient to show that and are independent. This is because is simply a square of . From the earlier discussion, we are only required to show that the transformation matrix of Y, and , are orthogonal, which holds because by definition.
With these insights, we can further delve into the statistical analysis of and .
What simplifies the multivariate calculation easier in our case? Firstly the usage of trace was pivotal, particularly when directly computing the expected value of is impractical (by using trace, we could swap and used essentially). Secondly, the unique attribute of multivariate normal distributions where zero covariance equates to independence plays a critical role.
This is not to suggest that complex calculations are inferior to more succinct ones. In mathematics, the aim is often to glean as much intuitive understanding as possible from various approaches. For example, in the 1D case, can be interpreted as the sample correlation between X and Y, scaled by the ratio of their standard deviations, reflecting the relative strength of their relationship.
We’ve derived some fundamental formulas for multiple regression, setting the groundwork for our analysis. This illustrates how matrix algebra can streamline the computational process in regression analysis.