Want a straightforward approach to prediction? Nothing is as direct as simple linear regression (SLR). Suppose that you want to predict a quantitative response variable on the basis of a single response variable
. It assumes a linear relationship amid the two variables. We say:
[1]
We can think of (1) as the regression of on
-this is a surjection. Note that the
and
are two unknown constants that represent the intercept and the slope values in the linear model in (1) -these values, collectively, are known as coefficients or parameters. Once we obtain some data, we can take these values a step further and produce estimators denoted as
and
. Thus, (1) is now expressed as:
[2]
is an estimator for Y based on where X = x. Thus,
estimates the values for Y -which is an unknown parameter.
Now that we are this far, we can continue our review with some more pragmatic discussion of how to estimate the values the coefficients. To proceed, we need data; let,
represent n observations (pairs), each of which consists of a measurement of X and a measurement of Y -ultimately, we want to find a line that fits all these observations as close as possible. This idea of closeness is interesting; there are many ways we can define how close we are to the various observations, but the most common (particularly when we are talking about SLR) is least squares criterion
Consider as the equation that predicts Y based on the
value of X. We can no consider the difference amid the value
and the estimator
which is denoted as the ith residual:
[4]
Ultimately, (4) will take us to the idea of residual sum of squares and, as we will see, the various coefficients in (2) and (1). The residual sum of squares (RSS) is direct and expressed as:
In LSR, we seek to MINIMIZE the RSS. This provides us the values of our coefficients, and hence the linear regression line.
and
With and
as simple means.
Understanding the minimization of RSS is essential, but the concept is not as straightforward as one would think. Infact, it’s easy to understand that is merely a line in the form
with m, and b corresponding to our coefficients. The catch is explaining how the line is formed -how we get the coefficients (its cheap to just place a formula out there) outside the formula.
So, let’s back-up. Our goal is to find a relationship that maps our observations on to some y. What does this look like geometrically? There is a great tool here that helps show how this works. If we find a line that does not minimize the RSS, we get something like this:
The black line is the LSR line, the red line is some other line -note that the red boxes are those that form a square from the point to the red line -the area of those squares is not minimized. That is, we never minimized the RSS to derive the correct values for our coefficients. The black boxes, however, are the areas of the squares of the LSR from each point -notice that they are minimized.
Coefficient Estimates and Accuracy
In general, LSR exhibits high interpretability and low flexibility. There are implications to model selection that must be considered; first, flexible models often require estimating a greater number of parameters and are more complex -this leads to a phenomenon known as overfitting wherein the noise is modeled too closely (the noise is integrated into the model over the signal). Second, there is the quintessential problem of spuriousness that tends to complicate all statistical models -but is more chronic amid problems where highly flexible models are more appropriate (there are methods to dealing with such a problem, but we won’t address that here). Second, there is a tradeoff that takes place amid flexibility and interpretability that is important to consider -as flexibility increases, interpretability tends to decrease.
It is important to highlight that when estimating and
from our data that there exists a true relationship amid X and Y such that
this relationship, however, is one where the error term has mean 0 and is derived from a normal distribution. This relationship is based on the population data. In reality, we rarely have access to such an epistemic vantage point -thus, we are stuck working with a subset of the population data. These observations generate
as a distinct line that is a approximation of
.
Prima facie, the idea of being different from
is not clear. Why would such a difference exist when we have only one data set to work with? First, this is mora a philosophical actualization and fundamental problem with measurability than anything inherent to the mathematics. Suppose we seek to understand certain characteristics of a large population -suppose further that we want to know the population mean
of some random variable Y. We know that we can’t really witness or obtain all the data elements of Y, but we can obtain a sample. This will allow us to estimate the mean of Y. Since we have access to the sample, say n observations from Y; call that k such that
, we can find a reasonable estimator
=\overline{y}$ where
as the sample mean. In general, the sample mean is a good estimator for the population mean. Thus, in the same way, the coefficients are being estimated for m and b in
.
The analogy drawn above amid the sample mean and the SLR is apt based on the concept of bias. Think about it…. if we use to estimate
, the estimate is unbiased on average -so, in some cases the sample mean will overestimate the population mean, and in some cases it will underestimate. However, on average, we expect that, over a huge number of estimates, to equal the value of the population mean. Hence, such an estimator does not systematically fail in estimating the true parameter. This property of unbiasedness holds for:
and,
Now, if we were to average these over a very large number of data sets (pulled from the same population), we would have estimates that are equal to the the “real” coefficients of the population. However, we should be asking ourselves how accurate are these estimators, when unbiased, to the population equivalents. Let’s revert back to our example of the population mean of some random variable Y . So, how accurate is
as a estimate of
? We have asserted that the average of
‘s over many data sets will be very close to
-however, a single estimate of
may substantially underestimate or overestimate
How far off could we get? In general, we answer this question by computing the standard error of
, written as
We have well-known formula:
[5]
Basically, the standard error tells us the average amount that this estimate differs from $latex \mu. Take note that where n in (5) is the number of data points, the standard error gets smaller when n gets large -that is, the more observations, the more accurate.
Now, we should be able to obtain an standard error for our coefficients -so, to compute the standard error associated with and
, we use the following formulas (more details here):
where . In all these formulas, the demand for them to be valid requires that the error,
, is uncorrelated with common variance
Note that we cannot know
directly, but we can estimate it; this estimate is known as the residual standard error, and is given by the formula RSE = (RSS/(n-2))^(1/2).
Standard errors are useful for various things; one of these is to compute confidence intervals. A 95% confidence interval (CI) is defined as a 95% probability that a range of values will contain the true unknown value of the parameter in question. The range is defined as upper and lower bounds; the 95% CI for is
which defines the CI:
The CI states that there is approximately a 95% confidence interval that our coefficient lies within that interval. We also define the CI for our intercept ( ) as
.
Pulling It All Together
Standard errors are more than just interesting measures to help us understand SLR. We can use these tools to perform hypothesis tests on these coefficients. We proceed as follows (some great resources here):
- Formulate a null hypothesis (think of this as the opposite of what you are guessing or looking for in your data).
- Provide an alternative hypothesis
- Set an
(this involves type 1 and 2 errors –
corresponds to a type 1 error -more on this later)/
- Assuming you have correct data, calculate a test statistic (F, T, etc…)
- Determine thresholds of rejection.
- Derive a decision of fail to reject the null or reject the null.