Simple linear regression
Regression measures the relationship between two variables. Both of the variables must be continuous — i.e., quantitative — meaning they include multiple samples of the same variable. For example, there may be two variables (
There are two types of regression,
Regression | Description |
---|---|
Simple linear regression | This compares the relationship between one predictor variable |
Multiple linear regression | This compares the relationship with two or more predictor variables. For example, this could explore the regression between average daily caloric intake and average daily minutes spent exercising compared to frequency of annual doctor's visits. |
Variables can exhibit four aspects to their relationship,
Aspect | Description |
---|---|
Deterministic | Some variables have a deterministic relationship, meaning they are totally related to one another by simple equations to convert between one and the other. For example, regression is not appropriate for measurements of temperature in Celsius compared to Fahrenheit, nor between measurements of height in inches compared to centimeters. Plotting these variables will result in a perfect line with a certain slope. |
Statistical | Some variables have a statistical relationship, meaning they are related to one another in somewhat complex ways that can be measured but are not simply conversions from one unit to another. For example, regression can measure the statistical relationship between average daily water consumption and annual incidence of kidney stones. Plotting these variables will result in charts with scatter and trend around an estimated trend line. |
Trend | How closely the two variables gather around a central line. |
Scatter | How scattered the two variables are away from a central line. |
Here are some variables and formulas,
Var. or form. | Description |
---|---|
The experimental unit is the person, thing, or entity upon which an observation is made, e.g. a single patient, in this case patient | |
Predictor variable. | |
Response variable. | |
Predictor value, e.g. that patient's amount of daily water consumption. | |
Observed response, e.g. that patient's actual annual number of kidney stone incidents. | |
Expected response based on the predictor value, e.g. that patient's expected annual number of kidney stone incidents. | |
The residual error | |
An ideal trend line — i.e. and ideal prediction model — will minimize the residual | |
To quantify overall similarity or dissimilarity between predictions and observations, the "Least Squares Criterion" is used. This measures overall similarity or dissimilarity as the sum of all | |
The quantity | |
Least squares line | The "least squares line" is a.k.a. the "least squares regression line" or "estimated regression equation" and is the formula |
This formula is used to calculate the least squares line. | |
This is the slope of the trend line of the scatter plot of the predictor variable and the response variable. The slope can be positive or negative. If | |
This is a measurement of the shift of the trend line either up or down, vertically, to fit within the plot. It also indicates that every increase in n general, In general, we can expect the mean response to increase or decrease by 1 for | |
This formula to calculate | |
The sample variance measures how spread-out the observed results are. It is calculated by comparison each observed result to the mean for all observed results. Below is the formula, where | |
Mean Square Error is similar to | |
Scope | Sometimes, especially for hypothetical values of |
There are a few "sums of squares" approaches, each with their own insights,
Sum of squares | Formula | Description |
---|---|---|
SSR | The "regression sum of squares" quantifies how far the slope of the regression line is from the horizontal line indicating there is no relationship. A lower SSR means there is less relationship between the predictor and reply variables, while a higher SSR means there is a stronger relationship. A lower value means less correlation. | |
SSE | The "error sum of squares" or "sum of residual error squares" quantifies how much the observed results vary around the regression line of the predicted results. A higher value means more scatter, and if it substantially higher than SSR, then it indicates high scatter but also a high degree of correlation between the predictor variable and the reply variable. | |
SSTO | The "total sum of squares" quantifies how much the observed results vary around their mean. Also, SSTO = SSR + SSE. | |
The "coefficient of determination"
|
Square of correlation coefficient
, known as coefficient of determination, represents the proportion of variation in one variable that is accounted for by the variation in the other variable. For example, if height and weight of a group of persons have a correlation coefficient of 0.80, one can estimate that 64% (0.80 × 0.80 = 0.64) of variation in their weights is accounted for by the variation in their heights. Aggarwal and Ranganathan, 2016
When interpreting the sums of squares, compare them to one another. If we see that SSR is much lower than SSE, then this indicates that scatter is due to lack of correlation. If we see that SSE is much higher than SSR, then this indicates that scatter is due to variation around the regression line.
When working with an entire population, not just a sample from the population, there is some differences in syntax,
Var. or form. | Description |
---|---|
Population regression line | In instances where we have an entire population in our sample — e.g. all persons living in an apartment complex, or all students at a particular school — then we can obtain the population regression line by the same techniques as those for a sample. It is equivalent to the least squares regression line, although the LSR line is an approximation of the population regression line that is used when it is not possible or feasible to obtain data for an entire population, and so only a sample is taken. |
This is the formula for the population regression line, comparable to the formula for the least squares line, and can be used when we contain all results in the entire population. | |
The predicted value — equivalent to | |
The error — equivalent to | |
The vertical shift in the trend line — equivalent to | |
The slope for the trend line — equivalent to | |
The "common variance" quantifies in one number how much the observed responses vary around the predicted responses. This is useful because if we make our own predictions — e.g. forecasts for events in the future, or filling in missing data — then we can create a confidence interval with the |
There are four assumptions behind the simple linear regression model, which can be remembered by the mnemonic L.I.N.E. as below,
Assumption | Description |
---|---|
Linearly related | This means that essentially the response variable can be calculated by multiplying the predictor variable by a coefficient. Also, the mean of the errors — the differences between the predicted and observed values — is zero. |
Independent | The errors are independent, meaning that they are not influencing one another. |
Normally distributed | The errors are normally distributed. |
Equal variance | For each value of the predictor variable, the errors for that value have variance that is equal to the variance for errors at other values. |
Remember that ultimately, the statistician is interested in drawing conclusions about the population as a whole, not just the observed sample. For this reason, there are separate but comparable parameters for populations and their samples. To re-compare sample parameters and corresponding population parameters which they estimate,
Population parameter | Sample parameter |
---|---|
Population regression line | Least squares regression line |
Endnotes
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5079093/
https://online.stat.psu.edu/stat501/lesson/1