Simple linear regression

Published: 2021 January 26Modified: 2021 January 26, 23:00:06More details

Regression measures the relationship between two variables. Both of the variables must be continuous — i.e., quantitative — meaning they include multiple samples of the same variable. For example, there may be two variables (), one for a student's GPA and another for a student's total savings. Looking at the regression between the two variables using data from 10,000 students () will tell us the level of relationship between these two data vectors. A vector is a matrix with one dimension, i.e. it has just one column. On the other hand, regression could not be performed between the test scores and a non-continuos variable, such as the cost of tuition per credit, as this would be comparing variables with different dimensions altogether, i.e. a two-dimensional continuous variable versus a one-dimensional point.

There are two types of regression,

Regression	Description
Simple linear regression	This compares the relationship between one predictor variable , sometimes archaically called the explanatory or independent variable, and one response variable , sometimes archaically called the outcome or dependent variable. For example, this could explore the regression between average daily caloric intake compared to frequency of annual doctor's visits.
Multiple linear regression	This compares the relationship with two or more predictor variables. For example, this could explore the regression between average daily caloric intake and average daily minutes spent exercising compared to frequency of annual doctor's visits.

Variables can exhibit four aspects to their relationship,

Aspect	Description
Deterministic	Some variables have a deterministic relationship, meaning they are totally related to one another by simple equations to convert between one and the other. For example, regression is not appropriate for measurements of temperature in Celsius compared to Fahrenheit, nor between measurements of height in inches compared to centimeters. Plotting these variables will result in a perfect line with a certain slope.
Statistical	Some variables have a statistical relationship, meaning they are related to one another in somewhat complex ways that can be measured but are not simply conversions from one unit to another. For example, regression can measure the statistical relationship between average daily water consumption and annual incidence of kidney stones. Plotting these variables will result in charts with scatter and trend around an estimated trend line.
Trend	How closely the two variables gather around a central line.
Scatter	How scattered the two variables are away from a central line.

Here are some variables and formulas,

Var. or form.	Description
	The experimental unit is the person, thing, or entity upon which an observation is made, e.g. a single patient, in this case patient .
	Predictor variable.
	Response variable.
	Predictor value, e.g. that patient's amount of daily water consumption.
	Observed response, e.g. that patient's actual annual number of kidney stone incidents.
	Expected response based on the predictor value, e.g. that patient's expected annual number of kidney stone incidents.
	The residual error (a.k.a the prediction error) measures the difference between observed results, and expected results based on the predictor variable.
	An ideal trend line — i.e. and ideal prediction model — will minimize the residual , meaning the predictions and observations will be as close as possible. There may be different approaches to forming a trend line, and calculations involving prediction error give us meaningful indications of which trend lines are better.
	To quantify overall similarity or dissimilarity between predictions and observations, the "Least Squares Criterion" is used. This measures overall similarity or dissimilarity as the sum of all results. An ideal trend line will minimize Q, while a larger Q value corresponds to a worse trend line.
	The quantity is equivalent to the sum of the squares of prediction errors. It is essential to perform the squaring, because otherwise the positive and negative prediction errors would cancel each other out, and the result would always be zero.
Least squares line	The "least squares line" is a.k.a. the "least squares regression line" or "estimated regression equation" and is the formula that results in the least squares . This is generally calculated automatically using software.
	This formula is used to calculate the least squares line.
	This is the slope of the trend line of the scatter plot of the predictor variable and the response variable. The slope can be positive or negative. If is positive, then it means that the predictor and response variables increase and decrease together. If is negative, then it means that when the predictor variable increases or decreases, the response variable decreases or increases, respectively.
	This is a measurement of the shift of the trend line either up or down, vertically, to fit within the plot. It also indicates that every increase in n general, In general, we can expect the mean response to increase or decrease by 1 for will result in an increase of for .
	This formula to calculate is generally performed automatically by statistical software.
	The sample variance measures how spread-out the observed results are. It is calculated by comparison each observed result to the mean for all observed results. Below is the formula, where is the mean for the the observed values.
	Mean Square Error is similar to but is based on each subpopulation in the sample, with a subpopulation being defined as all observed values for each predictor value.
	is known as the "regression standard error" or the "residual standard error" and estimates the of the population.
Scope	Sometimes, especially for hypothetical values of that are far beyond the original sample, we will get nonsensical predictions for . For example, there is nobody with negative ages, or ten meters high, and such estimates would be outside the scope of the data set.

There are a few "sums of squares" approaches, each with their own insights,

Sum of squares	Formula	Description
SSR		The "regression sum of squares" quantifies how far the slope of the regression line is from the horizontal line indicating there is no relationship. A lower SSR means there is less relationship between the predictor and reply variables, while a higher SSR means there is a stronger relationship. A lower value means less correlation.
SSE		The "error sum of squares" or "sum of residual error squares" quantifies how much the observed results vary around the regression line of the predicted results. A higher value means more scatter, and if it substantially higher than SSR, then it indicates high scatter but also a high degree of correlation between the predictor variable and the reply variable.
SSTO		The "total sum of squares" quantifies how much the observed results vary around their mean. Also, SSTO = SSR + SSE.
		The "coefficient of determination" (a.k.a "r-squared value") is lower when two variables are less dependent on one another, and higher when two variables are more dependent on one another. It is always between 0 and 1, because it is a proportion. At 1, the reply is totally dependent on the predictor. Changes in the predictor account for all changes seen in the replies, e.g. a conversion to Celsius from Feharenheit. At 0, the reply is not at all dependent on the predictor. Changes in the predictor account for none of the changes seen in the replies, e.g. two totally unrelated sets of data like body temperature and family size.

Square of correlation coefficient , known as coefficient of determination, represents the proportion of variation in one variable that is accounted for by the variation in the other variable. For example, if height and weight of a group of persons have a correlation coefficient of 0.80, one can estimate that 64% (0.80 × 0.80 = 0.64) of variation in their weights is accounted for by the variation in their heights. Aggarwal and Ranganathan, 2016

When interpreting the sums of squares, compare them to one another. If we see that SSR is much lower than SSE, then this indicates that scatter is due to lack of correlation. If we see that SSE is much higher than SSR, then this indicates that scatter is due to variation around the regression line.

When working with an entire population, not just a sample from the population, there is some differences in syntax,

Var. or form.	Description
Population regression line	In instances where we have an entire population in our sample — e.g. all persons living in an apartment complex, or all students at a particular school — then we can obtain the population regression line by the same techniques as those for a sample. It is equivalent to the least squares regression line, although the LSR line is an approximation of the population regression line that is used when it is not possible or feasible to obtain data for an entire population, and so only a sample is taken.
	This is the formula for the population regression line, comparable to the formula for the least squares line, and can be used when we contain all results in the entire population.
	The predicted value — equivalent to — based on the predictor value, when working with a whole population. This can also be written as instead.
	The error — equivalent to — between a predicted value and its observed value, when working with a whole population
	The vertical shift in the trend line — equivalent to — when calculating a value for the response variable based on a given value for the predictor variable, for a whole population.
	The slope for the trend line — equivalent to — formed by predictor values and predicted responses, that is a best fit line for the plot of predictor values and observed responses.
	The "common variance" quantifies in one number how much the observed responses vary around the predicted responses. This is useful because if we make our own predictions — e.g. forecasts for events in the future, or filling in missing data — then we can create a confidence interval with the result. It is a population parameter, not a sample parameter, and as a result is seldom obtained because we seldom have all data for a population. It is equivalent to sample variance

There are four assumptions behind the simple linear regression model, which can be remembered by the mnemonic L.I.N.E. as below,

Assumption	Description
Linearly related	This means that essentially the response variable can be calculated by multiplying the predictor variable by a coefficient. Also, the mean of the errors — the differences between the predicted and observed values — is zero.
Independent	The errors are independent, meaning that they are not influencing one another.
Normally distributed	The errors are normally distributed.
Equal variance	For each value of the predictor variable, the errors for that value have variance that is equal to the variance for errors at other values.

Remember that ultimately, the statistician is interested in drawing conclusions about the population as a whole, not just the observed sample. For this reason, there are separate but comparable parameters for populations and their samples. To re-compare sample parameters and corresponding population parameters which they estimate,

Population parameter	Sample parameter


or



	(for whole sample)
	(for all subpopulations of the whole sample)

Population regression line	Least squares regression line

Endnotes

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5079093/

https://online.stat.psu.edu/stat501/lesson/1