Analysis of Variance (ANOVA) table and F-test

Published: 2021 January 30Modified: 2021 March 07, 02:08:05More details

Term	Short form	Description
Analysis of Variance	ANOVA	The variance of the response variable y is the average of the sum of squares, or in other words, the analysis of the mean squares.
Analysis of Variance table	ANOVA table	A table that provides a range of numbers relating to variance and dependance.
Degrees of freedom	DF	Degrees of freedom are variations on population size, based on the type of computations. Some are , others are , and so forth.
Sum of squares	SS	The sum of the squares of various differences — sometimes the difference between a predicted variable and an observed variable; other times between a predicted variable and a mean of the predicted variables. Altogether, this measures difference types of variance.
Mean squares	MS	This is always the sum of squares divided by the degrees of freedom, or in other words, .
Regression sum of squares	SSR	Measure of the variance (sum of squares) in y, that is due to changes in the predictor variable x. If this is a large proportion of the SSTO then that indicates that there is indeed a linear association between the predictor and the response variables.
Error sum of squares	SSR	Measure of the variance (sum of squares) in y, that is due to random error. If this is a large proportion of the SSTO then that indicates that there is not a linear association between the predictor and the response variables.
Total sum of squares	SSTO	The sum of SSR and SSE.
F-Value		If , then we expect to see that . Alternatively, if , then we expect to see that . We can only use the F test for the that and the that . It does not test the of whether or not has a particular positive or negative sign, just that it does not equal zero. When working with simple linear regression, the P value is the same for the F-test and the t-test.
P-Value	P	"What is the probability that we’d get an F statistic as large as we did, if the null hypothesis is true?"

Here are the conclusions for that , and that . The reliability of these conclusions is based on a certain level of confidence, e.g. if ,

Hypothesis			Description
	We accept the null hypothesis that...	We reject the null hypothesis that...	The null hypothesis is that the predictor variable has zero impact on the response variable, i.e. there is no linear relationship between these two variables.
	We reject the hypothesis that...	We accept the hypothesis that...	The hypothesis is that the predictor variable has an impact on the response variable, i.e. these two variables have a linear relationship.

Using software, we generally wind up with tables where all the values are calculated automatically. However, it is still important to understand what these are composed of, and how the values are calculated. Below is an "analysis of variance" (ANOVA) table,

Source of Variation	Degrees of Freedom (DF)	Sum of Squares (SS)	Mean Squares (MS)	F-Value	P-Value
Regression (R)
Residual Error (E)
Lack of Fit (LF)
Pure Error (PE)
Total (TO)

Below are formulas related to the sum of squares,

Source of Variation	Degrees of Freedom (DF)	Sum of Squares (SS)	Mean Squares (MS)	F-Value	P-Value
R
E
LF
PE
TO

The values and are as follows,

Variable	Description
	Number of observations. For example, for observations then .
	The number of unique observed values. For example, for observations then .

Below are more formulas related to the sum of squares,

Source of Variation	Degrees of Freedom (DF)	Sum of Squares (SS)	Mean Squares (MS)	F-Value	P-Value
R
E
LF
PE
TO

Calculating R-squared (R²)

Generally, we can calculate R² as follows,

There is also an adjusted R²,

Also an adjusted R²,

Lack of fit test

To develop the F-statistic for lack of fit, we divide the mean square for the lack of fit divided by the mean square for the pure error,

We use an F test based on the null hypothesis that there is no lack of fit, and the hypothesis that there is lack of fit. If we get a p-value less than our (normally 0.05) then we will reject the null hypothesis, and the data conveys that the linear model is inadequate. In others words, a curvilinear model may be better.