Influential points

Published: 2021 April 11Modified: 2021 April 11, 23:04:04More details

Some quick definitions,

Term	Description
Outlier	A data point is an outlier if it has a response that does not follow the general trend of the data.
High Leverage	A data point has high leverage if its predictor values are unique. This may mean that it has one or more predictor values that are extraordinarily high or low (e.g. a predictor value of 1 or 15, when most of the predictor values are between 5 to 10). Also, it may mean that it has predictor values that normally do not go together; for example, on a data set of K-12 students, we know of course that being eight years old is not unusual, nor is being in the twelfth grade unusual, but being eight years old and in the twelfth grade would be a data point with high leverage.

Term

Description

Outlier

A data point is an outlier if it has a response that does not follow the general trend of the data.

High Leverage

A data point has high leverage if its predictor values are unique. This may mean that it has one or more predictor values that are extraordinarily high or low (e.g. a predictor value of 1 or 15, when most of the predictor values are between 5 to 10). Also, it may mean that it has predictor values that normally do not go together; for example, on a data set of K-12 students, we know of course that being eight years old is not unusual, nor is being in the twelfth grade unusual, but being eight years old and in the twelfth grade would be a data point with high leverage.

Calculations

Term	Formula	Description
Residual		As usual, this is just the difference between the observed result and the predicted result for an observation.
Deleted residual		A deleted residual is the residual for an observation, based on a model that has that row removed. These are also called PRESS prediction errors, or unstandardized deleted residuals. They will usually be larger than non-deleted residuals, because the influential observation will pull the predicted value towards itself.
Predicted R-squared		This is generally a more intuitive result than working with PRESS. Also, this is a helpful way to evaluate a model without having to split the data into training and validation sets.
Leverage		All the leverages should add up to .
Leverage threshold		If leverage is greater than three times then we can say it is high leverage.
Studentized residual		In other words, this is the residual, divided by an estimate of the standard deviation of the residuals.
Studentized deleted residual		Minitab refers to these as deleted residuals.
Difference in fits (DFFITS)
DFFITS threshold
Cook's distance		A large Cook's distance value indicates an observation is influential.