Influential points
Some quick definitions,
Term | Description |
---|---|
Outlier | A data point is an outlier if it has a response that does not follow the general trend of the data. |
High Leverage | A data point has high leverage if its predictor values are unique. This may mean that it has one or more predictor values that are extraordinarily high or low (e.g. a predictor value of 1 or 15, when most of the predictor values are between 5 to 10). Also, it may mean that it has predictor values that normally do not go together; for example, on a data set of K-12 students, we know of course that being eight years old is not unusual, nor is being in the twelfth grade unusual, but being eight years old and in the twelfth grade would be a data point with high leverage. |
Calculations
Term | Formula | Description |
---|---|---|
Residual | As usual, this is just the difference between the observed result and the predicted result for an observation. | |
Deleted residual | A deleted residual is the residual for an observation, based on a model that has that row removed. These are also called PRESS prediction errors, or unstandardized deleted residuals. They will usually be larger than non-deleted residuals, because the influential observation will | |
Predicted R-squared | This is generally a more intuitive result than working with PRESS. Also, this is a helpful way to evaluate a model without having to split the data into training and validation sets. | |
Leverage | All the leverages should add up to | |
Leverage threshold | If leverage is greater than three times | |
Studentized residual | In other words, this is the residual, divided by an estimate of the standard deviation of the residuals. | |
Studentized deleted residual | Minitab refers to these as deleted residuals. | |
Difference in fits (DFFITS) | ||
DFFITS threshold | ||
Cook's distance | A large Cook's distance value indicates an observation is influential. |