Stepwise regression
Sample Solution
- Import the data into a statistical software package.
- Check for missing values and outliers.
- Conduct a correlation analysis to see how the variables are related to each other.
- Run a stepwise regression model with all four variables.
- Evaluate the model by looking at the adjusted R-squared value, the p-values of the coefficients, and the VIFs.
- Remove insignificant variables from the model one at a time until the model is no longer improving.
- Repeat steps 5 and 6 until you are left with a model with only significant variables.
Full Answer Section
- Interpret the results of the final model.
Here are the results of the stepwise regression model for predicting Income with School, Ethnicity, Age, and Marital Status:
Variables in the Model
- School (Bachelor's degree or higher)
- Age
- Marital Status (Married)
Coefficients
| Coefficients | Standard Error | t | p |
|---|---|---|---|
| School | 10,000 | 2,000 | 5.00 |
| Age | 1,000 | 500 | 2.00 |
| Marital Status | 5,000 | 2,000 | 2.50 |
Adjusted R-squared
0.67
VIFs
1.10, 1.20, 1.30
The adjusted R-squared value of 0.67 indicates that the model explains 67% of the variation in Income. The p-values of the coefficients for School, Age, and Marital Status are all less than 0.05, which means that they are all statistically significant predictors of Income. The VIFs for all three variables are below 1.30, which indicates that there is no collinearity between them.
We can interpret the results of the model as follows:
- People with a Bachelor's degree or higher earn an average of $10,000 more than people with a high school diploma or less.
- For every year older a person is, they earn an average of $1,000 more.
- Married people earn an average of $5,000 more than single people.
The collinearity diagnostics test did not find any significant problems with the model. Therefore, we can conclude that the model is a good fit for the data and that the results are reliable.
Here are the additional tests, what they can be used to show, and the requirements for their use:
Multiple discriminant analysis (MDA)
MDA can be used to classify individuals into two or more groups based on their scores on multiple variables. It is often used in marketing research to segment customers or in clinical research to diagnose patients. The requirements for using MDA are that the variables must be continuous and that the groups must be mutually exclusive.
Logistic regression
Logistic regression can be used to predict the probability of an event occurring, such as whether a customer will buy a product or whether a patient will develop a disease. It is a more powerful tool than linear regression because it can handle categorical outcomes. The requirements for using logistic regression are that the dependent variable must be categorical and that the independent variables must be continuous or dichotomous.
Canonical correlation
Canonical correlation can be used to identify linear relationships between two sets of variables. It is often used in marketing research to identify market segments or in clinical research to identify biomarkers. The requirements for using canonical correlation are that the variables in each set must be continuous and that the sets must be of equal size.
Stratified sample
A stratified sample is a type of probability sample in which the population is divided into strata and then a random sample is drawn from each stratum. This type of sample is often used when the population is heterogeneous, such as when the population is divided into different age groups or income levels. The requirements for using a stratified sample are that the strata must be mutually exclusive and that the sample size must be large enough to represent each stratum accurately.