Analyzing Variation In The Regression Model

Assumptions of the Regression Model. The assumptions listed below enable us to calculate unbiased estimators of the population) and to use these in predicting values and regression function coefficients (of Y given X). You should be aware of the fact that violation of one or more of these assumptions reduces the efficiency of the medle, but a detailed discussion of this topic is beyond the purview of this text. Assume that all these assumptions have been met. • For each value of X there is an array of possible Y normally distributed about the regression line. • The mean of the distribution of possible Y values is on the regression line, i.e., the expected value of the error term is zero. • The standard deviation of the distribution of possible Y values is constant regardless of the value of X (this is called homoscedasticity). • The error terms are statistically independent of each other, i.e., there is no serial correlation. • The error term is statistically independent of X. Note: These assumptions are very important, in that they enable us to construct predictions around our point estimate. Variation in the Regression Model . Recall that the purpose of regression analysis is to predict the value of a dependent variable given the value of the independent variable. The LSBF technique yields the best single line to fit the data, but you also want some method of determining how good this estimating equation is. In order to do this, you must first partition the variation. • Total Variation. The sum of squares total (SST) is a measure of the total variation of Y. SST is the sum of the squared differences between the observed values of Y and the mean of Y. SST = (Y i - ) 2 Where: SST = Sum of squared differences Y i = Observed value i = Mean value of Y While the above formula provides a clear picture of the meaning of SST, you can use the following formula to speed SST calculation: Total variation can be partitioned into two variations categories: explained and unexplained. This can be expressed as SST = SSR + SSE • Explained Variation. The sum of squares regression (SSR) is a measure of variation of Y that is explained by the regression equation. SSR is the sum of the squared differences between the calculated value of Y (Yc) and the mean of Y ( ). You can use the following formula to speed SSR calculation: • Unexplained Variation . The sum of squares error (SSE) is a measure of the variation of Y that is not explained by the regression equations. SSE is the sum of the squared differences between the observed values of Y and the calculated value of Y. This is the random variation of the observations around the regression line. You can use the following formula to speed SSE calculation: Analysis of Variance . Variance is equal to variation divided by degrees of freedom (df). In regression analysis, df is a statistical concept that is used to adjust for sample bias in estimating the population mean. • Mean Square Regression (MSR). For 2-variable linear regression, the value of df for calculating MSR is always one (1). As a result, in 2-variable linear regression, you can simplify the equation for MSR to read: • Mean Square Error (MSE). In 2-variable linear regression, df for calculating MSE is always n - 2. As a result, in simple regression, you can simplify the equation for MSE to read: • Analysis of Variance Table. The terms used to analyze variation/variance in the regression model are commonly summarized in an Analysis of Variance (ANOVA) table. ANOVA Table Source Sum of Squares df Mean Square** Regression SSR 1 MSR Error SSE n-2 MSE Total SST n-1 **Mean Square = Sum of Squares/df Constructing an ANOVA Table for the Manufacturing Overhead Example . Before you can calculate variance and variation, you must use the observations to calculate the statistics in the table below. Since we already calculated these statistics to develop the regression equation to estimate manufacturing overhead, we will begin our calculations with the values in the table below: Statistic Value ? 144 ?Y 846 ?XY 22,647 ?X 2 3,872 ?Y 2 133,296 24 141 A 5.8272 B 5.6322 n 6 Step 1. Calculate SST. Step 2. Calculate SSR. Step 3. Calculate SSE. Step 4. Calculate MSR. Step 5. Calculate MSE. Step 6. Combine the calculated values into an ANOVA table. ANOVA Table Source Sum of Squares df Mean Square** Regression 13,196 1 13,196 Error 814 4 204 Total 14,010 5 **Mean Square = Sum of Squares/df Step 7. Check SST. Assure that value for SST is equal to SSR plus SSE. SST = SSR + SSE 14,010 = 13,196 + 814 14,010 = 14,010 5.4 - Measuring How Well The Regression Equation Fits The Data Statistics Used to Measure Goodness of Fit . How well does the equation fit the data used in developing the equation? Three statistics are commonly used to determine the "goodness of fit" of the regression equation: • Coefficient of determination; • Standard error of the estimate; and • T-test for significance of the regression equation. Calculating the Coefficient of Determination . Most computer software designed to fit a line using regression analysis will also provide the coefficient of determination for that line. The coefficient of determination (r 2 ) measures the strength of the association between independent and dependent variables (X and Y). The range of r 2 is between zero and one. 0 < r 2 < 1 An r 2 of zero indicates that there is no relationship between X and Y. An r 2 of one indicates that there is a perfect relationship between X and Y. As r 2 gets closer to 1, the better the regression line fits the data set. In fact, r 2 is the ratio of explained variation (SSR) to total variation (SST). An r 2 of .90 indicates that 90 percent of the variation in Y has been explained by its relationship with X; that is, 90 percent of the variation in Y has been explained by the regression line. For the manufacturing overhead example: This means that approximately 94% of the variation in manufacturing overhead (Y) can be explained by its relationship with manufacturing direct labor hours (X). Standard Error of the Estimate . The standard error of the estimate (SEE) is a measure of the accuracy of the estimating (regression) equation. The SEE indicates the variability of the observed (actual) points around the regression line (predicted points). That is, it measures the extent to which the observed values (Yi) differ from their calculated values (Yc). Given the first two assumptions required for use of the regression model (for each value of X there is an array of possible Y values which is normally distributed about the regression line and the mean of this distribution (Yc) is on the regression line), the SEE is interpreted in a way similar to the way in which the standard deviation is interpreted. That is, given a value for X, we would generally expect the following intervals (based on the Empirical Rule): • Yc = 1 SEE contains approximately 68 percent of the total observations (Yi) • Yc = 2 SEE contains approximately 95 percent of the total observations (Yi) • Yc = 3 SEE contains approximately 99 percent of the total observations (Yi) The SEE is equal to the square root of the MSE. For the manufacturing overhead example: Steps for Conducting the T-test for the Significance of the Regression Equation . The regression line is derived from a sample. Because of sampling error, it is possible to get a regression relationship with a rather high r 2 (e.g. greater than 80 percent) when there is no real relationship between X and Y. That is, when there is no statistical significance. This phenomenon will occur only when you have very small sample data sets. You can test the significance of the regression equation by applying the T-test. Applying the T-test is a 4-step process: Step 1. Determine the significance level (). = 1 - confidence level The selection of the significance level is a management decision; that is, management decides the level of risk associated with an estimate which it will accept. In the absence of any other guidance, use a significance level of .10. Step 2. Calculate T. Use the values of MSR and MSE from the ANOVA table: Step 3. Determine the table value of t. From a t Table, select the t value for the appropriate degrees of freedom (df). In 2-variable linear regression: df = n - 2 Step 4. Compare T to the t Table value. Decision rules: If T > t, use the regression equation for prediction purposes. It is likely that the relationship is significant. If T < t, do not use the regression equation for prediction purposes. It is likely that the relationship is not significant. If T = t, a highly unlikely situation, you are theoretically indifferent and may elect to use or not use the regression equation for prediction purposes. Conducting the T-test for the Significance of the Regression Equation for the Manufacturing Overhead Example . To demonstrate use of the T-test, we will apply the 4-step procedure to the manufacturing overhead example: Step 1. Determine the significance level ( ). Assume that we have been told to use = .05. Step 2. Calculate T. Step 3. Determine the table value of t. The partial table below is an excerpt of a t table. df = n- 2 = 6 - 2 = 4 Partial t Table df t - - - - - - - - - - - 2 4.303 3 3.182 4 2.776 5 2.571 6 2.447 - - - - - - - - - - - Reading from the table, the appropriate value is 2.776. Step 4. Compare T to the t Table value. Since T (8.043) > t (2.776), use the regression equation for prediction purposes. It is likely that the relationship is significant. Note: There is not normally a conflict in the decision indicated by the T-test and the magnitude of r 2 . If r 2 is high, T is normally > t. A conflict could occur only in a situation where there are very few data points. In those rare instances where there is a conflict, you should accept the decision indicated by the T-test. It is a better indicator than r 2 because it takes into account the sample size (n) through the degrees of freedom (df). 5.5 - Calculating And Using A Prediction Interval Formulating the Prediction Interval . You can develop a regression equation and use it to calculate a point estimate for Y given any value of X. However, a point estimate alone does not provide enough information for sound negotiations. You need to be able to establish a range of values which you are confident contains the true value of the cost or price which you are trying to predict. In regression analysis, this range is known as the prediction interval. For a regression equation based on a small sample, you should develop a prediction interval, using the following equation: When X = 21 the prediction interval is 80.9207 Y 167.2861 Constructing a Prediction Interval for the Manufacturing Overhead Example . Assume that we want to construct a 95 percent prediction interval for the manufacturing overhead estimate at 2,100 manufacturing direct labor hours. Earlier in the chapter, we calculated Y C and the other statistics in the following table: Statistic Value Yc 124.1034 t (Use n - 2 df) 2.776 SEE 14.27 24 X 2 3,872 Using the table data, you would calculate the prediction interval as follows: When X = 21 the prediction interval is: 80.9207 Y 167.2861. Prediction Statement: We would be 95 percent confident that the actual manufacturing overhead will be between $80,921 and $167,286 at 2,100 manufacturing direct labor hours.      

IS IT YOUR FIRST TIME HERE? WELCOME

USE COUPON "11OFF" AND GET 11% OFF YOUR ORDERS