Verifying the Assumptions Again Essay
From the normal probability plot and the histogram, we observe that the normality assumption is till valid. We need to verify that the assumptions for regression analysis still hold, since we have removed some variables from our analysis. The residual plots all reveal that the residuals are normally distributed. See Appendix VIII. However, there still exist some outliers from the residual plot of opponent 3-point per game. In a further attempt to improve on this particular model, we analyzed the data again, omitting the perceived outliers from the residual plots.
6. 4. 2 Further Improvement Performing the analysis without the observed outliers still does not make our model any better at prediction – in fact, the reverse is the case, as the R2 value moves from 77. 7% to 49. 8% an the S value moves from 0. 07979 to 0. 118905. This may be a pointer that the first model is still better at predicting the winning percentage of a team. Regression equation for this third model is: The regression equation is Winning percentage = 0. 487 + 0. 0184 Free throws per game + 0. 0240 Opponent Turn-over,pg + 0. 0188 Home rebound per game.
– 0. 0303 Oppnt rebound per game – 0. 0243 Opp 3-point per game S = 0. 118905 R-Sq = 49. 8% R-Sq(adj) = 45. 7% More details of this model are presented in Appendix IX. Since we have not yet improved on our first model, we still try to improve on it. 6. 4. 3 A Third Model We still search for a better model. We now choose another combination of variables from the Best Subset Regression Analysis. This one has 6 variables, and the regression model is presented below. (More details in Appendix X) The regression equation is Winning percentage = 0.
565 + 0. 0239 3-point per game + 0. 0163 Free throws per game – 0. 0630 Turn-over, pg + 0. 0436 Opponent Turn-over,pg + 0. 0265 Home rebound per game – 0. 0310 Oppnt rebound per game S = 0. 0755690 R-Sq = 80. 3% R-Sq(adj) = 78. 4% This model appears to be close to the first one, in which all seven variables were used. The model interpretation is as we have explained before (see p 7).
Moreover, with 80. 3% of the variability in the system being accounted for by this analysis, we also note that the standard error of our analysis is 0.0755690, which is not far from that of the seven-variable model. It is still better than the standard deviation of the explained variable (which is 0. 1625). Upon observing the residual plots for each variable, we observed an outlier in team number 7 (Notre Dame) for Team’s Turnover per Game.
The rest of the plots do not have obvious outliers. Also, the assumption of normality is not violated, since the histogram shows a normal distribution and so does the normal probability plot. 6. 5 The Final Model.
When carried the multiple regression once again without the outlier we identified, we obtained yet a better model. The regression equation is given below: (the details are in presented in Appendix XII) The regression equation is Winning percentage = 0. 604 + 0. 0226 3-point per game + 0. 0167 Free throws per game – 0. 0660 Turn-over, pg + 0. 0420 Opponent Turn-over,pg + 0. 0256 Home rebound per game – 0. 0292 Oppnt rebound per game S = 0. 0739739 R-Sq = 80. 8% R-Sq(adj) = 78. 8% The interpretation of this model is the same as we have given for the previous models (see p 7).
We only state that the model shows us that for each extra turnover per game, the percentage win should be expected to reduce, and so it s for opponent rebound per game. On the other hand, 3-point per game, free throw per game, opponent turnover per game and team’s rebound per game should all be expected to increase the winning chance of a team in this group. We take this model to be best because, even though the R2 value is 80. 8% (less than it was when we included all seven variables), the adjusted R2 value is 78. 8%, the same as that of the first model.
We believe that this takes us closer to perfection than the first model. The second consideration is the value of s. for this model we obtain s = 0. 0739739. Implication of this value is that our prediction is as close to the real thing as within (+2×0. 0739739) = +0. 1479. None of the other models took us this close. This further convinces us that this model is the best. The third consideration is that the standard deviation of the predicted variable is even less when we exclude the outlier we excluded, and the mean is even higher.
The standard deviation was 0.1625 (with mean = 0. 5946), but now it is 0. 1608 (with mean = 0. 5984). We interpret this as an improvement. Since we have excluded one observation from the data, we have a somewhat new data set. Therefore we will still examine the residual plots. The details are presented in Appendix XII (b). Here we do not have apparent outliers, and the normal probability lot and the histogram all exhibit the normality property. Thus the normality assumption is satisfied. The predictive power of the model is now indicated by the F-value = 41. 99. Before the exclusion of the outlier it was 41.
50. Our first model yielded 36. 68, and even though the 5-variable model gave us F = 43. 21, the model had other setbacks. Also, the high T-statistic values relative to the P values all confirm our conjecture. The residual plots also do not show a definite pattern that we can discern. We therefore believe that the assumption of homoskedasticity has also been satisfied. We thus settle for this model as our best for predicting the winning percentage of a basketball team in this group of basketball teams, for this particular basketball season.