Want to learn more? Take the full course at https://learn.datacamp.com/courses/ma... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
There are several goodnesses of fit measures used to judge a model's fit. One is the so-called coefficient of determination or the Multiple R-squared.
The value of Multiple R-squared provides the proportion of the dependent variable's variance that is explained by the regression model, adjusted for the number of variables in the model. Hence, if R^2 equals 0, none of the variations is explained. An R^2 equal to 1 corresponds to a model that explains 100% of the dependent variable's variation. In general, I want my R^2 to be as high as possible, but values above 0.9 are rarely reached.
The F-test is a test for the overall fit of the model. It tests whether or not R^2 is equal to 0. That is to say, at least one regressor (or a set of regressors) has significant explanatory power. In our model, the p-value of the F-test is smaller than 0.05, hence, the hypothesis of an R^2 of zero is rejected. The variables included in the model explain some variation of the margin in year 2.
So far I have considered only in-sample goodness of fit measures, that is to say, the model is evaluated on the same data that it was fitted on. This bears the risk of overfitting. Overfitting occurs when not only the relation between the variables - shown in blue - is modeled, but also the relation between the errors - shown in red. Then the model performs great when predicting the dataset it has been fitted on, but the prediction results on new data are poor. The linear model, at first glance, looks like it does not fit well, but for predictions, it will be superior to the more complicated model shown in red.
There are several ways to avoid overfitting. One is to keep your model lean. Some measures for the goodness of fit (for example, the AIC) penalize every additional explanatory variable so that you can control for overfitting while developing a model. When comparing the two models, the AIC-minimizing model is preferred.
In R you can find the AIC value using the function AIC() from the stats package. Note that here, since I am not comparing models to each other, I cannot draw any conclusions from an AIC of 33950.45.
Automatic model selection can be done using stepAIC() from the MASS package. More on that in the next chapter on logistic regression.
Other methods to avoid overfittings, like out-of-sample model validation or cross-validation, are also explained in depth in the chapter on logistic regression.
Let's turn to prediction. So far I have used explanatory variables in year one in order to explain variation in the margin of year two. Now, I will use explanatory variables in year two in order to predict the margin of year three. Therefore, I will make use of the new dataset clvData2.
Now, the prediction is fairly easy. I just hand the model multipleLM2 and our new dataset clvData2 over to the predict() function. If I store the predictions in a vector, such as predMargin, I can use them for further analysis, like calculating a mean, for example.
Perfect, you made it through the whole chapter! Check out what you have learned.
And don't forget to practice.
#DataCamp #RTutorial #MarketingAnalyticsinR #StatisticalModeling #ModelValidation #ModelFitandPrediction
Информация по комментариям в разработке