R tutorial: Introducing out-of-sample error measures

Описание к видео R tutorial: Introducing out-of-sample error measures

Learn more about machine learning with R: https://www.datacamp.com/courses/mach...

Hi! I'm Zach Deane Mayer, and I'm one of the co-authors of the caret package. I have a passion for data science, and spend most of my time working on and thinking about problems in machine learning.

This course focuses on predictive, rather than explanatory modeling. We want models that do not overfit the training data and generalize well. In other words, our primary concern when modeling is "do the models perform well on new data?"

The best way to answer this question is to test the models on new data. This simulates real world experience, in which you fit on one dataset, and then predict on new data, where you do not actually know the outcome.

Simulating this experience with a train/test split helps you make an honest assessment of yourself as a modeler.

This is one of the key insights of machine learning: error metrics should be computed on new data, because in-sample validation (or predicting on your training data) essentially guarantees overfitting.

Out-of-sample validation helps you choose models that will continue to perform well in the future.

This is the primary goal of the caret package in general and this course specifically: don’t overfit. Pick models that perform well on new data.

Let's walk through a simple example of out-of-sample validation: We start with a linear regression model, fit on the first 20 rows of the mtcars dataset.

Next, we make predictions with this model on a NEW dataset: the last 12 observations of the mtcars dataset. The 12 cars in this test set will not be used to determine the coefficents of the linear regression model, and are therefore a good test of how well we can predict on new data.

In practice, rather than manually splitting the dataset, we'd actually use the createResamples or createFolds function in caret, but the manual split simplifies this example.

Finally, we calculate root-mean-squared-error (or RMSE) on the test set by comparing the predictions from our model to the actual MPG values for the test set.

RMSE is a measure of the model's average error. It has the same units as the test set, so this means our model is off by 5 to 6 miles per gallon, on average.

Compared to in-sample RMSE from a model fit on the full dataset, our model is signifigantly worse.

If we had used in-sample error, we would have fooled ourselves into thinking our model is much better than it actually is in reality.

It's hard to make predictions on new data, as this example shows. Out-of-sample error helps account for this fact, so we can focus on models that predict things we don't already know.

Let's practice this concept on some example data.

Комментарии

Информация по комментариям в разработке