Data sampling

This is a crucial step when deciding the model. Never develop a score without separating the data into samples. The reason for that is a concept called overfitting.

Overfitting: Fitting the model to the training sample by using variable in such way that the independent variable is predicted perfectly. This shows that the coefficients of variables are forcefully pushed to fit the model perfectly, ignoring the potential outliers or deficiencies of the sample. You will have a poor fitting if you apply the same model on another sample.

To prevent this, the method called k-fold cross validation is used. It’s basically separating the total sample into “k” number of samples and training and testing the models on these samples separately to come up with the highest performing model on these sample. I’ll exhibit a simple version of this via separating the data into training and one validation sample (2-fold). It’s important to keep the two data sets randomly separate from each other.

I should give a recommendation here: While developing and validating on training and validation samples, keep a test sample apart. This test sample is usually taken as a different period than the training and validation samples, while representing similar population. In such cases, the test sample is called “out-of-time validation sample”.

The choice of % split of the samples is entirely up to the modeller, however a common usage is 60% training, 20% validation, 20% test sample. Here’s the R code how to do so:

   #Selecting the training / validation sample 
   d = sort(sample(nrow(MyData), nrow(MyData)*.6))
   TrainingData <- MyData[d,]
   OtherData <- MyData[-d,]
   d2 = sort(sample(nrow(OtherData ), nrow(OtherData )*.5))
   ValidationData <- OtherData[d2,]
   TestData <- OtherData[-d2,]

The steps following this is training the models and simply applying the results on cross-validation. There are ways to intuitively avoid overfitting even while developing the model on training sample, however we’ll come to this in modelling section. The choice of best model depends on the model’s separation power on not only the training sample but also (and more importantly) on the cross-validation sample.

Note: “Test/validation” terms may vary from literature to literature.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s