In-Sample vs Out-of-Sample: The Secret to Building Models that Can Predict the Future

data science Oct 03, 2023
thumbnail image for in-sample vs out-of-sample blog from BigDataElearning

Have you ever heard of in-sample and out-of-sample in Data Science and felt confused about what it really means? 

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"

You might have seen Data Scientists splitting their data into training data and testing data, and wondered why they call it as in-sample and out-of-sample. 

If that's the case, keep reading to learn more about the ins and outs of “in-sample” vs “out-of-sample” concepts.

Let's say you're a teacher preparing a lesson plan for your class. 

The part of the material you've covered in class and used to teach your students is like the in-sample data. 

On the other hand, any similar material you haven't covered in class and haven't used to teach your students is like out-of-sample data.

Simple enough? With this understanding let’s see the following in this article

 

What is in-sample?

In data analysis or modeling, the in-sample data is what you use to build or train your machine learning model. 

Understanding the nuances of "sample in statistics" helps in comprehending in-sample data's statistical representation.

What is out-of-sample?

After building a machine learning model using the in-sample data, you need to test its accuracy and effectiveness to ensure that it can make accurate predictions on new unseen data.

This is similar to testing your student’s ability to answer similar questions that are not exactly covered in your training. This is to assess whether the student generalizes their understanding to unseen questions, based on the knowledge they gained from your training.

This is where the out-of-sample data comes in. The out-of-sample data is the set of data that the model has never been exposed to during training. 

Why do we need in-sample vs out-sample?

Remember high-bias / underfitting is not desired. Similarly high-variance / overfitting is also not desired, right?  

So you need to apply your model to out-sample data to see how well it performs and to determine whether it is overfitting (or) generalizing well to new unseen data. 

By testing the model on out-of-sample data, you can assess its ability to make accurate predictions for real-world scenarios.

The utilization of sampling techniques in statistics plays a crucial role in defining the in-sample and out-of-sample data subsets.

In-sample vs out-sample in simple terms

So basically, in-sample data is the part of your data that your model is familiar with, while out-of-sample data is the part your model has never seen before. 

When is a model considered to be generalizing well?

If a student scores well in the topics you have covered and performs poorly in the similar topics which you have not covered, then it means that particular student is not thinking out-of-the-box. In other words that particular student is not generalizing their gained knowledge, right?

Similarly, If your model performs well only on in-sample data and poorly on out-of-sample (unseen) data, then it's probably overfitting and not generalizing well.  It won't be as useful in making accurate predictions for new data.

Having said that,  a model that performs well on both in-sample and out-of-sample data is considered a good generalization of the underlying patterns in the data. 

Clear enough!

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 


Conclusion

Let's review the main points we covered about in-sample and out-of-sample concepts in data science:

  • Understanding In-Sample and Out-of-Sample: In-sample data is like what you learn in class, used to teach machine learning models. Out-of-sample data is like new, unexplored content. 
  • Defining In-Sample Data: In-sample data is the information used to build or train machine learning models. It's important for the initial learning of the model. The sample mean of the in-sample data also serves as a central point or baseline for certain statistical analyses
  • Defining Out-of-Sample Data: Out-of-sample data is what we use to test how well the model works on new and unseen information. It helps ensure the model performs accurately.
  • Importance of In-Sample vs Out-of-Sample: It's crucial to check how well the model performs on out-of-sample data to avoid it being too specific or too general. This is similar to a student being able to handle questions they haven't seen before.
  • Model Generalization Analysis: Models that excel on both types of in-sample and out-of-sample data are considered good at generalizing underlying patterns. This ensures they are reliable in new situations.

Question for you

Imagine you're testing Tesla's automatic driving mode in a factory that's built for testing, with an indoor protected environment (which is generally used for training). 

Now, you take the same Tesla car and start driving to your grocery store in auto driving mode. Would the data collected during this drive to the grocery store be considered in-sample or out-of-sample data in machine learning?

Let me know in the Comments !

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.