Missing Data Imputation : 9 Smart Imputing Techniques for Better Predictive Outcomes

data science Oct 09, 2023
thumbnail image for Imputing techniques blog from BigDataElearning

Have you faced missing data in your data science projects? 

Ever wondered if there are effective strategies to handle those pesky missing values in your precious dataset? 

Perhaps you've come across the term "data imputing" in machine learning, and now you're curious about what it actually means and how it can help you fill those data gaps? 

Well, look no further! In this article, we'll delve into the world of missing data imputation by exploring the following topics

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"


Grab a cup of coffee and let's dive in!

What is Data Imputing?

Imputing in machine learning is like filling in the missing pieces of a puzzle.


Imagine you have a dataset with some missing values. 

Here's a small sample dataset with 10 rows, showcasing some missing values:

In this example, we have a dataset with three columns: Age, Height, and Weight. 

You can see that there are missing values represented as "NaN" (Not a Number) in some cells.

These missing values can cause issues when you're trying to train a machine learning model because most machine learning algorithms require complete data to work properly.



Data Imputing is the process of estimating or predicting these missing values based on the available information in your dataset.  

By imputing missing data, you can create a complete dataset that can be used to train your machine learning model effectively.

" Imputing is the process of estimating or predicting these missing values based on the available information in your dataset "

Clear enough on missing data imputation?

Importance of Imputing Missing Values

 Imputing missing values can be important because removing data points with missing values may result in a significant loss of information and reduce the effectiveness of your machine learning model.  

By imputing the missing data, you can make the most of your available data and potentially 

(1) Improve the performance of your model

(2) Make your model more robust and reliable

(3) Increase the statistical power of your model which results in significant findings

 

9 Smart Imputing Techniques: When to use what?

There are 9 smart techniques for imputing missing values. 

 They are 

  1. Mean,
  2. Median,
  3. Mode,
  4. Fixed Value,
  5. Minimum value or Maximum value,
  6. Linear Regression,
  7. Forward fill (or) Backward fill,
  8. Interpolation,
  9. K Nearest Neighbors

 The first 3 methods mean, median, & mode are based on summary statistics which assumes that the missing values are similar to the existing data points.

1. Mean

 Mean is one of the summary statistics methods which assumes that the missing values are similar to the existing data points. 

  • This “Mean imputation strategy” of summary statistics is suitable for numerical data with a normal distribution.

Now, let's showcase how “mean imputation strategy” works for filling in the missing values:

 

Let’s take the above example and notice that there is one missing value for the “Age” column.

For the missing “Age” value, we calculate the mean of the available Age values (25, 30, 40, 35, 28, 32, 27, 33, 29), which is approximately 31.56. 

Since it is better to represent the age as an integer value, we round it off to 31

Note that the missing “age” value is replaced with “mean” value , which is 31.

2. Median

 Median is the second of the summary statistics methods which assumes that the missing values are similar to the existing data points. 

This “Median imputation strategy” of summary statistics is suitable when the data contains outliers or is skewed.

When the distribution of the variable is not symmetric, median imputation can be more robust than mean imputation.

For the missing Height value, we calculate the median of the available Height values (160,165,168,170,172,175,176,180), which is 171. 


Median is nothing but the middle value when the numbers are sorted in increasing order.



Here since the number of elements is even, both 170 & 172 are middle numbers.  When you take the average of both, you can come up with 171 as the median value. 

Therefore, we replace the missing Height value with 171.

Note that the missing “height” values are replaced with the “median” value, which is 171.


3. Mode

 Mode is the last of the summary statistics methods which assumes that the missing values are similar to the existing data points. 

The “Mode imputation strategy” of summary statistics is suitable when the data contains categorical or nominal variables.

When you deal with variables with a high frequency of a particular value, imputing the mode can be a reasonable approach.

In the below example for the missing “Gender” value, we calculate the “mode” of the available “Gender” values, which is “Male” (since there are more occurrences of “male” records than “female” records). Hence, we replace the missing “Gender” value with “Male”.

Note that the missing “Gender” value in the below screenshot is replaced with the “mode” value , which is “Male”.

Note that mean, median, and mode imputations are simplistic approaches and may not always be the most accurate or appropriate methods, depending on the dataset and context.

It's important to note that these strategies are relatively simple and assume that the missing values are missing at random or missing completely at random. 

In scenarios where the missingness is not random or when the missing values carry important information, more advanced imputation methods such as regression-based imputation are used.

4. Fix Value

Imputing missing data with a predefined constant (or) fixed value, such as zero or a certain phrase. In some cases your domain expert knows it is ideal to replace with a constant value.

For e.g the domain expert notices that the missing data are for ages 40 or above, and the expert knows that there aren’t any female customers at or above age 40. So he/she decides to replace missing data with constant value as “Male”

5. Minimum value or Maximum value

Imputing missing data with either the minimum or maximum observed value in the dataset, depending on the context and requirements.

Consider you are evaluating students in a classroom where students take a test. You are given a task to evaluate if the classroom student’s scores pass a threshold or not.

And you notice some scores are missing…

If you decide to fill in the missing scores with the lowest possible score (minimum value), you're making an assumption based on the worst outcomes.

In the image below, we are replacing the missing value with the lowest score which is 55.

If you decide to fill in the missing scores with the highest possible score (maximum value), you're making an assumption based on the best outcomes.

In the image below, we are replacing the missing value with the highest score which is 100.

6.  Linear Regression (Regression-based imputation)

As discussed when missing is not random, you may need to use advanced techniques like regression based imputation. This strategy involves usage of regression models to predict the missing values based on the other features in your dataset. 

For example, if you have a dataset with information about a person's age, gender, and income, you can train a regression model to predict the missing age values based on the available gender and income information.


It’s essential to acknowledge another advanced technique known as multiple imputation. This method is particularly powerful when dealing with complex datasets. Multiple imputation involves creating multiple imputed datasets to estimate missing values more robustly. 

By generating several complete versions of the dataset with different imputed values, the uncertainty associated with the missing data can be better captured. This technique enhances the reliability of the imputed values and results in more accurate and stable model outcomes.

7.  Forward fill (or) Backward fill

These techniques involve propagating the last known value forward or backward in a time series or ordered dataset to fill missing values.

Think of a relay race where a baton is passed between runners. If one runner drops the baton (missing value), the next runner can either continue from the last known position (forward fill) or start from where the baton was dropped (backward fill).



In the below example the value is imputed using forward-fill. The previous value “60” is used to forward fill & replace the missing value.

In the below example the value is imputed using backward-fill. The next value “70” is used to backward fill.

8.  Interpolation

Interpolation estimates missing values by considering the relationship between data points, often using mathematical methods to infer values between observed data points.

Imagine you have a series of checkpoints along a winding road, but some markers are missing. 

Interpolation is like you estimating the position of the missing markers by considering the distances and curves between the known ones, similar to estimating your location on a map between two known landmarks.


In the example below, we pick 2 known values before and after the missing values. They are 60 & 94.  You then use the formula (next value - previous value) / number of gaps, which gives you 94-60/2 = 34/2 = 17

Then you add 17 to the previous value, which gives you 60+17 = 77

So you replace missing values with 77

9.  K Nearest Neighbors

This technique identifies the 'k' nearest data points to the one with missing values and imputes the missing values based on the values of its nearest neighbors.

Picture a neighborhood where houses represent data points. If you want to estimate the missing price value for one house, you look at the price values of the nearest houses (neighbors) to get an idea of what the missing value might be, assuming houses in close proximity are similar.



When imputing missing values may not be appropriate?

It is also important to note that, in certain cases if you impute missing data it may not be appropriate and simply leaving the missing values as it is , may be more helpful. 

Let’s see in what cases imputing data is not appropriate.

  • Significant amount of missing values: If a large portion of your dataset has missing values, data imputing may result in a substantial distortion of the original data.

  • Missing values are informative: Sometimes, the fact that a value is missing can itself convey meaningful information.

    Let's consider a survey dataset that collects information about people's income, education level, and job satisfaction. In this dataset, there is a column indicating whether the respondents have a side business or not.

    However, some respondents did not provide an answer for this column, resulting in missing data.

    In this scenario, the missing values for the "side business" column may actually be informative.



    It could indicate that the respondents either don't have a side business or they are reluctant to disclose that information.

    The missing values might reflect a certain characteristic or behavior of the respondents, such as being risk-averse or having a preference for privacy.

    In this case also it is better to leave the missing values as it is, instead of using imputing techniques.

  • Imputation introduces unrealistic patterns: Imputing missing data can introduce artificial patterns or relationships that don't exist in the actual data. 

    This can happen when you impute based on unrelated variables or when you use a simplistic imputation method.

    Let's consider a dataset that includes information about customers, such as their age, income, and purchase history. In this dataset, the “income column has some missing values.

    Let’s say to impute these missing data, you use a simplistic imputation method, such as filling in the missing values with the mean income of the entire dataset.

    However, the missing data in the income column are not related to the other variables in the dataset.

    Since you are Imputing these missing values with the “mean” income, it may introduce artificial patterns or relationships that don't reflect the true nature of the data.

    It assumes that all the missing values should have the average income, which may not be accurate.

    It may be the case that the missing values in the income column predominantly belong to younger customers.

    By imputing their incomes with the mean income, it artificially inflates the income values for that particular age group. This can create a false relationship between age and income, suggesting that younger customers have higher incomes when, in reality, it was a result of the data imputation process.



    Such artificial patterns can lead to misleading conclusions and impact downstream analysis or modeling.

    In such cases, it's important to be cautious and evaluate the impact of data imputation on the validity of the results.

Conclusion

Let's wrap up what we've learned about missing data imputation:

  • Understanding Data Imputation: It's like completing a puzzle by adding missing pieces to a dataset.
  • Why It's Important: Filling in missing data is super important for making sure machine learning models work well and are strong.
  • Different Ways to Fill Data: We've explored nine smart methods to do this, each good for different situations.
  • Using Strategies Smartly: Like using averages or regression based on what the data looks like and how it behaves.
  • Lots of Techniques: We've also seen fixed values, filling in based on nearby data, and other methods to handle missing info.
  • When to Be Careful: Sometimes, it's not a good idea to fill in missing data, especially when the gaps are huge or when missing info tells us something important.

By using these methods wisely, you can make your machine learning model stronger by filling in missing data the right way.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 


Question For You 

When is it appropriate to use mean imputation for missing values in a dataset?

 Let me know in the comments, if the correct answer is A, B, C, or D ? 

A) When the missing values are missing completely at random (MCAR)

B) When the missing values are related to other variables in the dataset

C) When significant proportion of the dataset is missing

D) When the missing values are part of a categorical feature

 

TELL ME IN THE COMMENTS!

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.