7 Must-Know Feature Importance Methods in Machine Learning: From Model-Agnostic to Model-Dependent

data science Nov 02, 2023
thumbnail image for feature importance blog from BigDataElearning

When baking a cake, not all ingredients are as important to one another..

Like how Cream Cheese acts as an important ingredient for a cheesecake..

And like how cocoa powder acts as an important ingredient for a brownie..

Some are stars of the show, while others are not..

I’m sure you’re with me on this one..

Likewise in the world of machine learning, some features (columns/fields) are more important than other features for predictions…

Understanding which features matter the most can be a game-changer for your machine learning model…

 Why is that?

 Keep reading and we will explore more of this!

I'll also walk you through seven feature importance methods that every aspiring data scientist like you should be familiar with..

Ready to dive in? 

We will cover the following topics and we will start with what “Feature Importance” means..

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"



What Does Feature Importance Mean in Machine Learning?

Feature importance is our super cool way of figuring out who our star players are in the dataset! 

It's like having a magnifying glass that shows which features (attributes or variables) are pulling the strings behind the scenes.

For example, see below image..

You can see that the “Job title”, “Salary” have the most high feature importance scores, whereas the “first name” & “Last name” have the least feature importance scores..

 

So what’s the point? 

This means “Job title”, & “Salary” has more weightage and your machine learning algorithm should heed to these features more, when predicting results, whereas “First name” & “Last name” has the least weightage in predicting results and your machine learning algorithm doesn’t need to pay much importance to those..

Now that you know what feature importance means..

You should always strive to use feature importance in machine learning modeling..

Why is that?

Let’s see why exactly feature importance is useful in machine learning..

Why Is Feature Importance Useful in Machine Learning?

(1) Faster Model Training : Let's imagine for a moment… You're cooking a lavish dinner for your friends. Now, you wouldn't want to waste time on ingredients that don't add much flavor, would you?

In the same way, in the world of data, there's a lot of... let's call it "unnecessary seasoning".

And too much of it can make your machine model slow and bulky.

By understanding feature importance techniques, you can trim off the excess and focus on what truly spices up our predictions. This will speed up your machine learning model training as it only has to learn the needed features and ignore the unimportant ones.

 

(2) Model Interpretability: When machine learning makes predictions , you would wish to know on what basis machine learning algorithms predicted the results..

Don’t you wish? 

Feature importance offers clarity on how a model makes decisions, making it more transparent and understandable to you..   

 

(3) Stakeholder Communication : You wanted to know how machine learning predicted the results..

Wouldn’t stakeholders also want to know why your model is making certain decisions..

The answer is feature importance..

Using the feature importance, you can explain it to the stakeholders on what basis the machine learning model is predicting the results

 

(4) Enhanced Predictions : When you know which ingredients are more important for a dish, you are better off to make the dish perfectly..

Likewise knowing which features carry the most weight can improve the accuracy and reliability of a model's predictions, as it knows which features to give more weightage to 

 

(5) Guided Data Collection: When you know which features matter the most, it can guide future data collection, ensuring more emphasis is placed on gathering relevant data. 

When I worked on a financial project in the past , we had a ton of data to deal with – thousands of attributes/features.. 

Whenever I tried to use my machine learning program, it took forever.. 

It wasn't just the program; even collecting the data took a long time..

Then one day, my boss asked if there was a way to make things go faster. 

That got me thinking, and I remembered "feature importance." 

Using feature importance, I realized that we didn't need all those thousands of attributes..

I decided to clean up the data by removing the stuff we didn't need..

When I did that, something amazing happened.. 

The data collection process got super fast, and my machine learning program's predictions also got speedy.. 

My boss was really happy because everything started working much better..

Knowing which features are important can be life saver as you only need to bring the relevant data for machine learning training.

Now that you have found why feature importance is so useful, you should now know the 7 must-know feature importance techniques that you can easily use for your own models. 

I’ll explain those 7 techniques 

Keep reading..

 

Feature Importance Methods 

There are 2 broad categories of feature importance methods. 

They are

(I) Model-agnostic methods &

(II) Model-dependent methods. 

 

(I) Model-agnostic methods are like the universal remote controls of feature importance — they work with any model.  

(II) Model-dependent methods are tailored for specific model types, giving detailed insights for each one. These methods are designed specifically for certain types of machine-learning models.  

We will look into 3 must-know model-agnostic methods and 4 model-dependent methods.

 

I. Model-Agnostic Feature Importance Methods 

 

 (1) Correlation Criteria

  • What is it? Correlation criteria is a model agnostic feature importance method that calculates the strength and direction of a linear relationship between a feature and the target variable. 

Imagine you're out shopping for a car.. 

You are keeping track of all the car details.  

Something interesting pops up when you're comparing different cars. 

You've noticed that when a car has a heavier engine, it also is more fuel-efficient..  

This might be because those heavier engines are more powerful or work more efficiently.. 

But, here's the twist.. 

You've also seen cases where if the whole car gets heavier, the fuel efficiency drops.. It may be that the car's weight is making it use more fuel.  

So, what's the bottom line?  

When the engine weight goes up, fuel efficiency also goes up too – that's a positive correlation.  

In the graph below, you can notice that the “engine weight” is on x-axis and MPG(miles per gallon) is on y-axis.  As the engine weight increases the fuel efficiency increases and thereby both have positive correlation. 

But when the car's total weight increases, fuel efficiency goes down – that's a negative correlation. 

In the graph below, you can notice that the “car weight” is on x-axis and MPG(miles per gallon) is on y-axis.  As the car weight increases the fuel efficiency decreases and thereby both have negative correlation. 

What would be the correlation between “car color” & “fuel efficiency” ?  

You guessed it.. 

It would be zero correlation, right?  

In the graph below, you can notice that the “car color” is on x-axis and MPG(miles per gallon) is on y-axis.  As you can see there is no correlation between the 2 and thereby they have zero correlation between each other. 

  • Advantages: Fast and easy to compute. 
  • Drawbacks: Only captures linear relationships. Non-linear trends might be missed. 
  • When to use: When looking for initial insights and potential linear relationships between features and the target.

Python Example Code

import pandas as pd

df.corr()['target_variable'].sort_values(ascending=False)

 

 

(2) Single Variable Prediction

  • What is it? Single variable prediction is also a model-agnostic feature importance method, as it can work with any model.  This method trains a model using only one feature at a time.. 

How well each feature can predict the outcome by itself is gauged..

Imagine you're trying to predict how fast a car can go based on different factors…

Single variable prediction is like looking at one factor at a time..

For example..

Let's say you start by looking at the engine size of the car..

You train a model using just the “engine size” and see how well it predicts the car's speed..

 Likewise you then train a model with just the weight of the car, and see how well it predicts speed. 

By doing this for each factor, you can gauge how good each feature is at predicting the car's speed by itself.  

So what’s the point of doing this? 

This helps you understand the importance of each feature in predicting the outcome, without considering the interactions between them. This is exactly how single variable prediction works.

  • Advantages: Gives a clear perspective of the standalone predictive power of each feature. 
  • Drawbacks: Doesn't account for how features might interact together. 
  • When to use: For an initial assessment of each feature's individual strength.

Python Example Code

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(df[['feature']], df['target'])

(3) Permutation Feature Importance 

  • What is it? Permutation feature importance is the 3rd model-agnostic feature importance method and this also works with any model.  It involves randomly shuffling one feature's values and measuring how much that shuffling affects the model's performance 

So, let's say you're baking a chocolate cake..  

And you're curious about how vital the cocoa powder is.  

Now, imagine you randomly decide to replace the cocoa powder with something else, like flour.  

What happens next is quite interesting. 

You taste the cake after making this swap and, oh boy, it's a disaster!  

If the cake suddenly tastes terrible, you'd probably exclaim that cocoa powder is super important, right? 

Likewise, “feature importance” in data analysis involves taking one ingredient (feature), shuffling it around, and checking how much it messes up the final result (model's performance).  

It's kind of like running a taste test in the world of data. 

Now, here's the catch.. 

If the model's performance takes a big hit when you mess with a particular feature, it tells us that the feature we tweaked is pretty darn important in making accurate predictions. 


Another example..

You are trying to predict a student's final exam score based on two features: "Study Hours" and "Prior Test Score."  

You would shuffle the values of "Study Hours" and "Prior Test Score" separately..

And then observe how it impacts the model's ability to predict the "Final Exam Score."..  

The changes in prediction accuracy due to shuffling help us determine the importance of each feature in making accurate predictions.

In the example below, after shuffling “study hours” or “prior test score” randomly, what happened to the “final exam score”?

 

The result? 

It is always less than the actual final exam score  

This tells that both the “Study Hours” & “Prior test score” are important features for determining whether a student will score well or not.

  • Advantages: Takes into account interactions with other features. 
  • Drawbacks: Requires more computational resources and can be slower. 
  • When to use: When you're interested in understanding a feature's importance in the context of the entire model, especially regarding feature interactions.

Python Example Code

from sklearn.inspection import permutation_importance

results = permutation_importance(model, X, y, scoring='accuracy')

II. Model-Dependent Feature Importance Methods

As we saw earlier model-dependent feature importance methods are tailored to specific models.


 (4) Linear Regression

  • What is it? Linear Regression is a model-dependent feature importance method.  

You are trying to predict a student's exam score.. 

And you need to figure out how different features affect the prediction of the score..

We're talking about factors like the number of study hours, previous test scores, and maybe even sleep hours.

I know you would think about using one of the model agnostic methods.., as you are familiar with it now..

Stick with me for now..

This linear regression method aims to find the best-fitting linear relationship (a straight line) between the features and the target variable.  

After training, the linear regression model will assign coefficients (weights) to each of the features. 

These coefficients indicate the strength and direction of the relationship between each feature and the target variable. 

The goal is to look for features that have “high coefficient” value. 

Why is that? 

A positive coefficient suggests that as the feature increases, the target variable also tends to increase.  

A negative coefficient suggests the opposite. 

The magnitude of the coefficient reflects the feature's relative importance..  

Larger magnitude coefficients have a more significant impact on the target variable.

 In the below example, you can see that the “final exam score” increases as the “study hours” increases. 

However there is not much correlation to the “final exam score” when the “previous test scores” were high. 

Coefficient for "Study Hours" (x1): 2.5

Coefficient for "Previous Test Scores" (x2): 0.7 

A higher coefficient value for a particular feature means that the relationship between that feature and the target variable is stronger.

 In this case, since, “Study Hours” coefficient value is higher, and so it has a stronger relationship to the “Final Exam Score” . The relationship of “previous test scores” to the “Final Exam score” is weaker.

  • Advantages: Clear, transparent method with easy-to-understand coefficients. 
  • Drawbacks: Assumes a linear relationship, which may not always be the case. 
  • When to use: When working with datasets where relationships between variables are primarily linear.

Python Example Code

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)

importance = model.coef_ 

(5) Logistic Regression 

  • What is it? Logistic Regression is another model dependent feature importance method used to understand the importance of features..

 Imagine you're dealing with a different scenario now.  

You're not predicting exam scores; you're into a different game – let's say you're working on predicting whether an email is spam or not. 

Logistic regression is all about classifying things, not predicting a continuous value, even though the name has “regression” in it. 

And you want to figure out which features make an email more likely to be spam – features like the sender, the subject, the use of certain keywords, and so on. 

Now, here's where Logistic Regression steps in..  

After training a Logistic Regression model, it assigns coefficients to each feature, just like Linear Regression does.  

These coefficients tell you which features influence the probability of an email being spam. 

You're on the lookout for features with 'high coefficient' values here as well.  

A positive coefficient means that as the feature increases, the log-odds of the email being spam increase too. 

A negative coefficient means the opposite. 

And yes, the magnitude of the coefficient matters here too.  

Features with larger coefficients have a more significant impact on the log-odds, which, in turn, affects the probability of an email being spam.

  • Advantages: Direct interpretability of the coefficients. 
  • Drawbacks: Assumes a linear relationship between independent variables and the logarithm of odds. 
  • When to use: Primarily for binary classification tasks.

Python Example Code

from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X, y)

importance = model.coef_

 

(6) Decision Tree 

  • What is it? Hey, imagine a decision tree as a kind of flowchart. You know, the kind where you start at the top and follow the branches to make a decision. 

So, in this tree, each internal node represents something like a feature or an attribute.  

Like, let's say we're trying to decide what object a given record is.. 

One internal node “Is it alive?” and the subsequent one is “Does it fly?”, and so on. 

Now, the branches coming out of these nodes represent the decision rules.  

Those are the rules we use to make decisions. 

And finally, the leaf nodes at the very end of the branches represent the outcomes.  

One important thing to note is that the top nodes, the ones near the beginning of the tree, often have a more significant influence on the final decision.  

They're like the big bosses of the decision-making process.

So, when you think about decision trees, picture them as these organized structures where you start with a feature, follow the rules, and reach a decision at the end. 

It's a handy way to make choices, both in real life and in data analysis. 

  • Advantages: Offers clear visualization and understanding of how decisions are made based on features. 
  • Drawbacks: Can easily result in overfitting if not properly tuned. 
  • When to use: When you want an interpretable model that visually breaks down decisions.

Python Example Code

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier().fit(X, y)

importance = model.feature_importances_

(7) Random Forest

  • What is it? Hey, you know how sometimes a group of people can make a better decision than just one person?  

Random Forest is a bit like that in the world of decision trees. 

It's like having a bunch of friends, and each friend has their own opinion on what's important. Instead of relying on just one friend's opinion, you listen to all of them. 

Likewise Random Forest is like a group of decision trees, and each tree has its own take on feature importance.  Instead of going with one tree's opinion, Random Forest combines all these opinions to give you a more balanced and reliable view of what features matter. 

And that's why Random Forest is such a powerful tool in machine learning. 

  • Advantages: More stable than a single decision tree, offering an averaged view of feature importance.
  • Drawbacks: Can be computationally demanding. 
  • When to use: When working with complex datasets and needing insights from multiple decision trees for a more balanced view.

Python Example Code

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier().fit(X, y)

importance = model.feature_importances_

  

And there you have it! These methods will definitely come in handy as you progress in your education and career.

Now, we’ll dive a bit deeper into how feature importance is calculated.

How Is Feature Importance Calculated?

Let's use a soccer team as a metaphor. 

If you wanted to find out how vital a star player is, you might observe how the team performs without them. 

If the team still does exceptionally well, perhaps that player isn't as essential as thought, right? :-)

But if the team struggles, it's clear their star player is pretty vital. πŸ‘

This is the crux of how feature importance works in machine learning.

Let’s break it down into steps:

Step-by-Step Guide to Calculating Feature Importance

  1. Measure Baseline Performance: First, measure the performance (e.g., accuracy) of the model with all features included. This gives you a baseline score.
  2. Remove Single Feature: One at a time, remove a feature from the dataset. This is akin to watching your soccer team play without their star player.
  3. Measure the Drop: After removing a feature, retrain the model and measure its performance again. If the performance drops significantly without a particular feature, it indicates that the feature is pretty crucial.
  4. Comparison: Compare the model's performance without the feature to the baseline score. The difference between the two scores indicates the importance of that feature.
  5. Repeat: Go through this process for every feature in your dataset. By the end, you'll have a list of performance drops that correspond to the importance of each feature.
  6. Ranking: Now, rank the features based on the difference in performance. The feature causing the most significant drop when removed is the most important one.
  7. Visualization: For a more intuitive understanding, you can visualize the results. Bar charts are a popular choice, with features on the x-axis and their importance score on the y-axis.

 Pretty simple, right? Let’s take a quick look at some advanced calculation methods…

Special Mention: Advanced Methods

Some techniques, like Permutation Feature Importance, don't physically remove a feature.

Instead, they shuffle its values, destroying any relationship it has with the target variable. 

Then they measure how this shuffling affects performance. 

It's a bit like asking your star player to play blindfolded instead of completely removing them from the field!

Remember: While this gives a solid idea of feature importance, keep in mind that these methods might not always capture complex interactions between features. 

Sometimes two seemingly unimportant features together can have a significant impact. 

So, always use these results as a guide, not an absolute truth!

In the next and final section, we’ll briefly explore the concept of SHAP feature importance.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 

 

SHAP Feature Importance

Let’s use another metaphor… Picture this: You're at a concert, watching your favorite band perform. 

Each musician brings their unique flavor to the overall sound. 

The drummer provides the rhythm, the guitarist gives those iconic riffs, and the vocalist belts out lyrics that give you goosebumps. 

But what if you wanted to know just how much each band member contributes to the song you love? 

Enter the world of SHAP!

What's SHAP? A Quick Intro

SHAP (SHapley Additive exPlanations) is like the music critic for machine learning. 

It doesn't just tell you that a song is good — it breaks down each musician's contribution to the song's greatness. 

In machine learning terms, SHAP values let you see how much each feature in your dataset contributes to a particular prediction.  

The Magic Behind SHAP

SHAP values are based on a concept from cooperative game theory called the Shapley value. 

Diving Deeper: How Does SHAP Work?

  1. Every Feature Gets a Turn: Imagine listening to each musician perform solo. With SHAP, we see how the model performs when each feature is present versus when it's absent, one feature at a time.
  2. Collaboration Matters: Just like how musicians collaborate, features in a model can interact. SHAP doesn't just look at features in isolation. It considers all possible combinations of features to gauge their joint contribution.
  3. Fair Attribution: If two features often work well together, SHAP ensures that they share the credit fairly, instead of attributing all the glory to one.
  4. Consistent Storytelling: The best part? SHAP values always sum up to the difference between the model's output and its average prediction, guaranteeing consistent explanations. 

It's like ensuring the sum of individual contributions of musicians always equals the total impact of the song.

In a sense, SHAP is your backstage pass to the world of machine learning predictions. 

It lets you go behind the scenes and understand the performance of every 'musician' in your 'band' (model). 

So, next time you're wondering why a model made a certain prediction, remember that SHAP's got your back, giving you all the deets on the action behind the curtains!

Conclusion

As we've journeyed through the landscape of feature importance, we've discovered seven key techniques and let's quickly recap what we covered

  • Understanding Feature Importance: We started by talking about feature importance. These are the key players in your data that can really influence how your model makes predictions.
  • Different Levels of Importance: We explored how some features are like the main actors in a play, taking the spotlight, while others play supporting roles in the background, across different algorithms.
  • Fine-Tuning for Perfection: We continued our journey by talking about how to fine-tune the importance of features. It's like tuning a musical instrument perfectly for a beautiful symphony.
  • Various Methods Explored: We looked at different ways to find out which features are important. 
  • Choosing the Right Method: We discussed how to pick the best method. This depends on your data, the resources you have, and what you want to achieve. It's like choosing the right tools for your quest.
  • Best Practices Highlighted: We shared some smart moves to make your journey efficient and insightful. 
  • Feature Importance in Action: Finally, we showed feature importance in real-life situations. From predicting diseases in healthcare to making reliable financial forecasts and personalizing retail experiences, we saw how important features are used in different fields.

So, how about a quick pop quiz? 

Question for You

Which method would you use if you needed a comprehensive, unbiased view of feature importance that isn't tied to any specific model? 

  1. A) Correlation Criteria
  2. B) SHAP
  3. C) Decision Tree
  4. D) Logistic Regression

 Drop your answers in the comments below!

(And no peeking back for the answer πŸ˜€)

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.