Confusion Matrix Explained: Your Key to Improving Machine Learning Model Performance

data science Oct 04, 2023
thumbnail image of confusion matrix blog from BigDataElearning

If you've ever used machine learning to solve a problem, you've likely encountered the term "confusion matrix." 

Did you know that confusion matrices are widely used in evaluating the performance of machine learning models? 

Yes, they help us understand how well our model is performing in terms of correctly predicting outputs.

Have you ever wondered how to calculate metrics like precision, recall, F1-score, and accuracy in machine learning? Confusion matrices are the building blocks for computing these metrics.

In this article, we'll explain 

Data Science Explained In 20 Infographics

“Data Science Made Simple: Learn It All Through 20 Engaging Infographics (Completely Free)"


What is a Confusion Matrix?

Think of a confusion matrix as a table that shows how well your machine learning model is doing at predicting different classes. 

It's called a confusion matrix because it shows how "confused" your model is when it comes to distinguishing between different classes.

Let's say you're building a model to identify whether a credit card transaction is fraudulent or not. 

You have a dataset of 1000 transactions, with 900 legitimate transactions and 100 fraudulent transactions. You train your model on this dataset and then use it to predict the class of new transactions.

In an ideal scenario, a machine learning model would accurately predict all 900 legitimate cases as legitimate and all 100 fraudulent cases as fraudulent, right?

Unfortunately, no model is perfect and there will always be some level of errors. The objective of machine learning is to achieve the highest possible accuracy while minimizing errors.

 

How to read and interpret it?


The confusion matrix for this problem would look something like this:

Each cell in the table represents the number of predictions that fall into a particular category. 

The rows represent the actual classes, and the columns represent the predicted classes.

Note: In this example identifying a transaction as fraud is actually “positive” , as it is our goal. Similarly identifying a transaction as not fraud (legitimate) is actually “negative”. 

Though this may contradict with the regular meaning, in our machine learning task context,  this is how it is. You shouldn’t think as to why we call a “fraud transaction” as positive. Please keep this in mind while evaluating the below scenarios.

True Positives (TP) - This represents the number of instances where the actual class is positive and the model predicted it correctly as positive. In our example, this would be 80.

False Positives (FP) - This represents the number of instances where the actual class is negative but the model predicted it as positive. In our example, this would be 90.

False Negatives (FN) - This represents the number of instances where the actual class is positive but the model predicted it as negative. In our example, this would be 20.

True Negatives (TN) - This represents the number of instances where the actual class is negative and the model predicted it correctly as negative. In our example, this would be 810.

 This is how you can read and interpret a confusion matrix. Using these numbers, we can calculate below metrics that tell us how well our model is performing:

  • Accuracy
  • Precision
  • Recall
  • F1 Score

4 Common Metrics

1) Accuracy

Accurancy measures the overall performance of the model. It's calculated as (TP + TN) / (TP + TN + FP + FN). In our example, the accuracy would be (80 + 810) / 1000 = 0.89 or 89%.

In simple terms, this is the total number of true’s (TP+TN) your model can predict out of the total instances.

 

2) Precision

Precision is the ratio of true positives to the total number of positive predictions.

In other words it determines how many were correct among all the instances that your model predicted as positive. 

The total number , the model predicted as positive, is 80+90. Out of those your model predicted 80 correctly as positive.

So it would be 80/(80+90) = 0.47 or 47%

In other words, TP/(TP + FP)

You can remember “precision” as a ratio to determine how many predicted positives are actually positives?


3) Recall

 

Recall is the ratio of true positives to the total number of actual positive instances, or TP/(TP+FN). 

In this case, recall is 80/(80+20) = 0.80, or 80%. 

In other words, TP/(TP + FN)

You can remember recall as a metric that shows how many of the actual positives were correctly predicted as positive by your model.


4) F1 Score

The F1-score is the harmonic mean of precision and recall, and it is a measure of the balance between precision and recall. 

The formula for F1 score is 2 x (precision x recall)/(precision + recall) 

In this case, the F1-score is 2 x (0.47 x 0.80)/(0.47 + 0.80) = 0.59. 

The F1-score ranges between 0 and 1, with 1 being the best possible score.

Why is the Confusion Matrix Important?

The confusion matrix is an important tool for evaluating machine learning model performance because it provides a detailed breakdown of how well the model is doing at predicting different classes. And also note that you can generate a confusion matrix using Python's popular machine learning library, sci-kit-learn (sklearn)

By looking at the metrics we calculated above, we can see which classes the model is performing well on and which ones it's struggling with.

For example, in our fraud transaction classification problem, we can see that the model is better at identifying legitimate transactions than fraudulent ones. 

  • The precision for legitimate transactions is 97%, which means that 97% of the time the model correctly identifies a transaction as legitimate when it is indeed legitimate. 

How we arrived at 97% is 810/(810+20)

  • The recall for legitimate transactions is 90%, which means that 90% of all legitimate transactions in the dataset were correctly identified as legitimate by the model.

How we arrived at 90% is 810/(810+90)

  • On the other hand, the precision and recall for fraudulent transactions are both lower than for legitimate ones, which means that the model is having a harder time distinguishing between fraudulent and legitimate transactions. 

  • As calculated earlier, the precision for fraudulent transactions is 47%, which means that 47% of the time the model correctly identifies a transaction as fraudulent when it is indeed fraudulent. 

  • As calculated earlier, the recall for fraudulent transactions is 80%, which means that only 80% of all fraudulent transactions in the dataset were correctly identified as fraudulent by the model.

For those using R programming language, leveraging the confusion matrix in R aids in evaluating model accuracy.

Conclusion

Concluding our exploration into the confusion matrix and its metrics, let's consolidate the invaluable insights gained

  • Understanding the Strengths: We first looked at the confusion matrix as a powerful tool for evaluating how well a machine learning model works. It helps us see things like true positives, true negatives, false positives, and false negatives.
  • Revealing Important Metrics: Next, we revealed the metrics found within the confusion matrix, such as accuracy, precision, and recall. Each metric gives us a different view of how well the model is performing.
  • Balancing Precision and Recall: We then delved into the balance between precision and recall, highlighting their close relationship. Precision and recall are crucial in evaluating the model, and finding the right balance is important.
  • Insights for Improvement: Finally, we emphasized the significance of understanding these metrics. This understanding gives us practical insights into where the model is strong and where it can be improved.

By embracing and understanding the details of the confusion matrix and its metrics, practitioners gain the ability to evaluate, improve, and fine-tune machine learning models for better accuracy and reliability.

The Data Science Aspirant's 90-Day Proven Roadmap

Get INSTANT ACCESS to This Proven Roadmap To Become a Data Scientist in 90 Days,

Even Without Prior Data Science Experience - Guaranteed.

 

 
Question for you

If a model identifies 90% of all relevant cases but also generates 50% false positives, would you consider this scenario an example of high recall or high precision? Why?

Stay connected with weekly strategy emails!

Join our mailing list & be the first to receive blogs like this to your inbox & much more.

Don't worry, your information will not be shared.