It is now time to learn how to evaluate and optimize machine learning models. During the process of modeling, or even after model completion, you might want to understand how your model is performing. Each type of model has its own set of metrics that can be used to evaluate performance, and that is what you are going to study in this chapter.

Apart from model evaluation, as a data scientist, you might also need to improve your model’s performance by tuning the hyperparameters of your algorithm. You will take a look at some nuances of this modeling task.

In this chapter, the following topics will be covered:

- Introducing model evaluation
- Evaluating classification models
- Evaluating regression models
- Model optimization

Alright, time to rock it!

There are several different scenarios in which you might want to evaluate model performance. Some of them are as follows:

- You are creating a model and testing different approaches and/or algorithms. Therefore, you need to compare these models to select the best one.
- You have just completed your model and you need to document your work, which includes specifying the model’s performance metrics that you got from the modeling phase.
- Your model is running in a production environment, and you need to track its performance. If you encounter model drift, then you might want to retrain the model.

*Important note*

*The term model drift is used to refer to the problem of model deterioration. When you are building a machine learning model, you must use data to train the algorithm. This set of data is known as training data, and it reflects the business rules at a particular point in time. If these business rules change over time, your model will probably fail to adapt to those changes. This is because it was trained on top of another dataset, which was reflecting another business scenario. To solve this problem, you must retrain the model so that it can consider the rules of the new business scenario.*

Model evaluations are commonly inserted in the context of testing. You have learned about holdout validation and cross-validation before. However, both testing approaches share the same requirement: they need a metric in order to evaluate performance.

These metrics are specific to the problem domain. For example, there are specific metrics for regression models, classification models, clustering, natural language processing, and more. Therefore, during the design of your testing approach, you have to consider what type of model you are building in order to define the evaluation metrics.

In the following sections, you will take a look at the most important metrics and concepts that you should know to evaluate your models.

Classification models are one of the most traditional classes of problems that you might face, either during the exam or during your journey as a data scientist. A very important artifact that you might want to generate during the classification model evaluation is known as a confusion matrix.

A confusion matrix compares your model predictions against the real values of each class under evaluation. *Figure 7.1* shows what a confusion matrix looks like in a binary classification problem:

Figure 7.1 – A confusion matrix

There are the following components in a confusion matrix:

- TP: This is the number of true positive cases. Here, you have to count the number of cases that have been predicted as true and are, indeed, true. For example, in a fraud detection system, this would be the number of fraudulent transactions that were correctly predicted as fraud.
- TN: This is the number of true negative cases. Here, you have to count the number of cases that have been predicted as false and are, indeed, false. For example, in a fraud detection system, this would be the number of non-fraudulent transactions that were correctly predicted as not fraud.
- FN: This is the number of false negative cases. Here, you have to count the number of cases that have been predicted as false but are, instead, true. For example, in a fraud detection system, this would be the number of fraudulent transactions that were wrongly predicted as not fraud.
- FP: This is the number of false positive cases. Here, you have to count the number of cases that have been predicted as true but are, instead, false. For example, in a fraud detection system, this would be the number of non-fraudulent transactions that were wrongly predicted as fraud.

In a perfect scenario, your confusion matrix will have only true positive and true negative cases, which means that your model has an accuracy of 100%. In practical terms, if that type of scenario occurs, you should be skeptical instead of happy, since it is expected that your model will contain some level of errors. If your model does not contain errors, you are likely to be suffering from overfitting issues, so be careful.

Once false negatives and false positives are expected, the best you can do is prioritize one of them. For example, you can reduce the number of false negatives by increasing the number of false positives and vice versa. This is known as the precision versus recall trade-off. Let’s take a look at these metrics next.