In terms of numerical features (discrete and continuous), you can think of transformations that rely on the training data and others that rely purely on the (individual) observation being transformed.

Those who rely on the training data will use the training set to learn the necessary parameters during **fit**, and then use them to transform any test or new data. The logic is pretty much the same as what you just learned for categorical features; however, this time, the encoder will learn different parameters.

On the other hand, those that rely purely on (individual) observations do not depend on training or testing sets. They will simply perform a mathematical computation on top of an individual value. For example, you could apply an exponential transformation to a particular variable by squaring its value. There is no dependency on learned parameters from anywhere – just get the value and square it.

At this point, you might be thinking about dozens of available transformations for numerical features! Indeed, there are so many options, and you will not learn all of them here. However, you are not supposed to know all of them for the AWS Machine Learning Specialty exam. You will learn the most important ones (for the exam), but you should not limit your modeling skills: take a moment to think about the unlimited options you have by creating custom transformations according to your use case.

Applying data **normalization** means changing the scale of the data. For example, your feature may store employee salaries that range between 20,000 and 200,000 dollars/year and you want to put this data in the range of 0 and 1; where 20,000 (the minimum observed value) will be transformed as 0; and 200,000 (the maximum observed value) will be transformed as 1.

This type of technique is especially important when you want to fit your training data on top of certain types of algorithms that are impacted by the scale/magnitude of the underlying data. For instance, you can think about those algorithms that use the dot product of the input variables (such as neural networks or linear regression) and those algorithms that rely on distance measures (such as k-**nearest neighbors (KNN)** or **k-means**).

On the other hand, applying data normalization will not result in performance improvements for rule-based algorithms, such as decision trees, since they will be able to check the predictive power of the features (either via entropy or information gain analysis), regardless of the scale of the data.