Data transformation in ML — Standardization vs Normalization
Many things in life come in a variety of shapes, sizes, flavors, etc. It is this variety that is said to be “the spice of life”. Unfortunately, data scientists often have to save the variety for after hours and get the data they are working with to become rather similar.
When working with real world data sets, it is common to find yourself working with a mix of continuous variables which span a wide range of information. If you are looking at a data set for student’s academic scores, the range of points for each assignment, quiz, or test might vary — a quiz might be scored out of 20 points while a test might be scored out of 100 points. If you are comparing automobiles, you might be comparing MPGs (on a range of 15–45 mpgs) to top speeds (80–130 mphs) to driving range on a full tank of gas (280–410 miles).
There might also be a difference in the scale of some of the data / data with heavy skew… For example, exploring sale prices of homes in given city might show that 80% of the homes are valued under $1 million with the remaining 20% ranging between $1–10 million. And those prices might be next to a feature for lot size, where the data describe sizes between 0.25 acres — 10 acres.
Having data that are on such widely different scales or with such variance that are representing such different things can impact predictive machine learning models. Algorithms see the numbers and would assume that greater values insinuate greater influence on the outcome. Knowing how to manipulate these data in such a way that comparing them and placing them into models in ways that the models gain the most predictive power is called ‘feature scaling’ and is a crucial step — some would argue the most crucial step — in the data pre-processing step of the data science cycle.
Two approaches to feature scaling are standardization and normalization. Standardization is the practice of making features look more or less normally distributed. It shifts values to where they are centered around the mean with the mean set to 0 and where the distribution of the rescaled data have a unit standard deviation. Normalization is the process to shift and rescale data so the data range between [0,1]. By scaling data before employing a distance based algorithm, all features contribute equally to the result. When using gradient descent algorithms, features on a similar scale allow for a faster convergence towards the minima.
Scaling Methods in Python
Scikit-Learn’s preprocessing library contains a number of transformer methods:
- Removes mean and scale data to unit variance
- Cannot guarantee balanced feature scales in the presence of outliers
- Rescales data so all values are in a 0–1 range
- Also sensitive to outliers, as inliers are often squeezed into a small range
- Centering and scaling are based on percentiles — median is removed and data are scaled according to interquantile range
- Median and IQR are robust to outliers, as opposed to other measures like min, max, mean, standard deviation
- Implements the Yeo-Johnson and Box-Cox transforms to make the data more Gaussian-like by finding optimal scaling factors to stabilize variance and minimize skewness through MLE.
- Rescales the variable so every sample has unit norm independent of the distribution of samples.
When you apply these methods to training and testing sets, you would first call a Transformer.fit(data) method to compute the pertinent information (min, max, mean, std) from the data to be used for later scaling. Then you would use the Transformer.transform(data) to scale the training and testing data sets at the appropriate times.
Where to scale in ML
On a high level, here are some tips on when to apply scaling. For machine learning methods which use distance measurements in their algorithms scaling is key. K-Means uses Euclidean distance measure — K-Prototypes would then also require scaling because it combines K-Means and K-Modes. K-Nearest Neighbors and SVMs also use distance between data points to determine similarity, so they should be scaled. Principal Component Analysis (PCA) works to select features with maximum variance, so it is important to standardize feature variance. Algorithms like neural networks, and linear and logistic regression, which use gradient descent as an optimization technique, require data to be scaled because the feature value X is present in the gradient descent formula.
Tree-based models and Naive Bayes are two examples where the model is fairly insensitive to feature scaling because they are not distance based. Tree based models split nodes based on single features and the split on a feature is not influenced by other features. In contrast to the rule based tree algorithms, Naive Bayes is designed to handle differences in data scale and variance and assigns weights to features accordingly.
Selecting a transformer
When data does not follow a Gaussian distribution, normalization is typically the route to take. If the data are Gaussian, standardization is likely the path. If outliers are present in the data, robust transformers or scalers (like the RobustScaler) are most appropriate.
Hopefully this serves as a good starting point for knowing why we sometimes need to scale data, when we should apply these methods, and what some of the methods for these transformations are and how they operate.