Once you have mastered some of the key modeling techniques for supervised learning, you might begin to hold a preference for a select few. However, regardless of your preference, a good data scientist understands there isn’t necessarily one perfect tool for every problem. It is the data scientist’s job to select the best model to make sense of the madness.
This is where the beauty of pipelines comes in.
Pipelines offer a streamlined technique for finding the best performance parameters for a model fit to a given dataset with the fewest lines of code. It allows a user to apply several transformations to preprocessed data and then build the best model for the presented information. Pipelines do this by chaining together multiple estimators into one .
To construct a pipeline, you input a variable number of steps, all of which are transformers apart from the last, which must be an estimator. What are transformers? Transformers are any one of a number of classes in sklearn that have a fit and transform or a fit_transform method. Transformers are used to either clean, reduce, expand, or generate feature representations. The fit method of a transformer examines a training set and extracts the model parameters. The transformer then uses the transform method to apply the transformation model to unseen data. Fit_transform does these actions together, modelling and transforming the training set simultaneously.
A few examples of transformers are the StandardScaler class in sklearn’s preprocessing module and the PCA (Principal Component Analysis) technique from the decomposition module. StandardScaler will standardize features by scaling the feature to the unit variance after removing the mean. This transformer will independently center and scale each feature of the training data and store the relevant statistics for later use on new (testing) data. Many of the estimators (that will be used in the last step of your pipeline) require data to be standardized. If an unstandardized feature has variance that is larger than others it could prevent the estimator from learning from other features correctly by dominating the objective function. This is a class that is most useful in the early steps of a pipeline.
The PCA technique is used to reduce dimensionality in a dataset. The main objective of PCA is to retain the most information about the variation present in a dataset while trimming the number of features present in the data. It does this by transforming the variables into a new set of variables that are the eigenvectors of a covariance matrix, which makes them orthogonal. These orthogonal variables are known as the Principal Components (PCs) and they are retained in descending order of their explained variance. By reducing the features present in the dataset, a shadow of the original object is created — this technique is used to make that shadow as clear as possible by viewing it from the most informative angle. It is critical that datasets reduced by PCA are scaled to avoid the domination of a feature from unequal variance.
As mentioned above, another technique that pairs well with pipelines and increases their efficiency is a grid search. Sklearn’s GridSearchCV allows you to run a search over specified parameter values fit to an estimator and then score the models. This practice of evaluating the performance of multiple models that are fit with a variety of parameters is known as hyper parameter tuning it is a significant step in modeling, as the performance of the entire model is based on the specified parameter values. By iterating through numerous parameter specifications and comparing model scores on the test data, the grid search will prevent models from overfitting or underfitting the data. For example, if you are tuning the parameters of a tree classifier, you could end up being overfit if the max_depth was allowed to be too high, resulting in a number of arbitrary splits. Alternatively, the model could produce numerous stumps that don’t provide much information if the parameter were set to low.
In addition to searching multiple parameters, this technique runs k-fold cross validation on your data set. With k-fold cross validation, the testing data is split into k groups. The model is then trained k times on k-1 groups of the training data, with the last group being held in reserve. The model is then introduced to the kth set and a performance score is collected. A performance score is collected, and the model is refit with k-1 subsets of the training data keeping a new kth set in reserve. The average of the k performance scores becomes the cross-validation score and provides an estimate of model performance on new data. Following this process, test data is introduced to determine what the real performance is. It is important to note that data preparation (scaling, normalization) should occur prior to fitting the CV-training set to the model being tuned. If this preparation occurs outside the loop, there is a potential for data leakage, which is when information is shared between the training and testing datasets. An added benefit of Pipelines is that data leakage is avoided because the pipeline ensures the same samples are used to train the transformers and the predictors.
Here is a look at pipelines in action:
First, import the necessary methods from the modules mentioned above.
Then set the target variable to y and use the train_test_split method to generate a test set that is 25% of the original dataset.
Then complete the next steps: 1) Establish a Pipeline with two transformations (StandardScaler and PCA) and a baseline classifier with no specified parameters other than the random seed (for reproducibility). 2) Design a grid of parameters you would like to tune with the grid search. You can find a list of a classifier’s parameters in the class documentation. 3) Run the grid search cross validation, feeding in the pipeline, the grid parameters, and specifying the k for k-fold CV.
The output will look like this:
Once the training sets are fit to the grid search, printing the .best_params_ will result in the bottom line of the output, identifying the parameters which provided the best score — in this case we evaluated the model’s accuracy, but a number of scoring methods can be selected.
Hopefully you can see that there is relatively little code needed and that you could duplicate the above for a number of models relatively easily.