Intro to AutoML – Brewing with TPOT

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Trying to figure out the best machine learning pipeline? Here is a 10-minute crash course to help you!

Have you ever stayed up late to clean data or struggle to select the perfect machine learning model?

If so, Automated Machine Learning — AutoML and Tree-based Pipeline Optimization Tool — TPOT can be your best friends. This article will be a mini crash course to introduce the basics of AutoML and one of its most prominent libraries— TPOT — in just under 10 minutes! Let’s get started!

What is AutoML?

AutoML attempts to totally or partially automate a machine learning workflow. In a traditional machine learning workflow, there are still multiple stages that require a significant amount of manual work and domain knowledge. These includes data cleaning, feature preprocessing, feature selection, feature construction, model selection and (hyper)parameter optimization. AutoML tools and platforms aim to automate more than one stage in this workflow by providing a partial or full pipeline solution.

Figure 1: A depiction of the typical supervised machine learning process.

Will AutoML replace data scientists?

AutoML is quite a revolutionary concept in data science. However, there are increasing debates on whether data scientists will be replaced by automated data science platforms. Due to budget cuts and the difficulty for many companies to recruit data professionals, AutoML seems to be a great alternative to hiring an in-house data science team, especially with free open-source tools such as TPOT available. However, recent experiments showed that AutoML could not outperform human data scientists entirely and there are certain tasks that AutoML tools could not accomplish. Such tasks include understanding business problems, transforming business problems into mathematical problems and applying domain knowledge on feature engineering.

Nevertheless, AutoML will be a powerful tool for data scientists to add to their arsenal of tools. Given adequate computation resources, it could significantly reduce the repetitive manual work involved in traditional machine learning workflows. In addition, with the help of AutoML, non-machine learning experts would be able to apply predictive models to solve problems without a deep understanding of the underlying math. Therefore, AutoML is not only a technology that data scientists should adopt, but also something anyone that wishes to harness the magical power of machine learning should learn.

TPOT: the genetically modified AutoML

If you are searching for an AutoML package to play around with, TPOT should be your first port of call due to its popularity and accessibility. TPOT, of all the AutoML packages available, has the most Github stars, forks and contributors. It is also one of the most downloaded AutoML packages according to PyPI Stats.

The goal of TPOT is to find the best tree-based machine learning pipeline, which uses machine learning operations (feature preprocessing, feature selection, model selection and hyperparameters tuning) as the tree nodes.

Below is an illustration of a tree-based pipeline selected by TPOT:

Figure 2: Tree-based pipeline from TPOT example

TPOT uses a technique called genetic programming (GP) algorithm to choose the best pipeline. The nodes (circles in Figure 2) in the tree-based pipeline are called machine learning (ML) operators. In short, TPOT uses the GP algorithm to choose the best tree-based pipeline composed of ML operators.

So what are ML operators?

Here are three kinds of ML operators:

Supervised classification operators: include models from scikit-learn and XGBoost
Feature processing operators: imported from `sklearn.preprocessing`
Feature selection operators: imported from `sklearn.feature_selection`

Figure 3: Machine Learning Operators in TPOT

But what exactly is Genetic Programming (GP) algorithm?

TPOT uses the genetic programming algorithm to select ML operators and form tree-based pipelines. The GP process consists of steps including initialization, evaluation, selection, crossover, and mutation.

Initialization: Generates 100 tree-based pipelines randomly
Evaluation & Selection: From the 100 pipelines, it selects the top 20 pipelines using the NSGA-II selection scheme, which aims to maximize cross-validation balanced accuracy and minimize the number of operators
Crossover & Mutation: Each of the 20 selected pipelines produces five copies (offspring), with 5% of pipelines crossed over with another 5%, and 90% of the pipelines each has a 1/3 chance of either being changed by a point, insert, or shrink

Step (3) is analogous to evolution. The selected chromosomes reproduce, cross over, mutate, and the new generation is created. If you need more information regarding this step, please refer to the link at the bottom of the page for additional visual examples.

After initialization, the algorithm will run steps (2) — (3) repeatedly for 100 iterations (generations). Then the best-performing pipeline will be returned by TPOT.

How to make tea(?) with TPOT

So, now you know how TPOT works, let’s go through some examples to see how easy it is to use TPOT! The TPOT automated machine learning can be used for both supervised regression and classification problems. Unlike the traditional process of finding models, preprocessors, estimators and pulling your hair out to perfect the features, TPOT is a “one-click” process that will find the best pipeline and return a runnable Python file.

First things first, before using TPOT, you will need to install TPOT. However, since TPOT is built on several Python libraries, you will need to install them all to get the best out of TPOT(I promise the trouble is worth it!). The basic packages are: numpy, scipy, scikit-learn, DEAP, update_check, tqdm, stopit, pandas, joblib and xgboost.

There are also additional packages to install in case you are looking into parallel computing (Dask), Multifactor Dimensionality Reduction (TPOT-MDR) or neural networks (PyTorch/TPOT-NN).

Please refer to the main TPOT installation guide at the bottom of the page for detailed dependencies installation. After the dependencies are out of the way. Let’s install TPOT.

# using pip
pip install tpot# using conda-forge
conda install -c conda-forge tpot

The general workflow of TPOT is to first split the dataset into train and test sets. The next step is to declare which type of TPOT pipeline, either classification or regression, to construct and set the hyperparameters in the function. Then training is performed to select the best pipelines through genetic programming. The last step is to either export the trained pipeline for future use, or to call additional functions to display results or predictions.

Let’s go through some of the basic hyperparameters that you can play around with while declaring the Classifier/Regressor. Refer to the link at the bottom of the page for the full list of parameters and their use.

generations: # of iterations to optimize the pipeline
population_size: retained size of generation through genetic programming
verbosity: mode of communication output from TPOT
random_state: seed to generate same results with the same dataset

Let’s import the libraries first. You are welcome to use other datasets and selection libraries.

from tpot import TPOTClassifier # for classifier
from tpot import TPOTRegressor # for regressor
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import numpy as np

The first example is a classifier example using the scikit-learn’s dataset “wine”.

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data,
wine.target, train_size=0.75, test_size=0.25, random_state=20)
tpot_classifier = TPOTClassifier(generations=80,
population_size=80,verbosity=2, random_state=20)
tpot_classifier.fit(X_train, y_train)

The second example is a regression example using the scikit-learn’s dataset “diabetes”.

diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data,
diabetes.target, train_size=0.75, test_size=0.25, random_state=20)
tpot_regressor = TPOTRegressor(generations=80, population_size=80,
verbosity=2, random_state=20)
tpot_regressor.fit(X_train, y_train)

TPOT also ships with a scoring/prediction function that can use the selected pipeline to make predictions. Note the <model> below can be substituted with either tpot_regressor or tpot_classifier from the training above.

<model>.predict(X_test)
<model>.score(X_test, y_test)
<model>.predict_proba(X_test) # only in Classification datasets

See how convenient TPOT is? By the way, don’t forget to export a copy of the pipeline using the following command.

<model>.export(‘tpot_pipeline.py’)

Export will generate a file presented below as an example. You can see that all the imports are done, the training data is split into train/test and the pipeline that scores the highest is constructed.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# NOTE: Make sure that the outcome column is labeled ‘target’ in the data filetpot_data = pd.read_csv(‘PATH/TO/DATA/FILE’, sep=’COLUMN_SEPARATOR’
,dtype=np.float64)features = tpot_data.drop(‘target’, axis=1)training_features, testing_features, training_target,
testing_target = train_test_split(features, tpot_data[‘target’], random_state=None)# Average CV score on the training set was: 0.9974999999999999exported_pipeline = RandomForestClassifier(bootstrap=True, criterion=”entropy”, max_features=0.6000000000000001, min_samples_leaf=19, min_samples_split=18, n_estimators=100)exported_pipeline.fit(training_features, training_target)results = exported_pipeline.predict(testing_features)

Note that the examples do not contain TPOT neural networks, parallel training, self-defined scoring functions, operator customization and pipeline templates. Please check the TPOT documentation and article if you require additional knowledge on those subjects.

Why is TPOT so great?

Increases productivity of current data scientists
Increases availability of machine learning for non-data scientists
No domain knowledge and no human input required for the basic models
Ability to output pipeline code

However….. it is

Slow in training process
Currently available for supervised learning techniques only
Requires lots of knowledge to unlock the full potential of TPOT

No TPOT? How about an infuser, or…

If you are still not 100 percent convinced about using TPOT — here are some other interesting AutoML libraries you might want to consider:

Figure 4: High-level overview of Auto-WEKA internal structure

Auto-WEKA is an approach which applies simultaneous selection to choose machine learning algorithms and their associated hyperparameters. It is based on the Java machine learning package WEKA, which can automatically yield good models when supplied with a variety of datasets. Auto-WEKA also ships with a graphical user interface, which means there is no need to use a terminal or programming language. It is one of the first systems to combine algorithm selection and hyperparameter optimization in addition to preprocessing steps.

Figure 5: The flow chart of Auto-sklearn

Auto-sklearn extends the approach in AutoWEKA using the Python library scikit-learn and is a drop-in replacement for regular scikit-learn classifiers and regressors. Auto-sklearn is intended to increase search efficiency and post-hoc ensemble building to combine the models generated during the hyperparameter optimization process. Auto-sklearn is also superior to Auto-WEKA because it uses meta-learning that acts as a warm-start in the search procedure. In other words, the search is more likely to begin with better pipelines.

Figure 6: The flow chart of H2O AutoML

The H2O machine learning and data analytics platform is a fully open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning and many more. A function of H2O, the H2O AutoML, performs (simple) data preprocessing, automates the process of training a large selection of candidate models, tunes hyperparameters of the models and creates stacked ensembles.

Time to dive in!

AutoML packages should be a part of every data scientist’s arsenal of tools and a great starting point for non-experts to experiment with machine learning. It allows users to focus on higher-value aspects of machine learning such as data-preprocessing, feature engineering and model deployment. So are you ready to give TPOT a shot?

Links:

Reference:

Figure 1: R. Olson et. al. (2016) “Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science.”Figure 2: R. Olson et. al. (2016) “TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning”Figure 4: Tu, H., & Nair, V. (2018). Is one hyperparameter optimizer enough? Proceedings of the 4th ACM SIGSOFT International Workshop on Software Analytics.Figure 5: Fernando López. (2020). Auto-Sklearn: An AutoML tool based on Bayesian OptimizationFigure 6: Parul Pandey. (2019). A Deep dive into H2O’s AutoML

Thanks for Reading!

Footer