A machine learning pipeline with TensorFlow Estimators and Google Cloud Platform

I used TensorFlow and especially Estimators (TF Estimators) to showcase how to industrialise a TensorFlow machine learning pipeline in GCP.

TensorFlow Estimator is meant as a high level API and was born to facilitate development and deployment of machine learning models including neural networks. With the release of TensorFlow 2, the preferred high level API is now Keras (the TensorFlow integrated Keras) but it doesn’t cover exactly the same scope. TF Estimator was not only created to simplify model design, but also for specifying, training, evaluating, and deploying machine learning models. The key concept of TF Estimators is pre-made estimators, i.e. models, for classification and regression, but you can also convert a custom Keras model into an estimator so the two APIs coexist and can be used in a complementary manner. Other capabilities include the ability of running Estimator-based models:

on a local host or on a distributed multi-server environment without changing the model
on CPUs, GPUs, or TPUs without recoding the model.

The typical structure of a machine learning pipeline built with Estimators consists of four elements:

an input function
the feature columns
an Estimator
a call to a method for training, evaluation or inference

Step 1: data ingestion with an input function

The first step of the pipeline consists in reading the data. This is done by writing an input function (it can be only one that can handle reading data for both training and prediction or 2 separate functions for each phase). An Estimator expects data to be formatted as a pair of objects: a dictionary with feature names and feature values, and a tensor of labels. It is recommended to return data as a tf.data.Dataset. The input function is executed within a tf.Graph, which means that operations are optimised before execution.

In our case, data are stored in BigQuery and I could access them directly thanks to TensorFlow IO, a TensorFlow extension that allows to read data from external sources.

Input function definition

Step 2: feature pre-processing with feature columns

The step following data ingestion is represented by the identification and pre-preprocessing of features. Another TensorFlow module is used: tf.feature_column. This module allows you to perform a number of predictive variable pre-processing like mapping categorical or string features to numerical values. It’s powerful but you soon reach its limits. The major limitation is that you can “map” transformations (line by line transformations like multiplying by a given constant) but if you need to pass over the full dataset to compute a scaling factor or get the full list of a variable categories, you can’t do it automatically and you need to compute values beforehand. Available transformations are: bucketing, binarisation, one-hot encoding, embedding and feature crossing among others. The embedding is the only one transformation that is trained during training.

I computed features based on the initial four variables available in the dataset (pick-up location and timestamp, drop-off location, passenger number).

Pick-up timestamps were used to extract temporal features: the week of year, the hour of the day and the day of the week.

Pick-up and drop-off locations were used to compute geographical features: approximated distance and if the trip was to or from an airport.

The number of passengers was used to distinguish between a 5-seat taxi and a minivan.

I computed sparse and dense features (one hot encoded and embedded) and I also combined categories together, as for example hour of the day and day of week, so to capture the combined daily and weekly seasonality of traffic jam. These combinations are called “feature cross” and are useful when the relationship between two features is important. The drawback is that the majority of the combinations may not be relevant at all, and by one-hot encoding them the feature space gets really sparse. These problems are handled partially by the model as explained in the next session.

Examples of usage of tf.feature_column

Step 3: model definition with an Estimator

This is the central part of the pipeline and the fact that the estimator is separated from the rest is key for an iterative development: you can choose a very simple pre-made estimator and focus on implementing a fully functional pipeline from end to end. Only then you can focus again on the model and improve its performances. The two previous and following steps should’t be much impacted.

The available pre-made estimators are:

The forth option is a hybrid model called wide and deep model and was presented in a recent paper. It is made of two parts : one connects inputs to outputs via a deep neural network, the other one connects inputs to outputs via a linear model. The linear part takes in sparse columns and the DNN takes in real-valued columns. This approach can be useful when dealing with sparse features as feature crosses, which is our case. The linear part is able to memorise previously seen interactions while the DNN part help generalise to new unseen interactions. These considerations motivated the choice of such a model and it was instantiated as follows:

Estimator definition

The heart of every Estimator — whether pre-made or custom — is its model function, which is a method that builds graphs for training, evaluation, and prediction. When you are using a pre-made Estimator, the model function has already been implemented. When relying on a custom Estimator, you must write the model function yourself. You can convert existing Keras models to Estimators with tf.keras.estimator.model_to_estimator. Doing so enables your Keras model to access Estimator’s strengths, such as distributed training.

Step 4: train, evaluate and predict with the appropriate method

All Estimators provide train, evaluate, and predict methods. Each of them takes an input_fn as first parameter to know how to read data. The predict method returns an iterable.

Estimators export SavedModels through tf.Estimator.export_saved_model: it exports inference graph as a SavedModel into the given directory. This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature tensors, and then calling this Estimator‘s model_fn to generate the model graph based on those features.

For automatic training and evaluation there’s a specific method: train_and_evaluate. This utility function trains, evaluates, and (optionally) exports the model by using the given estimator. Moreover, it provides consistent behaviour for both local (non-distributed) and distributed configurations. All training related specification is held in train_spec, including training input_fn and training max steps, etc. All evaluation and export related specifications are held in eval_spec, including evaluation input_fn, steps, exporter, etc. The exporter parameter is needed if you want to save the model. It is an instance of BestExporter (it performs a model export every time the new model is better than any existing model) or LatestExporter (in addition to exporting, this class also garbage collects stale exports) or FinalExporter (this class performs a single export at the end of training).

Whether you use train_and_evaluate or export_saved_model, when you save an Estimator you need to create a serving_input_receiver_fn. This function builds a part of a tf.Graph that parses the raw data received by the SavedModel. It takes no argument and returns a ServingInputReceiver instance. It basically specifies the features to be passed to the model.

Step 1: data ingestion with an input function

Step 2: feature pre-processing with feature columns

Step 3: model definition with an Estimator

Step 4: train, evaluate and predict with the appropriate method

Footer