Using Stochastic Gradient Descent to Train Linear Classifiers

This data was captured in my house in various locations designed to maximize the variation in detected objects (currently only people, dogs and cats), distance and angle from the radar sensor.

The data set contains known labeling errors, mostly stemming from the object detector mistaking my cat for my dog which happens (subjectively), at about a 10% error rate. A future effort will attempt to fine-tune the object detector to reduce the error. This error will get propagated to the radar classifier trained from this data set.

You can use the steps below to train the model on the radar data. The complete Python code that implements these steps can be found in the train.py module of the radar-ml project.

Scale data set sample features to the [0, 1] range.
Encode data set labels as integers.
Split samples and labels up into train, validation and test sets.
Generate feature vectors from the radar projections in each set above by concatenating all or selected projections. The result is a large but sparse feature space which is a function of the radar scan volume. In this example, the feature vector has length 10,010.
Augment the training set. This increases accuracy at the expense of training time. Fortunately, using SGD as a optimizer for linear classifiers scales extremely well on large data sets.
Balance the training set.
Use the training set and Stratified K-Folds cross-validation to fit a linear classifier using SGD as an optimization technique and a grid search to find the best hyperparameters.
Calibrate the best classifier using the validation set to obtain an accurate probability estimate of the predictions. This step may not be needed for probabilistic classifiers.

The Python snippet below from radar-ml’s train.py shows the actual fitting function. This uses the sklearn linear_model.SGDClassifier API with ‘log’ loss which gives Logistic Regression. You can see the online training aspect which is used to do partial fits on the optimum classifier using augmented data as well as novel data sets — a very computationally efficient process. You can also see that the grid search tries fits with a number of hyperparameters and getting these values right is key to an accurate classifier.

SGD Fitting Function

Note: the sklearn.svm.LinearSVC API can optimize the same cost function as the SGDClassifier by adjusting the penalty and loss parameters. However, LinearSVC does not allow for online learning. LinearSVC uses the LIBLINEAR library (Fan et al.,2008).

The train.py module also will fit a model using LIBSVM (Chang and Lin, 2011) via the sklearn svm.SVC API, you can see that used in the Python snippet below from radar-ml’s train.py. LIBSVM implements the Sequential minimal optimization algorithm for kernelized Support Vector Machines which is a very powerful method but does not scale well for large data sets or feature vectors from a fit time perspective.

SVC Fitting Function

Using the test set that was split from the data set in the step above, evaluate the performance of the final classifier. The test set was not used for either model training or calibration validation so these samples are completely new to the classifier. The evaluation function is shown in the Python snippet below which is part of radar-ml’s train.py.

Model Evaluation Function

The evaluation results from using SGDClassifier are shown below.

SGD Training Result Summary

The evaluation results from using SVC are shown below.

SVC Training Result Summary

You can see that the SGD method gives better overall accuracy (89% vs. 84%) on the test set and moreover completes the training in about seven minutes (including four epochs of augmentation) vs. about 75 minutes as compared to SVC. These results are not apples-to-apples since the SGD classifier accuracy benefits from the data augmentation. Using augmentation with SVC is basically infeasible on my i5 3.4 GHz machine since the training times are a non-linear function of the training set and would take many days. Note that some of the inaccuracies are likely due to the labeling errors highlighted above.

The resulting SGD-trained linear classifier takes about 250 KB of disk space whereas SVC results in a classifier (RBF kernel) more than two orders of magnitude larger, around 40 MB. This could be an advantage if you use the SDG classifier in a resource limited embedded system.

You can find the fitted classifiers and training results here.

Using the classifier to make predictions on new data is straightforward as you can see from the Python snippet below. This is taken from radar-ml’s predict.py.

Prediction Function

You should consider using Stochastic Gradient Descent as an optimizer to efficiently train linear classifiers if you have a large number (many thousands) of training examples or features. Also consider using it for online learning, for example in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data. Different optimization methods or classifiers may be better in other cases.

SGD classifiers are sensitive to feature scaling and require fine tuning of a number of hyperparameters including the regularization parameter and the number of iteration for good performance. You should always use feature normalization and a technique like grid search to find the most optimal hyperparameters when using this method. If you intend to use the classifier to predict both a class and a confidence level, you should calibrate it first on a data set disjoint from the training set. Always evaluate the final classifier on a test set disjoint from both the training and validation set.

You will find prediction using the SGD classifier straightforward via the sklearn APIs and its compact size favors resource limited embedded systems.

Footer