Getting Started with alwaysAI Model Training

More often than not, the most difficult part of a task is simply getting started. alwaysAI’s Model Training tool is now integrated with a Jupyter Notebook interface, which not only makes kicking off your Computer Vision project a breeze, but also keeps things simple from end-to-end. The alwaysAI Model Training toolkit — Jupyter Lab interface allows you to upload your dataset, dial in your training configuration, and start training, all in just a few clicks.

To get started, you’ll need to run through a brief initial setup. We’ll go through each step below.

Install the alwaysAI CLI and Docker. The alwaysAI CLI contains almost everything you will need to get started. To install both the latest version of the CLI and Docker, follow the instructions here.
Download our sample dataset. Use the download link here to access our sample quick-start dataset (173 MB). This should only take a few seconds. The dataset is fully annotated and is ready to train immediately. Save this dataset to an easily accessible path. For the sake of this tutorial, we’ll refer to the location of your dataset as path/to/dataset.zip

And that’s it! Now you’re ready to train your own model.

With the alwaysAI CLI and Docker installed, run

alwaysai dataset train --jupyter

The first time you run this command, it will automatically start the download of a docker image. Once the Docker image download is finished, the container will spin up, and you should see something like the text below in your terminal. This means that the alwaysAI Model Training platform is running and ready to be accessed on your machine. You should see a new browser tab pointed to the provided link automatically opened in your default browser. Wait for it to load, or manually paste the link into your browser of choice (ie: Chrome, Firefox, Safari).

info: Model id: testuser/confident_meitnerinfo: Train/Validation ratio : 0.7info: Model dimensions : 300 by 300[W 00:40:40.559 LabApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.[I 00:40:40.861 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.6/dist-packages/jupyterlab[I 00:40:40.862 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab[I 00:40:40.864 LabApp] Serving notebooks from local directory: /[I 00:40:40.864 LabApp] Jupyter Notebook 6.1.5 is running at:[I 00:40:40.864 LabApp] http://a055153d7980:8888/[I 00:40:40.864 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). — — — — — — — — — — — — — open http://localhost:8888 — — — — — — — — — — — — —

Once opened, you should be greeted with the following screen in your browser:

From here, click “Notebook.ipynb” in order to bring up the alwaysAI Training Environment, and click into the cell and hit the “run” (play-button) icon on the top toolbar to render the available options. You can also collapse the file explorer at this point if you wish.

Now you should see a series of buttons, sliders, and drop-down menus through which you are able to set up your training session. We’ll talk about each widget in detail in the next section, and walk through how to properly configure training on the test dataset we’re working with.

The first set of widgets that pops up enables you to select your dataset, as well as make any transformations to the images you’d like before training. You can create a dataset from local files, or using files stored on S3 so long as you have the AWS CLI installed and configured. See our training documentation for these instructions.

The ‘Train Fraction’ slider controls the train/validation split of our dataset. During training, it is important to set aside some subset of your full dataset for validation. The images separated for validation are not seen by the model for the purpose of training, but rather are analyzed at specified intervals during training in order to evaluate the model’s performance. In other words, the validation images act as new images your model hasn’t seen before, and how it performs on them provides us with a barometer of sorts to gauge whether or not our model is learning as expected. The default value of 0.7 is a ratio of training to validation, however since we are dealing with a fairly small dataset and training images are at a premium we will increase this slightly to 0.8, meaning:

0.8*593 = 475 training images

(1–0.8)*593 = 118 validation images

Now your dataset is uploaded and ready for training, but first let’s take have a look at some key information from our dataset:

This dataset contains 584 images, some examples of which can be seen below. Each image has a corresponding file in .xml format which describes its characteristics — image dimensions, the number of objects of interest present, the coordinates of the bounding box surrounding each object, and the label assigned to that object. This format, called Pascal VOC, is used for training object detection models and it is the expected format for using the alwaysAI Model Training tool. You can find a description of this format here

We have labeled this dataset with two classes: vehicle, and license_plate. Of course, in your own dataset, the labels will be decided by you. The training tool can not handle spaces in your label names; underscores or camel case are the safest ways to do multi-word labels. You’ll see that there is no space to define labels in the training set up below, this is because the tool automatically detects the labels in your dataset!

This section will go over all the configuration settings for training your model.

By default, the configuration fields are filled with placeholder values. We’ll want to make some adjustments specific to our application. Let’s go through those one-by-one.

1 — Model. First, select a model from the dropdown. We offer two options: Mobilenet SSD version 1, and FasterR-CNN. The Mobilenet model is the default, and it will train faster and have a faster inference speed, but it is not as accurate as FasterR-CNN. FasterR-CNN will take more time to train, and won’t inference as fast, but should provide better accuracy.

2 — Input Size. You can specify the image dimensions used to train your model, choosing either a small, medium, large, or an integer by integer dimension (e.g. 300×300). Larger models will take longer to train but likely have better performance, and vice versa. The small option is 300×300, you can see the other options in the image below.

3 — Epochs. An epoch is one full pass through the dataset. In other words, no matter the size of your dataset, in one epoch your model will have seen each image in that dataset once. A good rule of thumb is to train a model for at least 20 epochs. However, for the sake of this tutorial, we’ll start with 10.

4 — Batch Size. The batch size is the number of images that is fed to your model at a time. In other words, with a batch size of one, the model will ingest training images one at a time. With a batch size of two, the model will ingest training samples two at a time, etc… The default value of 4 should be suitable for most hardware, however, feel free to experiment with larger batch sizes. The training tool will output an estimated max batch size for you, and if you need to resort to a small batch size of two or even one, don’t sweat it.

Note: There are many reasons that you might want to work with a somewhat large batch size of 32 or even 64 images — smoother convergence, a slight normalization effect, quicker training depending on hardware — however, these intricacies are out of the scope of this article. Typically, a larger batch size facilitates a higher learning rate and vice-versa for smaller batch sizes. Currently, the ability to tune learning rate is not available in the alwaysAI Model Training tool, but rest assured it is our priority to expose that functionality to our users soon.

5 — Model Name. Give a name to your model using this field. If you don’t specify a name in the initial command, the tool will also generate a random phrase for your model. You can keep this name, or specify a new one. The first part of your model name is your alwaysAI username, followed by a ‘/’; this part should be automatically generated by the toolkit. The name you give your model will be pre-pended by your alwaysAI userid when it is uploaded to the catalogto create a unique ID. In the alwaysAI catalog, you can see the publicly available models are pre-pended with “alwaysai”; and you can access any model via the alwaysAI Python API, edgeIQ, by using a model’s full name, e.g. “alwaysai/mobilnet_ssd”. In the image above, the username was “alwaysai”, and the model name was “tutorial”.

6 — Plot loss every n steps. An epoch can be broken down into steps, a unit that is used to describe how many iterations it will take for one traversal of the dataset to complete. Above, we defined epoch and batch size, and these two values determine how many steps are completed during training and how often the training loss is calculated. The formula for this is as follows:

(total number of images) / (batch size) = (steps per epoch)

((total number of images) / (batch size)) * (total number of epochs) = (total number of steps)

During training, the model may provide feedback in terms of either training loss or validation loss. Training loss is obtained as a byproduct of training after each step, whereas validation loss is calculated only after evaluation steps along with the model’s accuracy metrics. This widget controls only the frequency that the training loss is plotted. Some things to consider when choosing the frequency of plotting are:

Plotting loss at each step incurs a performance cost — the more information being plotted and the frequency of the plotting itself decreases the speed of plotting over time. This becomes especially important for large amounts of data and long training sessions. In this case, consider plotting training loss infrequently — once at the beginning of each epoch for instance.
It is impractical and unnecessary to plot the training loss at each step. Training loss is noisy, and plotting it several thousand times will result in information that is difficult to interpret.

Depending on your preferences, you may want the training loss to be plotted more or less frequently. In this case, since we are training for approximately 1500 steps, let’s set this value to 50 and plot the training loss three times per epoch (recall that at a batch size of four, one epoch will be about 150 steps).

7 — Validate every n epochs. This parameter controls the frequency at which validation will occur, in terms of the number of epochs. It is good practice to validate after each epoch, so we will set this to 1. However, since evaluation is somewhat time-consuming one may desire to validate less frequently. Validation loss is automatically plotted after each evaluation step, and a more detailed breakdown of the model’s precision and recall can be found in the docker container’s alwaysai/logs folder.

8 — Early Stopping Rounds. Early stopping rounds is a way of automatically stopping training after the model is sufficiently trained. By setting this value, if there is no notable decrease in loss after the specified number of rounds is completed then training is automatically cut short.

9 — Begin Training. See the image below for a recap of these settings.

Now we are ready to begin training. All this takes is the click of a button and a bit of patience. In our case, after about 50 minutes, we were able to complete training at the number of steps we prescribed.

Training progress is plotted real-time in the notebook, and below is the plot for our training cycle. Note that the validation loss begins to plot only after the first validation occurs, which in this case was at step 150.

Interpreting the plot is fairly straightforward. We want to see that both train loss and validation loss are trending downwards over time. The key here is that a downward trend is important — some noise is to be expected, and if validation or training loss seems to increase for an epoch or two that is no cause for concern as long as they continue to trend downward. However, if we were to encounter a situation where validation loss is consistently increasing over time, then we have surely encountered a phenomenon known as overfitting, in which your model is learning well on the images being trained upon but failing to generalize on other images that it has not trained upon.

In the case of the model below, we can see that both train and validation loss is trending steadily downward, and this can allow us to conclude that the model is learning properly. As we mentioned earlier, more detailed evaluation metrics can be found in the docker container’s alwaysai/logs folder, although we plan to plot these metrics as well in the future.

9 — Shut Down Notebook. Once we’re finished with training, we are clear to shut down the notebook. The necessary model files are automatically saved locally on your machine to your alwaysAI cache, where they can be used by the alwaysAI API, edgeIQ, and used in projects, as well as published to the model catalog to be used by your team.

Once your model is trained, you have the option to either continue from where you left off in a new session or upload the model to our website, or test it out locally. Testing your model out locally enables you to see how it performs on your use case before making it available to your fellow collaborators. Once you are satisfied with the model, you can publish it to your private model catalog, so that your team members can try it out as well. However, let’s say that you tested it out locally and realized that it isn’t performing quite as well as you hoped and you now want to train this model for ten more epochs. The automatic save feature will assign this first iteration of the model a version number of 0.1, so we can continue from here by running:

alwaysai dataset train --name tutorial --continue-from-version 0.1 --jupyter

We had previously run training for 10 epochs. Let’s say that we would like to continue for an additional 5 epochs. We would repeat the process of configuring our training session just as we had earlier, but we would set the number of epochs to 5.

We can see from the chart above that we begin training at step 0, like before, however, we can tell that the model has automatically picked up from where we left off, as the initial loss, in this case, is much lower than plotted during the initial training. We extend the training time for an additional 750 steps, approximately, and shutting down the notebook will now save version 0.2 of this model. You can continue training from this version just as you had from version 0.1.

To further improve the performance of your model for your particular use case, you can create your own dataset, which means collecting and annotating your own images. See our articles on data collection and data annotation for more information on these topics.

With many new features in the pipeline — image augmentation, additional models, additional hyper-parameter tuning, and more — we at alwaysAI strive to deliver a tool that is not only accessible and usable for all audiences, regardless of skill level, but one that doesn’t sacrifice the ability to fine-tune your models in order to make that possible. We understand that training the perfect model can take experimentation and time, and it is our goal to ensure that this iterative process is as streamlined and efficient as possible with our alwaysAI Model Training tool.

Footer