Computer Vision: How to Set Up Your CNN Architecture

Convolutional layers perform convolutions, which are operations where a filter is moved over an input image, calculating the values in a resulting feature map.

Figure by the author

A convolutional layer is usually built up of multiple filters, which will produce multiple feature maps. During training of the CNN, the model will learn what weights to apply to the different feature maps and, hence, be able to recognize which features to extract from the input images.

By increasing the number of convolutional layers in the CNN, the model will be able to detect more complex features in an image.

However, with more layers, it’ll take more time to train the model and increase the likelihood of overfitting. While setting up a fairly simple classification task, two convolutional layers will usually be enough. And then the number of layers can be increased if the resulting accuracy is too low.

The appropriate number of nodes is also highly dependent on the complexity of the images and the task at hand. By varying the number of nodes and evaluating the resulting accuracy, the model can be run multiple times until a satisfying result is achieved.

After doing multiple computer vision projects, developers will better be able to guess what number of nodes will work on a certain type of project and, hence, reduce the number of iterations needed.

Using PyTorch, the convolutional layers are usually defined inside the __init__ function of a CNN model class defined by the developer. Importing torch.nn as nn, one can define two convolutional layers like this:

self.conv1 = nn.Conv2d(1, 10, 3)
self.conv2 = nn.Conv2d(10, 32, 3)

Footer