While the model size is somewhat controllable during design, runtime libraries are something that you get from distribution and a part of your model that you’ll need to pack into your application distributable. Hence, we reduced model size within those libraries from our regular releases of the Intel® Distribution of OpenVINO™ toolkit.
One year ago, the OpenVINO™ toolkit’s Inference Engine consisted of a single, shared library, which included all of the functionality even though some of the building blocks could be unused for particular scenarios. Based on that, the library was split into multiple, smaller libraries with each library representing a dedicated building block which you can choose to use or not.
- Inference Engine library represents core runtime Inference Engine functionality.
- Inference Engine Transformations library contains optimizations passes for CNN graph.
- Inference Engine Legacy library contains old network representation and compatibility code which is needed to convert from new nGraph-based representation.
- Inference Engine IR, ONNX readers represent plugins which are runtime loadable by Inference Engine Core library in case if IR or ONNX files are passed by user.
- Inference Engine Preprocessing library now represents the plugin which is runtime-loadable by Inference Engine plugins if a user sets preprocessing (e.g., color conversion and resize algorithm). Otherwise, the library is not needed and can be skipped when creating the deployment package.
- Inference Engine Low Precision Transformations library is linked to the plugins directly if a plugin supports int8 data type — as an example, the FPGA, MYRIAD, GNA plugins not support the int8 flow and doesn’t have to be linked against this library.
To optimally execute the inference procedure, the Inference Engine library implements basic threading routines as well as complex task schedulers with the TBB library under the hood. The TBB dependency, by default, contains debug information for the binaries, which gives an extra overhead to the total Inference Engine runtime size, but the debug information is not needed for productizing applications. The stripped TBB binaries included in the Intel® Distribution of OpenVINO™ toolkit, starting from the 2020.2 release, which results in the TBB library only taking 0.39 MB — a 5x reduction in size when compared to the original 2.12 MB containing debug symbols.
The nGraph library is a key component in the OpenVINO™ toolkit, which is responsible for model representation. This component allows to create or modify networks in the runtime. In the 2020.4 release, we separated the ONNX importer from the nGraph library, which results in a reduction in size.
The CPU plug-in (also known as the MKL-DNN plug-in) library is another key component in OpenVINO™ toolkit, which is responsible for model inference on Intel CPUs. It contains target-specific graph optimizations, layers implementations, threading, and memory allocation logic. Several compute-intensive routines like Convolution, FullyConnected, and other CPU plug-ins use the OneDNN fork library as a third-party component. This component is statically linked into the main CPU plug-in library. Another common operation in DL inference workloads is matrix multiplication which was supported via another third-party component, the Intel® Math Kernel Library. The Intel MKL has a large enough size since it solves various mathematical problems. To save disk space, instead of using the full version of MKL libraries, the OpenVINO™ toolkit redistributes custom dynamic libraries (called “mkltiny”) with a reduced list of functions which were built using official functionality. Despite this, the Intel MKL dependency still takes up a significant size of the distribution (see Table 2).
In the 2020.2 release, the oneDNN fork was migrated to 0.21.3 version of the original repository. This version includes optimizations for sgemm routine which allows us to achieve comparable performance to Intel MKL. In addition to this several optimizations were implemented inside the plug-in which allowed to finally get rid of Intel MKL dependency and fully rely on the plug-in and oneDNN fork capabilities. As a result, we were able to get about 1.8x binary size reduction for libraries responsible for CPU inference (see Table 2). We also managed to keep the same functionality and the same (or sometimes, even better) performance for all workloads in our validation and testing. However, to make sure this will not degrade in a specific user’s scenario, we provide the cmake option GEMM=MKL, which allows users to build the CPU plug-in from sources with an Intel MKL dependency.
To summarize, the table below outlines the minimal runtime sizes for several target devices (see Table 3):