The pipeline is structurally implemented as a MediaPipe graph with a holistic landmark subgraph. This main graph throttles the images flowing downstream as a measure of flow control.
The first incoming image passes through unaltered, and subgraphs are required to finish their tasks before the next frame can pass. All incoming frames that come in while waiting for the subgraphs to finish processing are dropped, limiting the queuing up of incoming images and data, which leads to increased latency and memory usage—a big win for any real-time mobile application.
The holistic landmark subgraph uses three separate models internally:
1. The pose landmark module
The MediaPipe Pose Landmark model allows for high-fidelity body pose tracking, inferring 33 2D landmarks on the whole body (or 25 upper-body landmarks) from RGB video frames, utilizing BlazePose. It detects the landmarks of a single body pose, full-body by default, but it can be configured to cover the upper-body only, in which case it only predicts the first 25 landmarks. It can run using either CPU or GPU depending on the type of module chosen.
2. The hand landmark module
MediaPipe Hands is a high-fidelity hand and finger tracking solution. It can infer up to 21 3D landmarks of a hand from just a single frame. It’s a hybrid between a palm detection model that operates on the full image and returns an oriented hand bounding box and a hand landmark model that operates on the cropped image region defined by the palm detector, which returns high-fidelity 3D hand keypoints. It detects landmarks of a single hand or multiple hand depending on the module type. The option to run using CPU or GPU is also featured.
3. The face landmark module
The MediaPipe Face Mesh estimates 468 3D face landmarks in real-time on mobile devices. It employs deep neural networks to infer the 3D surface geometry, requiring only a single camera input ,without the need for a dedicated depth sensor. Utilizing lightweight model architectures with optional GPU acceleration throughout the pipeline, it can track landmarks on a single face or multiple faces. Additionally, it establishes a metric 3D space and uses the face landmark screen positions to estimate face geometry within that space.
Performance and APIs
The ML pipeline here features highly-optimized models combined with significantly upgraded pre- and post-processing algorithms, as well (for example, affine transformations). This helps reduce processing time on most current mobile devices and keeps the complexity of the pipeline in check.
Furthermore, moving the pre-processing computations to GPU results in an average pipeline speedup for all devices. As a result, MediaPipe Holistic can run in near real-time performance, even on mid-tier devices and in the browser.
For each frame, the pipeline coordination between up to 8 models per frame — the pose detector, the pose landmark model, 3 re-crop models, and 3 keypoint models for hands and face. Since the models are mostly independent, they can be replaced with lighter or heavier versions (or even turned off completely) depending on performance and accuracy requirements.
Also, once the pose is inferred, the pipeline knows precisely whether hands and face are within the frame, allowing it to skip inference on those body parts if they are absent.
If you want to try it out for yourself, you have quite a few options. You can either build and run on a desktop using the Python , JavaScript, or C++ APIs configured to run with either CPU or GPU. Remember to first install the MediaPipe package. You can also try out the installation for iOS or Android. Even here, there are multiple options such as building with command line or using Android Studio etc. Let’s take a look at how to build with Bazel in the command line.
- To build an Android example app, build against the corresponding andoid_binary build target. The android target for this build is :
mediapipe/examples/android/src/java/com/google/mediapipe/apps/holistictrackinggpu:holistictrackinggpu
So we need the following command:
bazel build -c opt --config=android_arm64 mediapipe/examples/android/src/java/com/google/mediapipe/apps/handtrackinggpu:handtrackinggpu
2. Install on the device using:
adb install bazel-bin/mediapipe/examples/android/src/java/com/google/mediapipe/apps/handtrackinggpu/handtrackinggpu.apk
Alternatively, you can install this pre-built APK for Arm-64 devices. It’s not the greatest and sadly, doesn’t seem to offer functionality for the front facing camera, but works decently and is a good showcase of the holistic pipeline.
Use in Practice
According to Google, this integration of pose, hand tracking, and face detection all in the same pipeline will result in new applications such as remote gesture interfaces, full-body augmented reality, sign language recognition, sports/activity analytics, and more.
Utilizing upwards of 540 keypoints in one frame, a Holistic Tracking API offers better, simultaneous perception of body language, gesture, and facial expressions. To demonstrate the power, use-cases and performance of MediaPipe Holistic, Google engineers have built a simple remote control interface that runs locally in the browser and enables a highly-interactive visual correspondence with the user, without any kind of hardware (i.e. touchscreens, keyboards, mouse etc.)
A user can interact with objects on the screen, type on a virtual keyboard while sitting on the sofa, point to or touch specific face regions, or carry out specific activities such as mute or turn off the camera.
When performing these interactions, behind the scenes the pipeline relies on accurate hand detection with subsequent gesture recognition mapped to a “trackpad” space, enabling remote control from up to 4 meters. This gesture control method can enable various use-cases when other human-computer interaction modalities are not convenient.
Google is hopeful that the Holistic tracking solution will inspire research and development for community members to build new, unique applications. They are already using it as a stepping stone for future research into challenging domains such as sign-language recognition, touchless control interfaces, and a variety of other complex use cases.