“Is it possible for a technology solution to replace fitness coaches? Well, someone still has to motivate you saying “Come On, even my grandma can do better!” But from a technology point of view, this high-level requirement led us to 3D human pose estimation technology.
In this article, I will describe our own experience of how 3D human pose estimation can be developed and implemented for the AI fitness coach solution.
What is Human Pose Estimation?
Human pose estimation is a computer vision-based technology that detects and analyzes human posture. The main component of human pose estimation is the modeling of the human body. There are three of the most used types of human body models: skeleton-based model, contour-based, and volume-based.
Skeleton-based model consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.
Contour-based model consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette.
Volume-based model consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.
Here, I am talking about skeleton-based models, which may be detected from a 2D or 3D perspective.
2D pose estimation is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.
3D pose estimation is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image.
When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately.
Talking about AI fitness coach apps, the common flow looks as follows:
- Capture user’s movements while doing an exercise
- Analyze the correctness of an exercise performance
- Display mistakes to the user interface
How 3D Human Pose Estimation Works
The visual image of how 3D human pose estimation technology detects keypoints on a human body looks as follows:
The process usually involves the extraction of joints on a human body, and then analysis of a human pose by deep learning algorithms. If the human pose estimation system uses video records as a data source, keypoints (joints locations) are detected from a sequence of frames, not a single picture. It allows us to achieve more accuracy as the system analyzes an actual movement of a person, not a steady position.
There are several ways to develop the 3D human pose estimation system for fitness. The most optimal ways are training of a deep learning model to extract 3D or 2D key points from the given images/frames
For sure, using video streams from several cameras with different views on the same person doing exercises — it will grant us better accuracy. But multi-cameras are often not available. Also, analyzing video from several video streams will take more computer power to process.
For our research, we used a single video source for the analysis. And applied convolutional neural networks (CNNs) with dilated temporal convolutions (see the video below).
We made the analysis of existing models and figured out that VideoPose3D is the most optimal choice for fitness app purposes. In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.
How 3D Human Pose Estimation System Can Be Implemented in AI Fitness Coach App
After analysis and practical experience of working with 3D human pose estimation systems, we have come to our own vision of how it can be implemented. Let’s review the flow of how this system may be built so that it can analyze movements in an automatic manner by utilizing videos of users performing physical exercises.
Assuming that the goal of the given system is to inspect the input video for common exercise mistakes and compare it with the reference video, where the professional athlete is performing the same exercise, the flow will look as follows:
1. Cutting of the input video depending on the exercise start & end
For the start and the endpoints indication, we can automatically detect the position of body control points by using arbitrary thresholds. For example, when squatting, it is possible to detect the angle of arms and position of hands by height, and then, by using arbitrary thresholds, we can detect the start and the endpoints of an exercise.
One more way is to ask the user to indicate the start and the end of the exercise performance manually.
2. Detecting 2D and 3D keypoints on the user’s body
3. Decomposing the exercise phases
When having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.
These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.
The decomposition can be done through the analysis of keypoints detected from the input video frame by frame and then comparing them by certain criteria with the keypoints from the reference video.
4. Searching for common mistakes
When 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example, in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.
5. Comparing the input video frames with the reference ones
Here we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.
The step-by-step flow looks as follows:
a. Slow down/accelerate the reference video in order to match the speed of the input one.
b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.
c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.
d. Compare keypoints frame by frame and detect motion inconsistencies.
e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).
6. Display results and generate recommendations for a user
When the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.
Visually the 3D human pose estimation system based on videos looks like as follows:
In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.