Need for Stretch — a DL-based Work Break Reminder

Images are blurred so you can’t tell how messy my desk is.

For some of us, one of the changes brought by working from home is that we tend to sit for longer in front of computers. Without the need to go to a meeting room on another floor or the chance to grab a coffee from the coffee shop downstairs, we attend different meetings on the same chair and skip work breaks simply because there is nowhere to go.

During my Christmas staycation, I realized that my activity level, as suggested by Oura ring, has been declining in the second half of the year 🙁 which pushed me to think about what I can do to reverse this trend. (We all know that simply adding “be more active” in the new year resolution won’t magically solve the problem)

Yep, it’s my actual activity data and I don’t feel proud of it.

In order to remind myself to take breaks, get up and get some stretch, I made this deep learning based work break reminder called “Need For Stretch”. It is a proof of concept and certainly has lots of room for improvement, so as always, comments and suggestions are welcome.

GitHub link

Language and packages: python, pytorch, opencv, sqlite3

Video demo: To be added

Now we can add “be more active” to that 2021 New Year Resolution with greater confidence.

Happy New Year!

Work break reminder: the program will lock the PC if the user has been sitting for too long
Posture correction: during work, if the user has bad or non-ergonomic postures, program will notify the user (alarm sound)
Get some stretch: after enough off-screen time, the user can unlock the screen with self-defined stretch poses (e.g. squat, downward dog, etc.)

The illustration below shows the complete workflow of the program. In terms of the equipment used, I am using a 1080p webcam (lower resolution is also OK) to periodically take pictures of myself working on the desk. The webcam is connected to a PC with RTX 2080 Ti which runs the whole program.

When I start working on the desk, the webcam starts to take pictures every T seconds (default T=2), the image is passed to the OpenPose model, which predicts 18 key points of human bodies. Next, two processes take in the 18 key points as inputs: a neural network to classify whether I am sitting or standing, and a rule-based test to see whether my posture is wrong. If my posture is bad, say a forward head posture which causes neck and shoulder pains, an alarm will go off. The sitting or standing status, and the 18 points coordinates are stored into a database.

18 keypoints for the COCO dataset

Every time the database is updated, another process would access the database and count for how many minutes I have been sitting in the current cycle. For instance, if I have been sitting there working for 30 min without moving, the program will then lock the screen.

When screen is locked, the webcam continues taking pictures. The pictures are still passed to OpenPose to see if I am indeed off screen to take a break: if any of the 18 key points is detected, it will think that I am still in front of the screen and thus restart the counting. Only when enough CONTINUOUS off-screen time passes can the program enter the next stage.

In the next stage, I will need to do some self-defined stretches in front of the webcam for a certain amount of time to unlock the screen. Once the screen is unlocked, a new cycle starts.

(Of course you can always unlock the screen manually. It is so designed that the work break reminder will not create a hazard when the user is absolutely busy or in the middle of something important.)

Workflow

Why OpenPose?

Human pose estimation is an important field of computer vision research and there are several models I could have chosen for this program. So why OpenPose?

Before designing the program, I asked myself a list of questions:

Does it need to be real-time? (Yes)
Is it for single or multi person use? (single person use)
2D or 3D? (2D is enough with a fixed camera view)

OpenPose is the first real-time, open-source, 2D multi-person human pose estimation model. It takes a bottom-up approach which first predicts confidence maps for all the joints and part affinity fields (PAF) corresponding to the limbs, and then uses a greedy parsing method to correctly connect and assign body parts to each person.

Left: stacked heat maps of all the key points; Right: stacked PAF maps (there are 2 maps, one for the x coordinates and the other for the y coordinates of each vector)

Machine learning or rule-based algorithms?

In the current version, I am using a fully connected neural network (not good practice, will explain later) to classify sitting/standing positions and using rule-based algorithms to determine if the user is performing stretches when trying to unlock the screen.

Retrospectively, deciding when to use which method is one great thing I learned from this project.

For the sitting/standing classification problem, machine learning methods are more suitable than rule-based because:

users have a variety of sitting/standing poses, it’s both inefficient and incomplete to hard-code rules for key body parts (such as legs, hips and arms) to differentiate them.

My diverse and totally not weird sitting postures

it is cheap to get enough training data: I wrote some code to take pictures of myself sitting/standing in front of the desk once every few minutes. I ran the code everyday for a week and was able to get ~500 pictures for sitting and another ~500 for standing, without much additional time.

As a Data Scientist, I should say that throwing the 18 pairs of key points coordinates directly into a neural network, without EDA or feature engineering, is not good practice. It was Christmas and I was in a hurry to play Don’t Starve Together so I forgave myself this time. But I did make sure that there was no over-fitting. The eventual accuracy rate was above 97% for both training and testing datasets.

In contrast, determining whether a pose qualifies for a “stretch pose” could be better handled by a rule-based system:

The 1st reason is straightforward: there is no such thing as “standard sitting” or “standard standing”, you can sit or stand however you want. But there exist some standards for exercising poses. Only poses that meet such standards can be accepted to unlock the screen.

Squat or not

The 2nd reason concerns extensibility: what if one day I prefer other types of stretch poses such as downward dog? If I use a machine learning approach, I would have to collect new data to retrain the model.

With the current design, if I want to switch to a different stretch pose, I can reuse the code mentioned above to take a dozen pictures of myself in the new pose, calculate the ranges of angles of major joints and use the ranges as a new standard.

A very abstract illustration of how to get the “standard squat” data

Unofficial PyTorch implementation runs at sub-optimal speed

For this demo, I used an unofficial PyTorch implementation and could only achieve an fps of 8~9, instead of the 22 fps of the official implementation. (with GPU, image size 320*240) The reason for not using the official one is that the original code was written in C++ and Caffe, which I don’t use. There are two potential solutions: (1) learn C++ (maybe another new year resolution) and (2) profiling the PyTorch implementation to increase its speed.

The 18 points framework can be too sparse for posture correction

For instance, there are no joints on the back of our body, so we miss the details of the whole part between shoulders and hips. In the example below, my back was awfully curved but the current 18 key points can not capture this problem.

It is not view point invariant

In the current setting, the camera has to be placed with a fixed view point. When the view point changes, the angles need to be recalculated. One future improvement to make the program more flexible and mobile (like running on a Jetson Nano) is to switch to a 3D model and add a viewpoint invariant transformation.

Why OpenPose?

Machine learning or rule-based algorithms?

Unofficial PyTorch implementation runs at sub-optimal speed

The 18 points framework can be too sparse for posture correction

It is not view point invariant

Footer