How to Create your own Sign Language Translation App by extending SigNN

If you’ve made it this far, congrats! (If you’re reading ahead, also congrats! Most people don’t.) We’re more than halfway done on the diagram, having created a Raw Json folder and done the Json Creation script. The truth is that we’re almost done. By now you should know which variables to replace in your scripts. Additionally, the computationally hardest parts of the data creation process are done. All that’s left is to form our thousands of .json files into a single “data.json”.

The steps for static and dynamic json formation are the same, but keep their folders separate if you’re doing both. Firstly create a folder for your formed json files. As usual, make note of the GID while also painstakingly making subfolders in that folder for each sign you want to have. If you are doing both static and dynamic, then you are working with two folders here (one for static signs and one for dynamic). For once though, there’s an additional subfolder called “ALL”. Create the “ALL” (caps required) subfolder in your formed json folder. This is where “data.json” will be uploaded to.

The Json formation process is done with the Download and Form MediaPipe Character Data script. Don’t let the name “Character Data” fool you, it’ll work with any sign.

As usual, replace the RAW_JSON_DATABASE variable’s value with the GID of your “Raw Json” folder and replace the FORMED_JSON_DATABASE variable’s value with the GID of your “Formed Json” folder. Additionally, find the tuple called CHARACTERS which is in the FormJson function. Replace the characters with the signs you want to train for.

Once the script is run you should find “training_data.json” in the ALL folder. It should be a large file containing the (X, Y) coordinates of the hand of each picture you’ve uploaded. Download it. It’s suggested to use a .json analysis tool online and upload it there to see if you have all the data you should. Use their tree view. Some json analysis tools are poorly optimized and may crash your browser with such a large file, be patient and be willing to try a few until you find a fast one.

This is where the pre-made scripts end, but I can still instruct on the general steps needed. I suggest making a Google Colab script to do this. We’ll need to make use of TensorFlow/Keras.

Z-Scores

Firstly, you have to know that the data you have is raw and that SigNN works with z-scored data. This applies to both static and dynamic signs. However, for dynamic signs, the zscore is done on a per video basis. Make a function that takes in a list. Every odd element is an x coordinate and every even element is a y coordinate. For each data point, 42 in length, get the zscore of the x coordinate and the zscore of the y coordinate. I do not mean the zscore of all the data points combined. Each data point should have its zscore calculated on its own.

As an example, suppose that instead of there being 42 (21 coordinate pairs) floats in the list, that there are 6 (3 coordinate pairs) then

Full data: [1, 0, 0, -1, 2, 1]

X Coordinates: [1, 0, 2] …… X Coordinates Z-Scored [1, 1, 1]

Y Coordinates: [0, -1, 1] …… Y Coordinates Z-Scored [0, 0, 0]

Full data Z-scored: [1, 0, 1, 0, 1, 0]

The goal here is to make the distance from the camera for each image a non-factor. By working in zscores, the SigNN Project was able to increase the accuracy of their neural network significantly [1]. Either way, there’s not much of a choice because zscores being required is baked into SigNN and it’s much easier writing a function to translate the raw data into zscores than it is to remove the zscore requirement from the C++ code.

Regulation

All data for dynamic signs must be regulated. Neural networks require a static amount of inputs, however, video can run at different FPS and take place over a different amount of time. Therefore, the coordinates of each frame are interpolated to a set number of frames. The regulation script is already written in Python and easy to add to your Google Colab script. You can see it here: https://github.com/AriAlavi/SigNN/blob/master/scripts/regulation_python.py

By default, SigNN regulates to 60 frames. This can be changed by modifying the 60 in the following file: “SigNN/mediapipe/calculators/signn/regulation_calculator.cc”. Additionally, by default, the video collection script takes place over 3 seconds, thereby 20fps is the target of the application. However, it is recommended to experiment with the 3 seconds, as little as just 1 second may work.

No matter how many frames you choose to regulate, all data must be regulated before being used to train the neural network and before the zscores for each video is calculated.

Training The Neural Network

This is where other TensorFlow/Keras tutorials should come in. This neural network should not be the first you’ve ever made. Make sure to work with a few pre-defined datasets first in a heavily guided tutorial. After training one or two other networks, then come back to here.

The data should be split with a large percent for training and a small percent for validation. If you are working with dynamic and static networks, then they should be trained separately as they will be separate .tfltie files. Make sure that the input layer of the neural network is 42 (as there are 42 values [or 21 x, y coordinates] in each data point). The output layer should equal the number of possible outputs. For a static ASL neural network, there are 24 outputs (A-Y, minus J). The SigNN project found that the following neural network was most effective for static ASL alphabet translation:

Relu(x900) -> Dropout(.15) -> Relu(x400) -> Dropout(.25) -> Tanh(x200) -> Dropout(.4) -> Softmax(x24)

[1]. As for dynamic, SigNN did not manage to produce a very effective neural network, so it’s really up to you how you decide to go about it.

Once the Neural Network is trained with sufficient accuracy (hopefully over 80%) then it is ready to be implemented. Download the neural network as a .tflite file (it’s best to ask Google how).

Implementing The Neural Network

Now that you have a .tflite file (or two if you also did dynamic), it’s time to implement it. If your .tflite file was static, rename it to “signn_static.tflite” and “signn_dynamic.tflite” if it were dynamic. Put the .tflite file(s) in “SigNN/mediapipe/models” and overwrite the models that are already there.

Static Implementation

Go to the file at this directory: “SigNN/mediapipe/calculators/signn/tflite_tensors_to_character_calculator.cc”

Notice the array called DATA_MAP. Replace the strings with the corresponding data of your static neural network.

Dynamic Implementation

Go to the file at this directory:

“SigNN/mediapipe/calculators/signn/dynamic_tflite_tensors_to_character_calculator.cc”

There is no DATA_MAP here. You will need to use C++ to modify much of the process function, as it assumes you only have two outputs. The process function is run every time data is received by the calculator.

By now, the .tflite file has been implemented and the project is ready to be compiled. Compile it, run it, and see how it works. (Compile/run instructions are near the top of the SigNN README). Perhaps you got lucky and the configuration works fine, it also could be that the accuracy, in reality, does not reflect the accuracy that TensorFlow/Keras promised when training the neural network. This could be caused by bad data, but before throwing it all out, it’s important to try some modifications.

General Modifications

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_one_hand.pbtxt”

There are 8 variables that can be changed here:

OneHandGateCalculator::memory_in_seconds: When counting the number of hands being displayed the last (x) seconds are taken into account.
OneHandGateCalculator::percent_of_one_hand_required: (x)% of the frames in the last memory_in_seconds seconds must contain exactly 1 hand or the program will display an error to the screen
FPSGateCalculator::memory_in_seconds: The last (x) seconds are taken into account when deciding if the device is too slow to host SigNN
FPSGateCalculator::minimum_fps: If FPS is lower than (x) within the last memory_in_seconds seconds, then the program will display that FPS it too low
LandmarkHistoryCalculator::memory_in_seconds: The last (x) seconds of data are fed into the dynamic neural network when requested
StaticDynamicGateCalculator::dynamic_threshold: If the change in position is greater than (x) then the dynamic neural network is used, otherwise the static neural network is used
StaticDynamicGateCalculator::maximum_extra_dynamic_frames: If the change in position of the hand drops to below the dynamic threshold, the next (x) frames will render as dynamic anyway to prevent the letter switching too quick
StaticDynamicGateCalculator::velocity_history: The last x seconds of velocity are used to determine if the dynamic or static neural network should be used

Static Modifications

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_static.pbtxt”

There are 3 variables that can be changed here:

memory_in_seconds: The last (x) seconds are averaged and fed into the neural network, not what is immediately captured by the camera
unknown_threshold: If the probability of a sign is less than (x), unknown will be displayed to the user
last_character_bias: This probability is added to the probability of the last sign. For example, if the last sign was “H”, then (x) is added to the probability of “H” next frame. This prevents jumping between multiple predictions

Dynamic Modifications

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_dynamic.pbtxt”

There are 2 variables that can be changed here

unknown_threshold: If the probability of a sign is less than (x), unknown will be displayed to the user
memory_length: The last (x) seconds of dynamic neural network results are averaged to determine the sign to display

After Completion

If you manage to extend the functionality of SigNN, feel free to open a pull request. While it’s not guaranteed if the pull request will be accepted — if the work has already been done, why not help contribute to the open-source community?

References

Alavi, Arian, et al. One-Handed American Sign Language Translation, With Consideration For Movement Over Time — Our Process, Successes, and Pitfalls. 2020, github.com/AriAlavi/SigNN.

Z-Scores

Training The Neural Network

Implementing The Neural Network

Static Implementation

Dynamic Implementation

General Modifications

Static Modifications

Dynamic Modifications

After Completion

References

Footer