An architecture for production-ready natural speech synthesizer

Create a model handler for Tacotron2

If you take a look at the Tacotron2 implementation at the Nvidia Github repo, you’ll find among all the files a model definition for Tacotron2 completely Pytorch based. They also provide a Pytorch implementation for Waveglow, the required vocoder to synthesize the spectrograms and generate wave audio files human audible. However, in order to use those two models in conjunction, we’ll still need a class or a handler to manage the intermediate required steps like: data processing, inference, and, post-processing. We’ll organize our work separating the concerns of each part so that our code can be easily maintainable.

TTS model handler

initialize(): Load Tacontron and Waveglow with their respective checkpoints.
preprocess(text_seq): Transform raw text into suitable input for the model. Convert it to a specific set of character sequences.
inference(data): Run inference on the previous processed input and returns a corresponding synthesized audio matching the input text.
postprocess(inference_output): Save the wav audio file to a directory under the container file system.

The details of the code can be checked in the Github repo.

Build the Django Rest API:

Setup your Django project: django-admin startproject tacotron_tts and django-admin startapp API

If you need a thorough tour of how to begin with a Django Rest project, feel free to check out my previous article.

As per the project requirements, we’ll indeed be relying upon a third-party service to store and retrieve speech generated data through our endpoints. Therefore, Django ORM helpers and serializers will come in handy. As said in their documentation, Django ORM is “a pythonical way to create SQL to query and manipulate your database and get results in a pythonic fashion”.

Create your ORM model for the TTS output
Create the corresponding serializer
Build your views (POST, DELETE) and your routing.

Django models and serializers

Generate the Dockerfile for the Django app:

To package the whole API as a docker container, we’ll need to find a root image that complies with the project requirements. As the version of Tacotron that we’re using is entirely GPU based, we’ll need to pull a docker image already built with Cuda support. An interesting image backing Cuda-10.2 alongside PyTorch 1.5.0 can be found in the docker hub and it perfectly matches our needs.

Disclaimer: We’ll need a GPU with Cuda capabilities and nvidia-docker toolkit installed aside from the project specific requirements.

Copy the local folders to the image file system, install the requirements inside a virtual environment, give the required permissions and, you’re ready to go.
Create two new directories where the static and media files will be stored.
Once the image for the Django app fully operational, we’ll be configuring the Dockerfile for the Nginx proxy. Nothing special to add to the Dockerfile except for the static and media folders that will be shared between the two containers.

Django Dockerfile

Configure your Nginx Proxy:

Build your microservice architecture.
Nginx is a lightweight DNS micro service especially fitted for dockerized backend environments. The purpose is to use it as a proxy server that can route and serve static files and media. Rather than requesting Django internal server, a best practice for production environments is to utilize an independent proxy server responsible for that part. As the name microservice implies, each service works in a detached way as to focus on different parts of the whole infrastructure.

Microservice schema

We’ll build our Nginx service by pulling the standard Nginx docker image from the hub: Nginx-Unprivileged.

Basic configuration for our needs:

Define an upstream service
Prepare your server

URLs starting with /: Forward to Gunicorn
URLs with /static/: Forward to our media and static folders, which happen to be inside our docker file system.

Orchestrate your Architecture with Docker Compose

As previously discussed, we need to structure our code such that the containers can communicate and work tightly together to make the whole service run. The way to tackle it is by defining two services, one for the API and the other for the proxy specifying a shared volume (static_data) for the two components where media files can be accessed. And that’s it, you can know deploy the service.

docker-compose file

Run your application

There is one more little step to figure out before actually running the API regarding the static URL paths. In your settings.py add the following locations that match the static volumes previously defined in the Django Dockerfile.

STATIC_ROOT = '/vol/web/static'
STATIC_URL = '/static/static/'MEDIA_ROOT = '/vol/web/media'
MEDIA_URL = '/'

2. Download Postman and start testing your API locally:

service running on port localhost:8080/API/tts/

serialized output with the Text and Audio path

3. Listen to your output on port 8080: Listen to the transcription ⤵️

Input: “Every man must decide whether he will walk in the light of creative altruism or in the darkness of destructive selfishness” — Martin Luther King. Yeah I know, it’s my fancy side 😉.

TTS output

Conclusion

You’ve had a quick overview of the whole project in this article. I strongly recommend you to check the Github repo for more in-depth insight.

As you can see, the field of natural speech synthesis is very promising and it will keep improving till reaching stunning results. Conversational AI is getting closer to the extent of seamlessly discussing with intelligent systems without even noticing any substantial difference with human speech.

I leave you here some additional resources you may want to check.

If you have any questions regarding the code, please get in touch with me and, don’t hesitate to e-mail me at aymanehachchaming@gmail.com