
Today I’ve read a paper about using Deep Learning for audio synthesis (from sound effects to music instruments). The paper’s source can be found at the bottom of this article.
As explained in my previous article (https://bit.ly/3qysWQt), content generation is now a big trend in AI with applications ranging from creating fake videos to image resolution upscaling. This entire sub-field of Deep Learning relies on network architecture often referred to as GAN (Generative Adversarial Network) which have been initially designed for image generation.
The authors of today’s paper explored 2 different approaches for audio synthesis, both derived from the original GAN architecture proposed by Goodfellow and his colleagues in 2014. Both techniques have been designed to take into account the specificities of sound data and both approaches reveal promising results.
First, we need to understand how sound data differ from image data. Images are often seen as 2-dimensional objects: they have a width and a height. That’s why, when are searching for a new newspaper, for example, we need to first know the dimension of our screen and then search for images matching it. It could be 1920 width and 1080 height for a 1080p screen or 2560 width and 1440 height for 1440p screens. We can also think in terms of screen inches (the highest length between two opposite corners of the screen) associated with a form factor or aspect ratio (like 16:9).
These different variables miss the most important measurement of an image: resolution, aka, the number of pixels that make it. Previously, width and height were expressed in pixels: a 1080p 16:9 screen corresponds to a screen that is able to display 1920 pixels per row and 1080 pixels per height for a total of more than 2 million pixels. The more pixels, the more details an image can contain. But are all pixels equivalent? No. A given pixel can only take a finite number of values which is referred to as the bit depth. A bit depth of 8 means that each pixel can take one of 256 colors and a bit depth of 10 means that each pixel can take one of 1024 colors.
So how do we describe an image? We know that an image consists of pixels. We also know that each pixel can take a finite number of colors (bit depth) and that an image consists of a 2-dimensional grid of pixels described with a width and a height. Knowing the dimension of the image, we only need to specify the value of each pixel to be able to reconstruct the image. For 128×128 pixels images, then need 16,384 values: one for each pixel. We’re still missing something: how can the image be so detailed in most 8-bits 1080p monitors if each pixel can only take one of 256 colors? It is possible because each pixel can contain colors from 3 channels: Red, Green, and Blue often referred to as RGB. With each channel being able to display 256 values, we end-up with each pixel being able to represent more than 16 million colors. All this information explains why, in Deep Learning, we represent a given image as a set of 3 sub-images, each being 3-dimensional objects with a width, a height, and a value representing the pixel values from one of the RGB channels.
Now, we need to understand how to represent audio data. The audio consists of sound waves that consist of compression waves. A compression wave can be seen as waves going through different materials (like air) and compressing it on its way. Once this compression reaches our ears, it causes our eardrum to vibrate and this mechanical effect is turned into an electric signal which arrives in our brain which can analyze it as a sound. There exist many different techniques to turn these compression waves information into digital information, the main one being “Pulse-code modulation” (or PCM) which represents wave sounds into numbers. It has to be said here that going from a real sound wave to a set of numbers or encoding the audio information, often requires to lose a little bit of the information on the way. PCM is able to represent any waveform into a 2-dimensional object which consists of pairs of (timestamp — wave amplitude) and once re-normalized by time (aka setting a precise time unit) can be represented as a 1-dimensional object like a music sheet.
Now that we’ve understood the dimension of numerical sound objects, we need to understand their resolution. The resolution of an image depends on its bit depth and its number of pixels. On the other side, the resolution of a sound can be represented by 2 values: it’s bit depth which represents the amplitude range of the sound, and its sample rate which corresponds to how much sound there is per second (and which can be seen a sound resolution). Typical values for audio could be a bit depth of 16-bits (each sample taking one of 65,536 values) and a sample rate of 16kHz (16,000 samples per second).
We can know to see that one minute of 16kHz audio represents 16,000 values which are equivalent to a 128×128 pixel image. We can also see that each audio value of a 16-bits sound can take one of 65,536 values against more than 2 million for 8-bits images and its 3 channels. In terms of dimensions, sound can be represented as a 1-dimensional object (once normalized by time) and images can be represented as a set of three 3-dimensional objects (width, height, a pixel value).
These differences lead the authors to try 2 different approaches for audio synthesis using GAN architecture.
The first approach consists of a GAN were the very first layer’s dimensions have been adapted to the 1-dimensional audio objects. In this approach, we mainly take a GAN and adapt the structure dimensions. They named the related network structure WaveGAN.
The second approach is more straight forward: we represent sounds as images (spectrograms) and hope that a usual “Image generation” GAN will be able to learn from them. This technique, which doesn’t require as big of architecture changes, has a big flaw: we first need to turn the sound into an image (and potentially lose information) and then turn the image back into a sound. They named the related network structure SpecGan.
To test their two structures together, the authors used a dataset of 1-sec samples of people saying numbers from “one” to “nine”. In each case, the network was taught to produce its own “human-like number”. Examples of each network’s results can be found here: https://chrisdonahue.com/wavegan_examples/.
After having trained their algorithms they asked different people to listen to sounds from 3 sources: the original dataset (aka real people), the outputs of WaveGan (GAN on sound data), and the outputs of SpecGan (GAN on sound data turned into images — spectrograms). The participants were asked to: guess what number was being said, how easy that task was, how diverse the outputs were, and what was the overall quality of the sound.
Real sounds (sounds from the original dataset of real human saying numbers) got the highest scores with an accuracy of 95% (participants recognized the number 95% of the time), against 58% for WaveGan and 66% for SpecGAN. These results, while being very far from the accuracy on the real dataset, were still way above the performance of a random model: there are 9 numbers in the dataset and if the outputs from WaveGAN and SpecGAN were intelligible we could have expected the participants to reach a 1/9 ~11% accuracy. The authors also noted that the highest accuracy of the two models was achieved by SpecGAN. When it comes to sound quality, ease, and diversity, WaveGAN got better results overall than SpecGAN. This means that while SpecGAN outputs got guessed more easily, participants preferred the sound generated by WaveGAN.
These results show that the two approaches they designed both lead to promising results in generating audio with GANs.
For a more detailed explanation of WaveGAN, I invite you to watch this great video from Henry AI Labs: https://www.youtube.com/watch?v=BA-Z0KJIyJs.
Source: Donahue, C., McAuley, J. and Puckette, M., 2018. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208.