Audio Generation Using DeepMind's WaveNet

Published by: Amit Nikhade

July 7 . 2021

WaveNet by Google DeepMind, the time is now. Generating Audio with WaveNet 

Music is the language of the spirit. It opens the secret of life bringing peace, abolishing strife. -Kahlil Gibran


One of the most awaited and willful features was the interaction with machines in a natural form, similar to how human beings make conversations. Deep learning has been an efficacious field for Humanity in many ways. Specifically, the generative approach has been a turning point to the deep learning trend that unlocked several applications in Robotics, IoT, the Internet, telecommunication, and almost the software industry. The applications involved Natural language processing, Computer vision, speech to text and text to speech, Generating audio, etc. Today we will get acquainted with the same, one of the generative models for Audio generation.



WaveNet: A deep learning model, a kind of neural network that generates a raw audio waveform. It’s an autoregressive as well as a probabilistic model. It can be trained on a heavy amount of Audio samples. It offers state-of-art performance in the audio generation with a significant natural feel. It can be used to generate music. An excellent realistic musical result could be achieved if trained with extensive audio data. Actually, the secrete behind developing a human-like voice is directly modeling the waveform with a neural network. Wavenet model poses the flexibility to extract different characteristics from multiple agents with exactness, and it is also able to switch between these voices by applying conditioning for the voice identity, also known as conditional WaveNet variant.

Concatenative TTS (Text-to-speech) involves collecting data in the form of audio fragments from a single speaker and then concatenating them to produce a complete utterance. But the Concatenative TTS sounded unnatural and even was not able to modify the sound.

In Parametric TTS, the voice is created using a voice synthesizer known as a vocoder, an audio processing algorithm. The data is stored in the parameters of the model. The characteristics of the speech can be controlled via the model’s input.


Typical Convolutions

                                                 2D Convolutions

The convolutional neural network is a deep learning model that accepts an image as an input and further assigns essential learnable parameters to its various aspects, which makes differentiating an easy task. The convolution takes the following parameters:

  1. Kernel size: The size of the filter which hovers and moves over the image with a shadow-like appearance in the figure having 3×3 size is called a kernel filter. 3×3 is the commonly used size for 2D convolutions
  2. Channels: Channels are the dimensions holding values of the pixel; for a colored image, it can be represented by three channels RGB, i.e., red, green, and blue. Whereas in the case of the black and white images, we have just 2 channels, black and white
  3. Padding: As shown in the figure, the border is padded with a single box that can handle the input and output dimensions by keeping them balanced and equal. Whereas the unpadded convolution crops away from the border.
  4. Stride: The number of steps to be taken by the filter to cover the whole image, including the padding step by step, as shown in the figure

The above figure displays 2D convolutions using a kernel size of 3 padding=1 and stride=1.


Dilated Convolutions



                                             Dilated convolution

Dilated convolutions are a slight modification to the native convolutions by just adding a parameter of dilation rate. The dilation rate is nothing but the number of values to vanish between the kernel values, i.e., by how many steps the kernel should be spaced. It defines equal space between every value of the kernel.

In the figure above, the kernel has a size of 3×3, but still, it is widespread; this is due to the space created between them, where the dilation rate is 2, so every second value from the row and column gets vanished. A 3×3 kernel with a dilation rate of 2 will have the same field of view as a 5×5 kernel, while only 9 parameters in use. Dilated convolutions allow exponential expansion of the receptive field with the same computational power and without any loss.


Dilated Causal Convolutions





Dilated Causal Convolutions are one of the main components of the WaveNet model. The output of the model a particular timestep t does not depend upon the future timesteps. Causal convolutions can model sequences with long-term dependencies. They are faster as compared to the RNN’s because they don’t have recurrent connections. They follow the ordering of data in the way the data is modeled. The output is obtained sequentially and fed back to the model to get the next one. Causal convolutions require a wide receptive field, and this obstacle is overcome by the Dilation technique we saw before, which works efficiently and reduces the computational cost. Stacked dilated convolutions enable networks to have widespread receptive fields with just a few layers while preserving the input resolution.


Wavenet Architecture

Generating Audio Waveform for text to speech in a natural way can be done with WaveNet. The model is the combination of various strategies from computer vision, audio processing, and time series.


                                          WaveNet Architecture

The inputs are the audio samples passed from the causal convolution layer to the residual block. For each residual block, the output of the previous block will be fed into the next one. The Idea of Gated Activation units is implemented from the PixelCNN. The residual block produces two outputs one is the feature map which will be used back as input and the second one is the skip connection which is used to calculate the loss. The input is further passed through the dilated convolution layers, the gate, and the filter and meet each other to get multiplied element-wise. Where ⊙ denotes an element-wise multiplication operator,∗ denotes a convolution operator, σ is a sigmoid function, f and g indicate filter and gate. 



A combination of residual and parameterized skip connections is used throughout the network to speed up convergence and enable deeper models’ training. The input is further summed up with the original information and is reused by the next residual black to generate an output. The skip connections are summed up and activated through the rectified linear unit, and the final production of the Softmax is not a continuous value, but we apply the µ-law companding transformation and then quantized it into 256 possibilities. According to the paper, the reconstructed signal after quantization sounded very similar to the original.

Conditional Wavenet Comes with more modifications with the desired characteristics like feeding the voices of multiple entities; we’ll also provide their identity to the model.


Model Implementation

Python version 3.6 or higher required. Clone my WaveNet repository and run the files as instructed.

!git clone

To begin with, we’ll start with installing the requirements.

 pip install -r requirements.txt

Next, We’ll train the model with our custom data, In my case, I have used piano triads wavset dataset from Kaggle. I just trained it on 20 epochs to address you adequately. You’ll also have to define the data path.

!python3 ./WaveNet/src/ --dp piano-triads-wavset --epochs 20

Here start’s the training.

Now the time is to be a music director. Generate the music defining the save path and the model path.

!python ./WaveNet/src/ --mp src/trained_model/modelWN.h5 --fs ./

And You are done.

Try training on enough data with at least 1000 epoch, you’ll get to see outstanding results.

For complete code visit Github


Things to try

Try implementing the model for stock price prediction and much more, as we know it can also work with time-series data. Also, try producing a piece of good music by tuning the parameter manually. I hope you’ll.



Earlier TTS resulted in noisy sounds a bit robotic, but gradually and to the date, the advancements made to audio processing concluded the better voice quality. Wavenet is a much more advanced model that takes TTS to the next level, which sounds crystal clear, unreconstructed, and regular, making it a secret behind many TTS engines. WaveNet requires too much computational processing power to be used in real-world applications.


WaveNet was then used to generate Google Assistant voices for US English and Japanese across all  Google platforms. -deepmind

Hope you found the topic and explanation insightful, try to hit claps on medium and follow me, for my work if you wish. The claps and follows are my triggers of motivation.


About Me

originally published on

LinkedIn amitnikhade

Medium @amitnikhade

Also, visit the About me page for more info and collaboration.




© Copyright 2021 Amit Nikhade

Post Views: 225