Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2Tokenizer (Ref: Hugging Face)
I have used Liam Neeson famous dialogue audio clip from the movie “Taken” in this example which says “I will look for you, I will find you and I will kill you”
Please note the Wav2Vec model is pre-trained on 16 kHz frequency, so we make sure our raw audio file is also resampled to a 16 kHz sampling rate. I have used online audio tool conversion to resample the ‘taken’ audio clip into 16kHz.
Loading the audio file using librosa library and mentioning my audio clip size is 16000 Hz. It converts the audio clip into an array and stored into ‘audio’ variable.
Next step is taking the input values. Passing the audio (array) into tokenizer and we want our tensors into PyTorch format instead of python integers. Mentioned return_tensors = “pt” which is nothing but PyTorch format.
Getting the logit values (non-normalized values)
Passing the logit values to softmax to get the predicted values
The final step is to pass the prediction to the tokenizer decode to get the transcription
It exactly reads our audio clip.
In this blog, we have seen how to convert the speech into text using Wav2Vec pretraining model using transformers. This would be very helpful for NLP projects especially handling audio transcripts data. If you have anything to add, please feel free to leave a comment!
You can find the entire code and data in my GitHub repo.
Thanks for reading. Keep learning and stay tuned for more!