How Deep Learning Differentiate a Cat’s Meow to a Dog’s Bark

A Beginner-Friendly Guide for Audio Extraction and Classification through Librosa

After providing a comprehensive guide for beginners for Convolutional Neural Network in image classification for cell images to differentiate between cells that are infected by Malaria to healthy cells as seen in the link below, this time, I would like to articulate the procedures in extracting and classifying audio that allows our model to classify between a dog’s sound to a cat’s sound with Deep Learning algorithm.

Since this article will be focusing on the mechanism of classifying and extracting Audio files by using Librosa, we will use a simple one-layer Perceptron Neural Network model, however, if you are interested in dive deeper into a multi-layer Neural Network model, please read the article mention above.

Let’s get on with this project by downloading the Audio files for barking and meowing samples below:

These files contain 113 barking audio samples and 164 meowing audio samples in uncompressed WAV format with various background noises so we will be able to assess how robust the outcome eventually. First, let’s import basic necessary libraries such as Pandas and Numpy, however, this time we will also need a new unique library called Glob to merge all the audio files that are split into several folders.

import pandas as pd
import numpy as np
import glob

Then we are using glob library to merge all the files one by one from test and train files to dogs and cats file into a single variable called X_path by using glob.glob(‘path’):

Test_root = glob.glob('/content/drive/MyDrive/Colab Notebooks/audio-cats-and-dogs/cats_dogs/test')[0]
Train_root = glob.glob('/content/drive/MyDrive/Colab Notebooks/audio-cats-and-dogs/cats_dogs/train')[0]
X_path = glob.glob(Test_root + "/dogs/*")
X_path = X_path + glob.glob(Test_root + "/cats/*")
X_path = X_path + glob.glob(Train_root + "/dog/*")
X_path = X_path + glob.glob(Train_root + "/cat/*")

Then let’s split the files between the cat and dog’s audio files by using ntpath library’s function called ntpath.basename() to search them and for loop to search each of them and stack by using Numpy’s np.vstack().

import ntpath
y = np.empty((0, 1, ))
for f in X_path:
if 'cat' in ntpath.basename(f):
resp = np.array([0])
resp = resp.reshape(1, 1, )
y = np.vstack((y, resp))
elif 'dog' in ntpath.basename(f):
resp = np.array([1])
resp = resp.reshape(1, 1, )
y = np.vstack((y, resp))
print (f)

Once we have differentiated each file between cat’s or dog’s, let’s split them into train and test data and labels by using sklearn’s function as shown below. In the code below we are splitting 75% train and 25% test data and labels.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_path, y, test_size=0.25, random_state=42)

Let’s create a function to filter out audio data and read them by using Librosa library. Librosa is a library that is specifically designed for music and audio data to perform analysis on it or just assume it as an OpenCV version of picture data for audio files. From the function below, we will load the data once it is detected as an audio file, otherwise, nothing happens.

import librosa
def librosa_read_wav_files(wav_files):
if not isinstance(wav_files, list):
wav_files = [wav_files]
return [librosa.load(f)[0] for f in wav_files]

These loaded data will be assigned into each train and test lists:

wav_rate = librosa.load(X_train[0])[1]
X_train = librosa_read_wav_files(X_train)
X_test  = librosa_read_wav_files(X_test)

We can also visualise the audio data through Matplotlib, for instance, we are visualising the first four data of the training dataset as follows:

import matplotlib.pyplot as pltfig, axs = plt.subplots(2, 2, figsize=(16,7))
axs[0][0].plot(X_train[0])
axs[0][1].plot(X_train[1])
axs[1][0].plot(X_train[2])
axs[1][1].plot(X_train[3])
plt.show()

Then, we will get the four audio visualisations as follows:

The next most essential step in audio data analysis is to extract its features so that the computer will be able to understand the sound better, which is mostly used in every classification, prediction, or recommendation project for audio related data. There are many techniques to extract audio data, for instance, zero-crossing rate, spectral centroid, spectral rolloff, or MFCC (Mel-Frequency Cepstral Coefficients).

In this project, we are going to use zero-crossing rate and MFCC as they are the most commonly used techniques in audio classification, but first, we need to understand deeper about them.

Illustration of Zero Crossings. Source: Research Gate

The Zero-Crossing Rate (ZCR) of an audio frame is the rate of sign-changes of the signal during the frame. This means it is the number of times the signal changes value, from negative to positive and vice versa, divided by the length of the frame. The ZCR is defined according to the following equation:

Where s is a signal of length T and 1R<0 is an indicator function.

MFCC is a crucial technique in audio feature extraction and is used in almost every single audio analysis project. Its task is to mimic the mechanism of Cochlea, which is the internal part of our ear that can filter low-frequency sound and MFCC will allow computers to do the same by converting the data into the time domain and frequency domain by using Mel filters.

In the computation of MFCC, the first thing is to split the speech signal into frames. Because the high-frequency formants process reduced amplitude compared to the low-frequency formants, high frequencies are emphasized to obtain similar amplitude for all the formants. After windowing, Fast Fourier Transform (FFT) is applied to find the power spectrum of each frame. Subsequently, the filter bank processing is carried out on the power spectrum, using mel-scale. The DCT is applied to the speech signal after translating the power spectrum to log domain in order to calculate MFCC coefficients. Visually, the process can be described in the picture below:

Flow chart of MFCC. Source: Research Gate

The formula to obtain MFCC coefficients is as follow:

Another beginner-friendly article for MFCC can be found here.

Now let’s create our feature extractor function:

def extract_features(audio_samples, sample_rate):
extracted_features = np.empty((0, 41, ))
if not isinstance(audio_samples, list):
audio_samples = [audio_samples]for sample in audio_samples:
zero_cross_feat = librosa.feature.zero_crossing_rate(sample).mean()
mfccs = librosa.feature.mfcc(y=sample, sr=sample_rate, n_mfcc=40)
mfccsscaled = np.mean(mfccs.T,axis=0)
mfccsscaled = np.append(mfccsscaled, zero_cross_feat)
mfccsscaled = mfccsscaled.reshape(1, 41, )
extracted_features = np.vstack((extracted_features, mfccsscaled))
return extracted_features

So firstly we zero-crossing rate extraction technique, which is provided in the librosa libray: librosa.feature.zero_crossing_rate(). Then we can use our second extraction feature, MFCC, which is also provided on librosa’s library: librosa.feature.mfcc(data,number of filterbanks (40 is default number)). Afterward, we try to eliminate any sparsity by using the scale technique to bring the features into one scale. Alternatively, it can be done automatically by using sklearn’s sklearn.preprocessing.scale() function.

Subsequently, we will extract the features from those training and test dataset as follow:

X_train_features = extract_features(X_train, wav_rate)
X_test_features  = extract_features(X_test, wav_rate)

Since the filtering processes have been done in the feature extraction phase, we only need to use a one-layer deep learning model without a hidden layer, which is the Perceptron. But firstly, let’s import the necessary keras libraries to build this model.

from keras import layers
from keras import models
from keras import optimizers
from keras import losses
from keras.callbacks import ModelCheckpoint,EarlyStopping
from keras.utils import to_categorical

However, before we start training the data, we need to convert the labels into hot encoded.

train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Then we can build the model as follow:

model = models.Sequential()model.add(layers.Dense(100, activation = 'relu', input_shape = (41, )))
model.add(layers.Dense(50, activation = 'relu'))
model.add(layers.Dense(2, activation = 'softmax'))model.summary()

A detailed explanation of this Neural Network model can be found on the following link:

After we have built our Perceptron Neural Network model, there should be a summary as follows:

Now, let’s save our best model by using ModelCheckPoint from Keras.

best_model_weights = './base.model'
checkpoint = ModelCheckpoint(
best_model_weights,
monitor='val_acc',
verbose=1,
save_best_only=True,
mode='min',
save_weights_only=False,
period=1
)callbacks = [checkpoint]model.compile(optimizer='adam',
loss=losses.categorical_crossentropy,
metrics=['accuracy'])

In the ModelCheckpoint function, first, we need to determine the path to save the model, then, we can choose whether to monitor the val_acc or val loss, verbose = 1 to show line for each epoch or 0 to stay silent, and input true for save_best_only to save the best model, while the mode is to overwrite the minimum or maximum value of the monitored parameter (val_loss for min and val_acc for max), and save_weights_only means only the model’s weight will be saved, otherwise all model will be saved.

Then we can train the data to the model by using Keras’ model.fit function:

history = model.fit(
X_train_features,
train_labels,
validation_data=(X_test_features,test_labels),
epochs = 200, 
verbose = 1,
callbacks=callbacks,
)

Let’s visual our training performance by using Matplotlib:

print(history.history.keys())acc = history.history['accuracy']
val_acc = history.history['val_accuracy']epochs = range(1, len(acc)+1)plt.plot(epochs, acc, 'b', label = "training accuracy")
plt.plot(epochs, val_acc, 'r', label = "validation accuracy")
plt.title('Training and validation accuracy')
plt.legend()plt.show()

Which will give us an acceptable performance from the Perceptron model:

Let’s save the train model and weights:

model.save_weights('model_wieghts.h5')
model.save('model_keras.h5')

Since we have a pretty good model with above 85% accuracy, let’s test whether it can guess whether the given sound is a dog’s bark or a cat’s meow. Firstly, let’s give the following sound, which is considered challenging because lots of background noises:

nr_to_predict = 5
pred = model.predict(X_test_features[nr_to_predict].reshape(1, 41,))print("Cat: {} Dog: {}".format(pred[0][0], pred[0][1]))if (y_test[nr_to_predict] == 0):
print ("This is a cat meowing")
else:
print ("This is a dog barking")plt.plot(X_test_features[nr_to_predict])
ipd.Audio(X_test[nr_to_predict],  rate=wav_rate)

And the model can guess the sound to be:

As we can see the model is able to classify it as a dog’s bark.

However, let’s try out another sound, but this time, we are going to use this cat’s sound:

nr_to_predict = 5
pred = model.predict(X_test_features[nr_to_predict].reshape(1, 41,))print("Cat: {} Dog: {}".format(pred[0][0], pred[0][1]))if (y_test[nr_to_predict] == 0):
print ("This is a cat meowing")
else:
print ("This is a dog barking")plt.plot(X_test_features[nr_to_predict])
ipd.Audio(X_test[nr_to_predict],  rate=wav_rate)

Remember that because the data has been randomized during the split of the train and test data, no. 5 and no. 69 sound might not be the same, therefore, you can try other values (there are 70 sounds in test data).

And the model gives us a result of:

Despite there is a disruption from background vehicle noises, the model still able to guess it as a cat’s meow, which makes us sure that this model is quite robust.

Thanks for reading. I hope you have learned as much as I did in this beginner project for audio classification.

A Beginner-Friendly Guide for Audio Extraction and Classification through Librosa

Footer