Principal Component Analysis is one of the most popular methods in increasing Machine Learning algorithm’s performance that crunch large number of data and features. However, sometimes PCA can be too complicated, too technical, or even too tedious to be properly understood the basic principals, therefore, I decided to write this article that articulate every step in practical way and easily digested for beginners.
Firstly, we need to understand better on why we need to use PCA in Machine Learning:
- Get rid of noise data: Sometimes in a dataset, there are too many data to be analysed whether we need to include or drop it in order to improve the algorithm’s performance. However, with PCA, it will filter out noise data and leave it with only prominent data.
- Improves performance: By getting rid of massive number of non-corelated features from the dataset will automatically highly reduce the training time in Machine Learning
- Simplify visualisation: Too many features force us to analyse too many graphs that sometimes seems confusing, whether it is for us or for the clients to digest the data. By getting rid of those non-impactful features, we will be able to simply the data visualisation and make it more pleasant to understand.
- Reduce overfitting: There are many occurrence of overfitting simply because of too many features or dimensions involved. PCA clearly assists in mitigating this issue.
In this article, we implement PCA through the following Kaggle dataset which can be downloaded: here. By using Panda’s head function, we can see through the data’s features and data as follows:
As we can see, the original dataset consists of 10 features and there are still some categorical data. If you have read through my previous Titanic dataset article here, Machine Learning algorithm cannot digest categorical data and we need to convert them into ordinal data. Same case for PCA, we need to convert the sales and salary features into ordinal data as shown below:
data['salary'] = data['salary'].map( {'low': 1, 'medium': 2, 'high':3} ).astype(int)
data['sales'] = data['sales'].map( {'accounting': 1, 'hr': 2, 'IT':3, 'management':4, 'marketing':5, 'product_mng':6, 'RandD':7, 'sales':8, 'support':9, 'technical':10} ).astype(int)
Which will us a new dataset without any categorical data as shown below:
As we can see, left feature is a label, therefore, we it is more convenient to set it on the very left side of the data column for data splitting purpose in the next couple steps. First, we need to convert the features into list in order to be able reshuffling the columns.
columns = data.columns.tolist()
Then we reshuffle the left feature into the very first column.
columns.insert(0, columns.pop(columns.index('left')))
Then we re-convert from list form into columns again.
data = data.reindex(columns= columns)
Subsequently, we can analyse the correlation between each feature through correlation matrix which can be constructed through seaborn library below:
correlation = data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')plt.title('Correlation between features')
Since left feature is a label, we can split the data more convenient as training and label after the reshuffling method above as follow with X as training data and y as label data:
X = data.iloc[:,1:10].values
y = data.iloc[:,0].values
X
With training data is shown below:
Now we have 9 features for training data
Now we have taken our first step into PCA, which is data standardization and it is a necessary step before performing PCA. Basically, the purpose of data standardisation is to equalise all data by using mean and standard deviation. For instance, Andy from class 1-A got 80 in his Math exam out of 100 and standard deviation of 6, however Helen from class 1-B got 320 of her Math exam because her teacher using a scale out of 450 with standard deviation of 68. So in order to understand who has the higher score, we standardise the score by using percentage, for instance, Andy got 80%, while Helen achieved 71%. This way, we know Andy has higher score compares to Helen.
In Python, we can standardise the data by using sklearn’s function of StandardScaller as shown below:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
The principle of StandardScaler from Sklearn is subtracting the value with its mean and divided by the standard deviation:
- z: Standardised data
- x: Original data
- µ: Mean value
- σ: Standard deviation value
To find the mean:
To find the standard deviation:
Now, our dataset has been standardised by using sklearn function with the following output:
After they have been standardised, we want to find out again the correlation between each feature. This can be done by using the following formula:
Or we can translate it in Python as follow:
mean_vec = np.mean(X_std)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix', cov_mat)
Or simply by using Numpy’s covariance function:
print('NumPy covariance matrix:', np.cov(X_std.T))
Where both codes will give an identical output:
Now that we understand the relationship between each feature through covariance matrix, we can determine the Principal Components of it through calculating the Eigenvectors and Eigenvalues.
From the picture above understanding, covariance matrix is treated as A, hence, we can determine the Eigenvector and Eigenvalue by using Numpy’s function below:
eig_vals, eig_vecs = np.linalg.eig(cov_mat)print('Eigenvectors', eig_vecs)
print('nEigenvalues', eig_vals)
Which show us both matrices as:
Now that we have found out both Eigenvalues and Eigenvectors, we need to sort the Eigenvalues in order to determine which Eigenvector is the most relevant in the dataset. Firstly, we need to combine each Eigenvalue and Eigenvector into one column of Eigenpair:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
After that, we can sort the Eigenpairs from the highest to the lowest values:
eig_pairs.sort(key=lambda x: x[0], reverse=True)
We can obtain the ordered Eigenvalues as shown below:
print('Sorted Eigenvalues:')
for i in eig_pairs:
print(i[0])
Now we can obtain the Principal Components from the sorted Eigenvalues. The value of Principal Component, which is called explained variance, indicates how prominent a feature. Hence, with 9 features we currently have, we will produce 9 Principal Components with each distinct explained variance in percentage.
Explained variance can be determine through following formula:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
Then we can plot out the value of each Principal Component by using Matplotlib:
with plt.style.context('bmh'):
plt.figure(figsize=(6, 4))plt.bar(range(9), var_exp, alpha=0.5, align='center')
plt.ylabel('Explained Variance')
plt.xlabel('Principal components')
plt.tight_layout()
Through the graph above, we can determine that the maximum variance is around 20.5%. The last 2 features clearly show very minimum impact compared to the rest as they have variance of less than 7.5%. Therefore, we can drop both features.
Now that we decide to drop the last two features, we can construct new matrix consists of the first 7 features only
PCA_matrix = np.hstack((eig_pairs[0][1].reshape(9,1),
eig_pairs[1][1].reshape(9,1),
eig_pairs[2][1].reshape(9,1),
eig_pairs[3][1].reshape(9,1),
eig_pairs[4][1].reshape(9,1),
eig_pairs[5][1].reshape(9,1),
eig_pairs[6][1].reshape(9,1),
))
Then we use dot product to create a new feature space by using Y = X•W
Y = X_std.dot(PCA_matrix)
Then we can create a new dataset that comprises of the data of each Principal Component including the label as follow:
principalDf = pd.DataFrame(data = Y , columns = ['principal component 1', 'principal component 2', 'principal component 3', 'principal component 4', 'principal component 5', 'principal component 6', 'principal component 7'])
finalDf = pd.concat([principalDf,pd.DataFrame(y,columns = ['left'])], axis = 1)
Which will give us a new dataset that has been crunched through PCA as:
There are many options on how many features that you might decide to drop, whether it is 2, 3, or even only leaving 2 features behind, we have understood that at least the first feature is the most prominent one. Therefore, this data can be processed through Machine Learning algorithm, such as Linear Regression, Random Forest, etc. much faster that processing the whole dataset.
I hope this article helps you to understand more about PCA and thank you for reading.