Exploring some of the ways linear algebra proves to be an important component while making any data science model!
As Wikipedia defines it: linear algebra is the branch of mathematics concerning linear equations such as:
linear maps, such as :
and their representations in vector spaces and through matrices.
In simple words, linear algebra is the study of vectors and linear functions. Vectors in linear algebra are closed under the operations of addition and scalar multiplication and it includes the study of matrices, determinants, linear transformations and vector spaces and subspaces.
If you are someone that’s interested in data science, linear algebra is a key concept that you should know. You may ask why do we need to study linear algebra when most tasks in Python can be easily performed using pre-existing libraries and packages. Well, you would soon see, and even some of the most basic tasks in data science and machine learning need at least some prior knowledge of linear algebra.
While building any data science model, you may have to reduce the dimension of the data, or you may have to choose the right hyperparameters. Here is when linear algebra comes in! Also, the math behind most machine learning and deep learning algorithms has something to do with linear algebra. So it is definitely one of the building blocks of data science.
Let us look at some interesting (but definitely not exhaustive) applications of linear algebra in data science:-
When you are building a machine learning model, you are most probably dealing with large data sets having multiple rows and columns. These are nothing but matrices. When you split your dataset into training and testing data, you are performing operations on these matrices.
Matrices are the key data structures in linear algebra and it deals with the various operations performed on a matrix, including row and column transformations, transpose of a matrix, addition or scalar multiplication in matrices.
Take a look at this very common dataset, called the iris dataset:
This is nothing but a 5×4 matrix. Each record is a row and is indexed with the numbers 0, 1, …, 4. Each column has its name on top.
Linear Regression is mainly used in predicting continuous values. We deal with 2 kinds of variables, input variable (or independent variable), x, and the output variable (dependant variable), y. We model our data using a straight line that “best fits” our data. Linear Regression is simply expressing the relationship between the dependent and independent variables in a linear equation. In multiple linear regression, we have multiple independent variables that influence our dependant variable.
Such equations of the type Y = MX + C can be easily solved using matrix multiplication by expressing each of the variables, Y = {y1, y2, … , yn} , X = {x1, x2, … , xn}, M = {m1, m2, … , mn}, C = {c1, c2, … , cn} as a matrix. For the step by step process and for those that want to dive deep into the solution, you can check out this article.
Overfitting is one of the biggest hurdles in machine learning, especially for beginners in data science. It is when a model is too close a fit for the available data, to the point that it does not perform well with any new or outside data. A concept called “Regularisation” is used to prevent the model from overfitting. This concept makes use of linear algebra as it uses the ‘norm’.
The ‘norm’ can be defined simply as the magnitude of a vector. This magnitude can be calculated in various ways. One popular way is using the Euclidean Distance, i.e, using the distance from the origin.
The Wikipedia definition is:
In mathematics, a norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin.
Regularisation prevents overfitting as it adds the norm of the weight vector to the cost function. This makes sure that our model does not become overly complex as our aim is always to reduce the cost function, and therefore we would have to reduce this norm. This is much better understood by someone who knows the basics of linear algebra and is able to apply this concept not just in Python programming, but also in theory.
While making machine learning models, often we come across data that is made up of hundreds (or even thousands) of variables. Our model becomes more and more complicated as these variables increase.
Dimensionality Reduction is the technique that reduces the number of input variables in a data set. Since we know that datasets can be easily represented as matrices, certain matrix factorization methods can be used to reduce a matrix (and hence the dataset) into its constituents parts. Then, any operations that were to be performed on the original matrix, could be performed on the smaller matrices.
Decomposition methods like LU Matrix decomposition and QR Matrix decomposition can be easily performed using Python programming.
Natural Language Processing (NLP) is the application of computational techniques to the analysis and synthesis of natural language and speech.
When we are dealing with textual data, we need a way to convert it into the numerical or statistical form so that it is easily understood and interpreted by our model. One such way is by using Word Embedding.
Word Embedding
Simply described, word embedding represents words in the form of vectors while preserving their context in the document. Neural networks are trained using a large amount of text to obtain such representations. The relationships and similarity between words can be analysed using this technique: like Man is related to Woman the way King is related to Queen.
Since linear algebra deals with the study of vectors; words, when represented in the form of low dimension vectors, could be easily visualised with a basic understanding of linear algebra.
When implementing data science models, especially in deep learning, we come across data in the form of images. However, we cannot just pass an image to a model and expect it to understand it. We need to convert each image into something mathematical or statistical to be understood by the model. This is where linear algebra comes in.
Linear algebra deals with matrices and all the operations to be performed on a matrix. Any image is made of pixels, which are nothing but coloured squares of varying intensities (for gray-scale images it could be a single number with the intensity, and for coloured images, it could be the RGB value). If the dimension of an image is 100 x 100 pixels, then this image could be represented by a 100 x 100 matrix with each element having the intensity of the corresponding pixel. Once done, this matrix could easily be interpreted by a computer and can be used in our model.