Essential Math for Data Science: Introduction to Linear Algebra

Learn about scalars, vectors, and the dot product.

$Hadrien Jean$

(Image by author)

Machines only understand numbers. For instance, if you want to create a spam detector, you have first to convert your text data into numbers (for instance, through word embeddings). Data can then be stored in vectors, matrices, and tensors. For instance, images are represented as matrices of values between 0 and 255 representing the luminosity of each color for each pixel. It is possible to leverage the tools and concepts from the field of linear algebra to manipulate these vectors, matrices, and tensors.

Linear algebra is the branch of mathematics that studies vector spaces. You’ll see how vectors constitute vector spaces and how linear algebra applies linear transformations to these spaces. You’ll also learn the powerful relationship between sets of linear equations and vector equations, related to important data science concepts like least squares approximation. You’ll finally learn important matrix decomposition methods: eigendecomposition and Singular Value Decomposition (SVD), important to understand unsupervised learning methods like Principal Component Analysis (PCA).

Linear algebra deals with vectors. Other mathematical entities in the field can be defined by their relationship to vectors: scalars, for example, are single numbers that scale vectors (stretching or contracting) when they are multiplied by them.

However, vectors refer to various concepts according to the field they are used in. In the context of data science, they are a way to store values from your data. For instance, take the height and weight of people: since they are distinct values with different meanings, you need to store them separately, for instance using two vectors. You can then do operations on vectors to manipulate these features without losing the fact that the values correspond to different attributes.

You can also use vectors to store data samples, for instance, store the height of ten people as a vector containing ten values.

We’ll use lowercase, boldface letters to name vectors (such as v). As usual, refer to the Appendix Essential Math for Data Science to have the summary of the notations used in this book.

The word vector can refer to multiple concepts. Let’s learn more about geometric and coordinate vectors.

Coordinates are values describing a position. For instance, any position on earth can be specified by geographical coordinates (latitude, longitude, and elevation).

Geometric vectors, also called Euclidean vectors, are mathematical objects defined by their magnitude (the length) and their direction. These properties allow you to describe the displacement from a location to another.

Figure 1: A geometric vector running from A to B. (Image by author)

For instance, Figure 1 shows that the point A has coordinates (1, 1) and the point B has coordinates (3, 2). The geometric vector v describes the displacement from A to B, but since vectors are defined by their magnitude and direction, you can also represent v as starting from the origin.

Cartesian Plane

In Figure 1, we used a coordinate system called the Cartesian plane. The horizontal and vertical lines are the coordinate axes, usually labeled respectively x and y. The intersection of the two coordinates is called the origin and corresponds to the coordinate 0 for each axis.

In a Cartesian plane, any position can be specified by the x and the y coordinates. The Cartesian coordinate system can be extended to more dimensions: the position of a point in a n-dimensional space is specified by n coordinates. The real coordinate n-dimensional space, containing n-tuples of real numbers, is named ℝⁿ . For instance, the space ℝ² is the two-dimensional space containing pairs of real numbers (the coordinates). In three dimensions (ℝ³), a point in space is represented by three real numbers.

Coordinate vectors are ordered lists of numbers corresponding to the vector coordinates. Since vector initial points are at the origin, you need to encode only the coordinates of the terminal point.

Figure 2: The vector v has coordinates (3, 2) corresponding to three units from the origin on the x-axis and two on the y-axis. (Image by author)

For instance, let’s take the vector v represented in Figure 2. The corresponding coordinate vector is as follows:

Each value is associated with a direction: in this case, the first value corresponds to the x-axis direction and the second number to the y-axis.

Figure 3: Components of a coordinate vector. (Image by author)

As illustrated in Figure 3, these values are called components or entries of the vector.

Figure 4: Vectors can be represented as points in the Cartesian plane. (Image by author)

In addition, as represented in Figure 4, you can simply represent the terminal point of the arrow: this is a scatter-plot.

Indexing refers to the process of getting a vector component (one of the values from the vector) using its position (its index).

Python uses zero-based indexing, meaning that the first index is zero. However mathematically, the convention is to use one-based indexing. I’ll denote the component i of the vector v with a subscript, as v_i, without bold font because the component of the vector is a scalar.

In Numpy, vectors are called one-dimensional arrays. You can use the function np.array() to create one:

array([3, 2])

Let’s take the example of v, a three-dimensional vector defined as follows:

As shown in Figure 5, you can reach the endpoint of the vector by traveling 3 units on the x-axis, 4 on the y-axis, and 2 on the z-axis.

Figure 5: Three-dimensional representation of the origin at (0, 0, 0) and the point at (3, 4, 2). (Image by author)

More generally, in a n-dimensional space, the position of a terminal point is described by n components.

You can denote the dimensionality of a vector using the set notation ℝⁿ. It expresses the real coordinate space: this is the n-dimensional space with real numbers as coordinate values.

For instance, vectors in ℝ³ have three components, as the following vector v for example:

In the context of data science, you can use coordinate vectors to represent your data.

You can represent data samples as vectors with each component corresponding to a feature. For instance, in a real estate dataset, you could have a vector corresponding to an apartment with its features as different components (like the number of rooms, the location, etc.).

Another way to do it is to create one vector per feature, each containing all observations.

Storing data in vectors allows you to leverage linear algebra tools. Note that, even if you can’t visualize vectors with a large number of components, you can still apply the same operations on them. This means that you can get insights about linear algebra using two or three dimensions, and then, use what you learn with a larger number of dimensions.

Learn about scalars, vectors, and the dot product.

Footer