It’s key to speak Linear Algebra language, which allows you to better communicate with Linear Algebra experts and understand more advanced concepts. There’s a list of terminologies that I extracted from Prof. Gilbert Strang MIT Open CourseWare and Deep learning book written by Dr. Ian Goodfellow et al. I will provide a brief explanation for all the concepts and optionally I will introduce related Numpy function so you can play with while reading, but I would also like to refer you to Wikipedia whenever you want to know more about them.
Vector space, subspace, span, column space, row space, null space, left-null space, rank, basis, orthogonal matrix, symmetric matrix
Imagine we have a matrix A, in python, A.T means transpose of A, @ means matrix multiplication.
a = np.random.randn(2,3)
arr
# array([[ 0.39, 0.54, -0.71],
[-1.84, -0.6 , 0.53]])
a.T
# array([[ 0.39, -1.84],
[ 0.54, -0.6 ],
[-0.71, 0.53]])b = np.random.randn(3,2)
b
# array([[ 1., -0.],
[ 1., -0.],
[-3., 1.]])a @ b
# array([[-1.09762048, 0.96911979],
[-1.97859822, -3.15604207]])
The span of a set of vectors is all linear combinations of these vectors. Think about vector (0,1) and (1,0), a span of these two vectors would be the whole x-y plane.
Then thinking about a x-y-z 3D space (Vector space R3), it is composed of basis — (1,0,0), (0,1,0), and (0,0,1), since every element in Vector space can be written as a linear combination of the elements in the basis.
Span of Vector (1,0,0) and (0,1,0) constitutes a subspace of x-y-z 3D vector space R3. (A subset of a larger vector space)
Column space of A is the span of all column vectors of A, row space of A is the span of all row vectors of A. If A @ x=0, the span of all solutions x constitutes the null space of A. If A.T @ x= 0, span of all solutions x constitutes the left-null space of A.
How many linearly independent column vectors in matrix A? This is rank of A. (1,2,3) and (10,20,30) are linear dependent! How to compute the rank of a matrix?
x = np.random.rand(2,3)
np.linalg.matrix_rank(x)
# 2
# So, x has two linearly independent column vectors
Two vectors’ inner products are 0 and they are all of the unit lengths, then they are orthonormal vectors. If matrix A’s rows and columns are orthonormal vectors, this matrix A is an orthogonal matrix. In math, it means A.T @ A = A @ A.T = I.
Symmetric matrix means A = A.T, also, symmetric matrix and orthogonal matrix only apply to real value matrix, in a complex matrix, we have the counterparts, namely hermitian matrix and unitary matrix, We will cover them in later part.
(Make sure you understand: √√√√, if not, google them on Wikipedia).
Inverse, determinant
You can compute the determinant of any square matrix.
x = np.random.rand(4,4)
np.linalg.det(x)
# 0.08437697073565216
Only invertible (determinant != 0) square matrix will have inverse. If A @ B = I, B is A’s inverse matrix.
Gaussian Elimination, row echelon form, reduced echelon form, leading coefficient, pivot, elementary row operation
Gaussian Elimination can be used in (a) solving a linear system Ax=b, (b) compute inverse matrix (c) solve rank (d) solve determinant (details please refer to Wikipedia), via a series of elementary row operations including (1) swap rows, (2) scale rows, (3) Add one row to another.
Row echelon form: the first non-zero element from the left in each row (leading coefficient, pivot) is always to the right of the first non-zero element in the row above.
Reduced row echelon form: row echelon form whose pivots are1 and column containing pivots are zero elsewhere.
Inner product, outer product, hadamard product, Projection, Gram-Schmidt process
First, let’s understand the inner product, outer product, Hadamard product of two 1D arrays:
a = np.array([4,5,6])
b = np.array([7,8,9])
np.inner(a,b). # 122
np.outer(a,b). # array([[28, 32, 36],
[35, 40, 45],
[42, 48, 54]])
a * b. # element wise or hadamard product
# array([28, 40, 54])
How to project one vector (a) to another vector (b)?
a = np.array([4,5,6])
b = np.array([7,8,9])
proj_b_a = np.inner(a,b) / np.inner(b,b) * b
# array([4.40206186, 5.03092784, 5.65979381])
The gram-Schmidt process is to orthonormalize a set of vectors (v1,v2,v3…vn) which spans Rn vector space to (u1,u2,u3…un) in the same Rn vector space but within which each element is of unit length and are mutually orthogonal. For more detail and a geometric explanation please refer to Wikipedia.
LU decomposition, QR decomposition.
LU decomposition aims to decompose a matrix (no need to be square) to a lower triangular matrix (L, entries above diagonal are 0) and an upper triangular matrix (U, entries above diagonal are 0). It is related to Gaussian decomposition but has better computational efficiency. In order to make LU decomposition materialize, sometimes we reorder the matrix using a P matrix.
a = np.random.randn(3,4)
p,l,u = scipy.linalg.lu(a)#p
array([[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.]])#l
array([[ 1. , 0. , 0. ],
[-0.71426919, 1. , 0. ],
[-0.85039495, 0.12237106, 1. ]])#u
array([[-2.09219711, -0.48959089, 0.81707073, 0.77602155],
[ 0. , -1.53448255, -1.72785249, 0.04775144],
[ 0. , 0. , 1.5052947 , -0.27769281]])
QR decompostion aims to decompose a matrix to an orthogonal matrix (Q) and an upper triangular matrix (R). It is used in QR algorithms to solve the linear least square problem.
a = np.random.randn(3,4)
q,r = np.linalg.qr(a)#q
array([[-0.47569189, 0.71339517, 0.51457221],
[-0.82969021, -0.55818202, 0.00685511],
[ 0.29211536, -0.42367461, 0.85741964]])
#r
array([[-1.34089268, -1.73408897, -0.07436536, 0.78464807],
[ 0. , -1.66272812, 0.63477604, -1.60036506],
[ 0. , 0. , 0.43098896, -0.31316029]])
Eigen-decomposition, diagonalization, Characteristic polynomial.
I will illustrate Eigen-decomposition using a 3*3 square matrix, it will have 3 eigenvectors and cognate 3 eigenvalues.
ei = np.random.randn(4,4)
w,v = np.linalg.eig(ei)# w. eigenvalues
array([-2.21516912+1.65582705j, -2.21516912-1.65582705j,
1.45929568+0.99548974j, 1.45929568-0.99548974j])
# v. eigenvector, 4*4
array([[-0.1070701 -0.40447805j, -0.1070701 +0.40447805j,
-0.03773179+0.56113399j, -0.03773179-0.56113399j],
[ 0.15294619-0.0172865j , 0.15294619+0.0172865j ,
0.65763758+0.j , 0.65763758-0.j ],
[ 0.12433426+0.47215025j, 0.12433426-0.47215025j,
0.3236955 +0.37453337j, 0.3236955 -0.37453337j],
[-0.75023815+0.j , -0.75023815-0.j ,
0.00770123+0.07813095j, 0.00770123-0.07813095j]])
Diagonalization means for a square matrix A, if exists a matrix P such that P-1AP = D, a diagonal matrix, then matrix A is diagonalizable.
A square matrix A (n*n) can have a characteristic polynomial, we are trying to compute:
Moore-Penrose pseudo-inverse, hermitian, conjugate transpose, full-rank matrix
We know only an invertible square matrix has an inverse, how about a non-square matrix? we have pseudo-inverse.
Moore-Penrose defines the pseudo-inverse (A+,n*m) of a rectangular matrix (A, m*n) if A+ meets four particular conditions:
(1) AA+A = A
(2) A+AA+ = A+
(3) (AA+)* = AA+
(4) (A+A)* = A+A
* means conjugate transpose, for a complex square matrix (meaning the entry may be the complex number), the conjugate transpose is to first transpose the matrix and then conjugate (flip the sign of imaginary part but leave real part untouched) every entry of the matrix.
A hermitian matrix is a complex square matrix whose conjugate transpose is equal to itself. Its real value counterpart is a symmetric matrix.
If A is full-rank (rank=min(m,n)) matrix, A+ can be expressed as :
(1) A+ = (A*A)-1A*. A has independent columns
(2) A+ = A*(AA*)-1. A has independent rows
Singular Value Decomposition (SVD), singular matrix, complex unitary matrix
Only the square matrix has eigenvalues, the rectangular matrix can have singular values, a matrix whose determinant = 0 will be a singular matrix, or in another word, it is non-invertible.
A complex matrix will be unitary if its conjugate transpose happens to be its inverse. U*U = UU* = I. The real value counterpart is the orthogonal matrix.
So a rectangular matrix A can be decomposed into three components:
U, S, and Vh, diagonal of S matrix contain singular values for matrix A. This decomposition is called Singular Value Decomposition (SVD). SVD can be really convenient to compute pseudo-inverse, matrix rank, etc.
svd = np.random.rand(4,5)
u,s,vh = np.linalg.svd(svd)#u. shape (4,4)
array([[ 0.66581103, 0.61992005, -0.21849383, -0.3530655 ],
[ 0.47769768, 0.06749606, 0.56339472, 0.67069784],
[ 0.44673911, -0.66240419, 0.31234136, -0.51389467],
[ 0.35906096, -0.41516755, -0.73300048, 0.40177286]])#s. s is the np.diag(S), S will be of shape (4,5)
array([2.35262119, 0.87561858, 0.31537598, 0.043907 ])# vh of shape (5,5)
array([[ 0.29743919, 0.569923 , 0.3576598 , 0.52216343, 0.43144237],
[-0.61102456, 0.15085062, 0.50163189, -0.46496796, 0.36886763],
[ 0.49893385, 0.39660661, -0.29349238, -0.68317228, 0.2022525 ],
[ 0.52538762, -0.60314147, 0.5557502 , -0.15677528, 0.16355866],
[-0.11494248, -0.3624299 , -0.47481454, 0.14088043, 0.78111244]])
Lp Norm, Frobenius norm, einsum
What are L1 and L2 norms? Basically, it represents different metrics to define a vector’s distance from the origin. Let’s define a more general Lp norm:
For a matrix, we usually encounter Frobenius norm like above.
a = np.array([4,5,6])
np.linalg.norm(a,ord=3). # L3 norm for vector a
# 7.398636222991409x = np.random.rand(2,3)
np.linalg.norm(x,ord='fro'). # frobenius norm for matrix x
# 1.309085506183174
Finally, I’d like to introduce the Einstein Sum function in the Numpy package. This is a somewhat magic function that provides a generic solution for nearly all the basic matrix operations you can encounter, including, inner product, outer product, matrix multiplication, trace, diagonal, Hadamard product, summation, row sum, column sum, transpose.
Basically, this function requires two arguments, first is called “subscript”, second is called “operands”, let us see a matrix multiplication example to understand that:
x = np.random.rand(2,3)
y = np.random.rand(5,3)
np.einsum('ij,kj -> ik',x,y)#
array([[0.3676856 , 0.33156855, 0.39793874, 0.70939856, 0.8353566 ],
[0.19219345, 0.19312881, 1.36739081, 1.0606612 ,1.15039307]])
As we can see, we use i
and j
represent the dimensions of the operand x
, k
and j
to represent dimensions of the operand y
. And we specify the output matrix to be in the dimension i
* k
. The einsum
function will automatically figure out it is a matrix multiplication problem. With that understood, let’s delve into all the interesting examples einsum
function can perform. A basic rule is that the dimension that only happens in input subscripts but not in output subscript means this dimension needs to be summed.
# einsum
x = np.random.rand(2,3)
np.einsum('ij -> ji',x) # transpose
np.einsum('ij ->',x) # sum
np.einsum('ij -> i',x) # column sum
np.einsum('ij -> j',x) # row sumx = np.random.rand(2,3)
y = np.random.rand(5,3)
np.einsum('ij,kj -> ik',x,y) # matrix multiplication
a = np.array([4,5,6])
b = np.array([7,8,9])
np.einsum('i,i ->',a,b) # inner product
np.einsum('i,j ->ij',a,b) # outer product
np.einsum('i,i ->i',a,b) # hadamard product
y = np.random.rand(5,3)
np.einsum('ij -> j',y) # diagonal
np.einsum('ij ->',y) # trace
Not everyone needs to know Linear Algebra, and to what extent you should understand Linear Algebra heavily depends on your type of work. But it is reasonable to forecast that machine learning and other statistical models will become more and more accessible and by then, simply knowing how to run a machine learning model might not be sufficient to make you stand out. So I would suggest to equip yourself with the necessary knowledge from now and make sure you always be competitive in the foreseeable future.
Thanks for reading! Next time I will cover the statistical model in Python. If you like this article, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please let me know if you have any questions or what kind of NumPy tutorials you would like to see In the future!
Github Repository: