The eigenvectors of A don’t change direction when you multiply them by A. The output Ax is on the same line as the input vector x. The eigenvector x is just multiplied by its eigenvalue

$A^2x=\lambda ^2 x$

$Ax=\lambda x$ means that $A - \lambda I = 0$, i.e., it is not invertible, it is singular, and its determinant is 0

Similar matrices

For every invertible matrix B, the eigenvalues of A is same as $BAB^-1$. The eigenvector is Bx, when $Ax=\lambda x$

To compute eigenvalue of A, we gradually make $BAB^-1$ diagonal, and eigenvalue will show up on the main diagonal

Diagonalizing a Matrix

Suppose A has a full set of n independent eigenvectors. Put those eigenvectors into an invertible matrix X. $A=X\Lambda X^{-1}$. This property is used to compute the power of A

Symmetric positive definite matrices

$S = S^T$, Then eigenvectors q can be chosen orthogonal

Spectral Theorem: Every real symmetric matrix has the form $S=Q\Lambda Q^T$. Q is the orthonomal eignvector matrix $Q^TQ=I$

A positive definite matrix has all positive eigenvalues.

the energy test: S is positive definite if the energy $\x^TSx$ is positive for all vector x != 0

if $Sx=\lambda x$, then $x^TSx > 0$ if $\lambda > 0$

if every eigenvector has positive energy, those eigenvectors can be chosen orthogonal because S is symmetric, every vector is a linear combination of eigenvectors

$S=A^TA$ for a matrix A with independent columns

For an ordinary function f(x) of one variable x, the minimum test df/dx = 0 and second deravative > 0. For f ( x, y) with two variables, the second derivatives go into a matrix, and it is mininum when the matrix is positive definite.

There a saddle point when S has both positive and negative eigenvalues. A saddle point matrix is “indefinite”

Optimization and Machine Learning

gradient descent. Each step takes the steepest direction toward the bottom point of the bowl. Calculus: The partial derivatives of f are all zero. Linear algebra: The matrix S of second derivative is positive definite.

If S is positive definite (or semidefinite) at all points x, then the function f(x) is convex

Machine learning produces “loss functions” with hundreds of thousands of variables. They measure the error-which we minimize. But computing all the second derivatives is completely impossible. We use first derivatives to tell us a direction to move-the error drops fastest in the steepest direction. Then we take another descent step in a new direction. This is the central computation in least squares and neural nets and deep learning.

Ellipse

Stay with a positive definite matrix S. Cut through that bowl at height energy = 1, then the curve that goes around the cut is an ellipse

The axes of the tilted ellipse point along those eigenvectors of S. Length = $1/\sqrt\lambda$

SVD

If A is not square then $Ax = \lambda x$ is impossible and eigenvectors fail. SVD helps in this case

$Av_1=\sigma _1u_1$, vs are orthogonal in $R^n$. us are perpendicular to each other in $R^m$. That number r is the rank of A

$Av_{r+1} = 0$

$AV = U\Sigma$. V and U are square orthogonal matrices $V^T=V^{-1}$ because their columns are orthogonal unit vectors.

SVD: $A=U\Sigma V^T$. Reduced SVD so that $\Sigma _r$ is square

A special property of the SVD is that those pieces come in order of importance. $\sigma _1u_1v_1^T$ is the closest rank 1 matrix to A. The sum of the first k pieces is the best rank k approximation of A

Eckart-Young

$A^TA=V\Sigma^T\Sigma V^T$. V contains orthonormal eigenvectors of $A^TA$. $\sigma_i^2$ are the nonzero eigenvalues of both for both $A^TA$ and $AA^T$

KL transform’s connection to SVD

KL begins with a covariance matrix V of a zero-mean random process. V is symmetric and positive definite or semidefinite. In general V could be an infinite matrix or a covariance function. Then the KL expansion will be an infinite series. The KL transform is a stochastic (random) form of Principal Component Analysis

The Geometry of the SVD

The SVD separates a matrix into (orthogonal) x (diagonal) x (orthogonal). The orthogonal matrices U and V rotate the plane. The diagonal matrix stretches it along the axes.

Finite difference

The discrete form of a derivative is a finite difference. The discrete form of an integral is a sum.

D corresponds to the backward difference $f(x) - f(x - \Delta x)$

The eigenvectors v of $D^TD$ are the right singular vectors of D. They are discrete sines, u are discrete cosines. $\sqrt2V$ These are the famous DST and DCT matrices-Discrete Sine Transform and Discrete Cosine Transform. The DCT matrix has been the backbone of JPEG image compression. Actually JPEG increases U to 8 by 8, which reduces the “blockiness” of the image. 8 by 8 blocks of pixels are transformed by a two-dimensional DCT -then compressed and transmitted.