Part 1.6: Eigenvalues
The eigenvectors of A don’t change direction when you multiply them by A. The output Ax is on the same line as the input vector x. The eigenvector x is just multiplied by its eigenvalue
$A^2x=\lambda ^2 x$
$Ax=\lambda x$ means that $A - \lambda I = 0$, i.e., it is not invertible, it is singular, and its determinant is 0
Similar matrices
For every invertible matrix B, the eigenvalues of A is same as $BAB^-1$. The eigenvector is Bx, when $Ax=\lambda x$
To compute eigenvalue of A, we gradually make $BAB^-1$ diagonal, and eigenvalue will show up on the main diagonal
Diagonalizing a Matrix
Suppose A has a full set of n independent eigenvectors. Put those eigenvectors into an invertible matrix X. $A=X\Lambda X^{-1}$. This property is used to compute the power of A
Symmetric positive definite matrices
$S = S^T$, Then eigenvectors q can be chosen orthogonal
Spectral Theorem: Every real symmetric matrix has the form $S=Q\Lambda Q^T$. Q is the orthonomal eignvector matrix $Q^TQ=I$
A positive definite matrix has all positive eigenvalues.
the energy test: S is positive definite if the energy $\x^TSx$ is positive for all vector x != 0
if $Sx=\lambda x$, then $x^TSx > 0$ if $\lambda > 0$
if every eigenvector has positive energy, those eigenvectors can be chosen orthogonal because S is symmetric, every vector is a linear combination of eigenvectors
$S=A^TA$ for a matrix A with independent columns
For an ordinary function f(x) of one variable x, the minimum test df/dx = 0 and second deravative > 0. For f ( x, y) with two variables, the second derivatives go into a matrix, and it is mininum when the matrix is positive definite.
There a saddle point when S has both positive and negative eigenvalues. A saddle point matrix is “indefinite”
Optimization and Machine Learning
gradient descent. Each step takes the steepest direction toward the bottom point of the bowl. Calculus: The partial derivatives of f are all zero. Linear algebra: The matrix S of second derivative is positive definite.
If S is positive definite (or semidefinite) at all points x, then the function f(x) is convex
Machine learning produces “loss functions” with hundreds of thousands of variables. They measure the error-which we minimize. But computing all the second derivatives is completely impossible. We use first derivatives to tell us a direction to move-the error drops fastest in the steepest direction. Then we take another descent step in a new direction. This is the central computation in least squares and neural nets and deep learning.
Ellipse
Stay with a positive definite matrix S. Cut through that bowl at height energy = 1, then the curve that goes around the cut is an ellipse
The axes of the tilted ellipse point along those eigenvectors of S. Length = $1/\sqrt\lambda$
SVD
If A is not square then $Ax = \lambda x$ is impossible and eigenvectors fail. SVD helps in this case
$Av_1=\sigma _1u_1$, vs are orthogonal in $R^n$. us are perpendicular to each other in $R^m$. That number r is the rank of A
$Av_{r+1} = 0$
$AV = U\Sigma$. V and U are square orthogonal matrices $V^T=V^{-1}$ because their columns are orthogonal unit vectors.
SVD: $A=U\Sigma V^T$. Reduced SVD so that $\Sigma _r$ is square
A special property of the SVD is that those pieces come in order of importance. $\sigma _1u_1v_1^T$ is the closest rank 1 matrix to A. The sum of the first k pieces is the best rank k approximation of A
Eckart-Young
$A^TA=V\Sigma^T\Sigma V^T$. V contains orthonormal eigenvectors of $A^TA$. $\sigma_i^2$ are the nonzero eigenvalues of both for both $A^TA$ and $AA^T$
KL transform’s connection to SVD
KL begins with a covariance matrix V of a zero-mean random process. V is symmetric and positive definite or semidefinite. In general V could be an infinite matrix or a covariance function. Then the KL expansion will be an infinite series. The KL transform is a stochastic (random) form of Principal Component Analysis
The Geometry of the SVD
The SVD separates a matrix into (orthogonal) x (diagonal) x (orthogonal). The orthogonal matrices U and V rotate the plane. The diagonal matrix stretches it along the axes.
Finite difference
The discrete form of a derivative is a finite difference. The discrete form of an integral is a sum.
D corresponds to the backward difference $f(x) - f(x - \Delta x)$
The eigenvectors v of $D^TD$ are the right singular vectors of D. They are discrete sines, u are discrete cosines. $\sqrt2V$ These are the famous DST and DCT matrices-Discrete Sine Transform and Discrete Cosine Transform. The DCT matrix has been the backbone of JPEG image compression. Actually JPEG increases U to 8 by 8, which reduces the “blockiness” of the image. 8 by 8 blocks of pixels are transformed by a two-dimensional DCT -then compressed and transmitted.