Part 1.9 Principal Components
Principal Component Analysis (PCA) uses the largest $\sigma$ connected to the first u’s and v’s to understand the information in a matrix of data. The closest rank k matrix to A is $A_k$. In statistics we are identifying the pieces of A with largest variance
PCA is “unsupervised” learning. Our only instructor is linear algebra. the SVD tells us to choose $A_k$.
Eckart-Young. Three choices for the matrix norm | A |
PCA
We have n samples. For each sample we measure m variables, i.e., a m rows by n columns matrix.
The first step is to find the average (the sample mean) along each row of A0 . Subtract that mean from all m entries in the row. Now each row of the centered matrix A has mean zero
How will linear algebra find that closest line through (0, 0)? It is in the direction of the first singular vector
The Statistics Behind PCA
Subtracting those means in each row from each row of $A_0$ produced the centered A. The variances are the diagonal entries of the matrix $AA^T$. The covariances are the off-diagonal entries of the matrix $AA^T$. High covariance means that increased height goes with increased age. (Negative covariance means that one variable increases when the other decreases.).
sample covariance matrix $ S = AA^T / (n -1)$
Given the data in A, computing S would be a computational mistake. For large matrices, a direct SVD of A is faster and more accurate.
The leading eigenvector $u_1$ tells us slope of the closest line in the scatter plot.
The sum of squared distances from the data points to the $u_1$ line is a minimum.
The total variance in the data comes from the Frobenius norm (squared) of A. This is the trace of S-the sum down the diagonal. the trace equals the sum of the eigenvalues of the sample covariance matrix x
Rayleigh quotient for symmetric matrix
$R(x) = x^TSx/x^Tx$. The maximum value of R(x) is the largest eigenvalue. Other eigenvectors are saddle points. Saddles have first derivatives = zero but they are not maxima or minima.
$ | A | ^2 = max | Ax | ^2/ | x | ^2 = max(R(x)) = \lambda_1(S)$ |
Generalized Eigenvalues
Generalized Rayleigh quotient $R(x) = x^TSx / x^TMx$. M is symmetric. In dynamical problems M is often the “mass matrix” or the “inertia matrix”. In statistics M is generally the covariance matrix.
If M is positive definite, the maximum of R(x) is the largest eigenvalue of $M^{-1}S$
$H=M^{-1/2}SM^{-1/2}$ is symmetric
$x=M^{-1/2}y$ to convert $Sx = \lambda Mx$ into $Hy=\lambda y$
A crucial fact about a symmetric matrix S is that any two eigenvectors are orthogonal. Not true for generalized eigenvectors, but true for $x_1Mx_2^T=0$
Positive Semidefinite M : Not Invertible
$x^TMx$ can be zero, possible to have infinite eigen values
In statistics M is often a covariance matrix. Its diagonal entries tell us the separate variances of two or more measurements. Its off-diagonal entries tell us the “covariances between the measurements”. If we are foolishly repeating exactly the same observations, or if one experiment is completely determined by another-then the covariance matrix M is singular. Its determinant is zero and it is not invertible
Fisher’s Linear Discriminant Analysis
We are given samples from two different populations and they are mixed together, with mean and variance of each pop. If all the samples are mixed together and we pick one, how do we tell if it probably came from population 1 or population 2 ?
Each sample with feature vector f, finds a separation vector v, if $v^Tf>c$ then likely from pop 1. v maximizes the separation ratio R.
$R = (x^Tm_1 - x^Tm_2)^2/x^T\Sigma _1x + x^T\Sigma _2x$. m: mean vector
$S = (m_1-m_2)(m_1-m_2)^T$ $M = \Sigma _1 + \Sigma _2$
$Sv = \lambda Mv$
$v = M^{-1} (m_1 - m_2)$
Neural networks will succeed by allowing separating surfaces that are not just planes.