Linear Algebra and Learning from Data: Deep Learning and Neural Nets
- Goal of a neural net: To construct a function that classifies the training data correctly, so it can generalize to unseen test data.
Learning function F
- One input v for each training sample. For the problem of identifying handwritten digits, each input sample will be an image-a matrix of pixels
- Output: 0 to 9. Those ten numbers are the possible outputs from the learning function.
- Train a learning function on part of the MNIST set. Validate the function by choosing unseen MNIST samples
Linear/affine learning function
The simplest learning function would be linear $w=Av+b$
- The entries in the matrix A are the weights to be learned
- b is the bias vector and is to be learned
- Linearity is a very limiting requirement. What would be halfway between I and XIX?
Neural network
$F(v) = L(R(L(R(…Lv))))$
$Lv = Av + b$
R/ReLU: nonlinear function on each component of $Lv$, ReLU(x) = max(0, x) is often picked
$F(x,v)$ depends input v and weight x (A’s and b’s). $v_1=ReLU(A_1v+b_1)$ produces the first hidden layer of neural network
The complete net starts with the input layer v and output layer $w=F(v)$
$L_k(v_{k-1})= A_kv_{k-1}+b_k$
The goal is to minimzie total loss over all training samples by Choosing weights in $A_k$ and $b_k$, although Least square is not the best loss function for deep learning
The nubmer of weights $A_{ij}$ and $b_k$ is often larger than the nubmer of training samples
For images, CNN is often appropriate and weights are shared the diagonals of A are constant
Weights are optimized by stochastic gradient descent
positive definite symmetric matrices S
Orthogonal eigenvector q. Positive eigenvalues $\lambda$
$S = \lambda_1 q_1q_1^T + \lambda_2 q_2q_2^T….$. Assuming $\lambda_1 > \lambda_2…$, the $\lambda_1 q_1q_1^T$ is the most informative part
Layers of F(x, v)
x is in the ramp function ReLU
Start with fully connected layers - neurons on layer N connects to layer N + 1.
CNN repeats same weights around all pixels in an image
Pooling layer reduces dimenstions
Dropout randoms leaves out neurons
Batch normalization resets means and variance