• Goal of a neural net: To construct a function that classifies the training data correctly, so it can generalize to unseen test data.

Learning function F

  • One input v for each training sample. For the problem of identifying handwritten digits, each input sample will be an image-a matrix of pixels
  • Output: 0 to 9. Those ten numbers are the possible outputs from the learning function.
  • Train a learning function on part of the MNIST set. Validate the function by choosing unseen MNIST samples

Linear/affine learning function

The simplest learning function would be linear $w=Av+b$

  • The entries in the matrix A are the weights to be learned
  • b is the bias vector and is to be learned
  • Linearity is a very limiting requirement. What would be halfway between I and XIX?

Neural network

$F(v) = L(R(L(R(…Lv))))$

$Lv = Av + b$

R/ReLU: nonlinear function on each component of $Lv$, ReLU(x) = max(0, x) is often picked

$F(x,v)$ depends input v and weight x (A’s and b’s). $v_1=ReLU(A_1v+b_1)$ produces the first hidden layer of neural network

The complete net starts with the input layer v and output layer $w=F(v)$

$L_k(v_{k-1})= A_kv_{k-1}+b_k$

The goal is to minimzie total loss over all training samples by Choosing weights in $A_k$ and $b_k$, although Least square is not the best loss function for deep learning

The nubmer of weights $A_{ij}$ and $b_k$ is often larger than the nubmer of training samples

For images, CNN is often appropriate and weights are shared the diagonals of A are constant

Weights are optimized by stochastic gradient descent

positive definite symmetric matrices S

Orthogonal eigenvector q. Positive eigenvalues $\lambda$

$S = \lambda_1 q_1q_1^T + \lambda_2 q_2q_2^T….$. Assuming $\lambda_1 > \lambda_2…$, the $\lambda_1 q_1q_1^T$ is the most informative part

Layers of F(x, v)

x is in the ramp function ReLU

Start with fully connected layers - neurons on layer N connects to layer N + 1.

CNN repeats same weights around all pixels in an image

Pooling layer reduces dimenstions

Dropout randoms leaves out neurons

Batch normalization resets means and variance