Part VII : Learning from Data

There are a great many points in the training data, but there are far more weights to be computed in a deep network. The art of deep learning is to find, among many possible solutions, one that will generalize to new data.

F(x,v) The training data is given by a set of feature vectors v. The weights that allow F to classify that data are in the vector x. To optimize F, gradient descent needs its derivatives dF/dx. The weights x are the matrices A_1…A_L and bias vectors B_1…B_L. Sample data v=v_0, output w = v_L

Real codes use automatic differentiation (AD) for backpropagation. Each hidden layer with its optimized weights learns more about the data and the population from which it comes-in order to classify new and unseen data from the same population.

The Functions of Deep Learning

So we start with M different images (the training set). An image will be a set of p small pixels-or a vector v. v_i is the greyscale of ith pixel. For every v in that training set we know the digit it represents.

Continuous Piecewise Linear (CPL) functions. Linear for simplicity, continuous to model an unknown but reasonable rule, and piecewise to achieve the nonlinearity that is an absolute requirement for real images and data.

If A_1 is q by p, the input space R^p is is sliced by q hyperplanes into r pieces

Bias vs. Variance : Underfit vs. Overfit

A training set contains N vectors with m components each (them features of each sample). For each ofthose N points in R^m, we are given y_i. We want to learn $y_i = f(x_i) + \epsilon$, with epsilon mean = 0

Our learning algorithm actually finds a function F(x) close to f(x)

bias-variance tradeoff. High bias from underfitting, high variance from overfitting.

bias = E[f(x)-F(x)]. variance = E[F(X)^2] - E[F(X)]^2

The Construction of Deep Neural Networks

create a learning function F(x, v) with weights x that capture information from the training data v

Key operation: F=F_3(F_2(F_1(x, v)))
key rule: Chain rule for x-derivatives of F
Key algorithm: Stochastic gradient descent to find the best weights x
Key subroutine: Backpropagation to execute the chain rule, find the x-derivatives of F to solve \nabla F = 0
Key nonlinearity: ramp function ReLU(y). This makes the learning function F is continuous and piecewise linear in v.

One Internal Layer L = 2

For a classification problem each sample v_0 of the training data is assigned 1 or -1. We want the output v_2 to have that correct sign (most of the time). For a regression problem we use the numerical value (not just the sign) of v_2

Machine learning doesn’t aim to capture every detail of the numbers 0, 1, 2 … , 9. It just aims to capture enough information to decide correctly which number it is.

The Initial Weights in Gradient Descent

A proper choice of the net and the initial x_0 has random (and independent) weights that meet two requirements.

x_0 has a carefully chosen variance, which controls the mean of the computed weights.
The hidden layers in the neural net have enough neurons (not too narrow).

Many-layered depth can reduce the loss on the training set. But if $\sigma^2$ is wrong or width is sacrificed, then gradient descent can lose control of the weights. They can explode to infinity or implode to zero.

The good choice is 2/fan-in. The fan-in is the maximum number of inputs to neurons

The key point: Deep learning can go wrong if it doesn’t start right.

Stride and Subsampling

goal : Reduce the dimension from 128 to 64

In two steps Multiply the 128-component vector v by A, and then discard the odd-numbered components of the output. This is filtering followed by subsampling.
In one step Discard the odd-numbered rows of the matrix A. The new matrix A2 becomes short and wide: 64 rows and 128 columns. The “stride” of the filter is now 2. Now multiply the 128-component vector v by A2.

Max-pooling

Multiply the 128-component vector v by A, as before. Then from each even-odd pair of outputs, keep the maximum.

For an image (a 2-dimensional signal) we might use max-pooling over every 2 by 2 square of pixels. This speeds up the training, when the number of neurons on a hidden layer is divided by 4.

Pooling also reduces the possibility of overfitting. Average pooling would keep the average of the numbers in each pool : now the pooling layer is linear.

The Graph of the Learning Function F(v)

The graph of F ( v) is a surface made up of many, many flat pieces-they are planes or hyperplanes that fit together along all the folds where ReLU produced a change of slope. This is like origami except that this graph has flat pieces going to infinity.

That polygon separating blue from orange (or plus from minus : this is classification) is the analog of a separating hyperplane in a Support Vector Machine. If we were limited to linear functions and a straight line between a blue ball and an orange ring around it, separation would be impossible.

Counting Flat Pieces in the Graph

Suppose v_o has m components and A_1v_0+b_1 has N components. We have N functions of V_o. Each of those linear functions is zero along a hyperplane (dimension m - 1) in R^m. When we apply ReLU to that linear function it becomes piecewise linear, with a fold along that hyperplane. On one side of the fold its graph is sloping, on the other side the function changes from negative to zero.

Convolutional Neural Nets

That fully connected net will be extremely inefficient for image recognition. We are looking for connections between faraway pixels. Almost always, the important connections in an image are local

the search for structure is essentially the same everywhere in the image. There is normally no reason to process one part of a text or image or video differently from other parts. We can use the same weights in all parts: Share the weights. The neural net of local connections between pixels is shift-invariant: the same everywhere.

Suppose each neuron is connected to only E neurons on the next layer, and those connections are the same for all neurons. Then the matrix A between those layers has only E independent weights x.

we have time to create several different channels with their own E or E^2 weights. They can look for edges in different directions (horizontal, vertical, and diagonal).

In one dimension, a banded shift-invariant matrix is a Toeplitz matrix or a filter. Multiplication by that matrix A is a convolution x * v.

A Toeplitz matrix is a type of structured matrix where each descending diagonal from left to right is constant. That means all the elements along any diagonal parallel to the main diagonal are the same.

Each shift has a diagonal of 1 ‘s $A = x_1L + x_0C + x_{-1}R$, assuming S = 1

$dAv/d_x1 = Lv$

Convolutions in Two Dimensions

When the input v is an image, the convolution with x becomes two-dimensional. 1-d $x_1 , x_0 , x_{-1}$ changes to E^2 independent weights. v represents (N+2)^2 pixels, and output have only N^2 pixels

The 2D convolution $A=x_{11}LU+x_{01}CU+x_{-11}RU+…+x_{-1-1}RD$

These nine derivatives are computed inside backpropagation.

CNN’s can readily afford to have B parallel channels (and that number B can vary as we go deeper into the net).

A convolution is a combination of shift matrices (producing a filter or Toeplitz matrix)

A cyclic convolution is a combination of cyclic shifts (producing a circulant matrix)

In deep learning, the coefficients in the combination will be the “weights” to be learned.

In two dimensions (for images) the matrix A is block Toeplitz. Each small block is E by E

The same weights are used around all pixels (shift-invariance). The matrix produces a 2D convolution. Frequently A is called a filter

The dot products between a smooth function and a moving filter window will be smooth. But when an edge in the image lines up with a diagonal wall, we see a spike.

The difficulty with two or more dimensions is that edges can have many directions. We will need horizontal and vertical and diagonal filters for the test images. And filters have many purposes, including smoothing, gradient detection, and edge detection.

For a 2D function f, the natural smoother is convolution with a Gaussian. The filter removes noise (at a price in sharp edges)
Gradient detection. Image processing (as distinct from learning by a CNN) needs filters that .detect the gradient. They contain specially chosen weights. We mention some simple filters just to indicate how they can find gradients-the first derivatives of f.k
Edge detection. we don’t want smoothing, which would blur the edge. The good filters become Laplacians of Gaussians

The Stride of a Convolutional Filter

For a larger stride, the moving window takes longer steps as it moves across the image.

the length of the output y=Av is reduced by the factor of stride.

nonzero weights like x_1 are S columns apart for stride S

Extending the Signal

Instead of losing neurons at the edges of the image when A is not square, we can extend the input layer. We are “inventing” components beyond the image boundary. Then the output y = Av has equal dimensions for input and output.

For periodic signals, zero-padding is replaced by wraparound.

A more accurate choice is to go beyond the boundary by reflection.

This is what CNN’s usually do: Add more channels of weight matrices A to capture more features of the training sample.

Counting the Number of Inputs and Outputs

In a one-dimensional problem, suppose a layer has N neurons. We apply a convolutional matrix withE nonzero weights. The stride isS, and we pad the input signal by P zeros at each end. How many outputs (M numbers) does this filter produce?

Karpothy’s formula

In a 2D or 3D problem, this 1D formula applies in each direction.

This case 2P = E - 1 with stride S = 1 is the most common architecture for CNN’s.

it is very common to end a CNN with fully-connected layers.

Softmax Outputs for Multiclass Networks

“Softmax” replaces the two-output case of logistic regression. We are turning n numbers into probabilities.

For CNN’s in computer vision, the final layer often has a special form. If the previous layers used ReLU and max-pooling (both piecewise linear), the last step can become a difference-of-convex program, and eventually a multiclass Support Vector Machine (SVM). Then optimization of the weights in a piecewise linear CNN can be one layer at a time.

Residual Networks

But depth brings dangers. Information can jam up and never reach the output. The problem of “vanishing gradients” can be serious: so many multiplications in propagating so far, with the result that computed gradients are exponentially small. When it is well designed, depth is a good thing-but you must create paths for learning to move forward.

“skip connections” that go directly to the next layer.

entire layers can be removed without significant impact