VII.3 Backpropagation and the Chain Rule

system of equations to solve for the weights that minimize L: The partial derivatives of L with respect to the weights x should be zero.

Backpropagation is a method to compute derivatives quickly, using the chain rule

$ dF/dx = \frac{d}{dx} \left( F_2(F_1(x)) \right) = \frac{dF_2}{dx_1} \left(F_1(x) \right) \frac{dF_1}{dx} \left (x \right) $

For a standard net, the steps in the chain rule can correspond to layers in the neural net.

Backpropagation has been discovered many times. Another name is automatic differentiation (AD). You will see that the steps can be arranged in two basic ways: forward-mode and backward-mode. The right choice of mode can make a large difference · in the cost (a factor of thousands). That choice depends on whether you have many functions F depending on a few inputs, or few functions F depending on many inputs.

Deep learning has basically one loss function depending on many weights. The right choice is “backward-mode AD”. This is what we call backpropagation.

Derivatives of the Learning Function F

find dw_i/db_j and dw_i/dA_{jk} for all components of b + Av. When j is different from i, the ith output w_i is not affected by b_j or A_{jk}

Columns of I are 1-hot vectors.

$ dw_i/db_j = \sigma _{ij}$

$dw_i/dA_{jk} = \sigma _{ij}v_k$

Combining Weights band A into M

It is often convenient to combine the bias vector b and the matrix A into one matrix M. Matrix of weights.

For each layer of the neural net, the top entry (the zeroth entry) is now fixed at 1. After multiplying that layer by M, the zeroth component on the next layer is still 1. Then ReLU(l) = 1 preserves that entry in every hidden layer.

$dw_i/dM_{jk} = \sigma _{ij}v_k$

Derivatives for Hidden Layers

$dw/dA_1 = d(A_2R(b_1 + A_1v_0))/dA_1 = A_2R’(b_1 + A_1v_0)d(b_1 + A_1v_0)/dA_1$

Automatic backpropagation go backwards from w to v

ReLU is a limiting case of an S-shaped sigmoid function

Delta function: $\delta (x) = d^2R/dx^2$. The delta function represents an impulse. It models a finite change in an infinitesimal time. It is physically impossible but mathematically convenient.

With ReLU, a neuron could stay at zero through all the steps of deep learning. This “dying ReLU” can be avoided in several ways-it is generally not a major problem. One way that firmly avoids it is to change to a Leaky ReLU with a nonzero gradient

Computational Graphs

Forward mode requires a new graph for each input x_i, to compute dF/dx_i

The reverse mode starts with the output F. It computes the derivatives with respect to both inputs. The computations go backward through the graph.

Instead of N forward graphs from N inputs, we will have one backward graph from one output. It finds the derivative ofF with respect to every node. It starts with dF/dF = 1

The reverse mode finds all derivatives dF/dx_i by following all chains backward from output to input. Those chains all appear as paths on one graph-not as separate chain rules for exponentially many possible paths. This is the success of reverse mode.

The derivatives of F with respect to the matrices A (and the bias vectors b) are easiest for the last matrix A_L in A_Lv_{L-1} dF_i/dA_{jk} = \delta {ij}v{k} Next is the derivative A_LReLU(A_{L-1}v_{L-1}) with respect to A_{L_1}

Adjoint Methods

Hyperparameters

The stepsize/learning rate s_k in gradient descent is first and foremost: accelerated (by momentum) or adaptive (ADAM) or stochastic with a random minibatch of training data at each step k.

line search

No method is perfect. s_k is too small Then gradient descent takes too long. s_k is too large Gradient descent will jump around the minimizing x^*

After reaching weights x that are close to minimizing the loss function L( x, v) we may want to bring new v ‘s from a validation set. This is not yet production mode. The purpose of cross-validation is to confirm that the computed weights x are capable of producing accurate outputs from new data.

Cross-validation

A first step in cross-validation is to divide the available data into K subsets. If K = 2, these would essentially be the training set and test set-but we are usually aiming for more information from smaller sets before working with a big test set. K-fold cross-validation uses each of K subsets separately as a test set. In every trial, the other K - 1 subsets form the training set. We are reworking the same data (moderate size) to learn more than one optimization can teach us.

Batch Normalization of Each Layer

As training goes forward, the mean and variance of the original population can change at every layer of the network. This change in the distribution of inputs is “covariate shift”. We often have to adjust the stepsize and other hyperparameters, due to this shift in the statistics of layers. A good plan is to normalize the input to each layer.

Normalization makes the training safer and faster. The need for dropout often disappears. Fewer iterations can now give more accurate weights. And the cost can be very moderate. Often we just train two additional parameters on each layer.

The key point is to normalize the inputs y_i to each new layer. What was good for thel, original batch of vectors (at layer zero) is also good for the inputs to each hidden layer.

Dropout

Dropout is the removal of randomly selected neurons in the network. Typically hidden layer neurons might be given probability p = 0.5 of surviving, and input components might have p = 0.8 or higher. The main objective of random dropout is to avoid overfitting.

In training, each new v_0 (the feature vector of an input sample) leads to a new thinned network.

At test time, we use the full network (no dropout) with weights rescaled from the training weights. The outgoing weights from an undropped neuron are multiplied by p in the rescaling.

To compute gradients, use backpropagation for each training example in the minibatch. Then average those gradients. Stochastic gradient descent can still include acceleration (momentum added) and adaptive descent and weight decay.

LeCun emphasizes that for a multiparameter search, random sampling is the way to cover many possibilities quickly. Grid search is too slow in multiple dimensions.

Loss Functions

Quadratic cost(square loss): is not a favorite choice for deep learning. One reason is the parabolic shape for the graph of l(w, v), as we approach zero loss at w = y. The derivative also approaches zero.

A zero derivative at the minimum is normal for a smooth loss function, but it frequently leads to an unwanted result: The weights A and b change very slowly near the optimum. Learning slows down and many iterations are needed.

Cross-entropy loss: expect that the N outputs w_i from training the neural net have been normalized to z(w), with z_i in (0, 1)

Shannon’s formula for entropy

Kullback-Leibler (KL) divergence.

Regularization

It adds a penalty term to the loss function L(x)

The penalty controls the size of x. Regularization is also called weight decay.

The coefficient \lambda is a hyperparameter. Its value can be based on cross-validation. The purpose of the penalty terms is to avoid overfitting (sometimes expressed as fitting the noise). Cross-validation for a given lambda finds the minimizing x on a test set. Then it checks by using those weights on a training set. If it sees errors from overfitting, lambda is increased.

A small value of lambda tends to increase the variance of the error : overfitting. Large lambda will increase the bias: underfitting because the fitting term

b-Ax

^2 is less important

Recent experiments on MNIST make it unclear if explicit regularization is always necessary.

The Structure of AlphaGo Zero

A convolution of 256 filters of kernel size 3 x 3 with stride 1 : E = 3, S = 1
Batch normalization
ReLU
A convolution of 256 filters of kernel size 3 x 3 with stride 1
Batch normalization
A skip connection as in ResNets that adds the input t? the block
ReLU
A fully connected linear layer to a hidden layer of size 256
ReLU

Training was by stochastic gradient descent on a fixed data set that contained the final 2 million games of self-played data from a previous run of AlphaGo Zero. The CNN includes a fully connected layer that outputs a vector of size 19^2 + 1. This accounts for all positions on the 19 x 19 board, plus a pass move allowed in Go.

Recurrent Neural Networks

appropriate for data that comes in a definite order. This includes time series and natural language: speech or text or handwriting. In the network of connections from inputs v to outputs w, the new feature is the input from the previous time t - 1. This recurring input is determined by the function h(t- 1).

The inputs to h(t) are the new data v(t) and the recurrent data h( t- 1) from the previous time. The weights multiplying the data are x_in, x_recur, and x_out

Google translate

Instead of programming every word and grammatical rule and exception in both languages, let the computer find the rules. Just give it enough correct translations. If we were recognizing images, the inputs would be many examples with correct labels (the training set). The machine creates the function F(x, v).