Part VI: Optimization
Gradient Descent Toward the Minimum
Minimize a function $f(x_l … , x_n)$ - all the first derivatives are 0 at the minimum. If we have n = 20 unknowns (a small number in deep learning) then minimizing one function f produces 20 equations
The steepest direction, in which f *x) decreases fastest, is given by the gradient $-\nabla f$
$\nabla f$ represents the vector of n partial derivatives of f: its gradient. $s_k$ is the stepsize or learning rate
The neat idea of automatic differentiation-rediscovered and extended as backpropagation in machine learning-makes those cost factors much smaller in practice.
df/dx = limit of $\delta f/ \delta x$ = limit of $ (f(x + \delta x) - f(x)) / \delta x $
For n = 1, df/dx is the gradient
we get a better ratio (closer to the limiting slope df/dx) by averaging the forward difference (where $\delta x > 0$) with the backward difference (where $\delta x < 0$). The average is the more accurate centered difference
The gradient of $F(x_1, …, x_n)$ is the column vector $\nabla F = (dF/dx_1,….,dF/dx_n). Its components are the n partial derivatives of F, and it points in the steepest direction.
For a constant vector a, $F(x) = a^Tx$ has gradient a.
For a symmetric matrix S, gradient of $F(x)=x^TSx$ is 2Sx
For a positive definite symmetric S, $1/2x^TSx-a^Tx$’s gradient is Sx-a
That steepest direction is perpendicular to the level direction
Stochastic Gradient Descent and ADAM
Gradient descent: $x_{k+1}=x_k-s_k\nabla L(x_k)$ L(x) is the loss function. But for large networks with many samples in the training set, this algorithm (as it stands) is not successful
Computing $\nabla L$ at every descent step-the derivatives of the total loss L with respect to all the weights x in the network-is too expensive. Potentially millions of separate losses are computed and added in every computation of L
The number of weights is even larger. many different choices $x^$ of the weights. x^ minimized L(x) for test data v. Some of those choices can give poor results on unseen test data.
Stochastic gradient descent uses only a “minibatch” of the training data at each step. B samples will be chosen randomly. This changes L(x) to a sum of only B losses.
Computing $\nabla l_i by backpropagation on B samples is much faster. Often B = 1
The stochastic algorithm produces weights that also succeed on unseen data.
And for each sample in the training set, we know if it is “a cat or a dog”-or if. the text is “poetry or prose”. We look for a learning function F that assigns’ good weights. Then for v in a similar population, F outputs the correct classification “cat” or “poetry”.
We use this function F for unidentified test data. The features of the test data are the new inputs v. The output from F will be the correct (?) classification-provided the function has learned the training data in a way that generalizes.
Generalization by SGD is the ability to give the correct classification for unseen test data v, based on the weights x that were learned from the training data.
I compare overfitting with choosing a polynomial of degree 60 that fits exactly to 61 data points. Need early stopping
The Loss Function and the Learning Function
“loss” L(x) that our function will (approximately) minimize. This is the sum of the errors in classifying each of the training data vectors v. And we need to describe the form of the learning function F that classifies each data vector v
At the beginning of machine learning the function F was linear-a severe limitation. Now F is certainly nonlinear. Just the inclusion of one particular nonlinear function at each neuron in each layer has made a dramatic difference.
The gradient is determined by the network architecture-the “feedforward” steps whose weights we will optimize.
Square loss: sum over the training samples
Hinge loss
Cross-entropy loss: favorite for neural nets. Cross-entropy loss or “logistic loss” is preferred for logistic regression (with two choices only). For a minibatch of size B, replace N by B. And choose the B samples randomly.
Stochastic Descent Using One Sample Per Step
To simplify, suppose each minibatch contains only one sample (so B = 1). That sample is chosen randomly. The theory of stochastic descent usually assumes that the sample is replaced after use-in principle the sample could be chosen again at step k + 1. But replacement is expensive compared to starting with a random ordering of the samples. In practice, we often omit replacement and work through samples in a random order.
Each pass through the training data is one epoch of the descent algorithm. Ordinary gradient descent computes one epoch per step (batch mode). Stochastic gradient descent needs many steps (for minibatches).
Stochastic descent is more sensitive to the stepsizes than full gradient descent.
We are doing much less work per step (B inputs instead of all inputs from the training set). But we do not necessarily converge more slowly. A typical feature of stochastic gradient descent is “semi-convergence” : fast convergence at the start.
later iterations of SGD are frequently erratic. Convergence at the start changes to large oscillations near the solution. One response is to stop early and thereby we avoid overfitting the data.
If $x_k$ is outside I, then $x_{k+1}$ moves toward the interval I. Otherwise, it just bounces inside I
Randomized Kaczmarz is Stochastic Gradient Descent
We are randomly selecting row i of A at step k, and adjust x_{k+1} to solve equation i in Ax=b, i.e., $a_i^Tx_{k+1}=b_i$. x_{k+1} is the the projection of x_k onto one hyperplane $a_i^Tx=b_i$ that meets $x^*=A^{-1}b$
Adaptive Methods Using Earlier Gradients
That “memory” guides the choice of search direction D and the all-important stepsize s.
For a standard SGD iteration, D_k depends only on the current gradient $\nabla L_k$. $\nabla L_k(x_k, B)$ is evaluated only on a random minibatch B of the test data.
Adaptive Stochastic Gradient Descent: x_{k+1}=x_k - s_kD_k
Exponential moving averages in ADAM: recent gradients $\nabla L$ have greater weight than earlier gradients in both s_k and step direction D_k
Typical values are \delta = 0.9 and \beta = 0.999.
The actual computation of D_k and S_k will be a recursive combination of old and new
Generalization : Why is Deep Learning So Effective?
Often we have more free parameters in x than data in v. In that case we can expect many sets of weights (many vectors x) to be equally accurate on the training set. Those weights could be good or bad. They could generalize well or poorly. Our algorithm chooses a particular x and applies those weights to new data v_{test}