Lecture 2 & 3

2025-01-04 00:00:00 +0000

Lecture 2 - ML Refresher / Softmax Regression

Supervised learning

  • classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
  • Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
  • Requires a large amount of accurately labeled data for effective training.

Unsupervised learning

  • Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
  • Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
  • Can work with unlabeled data, which is often more readily available and cost-effective to collect.

Multi-class classification setting

Matrix batch notation

  • X in $R^{m \times n}$
  • Matrix ops are more efficient than many vector ops

Loss function: classification error

  • $l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
  • bad for optimization, because it is not differentiable

Loss function: softmax/cross-entropy loss

  • Softmax(h(x)) = normalize(exp(h(x)))
  • $l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
  • Find \theta that minimizes the average loss of the training set

SGD

\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B

gradient of cross entropy

Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss

Lecture 3 - “Manual” Neural Networks

Need to create feature function \theta so that the classier is more powerful than the pure linear one.

  • Traditional: manually extract features
  • Neural network: learn features from data, automated

$\theta = {W_1 , W_2}$

A two layer network can approximate any smooth function well

Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.

Backpropogation

two-layer network gradient w.r.t. w_2

w_1

General form

Forward path: compute Z_i

Backward path: compute G_i

vector Jacobian product

VII.3 Backpropagation and the Chain Rule

2024-12-30 00:00:00 +0000

system of equations to solve for the weights that minimize L: The partial derivatives of L with respect to the weights x should be zero.

Backpropagation is a method to compute derivatives quickly, using the chain rule

$ dF/dx = \frac{d}{dx} \left( F_2(F_1(x)) \right) = \frac{dF_2}{dx_1} \left(F_1(x) \right) \frac{dF_1}{dx} \left (x \right) $

For a standard net, the steps in the chain rule can correspond to layers in the neural net.

Backpropagation has been discovered many times. Another name is automatic differentiation (AD). You will see that the steps can be arranged in two basic ways: forward-mode and backward-mode. The right choice of mode can make a large difference · in the cost (a factor of thousands). That choice depends on whether you have many functions F depending on a few inputs, or few functions F depending on many inputs.

Deep learning has basically one loss function depending on many weights. The right choice is “backward-mode AD”. This is what we call backpropagation.

Derivatives of the Learning Function F

find dw_i/db_j and dw_i/dA_{jk} for all components of b + Av. When j is different from i, the ith output w_i is not affected by b_j or A_{jk}

Columns of I are 1-hot vectors.

$ dw_i/db_j = \sigma _{ij}$

$dw_i/dA_{jk} = \sigma _{ij}v_k$

Combining Weights band A into M

It is often convenient to combine the bias vector b and the matrix A into one matrix M. Matrix of weights.

For each layer of the neural net, the top entry (the zeroth entry) is now fixed at 1. After multiplying that layer by M, the zeroth component on the next layer is still 1. Then ReLU(l) = 1 preserves that entry in every hidden layer.

$dw_i/dM_{jk} = \sigma _{ij}v_k$

Derivatives for Hidden Layers

$dw/dA_1 = d(A_2R(b_1 + A_1v_0))/dA_1 = A_2R’(b_1 + A_1v_0)d(b_1 + A_1v_0)/dA_1$

Automatic backpropagation go backwards from w to v

ReLU is a limiting case of an S-shaped sigmoid function

Delta function: $\delta (x) = d^2R/dx^2$. The delta function represents an impulse. It models a finite change in an infinitesimal time. It is physically impossible but mathematically convenient.

With ReLU, a neuron could stay at zero through all the steps of deep learning. This “dying ReLU” can be avoided in several ways-it is generally not a major problem. One way that firmly avoids it is to change to a Leaky ReLU with a nonzero gradient

Computational Graphs

Forward mode requires a new graph for each input x_i, to compute dF/dx_i

The reverse mode starts with the output F. It computes the derivatives with respect to both inputs. The computations go backward through the graph.

Instead of N forward graphs from N inputs, we will have one backward graph from one output. It finds the derivative ofF with respect to every node. It starts with dF/dF = 1

The reverse mode finds all derivatives dF/dx_i by following all chains backward from output to input. Those chains all appear as paths on one graph-not as separate chain rules for exponentially many possible paths. This is the success of reverse mode.

The derivatives of F with respect to the matrices A (and the bias vectors b) are easiest for the last matrix A_L in A_Lv_{L-1} dF_i/dA_{jk} = \delta {ij}v{k} Next is the derivative A_LReLU(A_{L-1}v_{L-1}) with respect to A_{L_1}

Adjoint Methods

Hyperparameters

The stepsize/learning rate s_k in gradient descent is first and foremost: accelerated (by momentum) or adaptive (ADAM) or stochastic with a random minibatch of training data at each step k.

line search

No method is perfect. s_k is too small Then gradient descent takes too long. s_k is too large Gradient descent will jump around the minimizing x^*

After reaching weights x that are close to minimizing the loss function L( x, v) we may want to bring new v ‘s from a validation set. This is not yet production mode. The purpose of cross-validation is to confirm that the computed weights x are capable of producing accurate outputs from new data.

Cross-validation

A first step in cross-validation is to divide the available data into K subsets. If K = 2, these would essentially be the training set and test set-but we are usually aiming for more information from smaller sets before working with a big test set. K-fold cross-validation uses each of K subsets separately as a test set. In every trial, the other K - 1 subsets form the training set. We are reworking the same data (moderate size) to learn more than one optimization can teach us.

Batch Normalization of Each Layer

As training goes forward, the mean and variance of the original population can change at every layer of the network. This change in the distribution of inputs is “covariate shift”. We often have to adjust the stepsize and other hyperparameters, due to this shift in the statistics of layers. A good plan is to normalize the input to each layer.

Normalization makes the training safer and faster. The need for dropout often disappears. Fewer iterations can now give more accurate weights. And the cost can be very moderate. Often we just train two additional parameters on each layer.

The key point is to normalize the inputs y_i to each new layer. What was good for thel, original batch of vectors (at layer zero) is also good for the inputs to each hidden layer.

Dropout

Dropout is the removal of randomly selected neurons in the network. Typically hidden layer neurons might be given probability p = 0.5 of surviving, and input components might have p = 0.8 or higher. The main objective of random dropout is to avoid overfitting.

In training, each new v_0 (the feature vector of an input sample) leads to a new thinned network.

At test time, we use the full network (no dropout) with weights rescaled from the training weights. The outgoing weights from an undropped neuron are multiplied by p in the rescaling.

To compute gradients, use backpropagation for each training example in the minibatch. Then average those gradients. Stochastic gradient descent can still include acceleration (momentum added) and adaptive descent and weight decay.

LeCun emphasizes that for a multiparameter search, random sampling is the way to cover many possibilities quickly. Grid search is too slow in multiple dimensions.

Loss Functions

Quadratic cost(square loss): is not a favorite choice for deep learning. One reason is the parabolic shape for the graph of l(w, v), as we approach zero loss at w = y. The derivative also approaches zero.

A zero derivative at the minimum is normal for a smooth loss function, but it frequently leads to an unwanted result: The weights A and b change very slowly near the optimum. Learning slows down and many iterations are needed.

Cross-entropy loss: expect that the N outputs w_i from training the neural net have been normalized to z(w), with z_i in (0, 1)

Shannon’s formula for entropy

Kullback-Leibler (KL) divergence.

Regularization

It adds a penalty term to the loss function L(x)

The penalty controls the size of x. Regularization is also called weight decay.

The coefficient \lambda is a hyperparameter. Its value can be based on cross-validation. The purpose of the penalty terms is to avoid overfitting (sometimes expressed as fitting the noise). Cross-validation for a given lambda finds the minimizing x on a test set. Then it checks by using those weights on a training set. If it sees errors from overfitting, lambda is increased.

A small value of lambda tends to increase the variance of the error : overfitting. Large lambda will increase the bias: underfitting because the fitting term   b-Ax   ^2 is less important

Recent experiments on MNIST make it unclear if explicit regularization is always necessary.

The Structure of AlphaGo Zero

  1. A convolution of 256 filters of kernel size 3 x 3 with stride 1 : E = 3, S = 1
  2. Batch normalization
  3. ReLU
  4. A convolution of 256 filters of kernel size 3 x 3 with stride 1
  5. Batch normalization
  6. A skip connection as in ResNets that adds the input t? the block
  7. ReLU
  8. A fully connected linear layer to a hidden layer of size 256
  9. ReLU

Training was by stochastic gradient descent on a fixed data set that contained the final 2 million games of self-played data from a previous run of AlphaGo Zero. The CNN includes a fully connected layer that outputs a vector of size 19^2 + 1. This accounts for all positions on the 19 x 19 board, plus a pass move allowed in Go.

Recurrent Neural Networks

appropriate for data that comes in a definite order. This includes time series and natural language: speech or text or handwriting. In the network of connections from inputs v to outputs w, the new feature is the input from the previous time t - 1. This recurring input is determined by the function h(t- 1).

The inputs to h(t) are the new data v(t) and the recurrent data h( t- 1) from the previous time. The weights multiplying the data are x_in, x_recur, and x_out

Google translate

Instead of programming every word and grammatical rule and exception in both languages, let the computer find the rules. Just give it enough correct translations. If we were recognizing images, the inputs would be many examples with correct labels (the training set). The machine creates the function F(x, v).

Part VII : Learning from Data

2024-12-29 00:00:00 +0000

There are a great many points in the training data, but there are far more weights to be computed in a deep network. The art of deep learning is to find, among many possible solutions, one that will generalize to new data.

F(x,v) The training data is given by a set of feature vectors v. The weights that allow F to classify that data are in the vector x. To optimize F, gradient descent needs its derivatives dF/dx. The weights x are the matrices A_1…A_L and bias vectors B_1…B_L. Sample data v=v_0, output w = v_L

Real codes use automatic differentiation (AD) for backpropagation. Each hidden layer with its optimized weights learns more about the data and the population from which it comes-in order to classify new and unseen data from the same population.

The Functions of Deep Learning

So we start with M different images (the training set). An image will be a set of p small pixels-or a vector v. v_i is the greyscale of ith pixel. For every v in that training set we know the digit it represents.

Continuous Piecewise Linear (CPL) functions. Linear for simplicity, continuous to model an unknown but reasonable rule, and piecewise to achieve the nonlinearity that is an absolute requirement for real images and data.

If A_1 is q by p, the input space R^p is is sliced by q hyperplanes into r pieces

Bias vs. Variance : Underfit vs. Overfit

A training set contains N vectors with m components each (them features of each sample). For each ofthose N points in R^m, we are given y_i. We want to learn $y_i = f(x_i) + \epsilon$, with epsilon mean = 0

Our learning algorithm actually finds a function F(x) close to f(x)

bias-variance tradeoff. High bias from underfitting, high variance from overfitting.

bias = E[f(x)-F(x)]. variance = E[F(X)^2] - E[F(X)]^2

The Construction of Deep Neural Networks

create a learning function F(x, v) with weights x that capture information from the training data v

  • Key operation: F=F_3(F_2(F_1(x, v)))
  • key rule: Chain rule for x-derivatives of F
  • Key algorithm: Stochastic gradient descent to find the best weights x
  • Key subroutine: Backpropagation to execute the chain rule, find the x-derivatives of F to solve \nabla F = 0
  • Key nonlinearity: ramp function ReLU(y). This makes the learning function F is continuous and piecewise linear in v.

One Internal Layer L = 2

For a classification problem each sample v_0 of the training data is assigned 1 or -1. We want the output v_2 to have that correct sign (most of the time). For a regression problem we use the numerical value (not just the sign) of v_2

Machine learning doesn’t aim to capture every detail of the numbers 0, 1, 2 … , 9. It just aims to capture enough information to decide correctly which number it is.

The Initial Weights in Gradient Descent

A proper choice of the net and the initial x_0 has random (and independent) weights that meet two requirements.

  • x_0 has a carefully chosen variance, which controls the mean of the computed weights.
  • The hidden layers in the neural net have enough neurons (not too narrow).

Many-layered depth can reduce the loss on the training set. But if $\sigma^2$ is wrong or width is sacrificed, then gradient descent can lose control of the weights. They can explode to infinity or implode to zero.

The good choice is 2/fan-in. The fan-in is the maximum number of inputs to neurons

The key point: Deep learning can go wrong if it doesn’t start right.

Stride and Subsampling

goal : Reduce the dimension from 128 to 64

  • In two steps Multiply the 128-component vector v by A, and then discard the odd-numbered components of the output. This is filtering followed by subsampling.
  • In one step Discard the odd-numbered rows of the matrix A. The new matrix A2 becomes short and wide: 64 rows and 128 columns. The “stride” of the filter is now 2. Now multiply the 128-component vector v by A2.

Max-pooling

Multiply the 128-component vector v by A, as before. Then from each even-odd pair of outputs, keep the maximum.

For an image (a 2-dimensional signal) we might use max-pooling over every 2 by 2 square of pixels. This speeds up the training, when the number of neurons on a hidden layer is divided by 4.

Pooling also reduces the possibility of overfitting. Average pooling would keep the average of the numbers in each pool : now the pooling layer is linear.

The Graph of the Learning Function F(v)

The graph of F ( v) is a surface made up of many, many flat pieces-they are planes or hyperplanes that fit together along all the folds where ReLU produced a change of slope. This is like origami except that this graph has flat pieces going to infinity.

That polygon separating blue from orange (or plus from minus : this is classification) is the analog of a separating hyperplane in a Support Vector Machine. If we were limited to linear functions and a straight line between a blue ball and an orange ring around it, separation would be impossible.

Counting Flat Pieces in the Graph

Suppose v_o has m components and A_1v_0+b_1 has N components. We have N functions of V_o. Each of those linear functions is zero along a hyperplane (dimension m - 1) in R^m. When we apply ReLU to that linear function it becomes piecewise linear, with a fold along that hyperplane. On one side of the fold its graph is sloping, on the other side the function changes from negative to zero.

Convolutional Neural Nets

That fully connected net will be extremely inefficient for image recognition. We are looking for connections between faraway pixels. Almost always, the important connections in an image are local

the search for structure is essentially the same everywhere in the image. There is normally no reason to process one part of a text or image or video differently from other parts. We can use the same weights in all parts: Share the weights. The neural net of local connections between pixels is shift-invariant: the same everywhere.

Suppose each neuron is connected to only E neurons on the next layer, and those connections are the same for all neurons. Then the matrix A between those layers has only E independent weights x.

we have time to create several different channels with their own E or E^2 weights. They can look for edges in different directions (horizontal, vertical, and diagonal).

In one dimension, a banded shift-invariant matrix is a Toeplitz matrix or a filter. Multiplication by that matrix A is a convolution x * v.

A Toeplitz matrix is a type of structured matrix where each descending diagonal from left to right is constant. That means all the elements along any diagonal parallel to the main diagonal are the same.

Each shift has a diagonal of 1 ‘s $A = x_1L + x_0C + x_{-1}R$, assuming S = 1

$dAv/d_x1 = Lv$

Convolutions in Two Dimensions

When the input v is an image, the convolution with x becomes two-dimensional. 1-d $x_1 , x_0 , x_{-1}$ changes to E^2 independent weights. v represents (N+2)^2 pixels, and output have only N^2 pixels

The 2D convolution $A=x_{11}LU+x_{01}CU+x_{-11}RU+…+x_{-1-1}RD$

These nine derivatives are computed inside backpropagation.

CNN’s can readily afford to have B parallel channels (and that number B can vary as we go deeper into the net).

A convolution is a combination of shift matrices (producing a filter or Toeplitz matrix)

A cyclic convolution is a combination of cyclic shifts (producing a circulant matrix)

In deep learning, the coefficients in the combination will be the “weights” to be learned.

In two dimensions (for images) the matrix A is block Toeplitz. Each small block is E by E

The same weights are used around all pixels (shift-invariance). The matrix produces a 2D convolution. Frequently A is called a filter

The dot products between a smooth function and a moving filter window will be smooth. But when an edge in the image lines up with a diagonal wall, we see a spike.

The difficulty with two or more dimensions is that edges can have many directions. We will need horizontal and vertical and diagonal filters for the test images. And filters have many purposes, including smoothing, gradient detection, and edge detection.

  • For a 2D function f, the natural smoother is convolution with a Gaussian. The filter removes noise (at a price in sharp edges)
  • Gradient detection. Image processing (as distinct from learning by a CNN) needs filters that .detect the gradient. They contain specially chosen weights. We mention some simple filters just to indicate how they can find gradients-the first derivatives of f.k
  • Edge detection. we don’t want smoothing, which would blur the edge. The good filters become Laplacians of Gaussians

The Stride of a Convolutional Filter

For a larger stride, the moving window takes longer steps as it moves across the image.

the length of the output y=Av is reduced by the factor of stride.

nonzero weights like x_1 are S columns apart for stride S

Extending the Signal

Instead of losing neurons at the edges of the image when A is not square, we can extend the input layer. We are “inventing” components beyond the image boundary. Then the output y = Av has equal dimensions for input and output.

For periodic signals, zero-padding is replaced by wraparound.

A more accurate choice is to go beyond the boundary by reflection.

This is what CNN’s usually do: Add more channels of weight matrices A to capture more features of the training sample.

Counting the Number of Inputs and Outputs

In a one-dimensional problem, suppose a layer has N neurons. We apply a convolutional matrix withE nonzero weights. The stride isS, and we pad the input signal by P zeros at each end. How many outputs (M numbers) does this filter produce?

Karpothy’s formula

In a 2D or 3D problem, this 1D formula applies in each direction.

This case 2P = E - 1 with stride S = 1 is the most common architecture for CNN’s.

it is very common to end a CNN with fully-connected layers.

Softmax Outputs for Multiclass Networks

“Softmax” replaces the two-output case of logistic regression. We are turning n numbers into probabilities.

For CNN’s in computer vision, the final layer often has a special form. If the previous layers used ReLU and max-pooling (both piecewise linear), the last step can become a difference-of-convex program, and eventually a multiclass Support Vector Machine (SVM). Then optimization of the weights in a piecewise linear CNN can be one layer at a time.

Residual Networks

But depth brings dangers. Information can jam up and never reach the output. The problem of “vanishing gradients” can be serious: so many multiplications in propagating so far, with the result that computed gradients are exponentially small. When it is well designed, depth is a good thing-but you must create paths for learning to move forward.

“skip connections” that go directly to the next layer.

entire layers can be removed without significant impact

Part VI: Optimization

2024-12-28 00:00:00 +0000

Gradient Descent Toward the Minimum

Minimize a function $f(x_l … , x_n)$ - all the first derivatives are 0 at the minimum. If we have n = 20 unknowns (a small number in deep learning) then minimizing one function f produces 20 equations

The steepest direction, in which f *x) decreases fastest, is given by the gradient $-\nabla f$

$\nabla f$ represents the vector of n partial derivatives of f: its gradient. $s_k$ is the stepsize or learning rate

The neat idea of automatic differentiation-rediscovered and extended as backpropagation in machine learning-makes those cost factors much smaller in practice.

df/dx = limit of $\delta f/ \delta x$ = limit of $ (f(x + \delta x) - f(x)) / \delta x $

For n = 1, df/dx is the gradient

we get a better ratio (closer to the limiting slope df/dx) by averaging the forward difference (where $\delta x > 0$) with the backward difference (where $\delta x < 0$). The average is the more accurate centered difference

The gradient of $F(x_1, …, x_n)$ is the column vector $\nabla F = (dF/dx_1,….,dF/dx_n). Its components are the n partial derivatives of F, and it points in the steepest direction.

For a constant vector a, $F(x) = a^Tx$ has gradient a.

For a symmetric matrix S, gradient of $F(x)=x^TSx$ is 2Sx

For a positive definite symmetric S, $1/2x^TSx-a^Tx$’s gradient is Sx-a

That steepest direction is perpendicular to the level direction

Stochastic Gradient Descent and ADAM

Gradient descent: $x_{k+1}=x_k-s_k\nabla L(x_k)$ L(x) is the loss function. But for large networks with many samples in the training set, this algorithm (as it stands) is not successful

Computing $\nabla L$ at every descent step-the derivatives of the total loss L with respect to all the weights x in the network-is too expensive. Potentially millions of separate losses are computed and added in every computation of L

The number of weights is even larger. many different choices $x^$ of the weights. x^ minimized L(x) for test data v. Some of those choices can give poor results on unseen test data.

Stochastic gradient descent uses only a “minibatch” of the training data at each step. B samples will be chosen randomly. This changes L(x) to a sum of only B losses.

Computing $\nabla l_i by backpropagation on B samples is much faster. Often B = 1

The stochastic algorithm produces weights that also succeed on unseen data.

And for each sample in the training set, we know if it is “a cat or a dog”-or if. the text is “poetry or prose”. We look for a learning function F that assigns’ good weights. Then for v in a similar population, F outputs the correct classification “cat” or “poetry”.

We use this function F for unidentified test data. The features of the test data are the new inputs v. The output from F will be the correct (?) classification-provided the function has learned the training data in a way that generalizes.

Generalization by SGD is the ability to give the correct classification for unseen test data v, based on the weights x that were learned from the training data.

I compare overfitting with choosing a polynomial of degree 60 that fits exactly to 61 data points. Need early stopping

The Loss Function and the Learning Function

“loss” L(x) that our function will (approximately) minimize. This is the sum of the errors in classifying each of the training data vectors v. And we need to describe the form of the learning function F that classifies each data vector v

At the beginning of machine learning the function F was linear-a severe limitation. Now F is certainly nonlinear. Just the inclusion of one particular nonlinear function at each neuron in each layer has made a dramatic difference.

The gradient is determined by the network architecture-the “feedforward” steps whose weights we will optimize.

Square loss: sum over the training samples

Hinge loss

Cross-entropy loss: favorite for neural nets. Cross-entropy loss or “logistic loss” is preferred for logistic regression (with two choices only). For a minibatch of size B, replace N by B. And choose the B samples randomly.

Stochastic Descent Using One Sample Per Step

To simplify, suppose each minibatch contains only one sample (so B = 1). That sample is chosen randomly. The theory of stochastic descent usually assumes that the sample is replaced after use-in principle the sample could be chosen again at step k + 1. But replacement is expensive compared to starting with a random ordering of the samples. In practice, we often omit replacement and work through samples in a random order.

Each pass through the training data is one epoch of the descent algorithm. Ordinary gradient descent computes one epoch per step (batch mode). Stochastic gradient descent needs many steps (for minibatches).

Stochastic descent is more sensitive to the stepsizes than full gradient descent.

We are doing much less work per step (B inputs instead of all inputs from the training set). But we do not necessarily converge more slowly. A typical feature of stochastic gradient descent is “semi-convergence” : fast convergence at the start.

later iterations of SGD are frequently erratic. Convergence at the start changes to large oscillations near the solution. One response is to stop early and thereby we avoid overfitting the data.

If $x_k$ is outside I, then $x_{k+1}$ moves toward the interval I. Otherwise, it just bounces inside I

Randomized Kaczmarz is Stochastic Gradient Descent

We are randomly selecting row i of A at step k, and adjust x_{k+1} to solve equation i in Ax=b, i.e., $a_i^Tx_{k+1}=b_i$. x_{k+1} is the the projection of x_k onto one hyperplane $a_i^Tx=b_i$ that meets $x^*=A^{-1}b$

Adaptive Methods Using Earlier Gradients

That “memory” guides the choice of search direction D and the all-important stepsize s.

For a standard SGD iteration, D_k depends only on the current gradient $\nabla L_k$. $\nabla L_k(x_k, B)$ is evaluated only on a random minibatch B of the test data.

Adaptive Stochastic Gradient Descent: x_{k+1}=x_k - s_kD_k

Exponential moving averages in ADAM: recent gradients $\nabla L$ have greater weight than earlier gradients in both s_k and step direction D_k

Typical values are \delta = 0.9 and \beta = 0.999.

The actual computation of D_k and S_k will be a recursive combination of old and new

Generalization : Why is Deep Learning So Effective?

Often we have more free parameters in x than data in v. In that case we can expect many sets of weights (many vectors x) to be equally accurate on the training set. Those weights could be good or bad. They could generalize well or poorly. Our algorithm chooses a particular x and applies those weights to new data v_{test}

Part V: Probability and Statistics

2024-12-27 00:00:00 +0000

The first 100 or 1000 flips do affect the sample mean. But 1000 flips will not affect its limit-because you are dividing by $N \to \infty$

F(x) is the cumulative distribution and its derivative p(x) is the probability density function. You could say that p(x) dx is the probability of a sample falling in between x and x + dx.

$E[x] = \int xp(x)dx$ $\sigma^2 = E[(x-m)^2] = \int p(x)(x-m)^2dx$

Normal distribution

Central Limit Theorem. Standard normal distribution N(0, 1).

$ F(\sigma) - F(-\sigma) \approx 2/3 $ $ F(2\sigma) - F(-2\sigma) \approx 0.95 $

The “binomial” probabilities count the number of heads in N coin flips. binomial approaches normal.

Stirling’s formula

Monte Carlo Estimation Methods

accepting uncertainty in the inputs and estimating the variance in the outputs. The error in approximating E[x] is ususally $1/ \sqrt N$

2-level Monte Carlo

Probability Distributions

Binomial: Tossing a coin n times

Poisson

p \to 0$ events and n trials with expected $\lambda = np$ successes. What is the probablity of k successes in n trials?

we wait a long time, or we include a large population, to produce a decent chance that one or more events will actually occur. Poisson is not for the probability that you will be struck by lightning : say one in a million, over your lifetime. It is for the probability that someone in a city of 100,000 will be struck.

Poisson assumes independent events. One of the most difficult aspects of applied probability is to estimate the dependence of one event on other events.

Exponential

The exponential distribution is continuous (not discrete). It describes the waiting time in a Poisson process. How long until lightning strikes your city ? The key assumption is that The waiting time is independent of the time you have already waited.

The future is independent of the past. Is this true for a television set to fail ? It is true for a computer to fail ? If failure depends on a random disaster, the waiting time has exponential distribution. The failure rate >. is constant. If failure comes from slow decay, the independence requirement will not hold.

The exponential distribution has no memory.: The probability of waiting at least y hours more for a table is unfortunately not affected by already having waited x hours

If failures only occur at integer times t = 0, 1, 2, 3, … then the exponential distribution (which has continuous time) is replaced by the geometric distribution (discrete time).

Log-normal

When x has a normal distribution, the exponential $e^x$ has a log-normal distribution. i.e., The distribution of x is log-normal when the distribution of x y = log(x) is normal.

The log-normal distribution appears when you take the product of many positive sample values. The arithmetic mean for normal compares with the geometric mean for log-normal.

Chi-squared

If x has the standard N(O, 1) distribution, then $x^2$ is chi-squared.

If $x_1….x_n$ are independent N(0, 1), then $x_1^2+…+ x_n^2$ has chi-squared distribution with n degrees of freedom

Multivariable Gaussian

When the variables are not independent, like stocks and bonds, C is symmetric and positive definite (in an extreme case, semidefinite). Then the joint distribution of Gaussian variables is “multivariate Gaussian”

Moments, Cumulants, and Inequalities

Suppose we know E[X], we want to know Pr(X>=a)

Markov’s inequality

Chebyshev’s inequality: The proof is to apply Markov’s inequality

Connecting a list of numbers to a (generating) function $f(x)=\Sigma a_nx^n$

Moments and Central Moments

$m_n =E[x^n]$

central moments (around m) $\mu_n=E[(x-m)^n]$

The normalized third central moment is the skewness of the distribution.

We generally use the fourth moment of a normal distribution as a basis for comparison. Then the kurtosis of any distribution is called kappa

Generating Functions and Cumulants

Probability generating function

Characteristic function. uses the characteristic function to prove the central limit theorem

Moment generating function M(t)

Cumulant generating function K(t)

We are capturing an infinite set of coefficients in each function. The key point of cumulants is that they lead to this property $K(t)=logM(t)$

Chernoff’s Inequality for Sums

The mean value of a sum. If those variables are independent, then also the variance

Chebyshev for a sum

Markov and Chebyshev Inequalities for Matrices

an essential step in the statistics of deep learning is to develop inequalities for the eigenvalues and the trace of random matrices.

Markov’s inequality. Suppose X is a semidefinite or definite random matrix, if A is any positive definite matrix, then

Prob {A- X is not positive semidefinite } <= Trace of $ E[X]A^{-1}$

Chebyshev’s inequality for a random matrix X with mean 0:

Matrix Chernoff Inequalities

Erdos-Renyi Random Graphs

Each edge is present with probability p. Then the question is : For which p is the graph likely to be connected?.

Covariance Matrices and Joint Probabilities

Prove that independent experiments have covariance 0

Max covariance = $\sigma_1\sigma_2$ when the two variables are fully dependent

U is a column in covariance matrix. $p_{ij}UU^T$ is positive semidefinite. So the whole matrix V (the sum of those rank 1 matrices) is at least semidefinit

The covariance matrix V is positive definite unless the experiments are dependent.

$V=E[(X-E(X))(X-E(X))^T]=\Sigma p_{ij}(X-E(X))(X-E(X))^T$

Variance of $c^TX$ = $c^TVc$

Diagonalizing the covariance matrix $V=Q\Lambda Q^T$ means finding M independent experiments as combinations of the original M experiments.

Another proof that covariance matrix V is positive semidefinite. V is the sum of the joint probability of each combination times the positive semidefinite matrix $(X-E(X))(X-E(X))^T$

The value of the expectation symbol E is that it also allows pdf’s : probability density functions like p(x, y, z) for continuous random variables x andy and z.

The covariance matrix of $Z =AX$ is $ V_Z=AV_XA^T$

Suppose x and y are independent random variables with mean 0 and variance 1, What are the covariance matrix for Z =(x, y, ax + by)?

Correlation

Normalize the random variables x to have their variance = 1 $ X = x/\sigma _x$

The correlation of x and y is the covariance of X and Y

Correlation matrix R

GPS Example

Distances from four or more satellites will pinpoint the receiver position (using least squares!) The speed of light changes in the ionosphere. But the correction will be almost the same for all nearby receivers. If one receiver stays in a known position, we can take differences from that one. Differential GPS reduces the error variance by fixing one receiver

Markov chains

$y_{n+1}=Py_n$

Find eigenvalues of a square matrix through determinant.

The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.

The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.

steady state with eigenvalue 1

$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns

The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$

Transition Probabilities

$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n

Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain

1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.

Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive

if $ \lambda_2 < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P

Gambler’s Ruin

when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.

Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?

Find eigenvalues of a square matrix through determinant.

The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.

The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.

steady state with eigenvalue 1

$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns

The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$

Transition Probabilities

$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n

Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain

1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.

Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive

if $ \lambda_2 < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P

Gambler’s Ruin

when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.

Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?

Part 1.11 Norms

2024-12-26 00:00:00 +0000

Every norm for vectors or functions or matrices must share these two properties of the absolute value c of a number   cv   = c     v   ,   v + w   <=   v   +   w  
Euclidean norm, 1-norm, and infinite norm for a vector $   v   _p=( v_1 ^p+…+ v_n ^p)^{1/p}$
$v^Tw=   v       w   cos\theta$

S-norms

Choose any symmetric positive definite matrix S, S-norm $   v   ^2_S = v^TSv$, S-inner product: $(v, w)_S = v^TSw$

Norms of functions

A function f(x) is a “vector in function space”.

For a vector space to be “complete”, every converging sequence $v_n$ must have $   v_n - v_\infty   \to 0$
Banach space. Hilbert space is a Banach space with inner product of $(v,v) =   v   ^2$

The Frobenius Norm

A matrix norm follows rules for a vector norm, plus   AB   <=   A       B  
$   A   _F^2 = a_{11} ^2+…+ a_{mn} ^2$
When Q is an orthogonal matrix, we know that Qx has the same $l^2$ length as x. $   QB   _F =   B   _F$
$   A   _F^2$ = trace of AT A =sum of eigenvalues

Matrix Norms from Vector Norms

If we choose the vector v with the largest growth factor, $   A   = max(   Av   /   v   )$

l2 norm: largest singular value l1 norm: largest l1 norm of the columns of A $l_\infty$ norm: largest l1 norm of the rows of A

The Nuclear Norm

Also called the trace norm.

$   A   _N = \sigma _1 + … + \sigma _n$. It is the minimum value of $   U   _F   V   _F$ with UV=A
$   A   _F = \sigma _1^2 + … + \sigma _n^2$
$   A   _2 = \sigma _1$

Factoring Matrices and Tensors

A matrix is just a two-way tensor.

To compute a factorization A = UV, we introduce a simple alternating iteration. Update U with V fixed, then update V with U fixed. This UV idea also fits the famous k-means algorithm.

NMF

The goal of NMF is to approximate a nonnegative matrix A >= 0 by a lower rank product UV of two nonnegative matrices U >= 0 and V >= 0

When A >= 0 is symmetric positive definite, we hope for a matrix B >= 0 that satisfies $B^TB=A$. Often no solution though, and we want to find a close enough $B^TB$

SPCA: Find sparse low rank matrices B and C so that BC is close to A

Facial Feature Extraction: Each column vector of the data matrix A will represent a face. Its components are the intensities of the pixels in that image. The goal is to find a few “basic faces” in B, so that their combinations come close to the many faces in A.

Text Mining and Document Classification

each column of A represents a document. Each row of A represents a word. NMF identifies topics and classifies the whole set of documents with respect to those topics.

$a_j \approx \Sigma c_{ij}b_i$ c - importance, b topic

NMF is an NP hard problem

Computing the Factors: A central idea is alternating factorization: Hold one factor fixed, and optimize the other factor. Hold that one fixed, optimize the first factor, and repeat. Using the Frobenius norm, each step is a form of least squares.

Tensors

A column vector is a 1-way tensor. A matrix is a 2-way tensor. Then a 3-way tensor has elements $T_{ijk}$ row number, column number, and “tube number”

The rows and columns and tubes are the fibers of T, with only one varying index.

A Color Image is a Tensor with 3 Slices

A black and white image is just a matrix of pixels. The numbers in the matrix are the grayscales of each pixel. Normally those numbers are between zero (black) and 255 (white). A color image is a tensor. It has 3 slices corresponding to red-green-blue. Each slice shows the density of one of the colors RGB.

The Joint Probability Tensor

Suppose we measure age a in years and height h in inches and weight w in pounds. We put N children into I age groups and J height groups and K weight groups.

joint probabilities $p_{ijk}=N_{ijk}/N$. For each combination i, j, k we count only the children who are in age group i and also height group j and also weight group k.

The Norm and Rank of a Tensor

A tensor can multiply vectors, matrices, or tensors. Then it is a linear operator. A tensor can contain data. Its entries could give the brightness of pixels in an image. A color image is 3-way, stacking RGB. A color video will be a 4-way tensor.

Part 1.9 Principal Components

2024-12-24 00:00:00 +0000

Principal Component Analysis (PCA) uses the largest $\sigma$ connected to the first u’s and v’s to understand the information in a matrix of data. The closest rank k matrix to A is $A_k$. In statistics we are identifying the pieces of A with largest variance

PCA is “unsupervised” learning. Our only instructor is linear algebra. the SVD tells us to choose $A_k$.

Eckart-Young. Three choices for the matrix norm   A  

PCA

We have n samples. For each sample we measure m variables, i.e., a m rows by n columns matrix.

The first step is to find the average (the sample mean) along each row of A0 . Subtract that mean from all m entries in the row. Now each row of the centered matrix A has mean zero

How will linear algebra find that closest line through (0, 0)? It is in the direction of the first singular vector

The Statistics Behind PCA

Subtracting those means in each row from each row of $A_0$ produced the centered A. The variances are the diagonal entries of the matrix $AA^T$. The covariances are the off-diagonal entries of the matrix $AA^T$. High covariance means that increased height goes with increased age. (Negative covariance means that one variable increases when the other decreases.).

sample covariance matrix $ S = AA^T / (n -1)$

Given the data in A, computing S would be a computational mistake. For large matrices, a direct SVD of A is faster and more accurate.

The leading eigenvector $u_1$ tells us slope of the closest line in the scatter plot.

The sum of squared distances from the data points to the $u_1$ line is a minimum.

The total variance in the data comes from the Frobenius norm (squared) of A. This is the trace of S-the sum down the diagonal. the trace equals the sum of the eigenvalues of the sample covariance matrix x

Rayleigh quotient for symmetric matrix

$R(x) = x^TSx/x^Tx$. The maximum value of R(x) is the largest eigenvalue. Other eigenvectors are saddle points. Saddles have first derivatives = zero but they are not maxima or minima.

$   A   ^2 = max   Ax   ^2/   x   ^2 = max(R(x)) = \lambda_1(S)$

Generalized Eigenvalues

Generalized Rayleigh quotient $R(x) = x^TSx / x^TMx$. M is symmetric. In dynamical problems M is often the “mass matrix” or the “inertia matrix”. In statistics M is generally the covariance matrix.

If M is positive definite, the maximum of R(x) is the largest eigenvalue of $M^{-1}S$

$H=M^{-1/2}SM^{-1/2}$ is symmetric

$x=M^{-1/2}y$ to convert $Sx = \lambda Mx$ into $Hy=\lambda y$

A crucial fact about a symmetric matrix S is that any two eigenvectors are orthogonal. Not true for generalized eigenvectors, but true for $x_1Mx_2^T=0$

Positive Semidefinite M : Not Invertible

$x^TMx$ can be zero, possible to have infinite eigen values

In statistics M is often a covariance matrix. Its diagonal entries tell us the separate variances of two or more measurements. Its off-diagonal entries tell us the “covariances between the measurements”. If we are foolishly repeating exactly the same observations, or if one experiment is completely determined by another-then the covariance matrix M is singular. Its determinant is zero and it is not invertible

Fisher’s Linear Discriminant Analysis

We are given samples from two different populations and they are mixed together, with mean and variance of each pop. If all the samples are mixed together and we pick one, how do we tell if it probably came from population 1 or population 2 ?

Each sample with feature vector f, finds a separation vector v, if $v^Tf>c$ then likely from pop 1. v maximizes the separation ratio R.

$R = (x^Tm_1 - x^Tm_2)^2/x^T\Sigma _1x + x^T\Sigma _2x$. m: mean vector

$S = (m_1-m_2)(m_1-m_2)^T$ $M = \Sigma _1 + \Sigma _2$

$Sv = \lambda Mv$

$v = M^{-1} (m_1 - m_2)$

Neural networks will succeed by allowing separating surfaces that are not just planes.

Part 1.6: Eigenvalues

2024-12-23 00:00:00 +0000

The eigenvectors of A don’t change direction when you multiply them by A. The output Ax is on the same line as the input vector x. The eigenvector x is just multiplied by its eigenvalue

$A^2x=\lambda ^2 x$

$Ax=\lambda x$ means that $A - \lambda I = 0$, i.e., it is not invertible, it is singular, and its determinant is 0

Similar matrices

For every invertible matrix B, the eigenvalues of A is same as $BAB^-1$. The eigenvector is Bx, when $Ax=\lambda x$

To compute eigenvalue of A, we gradually make $BAB^-1$ diagonal, and eigenvalue will show up on the main diagonal

Diagonalizing a Matrix

Suppose A has a full set of n independent eigenvectors. Put those eigenvectors into an invertible matrix X. $A=X\Lambda X^{-1}$. This property is used to compute the power of A

Symmetric positive definite matrices

$S = S^T$, Then eigenvectors q can be chosen orthogonal

Spectral Theorem: Every real symmetric matrix has the form $S=Q\Lambda Q^T$. Q is the orthonomal eignvector matrix $Q^TQ=I$

A positive definite matrix has all positive eigenvalues.

the energy test: S is positive definite if the energy $\x^TSx$ is positive for all vector x != 0

if $Sx=\lambda x$, then $x^TSx > 0$ if $\lambda > 0$

if every eigenvector has positive energy, those eigenvectors can be chosen orthogonal because S is symmetric, every vector is a linear combination of eigenvectors

$S=A^TA$ for a matrix A with independent columns

For an ordinary function f(x) of one variable x, the minimum test df/dx = 0 and second deravative > 0. For f ( x, y) with two variables, the second derivatives go into a matrix, and it is mininum when the matrix is positive definite.

There a saddle point when S has both positive and negative eigenvalues. A saddle point matrix is “indefinite”

Optimization and Machine Learning

gradient descent. Each step takes the steepest direction toward the bottom point of the bowl. Calculus: The partial derivatives of f are all zero. Linear algebra: The matrix S of second derivative is positive definite.

If S is positive definite (or semidefinite) at all points x, then the function f(x) is convex

Machine learning produces “loss functions” with hundreds of thousands of variables. They measure the error-which we minimize. But computing all the second derivatives is completely impossible. We use first derivatives to tell us a direction to move-the error drops fastest in the steepest direction. Then we take another descent step in a new direction. This is the central computation in least squares and neural nets and deep learning.

Ellipse

Stay with a positive definite matrix S. Cut through that bowl at height energy = 1, then the curve that goes around the cut is an ellipse

The axes of the tilted ellipse point along those eigenvectors of S. Length = $1/\sqrt\lambda$

SVD

If A is not square then $Ax = \lambda x$ is impossible and eigenvectors fail. SVD helps in this case

$Av_1=\sigma _1u_1$, vs are orthogonal in $R^n$. us are perpendicular to each other in $R^m$. That number r is the rank of A

$Av_{r+1} = 0$

$AV = U\Sigma$. V and U are square orthogonal matrices $V^T=V^{-1}$ because their columns are orthogonal unit vectors.

SVD: $A=U\Sigma V^T$. Reduced SVD so that $\Sigma _r$ is square

A special property of the SVD is that those pieces come in order of importance. $\sigma _1u_1v_1^T$ is the closest rank 1 matrix to A. The sum of the first k pieces is the best rank k approximation of A

Eckart-Young

$A^TA=V\Sigma^T\Sigma V^T$. V contains orthonormal eigenvectors of $A^TA$. $\sigma_i^2$ are the nonzero eigenvalues of both for both $A^TA$ and $AA^T$

KL transform’s connection to SVD

KL begins with a covariance matrix V of a zero-mean random process. V is symmetric and positive definite or semidefinite. In general V could be an infinite matrix or a covariance function. Then the KL expansion will be an infinite series. The KL transform is a stochastic (random) form of Principal Component Analysis

The Geometry of the SVD

The SVD separates a matrix into (orthogonal) x (diagonal) x (orthogonal). The orthogonal matrices U and V rotate the plane. The diagonal matrix stretches it along the axes.

Finite difference

The discrete form of a derivative is a finite difference. The discrete form of an integral is a sum.

D corresponds to the backward difference $f(x) - f(x - \Delta x)$

The eigenvectors v of $D^TD$ are the right singular vectors of D. They are discrete sines, u are discrete cosines. $\sqrt2V$ These are the famous DST and DCT matrices-Discrete Sine Transform and Discrete Cosine Transform. The DCT matrix has been the backbone of JPEG image compression. Actually JPEG increases U to 8 by 8, which reduces the “blockiness” of the image. 8 by 8 blocks of pixels are transformed by a two-dimensional DCT -then compressed and transmitted.

Part I: Highlights of Linear Algebra

2024-12-22 00:00:00 +0000

Ax = b : is the vector b in the column space of A?

$Ax = \lambda x$: eigenvector gives direction so that Ax keeps the direction of x. $A^2x = \lambda^2 x$

$Av = \sigma u$: SVD. Which part of that data matrix is important?

Ax is a linear combination of columns of A. Column space of A

3 indepedent columns in R^3 produce an invertible matrix.

A basis for a subspace is a full set of independent vectors. The rank of a matrix is the dimension of its column space.

A = CR. R = rref(A). The nubmer of independent columns equals independent rows

One column u and one row v, and all nonzero matrix $uv^T$ has rank 1

Outer product approach is important in data science because we are looking for important parts of A

A = LU: Elimniation, L is lower triangular, and U is upper triangular

A = QR: orthogonalizing. $Q^TQ=I$ and R is upper triangular

$S=Q\Lambda Q^T$: S is symmetric. Eigenvalues on the diagonal of lambda. Orthonormal eigenvectors in the columns of Q

$A=X\Lambda X^{-1}$: diagonalization when A is n by n with n independent eigenvectors. Eigenvalues of A on the diagonal of lambda. Eigenvectors of A in the columns of X.

$A=U\Sigma V^T$: SVD. Orthonormal singular vectors in U and V. Singular values in sigma

Orthogonal matrix: $Q^T=Q^{-1}$. $q_i * q_j = 1$ if i = j, otherwise 0

$Sq = \lambda q$ Eigenvector q and eigenvalue lambda. Each eigenvalue and eigenvector contribute a rank one piece to S

Columns in $Q\Lambda$ are $\lambda _1q_1$ to $\lambda _nq_n$, because lambda is a diagonal matrix

nullspace: N(A) contains all solutions x to Ax = 0

left nullspace: $N(A^T)$ contains all solutions y to $A^Ty=0

r independent equations Ax = 0 have n - r independent solutions

incidence matrix

A has 1 and -1 on every row, to show the end node and the start node for each edge

The r edges of a tree give r independent rows of A : rank = r = n - 1

There are rn - r = rn - n + 1 independent small loops in the graph

Rank

Rank of AB <= rank of A, because every column of AB is a combination of the columns of A

Rank of A + B <= (rank of A) + (rank of B)

Rank of $A^TA$ = rank of $AA^T$ = rank of A = rank of $A^T$, because $A^TA$ and A have the same nullspace.

If A is m by r and B is r by n, both with rank r, then AB also has rank r, because r = rank of ($A^TABB^T) <= rank of (AB) <= rank A = r. However this does not hold for BA

Elimination factored A = LU into a lower triangular L times an upper triangular U

Zero cannot be the first pivot. If there is a nonzero number lower down in column 1, its row can be the pivot row. Good codes will choose the largest number to be the pivot.

Every invertible n by n matrix A leads to P A = LU : P = permutation. The inverse of every permutation matrix P is its transpose $P^T$

Orthogonal Matrices and Subspaces

Orthogonal vectors x and y: $x^Ty= x_1y_1 + .. + x_ny_n = 0$

Orthogonal subspaces R and N. Every vector in the space R is orthogonal to every vector in N.

Orthonormal basis: Orthogonal basis of unit vectors. Every $v_i^Tv_i = 1$

Tall thin matrices Q with orthonormal columns $Q^TQ= I$. If this Q multiplies any vector x, the length of the vector does not change   Qx   =   x  

$(Qx)^TQx = x^TQ^TQx = X^Tx$

Orthogonal matrices” are square with orthonormal columns, $Q^T=Q^-1$

Pythagoras Law for right triangles $   x-y   ^2 = (x-y)^T(x-y)=x^Tx - y^Tx - x^Ty + y^Ty =   x   ^2 +   y   ^2$

The row space of A is orthogonal to the nullspace of A. The column space of A is orthogonal to the nullspace of $A^T$

if $P^2 = P = P^T$, then Pb is the orthogonal projection of b onto the column space of P

Multiplying orthogonal matrices produces an orthogonal matrix. $(Q_1Q_2)^T(Q_1Q_2) = I$. Rotation times rotation = rotation. Reflection times reflection = rotation. Rotation times reflection = reflection

Suppose then by n orthogonal matrix Q has columns $q_1…q_n$, Every vector v can be written as a combination of the basis vectors. Coefficients in an orthonormal basis is just $c_i= q_iv$, i.e., v=Qc

$Q^Tv=Q^TQc=c$

Deep Learning and Neural Nets

2024-12-20 00:00:00 +0000

  • Goal of a neural net: To construct a function that classifies the training data correctly, so it can generalize to unseen test data.

Learning function F

  • One input v for each training sample. For the problem of identifying handwritten digits, each input sample will be an image-a matrix of pixels
  • Output: 0 to 9. Those ten numbers are the possible outputs from the learning function.
  • Train a learning function on part of the MNIST set. Validate the function by choosing unseen MNIST samples

Linear/affine learning function

The simplest learning function would be linear $w=Av+b$

  • The entries in the matrix A are the weights to be learned
  • b is the bias vector and is to be learned
  • Linearity is a very limiting requirement. What would be halfway between I and XIX?

Neural network

$F(v) = L(R(L(R(…Lv))))$

$Lv = Av + b$

R/ReLU: nonlinear function on each component of $Lv$, ReLU(x) = max(0, x) is often picked

$F(x,v)$ depends input v and weight x (A’s and b’s). $v_1=ReLU(A_1v+b_1)$ produces the first hidden layer of neural network

The complete net starts with the input layer v and output layer $w=F(v)$

$L_k(v_{k-1})= A_kv_{k-1}+b_k$

The goal is to minimzie total loss over all training samples by Choosing weights in $A_k$ and $b_k$, although Least square is not the best loss function for deep learning

The nubmer of weights $A_{ij}$ and $b_k$ is often larger than the nubmer of training samples

For images, CNN is often appropriate and weights are shared the diagonals of A are constant

Weights are optimized by stochastic gradient descent

positive definite symmetric matrices S

Orthogonal eigenvector q. Positive eigenvalues $\lambda$

$S = \lambda_1 q_1q_1^T + \lambda_2 q_2q_2^T….$. Assuming $\lambda_1 > \lambda_2…$, the $\lambda_1 q_1q_1^T$ is the most informative part

Layers of F(x, v)

x is in the ramp function ReLU

Start with fully connected layers - neurons on layer N connects to layer N + 1.

CNN repeats same weights around all pixels in an image

Pooling layer reduces dimenstions

Dropout randoms leaves out neurons

Batch normalization resets means and variance

Slowly Changing Dimensions

2024-11-27 00:00:00 +0000

Motivation

  • A dimension does not change often. For example, suppliers do not move to a different state often. Customers rarely update demographical info. Products dont change their description often
  • Track both current and historical data over time. An aggregate table summarizing facts should continue to reflect the historical state

Type 1: Overwrite

  • No historical data

Type 2: add a new row

  • Multiple rows for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Each row represents the full details at that version.
  • Ways to track versions by
    • a version number. Can use a new surrogate key for the new version
    • Start date + end date. The latest version use a null end date or a sentinel value
    • Single effective date + current flag
  • If there are retroactive changes made to the contents of the dimension, or if new attributes are added to the dimension which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to frequent change.

Type 3: previous value column

  • Only the previous history is stored
  • No new row, and thus no change to the surrogate key

Type 4: historical table

  • The dimension table does not change, i.e., it holds only the latest version
  • The history table keeps track of create date

Skipping type 5 - 7 since they seem less popular

Support UNION in Squirrel

2024-11-03 00:00:00 +0000

Option: add UNION to selectData

  • PR 1
  • PR 2
  • PR 3. This PR is not feasible because it assumes
    • single union only
    • No group by/join when union is added to the builder

type selectData struct {
	PlaceholderFormat PlaceholderFormat
	RunWith           BaseRunner
	Prefixes          exprs
	Options           []string
	Columns           []Sqlizer
	From              Sqlizer
	Joins             []Sqlizer
	WhereParts        []Sqlizer
	GroupBys          []string
	HavingParts       []Sqlizer
	OrderBys          []string
	Limit             string
	Offset            string
	Union             []Sqlizer -- data type is same as Joins and From
	UnionAll          []Sqlizer
	Suffixes          exprs --previous used to simulate union
}

Inside ToSql, attach UNION and UNION ALL after suffixes

	if len(d.Suffixes) > 0 {
		sql.WriteString(" ")
		args, _ = d.Suffixes.AppendToSql(sql, " ", args)
	}

	if len(d.Union) > 0 {
		sql.WriteString(" UNION ")
		args, err = appendToSql(d.Union, sql, " UNION ", args)
		if err != nil {
			return
		}
	}
	if len(d.UnionAll) > 0 {
		sql.WriteString(" UNION ALL ")
		args, err = appendToSql(d.UnionAll, sql, " UNION ALL ", args)
		if err != nil {
			return
		}
	}

	sqlStr = sql.String()
	return
  • Union and union all should probably be before the suffix
  • Union still needs to support limit and order by
  • This code assumes no mix of UNION and UNION ALL in the same SQL

Option: UnionBuilder


type unionSelect struct {
	op       string // e.g. "UNION"
	selector squirrel.SelectBuilder
}

type unionData struct {
	Selects []*unionSelect
	Limit   string
	OrderBy []string
        PlaceholderFormat sq.PlaceholderFormat
}

type UnionBuilder builder.Builder

Notes on pg_stats

2024-08-16 00:00:00 +0000

  • To collect statistics, the analyzer randomly selects 300 × default_statistics_target rows (the default value is 100, so 30,000 rows in total).
  • Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated. After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table

Statistics Used by the Planner

SELECT * FROM tenk1

  • The planner determines the cardinality of tenk1
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
  • Basic relation-level statistics are stored in the table pg_class in the system catalog.
  • reltuples: Relation’s row count
  • relpages: Relation’s size in pages

WHERE unique1 < 1000

  • The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the histogram for unique1 from pg_statistic.
  • For equality estimation the histogram is not useful; instead the list of most common values (MCVs) is used to determine the selectivity.

WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2

  • The restriction on tenk1, unique1 < 50, is evaluated before the nested-loop join
  • The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g., null_frac,n_distinct, most_common_vals
 selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
rows = (outer_cardinality * inner_cardinality) * selectivity

pg_stats

  • pg_stats is designed to be more easily readable and is readable by all, whereas pg_statistic is only readable by a superuser.
  • For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).

ALTER TABLE SET STATISTICS

  • sets the per-column statistics-gathering target for subsequent ANALYZE operations. The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default statistics target (default_statistics_target)

correlation

  • When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0

most_common_vals(MCV) and most_common_freqs

  • MCV lists are also used for selectivity estimations of inequalities: to find the selectivity of “column < value”, the planner searches most_common_vals for all the values lower than the given value, and then adds together their frequencies from most_common_freqs.
  • Common value statistics work best when the number of distinct values is low. The maximum size of the MCV arrays is defined by default_statistics_target, the same parameter that governs row sample size during analysis.

Histogram

  • When the number of distinct values grows too large to store them all in an array, the system starts using the histogram representation. A histogram employs several buckets to store values in. The number of buckets is limited by the same default_statistics_target parameter.
  • histograms are used to estimate selectivity of “greater than” and “less than” operations together with MCV lists.

n_distinct

ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...)

the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations, i.e., the n_distinct change will not be in effect until you run ANALYZE again

How is pg_stats build

    FROM (((pg_statistic s                                                                                                                                     
      JOIN pg_class c ON ((c.oid = s.starelid)))                                                                                                               
      JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum))))                                                                            
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))  
  • tablename: pg_class.relname
  • attname: from pg_attribute, identified by oid + attnum
  • null_frac: stanullfrac
  • avg_width: stawidth
  • n_distinct: stadistinct

CREATE STATISTICS

  • Extended statistics metadata is stored in the pg_statistic_ext table in the system catalog, while the statistics data itself is stored in a separate table pg_statistic_ext_data
  • CREATE STATISTICS will create a new extended statistics object tracking data about the specified table, foreign table or materialized view. The statistics object will be created in the current database and will be owned by the user issuing the command.
  • The CREATE STATISTICS command has two basic forms. The first form allows univariate statistics for a single expression to be collected, providing benefits similar to an expression index without the overhead of index maintenance. This form does not allow the statistics kind to be specified, since the various statistics kinds refer only to multivariate statistics. The second form of the command allows multivariate statistics on multiple columns and/or expressions to be collected, optionally specifying which statistics kinds to include. This form will also automatically cause univariate statistics to be collected on any expressions included in the list.
  • If a schema name is given (for example, CREATE STATISTICS myschema.mystat …) then the statistics object is created in the specified schema. Otherwise it is created in the current schema. The name of the statistics object must be distinct from the name of any other statistics object in the same schema.
  • Extended statistics are not currently used by the planner for selectivity estimations made for table joins. This limitation will likely be removed in a future version of PostgreSQL.
  • Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE (either a manual command, or background auto-analyze). The collected values can be examined in the pg_statistic_ext_data catalog.
  • ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics.

Dependencies

  • By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.

References

Notes on query plan optimization in RDBMS

2024-06-01 00:00:00 +0000

  • Steps in Query Processing: query -> parsing -> RA expressions -> optimizer with stats -> execution plan -> eval engine with data -> output
  • Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression, i.e., heuristic optimization and cost-based optimization
  • Frequently used approach: heuristic rewriting of nested block structure and aggregation. Followed by cost-based join-order optimization for each block
  • Optimizers often use simple heuristics for very cheap queries, and perform exhaustive enumeration for more expensive queries
  • Push down projection also reduces number of pages since it eliminates columns early
  • Heuristics
  • Consider only left-deep plans
  • Avoid Cartesian products
  • Don’t optimize the entire query at once. Break query into query blocks and optimize one block at a time
    • A query block contains a single SELECT-FROM-WHERE expression as well as GROUP BY, HAVING clauses

References

Notes on Relational Algebra

2024-05-24 00:00:00 +0000

Overview

  • Relational algebra and relational calculus are formal languages associated with the relational model. Informally, relational algebra is a (high-level) procedural language and relational calculus a nonprocedural language. However, formally both are equivalent to one another.
  • Relational algebra operations work on one or more relations to define another relation without changing the original relations. Any operation on relations produces a relation. This is why we call this an algebra
  • A basic expression in the relational algebra consists of either a relation in the database or a constant relation
  • Both operands and results are relations, so output from one operation can become input to another operation.
  • Composition of relational algebra operations: Allows expressions to be nested, just as in arithmetic. This property is called closure.
  • Relational model is set-based (no duplicate tuples)
  • RA is not universal. It is impossible in relational algebra (or standard SQL) to compute the relation Answer(Ancestor, Descendant), e.g., traverse through a graph
  • Many relational databases use relational algebra operations for representing execution plans. Relatively easy to manipulate for query optimization

Operations

Basic Operations

  • Each operation takes one or two relations as input. Produces another relation as output
  • Selection: a subset of rows. Sigma(predicate, R). Unary
  • Projection: a subset of columns. Pi(col1, … colN, R). Unary
    • Result of PROJECT operation is a set of distinct tuples, i.e., maps to a DISTINCT in SQL
  • Cartesian product (X)
  • Union
  • Set Difference: R - S => items in R but not S. R and S must be union-compatible.

  • Intersection: R - (R - S)
  • Division: Defines a relation over the attributes C that consists of set of tuples from R that match combination of every tuple in S. Binary, similar to join
  • Unary RENAME operator: Sometimes we want to be able to give names to intermediate results of relational algebra operations

Join

  • Equivalent to performing a Selection, using join predicate as selection formula, over Cartesian product of the two operand relations.
  • One of the most difficult operations to implement efficiently in an RDBMS and one reason why RDBMSs have intrinsic performance problems.
  • Inner join/Theta join: Defines a relation that contains tuples satisfying the predicate F from the Cartesian product of R and S
    • Can rewrite Theta join using basic Selection and Cartesian product operations.
    • If predicate F contains only equality (=), the term Equijoin is used.
  • Natural join: An Equijoin of the two relations R and S over all common attributes x. One occurrence of each common attribute is eliminated from the result.
    • No join condition
    • Equijoin on attributes having identical names followed by projection to remove duplicate (superfluous) attributes
    • Often attribute(s) in foreign keys have identical name(s) to the corresponding primary keys
  • Outer join
  • Semi join: Defines a relation that contains the tuples of R that participate in the join of R with S.
    • Can rewrite Semijoin using Projection and Join
  • To push a projection operation inside a join requires that the result of the projection contain the attributes used in the join.

Aggregate Functions & Grouping

  • F(grouping attributes, funciton list, R)
    • grouping attributes are attributes of R
    • function list is a list of (function, attribute) pairs
  • If no renaming occurs, the attributes of the resulting relation are named by concatenating the name of the function and the attribute.
  • Can append -distinct to any aggregate function to specify elimination of duplicates, e.g., count-distinct

Relation variables

  • Refer to a specific relation: A specific set of tuples, with a particular schema
  • A named relation is technically a relation variable

Assignment

  • relvar <- E. E is an expression that evaluates to a relation
  • assign a relation-value to a relation-variable
  • the name relvar persists in the database
  • Query evaluation becomes a sequence of steps, e.g., % operator

Rename

  • Results of relational operations are unnamed. Result has a schema, but the relation itself is unnamed
  • Used to resolve ambiguities within a specific relational algebra expression
    • Allow a derived relation and its attributes to be referred to by enclosing relational-algebra operations
    • Allow a base relation to be used multiple ways in one query, e.g., self-join
  • rho(x, E): x - new name, E - named relation, relation-variable, or an expression that produces a relation
  • does not create a new relation-variable
  • The new name is only visible to enclosing relational-algebra expression

Query tree

  • Leaf nodes are base relations/operands — either variables standing for relations or particular, constant relations
  • Internal nodes are operators

Relational algebra vs SQL

  • SQL data model is a bag/multiset not a set, i.e. duplicates allowed.
  • Some operations, like projection, are more efficient on bags than sets.
  • SQL is much more on the “declarative” end of the spectrum (What you want). RA is procedural (how to do it)

Null

  • Null signifies an unknown value or that a value does not exist.
  • The result of any arithmetic expression involving null is null.
  • Aggregate functions simply ignore null values (as in SQL)
  • For duplicate elimination and grouping, null is treated like any other value, and two nulls are assumed to be the same (as in SQL)
  • (unknown or true) = true, (unknown or false) = unknown, (unknown or unknown) = unknown

References

How does CockroachDB translate an AST into execution plans

2024-05-21 00:00:00 +0000

Phases

  • Parsing → Logical Planning & Optimization → Physical Planning → Execution
  • The SQL front-end is responsible for parsing, desugaring, free simplifications and semantic analysis. The SQL middle-end is responsible for logical planning and optimization. For efficiency and engineering reasons, the the front-end and middle-end are grouped together in the code
  • If the architecture calls out a thing called e.g. “query runner”, which takes as input a logical query plan (a data structure) and outputs result rows (another data structure), you’d usually expect a thing in the source code called “query runner” that looks like a class whose instances would carry the execution’s internal state providing some methods that take a logical plan as input, and returning result rows as results.
  • It is common for SQL engines to separate processing of a query into two phases: preparation and execution. This is especially valuable because the work of preparation can be performed just once for multiple executions.

Parsing

  • CockroachDB uses an LALR parser that is generated by goyacc from a Yacc-like grammar file. This file specifies CockroachDB’s SQL dialect, and was originally copied from PostgreSQL and then modified.

// Statement is the result of parsing a single statement. It contains the AST
// node along with other information.
type Statement[T any] struct {
	// AST is the root of the AST tree for the parsed statement.
	// Note that it is NOT SAFE to access this currently with statement execution,
	// as unfortunately the AST is not immutable.
	// See issue https://github.com/cockroachdb/cockroach/issues/22847 for more
	// details on this problem.
	AST T

	// Comments is the list of parsed SQL comments.
	Comments []string

	// SQL is the original SQL from which the statement was parsed. Note that this
	// is not appropriate for use in logging, as it may contain passwords and
	// other sensitive data.
	SQL string

	// NumPlaceholders indicates the number of arguments to the statement (which
	// are referenced through placeholders). This corresponds to the highest
	// argument position (i.e. the x in "$x") that appears in the query.
	//
	// Note: where there are "gaps" in the placeholder positions, this number is
	// based on the highest position encountered. For example, for `SELECT $3`,
	// NumPlaceholders is 3. These cases are malformed and will result in a
	// type-check error.
	NumPlaceholders int

	// NumAnnotations indicates the number of annotations in the tree. It is equal
	// to the maximum annotation index.
	NumAnnotations tree.AnnotationIdx
}
  • Note that the AST only represents the syntax of the query, and says nothing about how, or even if, it can be executed. For example, it has no idea whether a table exists or what datatype a column has. Figuring that out is the job of the planner.
  • mainly Parser.Parse() in sql/parser/parse.go.

Logical Planning and Optimization

  • Query planning is the process of converting the SQL query AST into a tree of relational algebra operators that will produce the desired result.
EXPLAIN
    SELECT companies.name AS company, countries.name AS country, employees
    FROM companies JOIN countries ON companies.country = countries.id
    WHERE employees > 1e6
    ORDER BY employees DESC;
  • Each node in the plan is a relational operator, and results flow upwards from one node to the next. We can see that the optimizer has decided to do a full table scan of the companies table, filter the rows on employees, perform a lookup join with the countries table, and finally sort the results by employees in descending order.
  • Scalar expressions (algebraic expressions such as tax / total * 100) are originally represented by nested tree.Expr AST nodes and implement a visitor pattern for analysis and transformation.

Planner


// planner is the centerpiece of SQL statement execution combining session
// state and database state with the logic for SQL execution. It is logically
// scoped to the execution of a single statement, and should not be used to
// execute multiple statements. It is not safe to use the same planner from
// multiple goroutines concurrently.
//
// planners are usually created by using the newPlanner method on a Session.
// If one needs to be created outside of a Session, use makeInternalPlanner().
type planner struct {
	txn *kv.Txn

	// isInternalPlanner is set to true when this planner is not bound to
	// a SQL session.
	isInternalPlanner bool

	// Corresponding Statement for this query.
	stmt Statement

	instrumentation instrumentationHelper

	// Contexts for different stages of planning and execution.
	semaCtx         tree.SemaContext
	extendedEvalCtx extendedEvalContext

	// sessionDataMutator is used to mutate the session variables. Read
	// access to them is provided through evalCtx.
	sessionDataMutator *sessionDataMutator

	// execCfg is used to access the server configuration for the Executor.
	execCfg *ExecutorConfig

	preparedStatements preparedStatementsAccessor

	// avoidCachedDescriptors, when true, instructs all code that
	// accesses table/view descriptors to force reading the descriptors
	// within the transaction. This is necessary to read descriptors
	// from the store for:
	// 1. Descriptors that are part of a schema change but are not
	// modified by the schema change. (reading a table in CREATE VIEW)
	// 2. Disable the use of the table cache in tests.
	avoidCachedDescriptors bool

	// If set, the planner should skip checking for the SELECT privilege when
	// initializing plans to read from a table. This should be used with care.
	skipSelectPrivilegeChecks bool

	// autoCommit indicates whether we're planning for an implicit transaction.
	// If autoCommit is true, the plan is allowed (but not required) to commit the
	// transaction along with other KV operations. Committing the txn might be
	// beneficial because it may enable the 1PC optimization.
	//
	// NOTE: plan node must be configured appropriately to actually perform an
	// auto-commit. This is dependent on information from the optimizer.
	autoCommit bool

	// cancelChecker is used by planNodes to check for cancellation of the associated
	// query.
	cancelChecker *cancelchecker.CancelChecker

	// isPreparing is true if this planner is currently preparing.
	isPreparing bool

	// curPlan collects the properties of the current plan being prepared. This state
	// is undefined at the beginning of the planning of each new statement, and cannot
	// be reused for an old prepared statement after a new statement has been prepared.
	curPlan planTop

	// Avoid allocations by embedding commonly used objects and visitors.
	txCtx                 transform.ExprTransformContext
	nameResolutionVisitor schemaexpr.NameResolutionVisitor
	tableName             tree.TableName

	// Use a common datum allocator across all the plan nodes. This separates the
	// plan lifetime from the lifetime of returned results allowing plan nodes to
	// be pool allocated.
	alloc *rowenc.DatumAlloc

	// optPlanningCtx stores the optimizer planning context, which contains
	// data structures that can be reused between queries (for efficiency).
	optPlanningCtx optPlanningCtx

	// noticeSender allows the sending of notices.
	// Do not use this object directly; use the BufferClientNotice() method
	// instead.
	noticeSender noticeSender

	queryCacheSession querycache.Session

	// contextDatabaseID is the ID of a database. It is set during some name
	// resolution processes to disallow cross database references. In particular,
	// the type resolution steps will disallow resolution of types that have a
	// parentID != contextDatabaseID when it is set.
	contextDatabaseID descpb.ID
}
  • These normalization rules are compiled into Go code along with a norm.Factory that builds normalized memo expressions.
  • Now that we’ve built a normalized memo, buildExecMemo() goes on to optimize the plan using xform.Optimizer. This explores possible query plans by applying Optgen transformation rules to generate equivalent memo groups, and tries to pick the best combination through cost-based optimization based on table statistics.
  • opt/optbuilder
    • in-order depth-first recursive traversal of the AST;
    • invokes semantics checks on the way;
    • constructs a tree of “relational expression nodes”. This tree is also called “the memo” because of the data structure it uses internally.
    • the resulting tree is the logical plan.

References

How does TiDB translate JOIN and GROUP BY into logical plans

2024-05-13 00:00:00 +0000

Join


// Join represents table join.
type Join struct {
	node
	resultSetNode

	// Left table can be TableSource or JoinNode.
	Left ResultSetNode
	// Right table can be TableSource or JoinNode or nil.
	Right ResultSetNode
	// Tp represents join type.
	Tp JoinType
	// On represents join on condition.
	On *OnCondition
	// Using represents join using clause.
	Using []*ColumnName
	// NaturalJoin represents join is natural join
	NaturalJoin bool
}


func (b *planBuilder) buildJoin(join *ast.Join) LogicalPlan {
	b.optFlag = b.optFlag | flagPredicatePushDown
	leftPlan := b.buildResultSetNode(join.Left)
	rightPlan := b.buildResultSetNode(join.Right)
	newSchema := expression.MergeSchema(leftPlan.Schema(), rightPlan.Schema())
	joinPlan := LogicalJoin{}.init(b.ctx)
	joinPlan.SetChildren(leftPlan, rightPlan)
	joinPlan.SetSchema(newSchema)

	// Merge sub join's redundantSchema into this join plan. When handle query like
	// select t2.a from (t1 join t2 using (a)) join t3 using (a);
	// we can simply search in the top level join plan to find redundant column.
	var lRedundant, rRedundant *expression.Schema
	if left, ok := leftPlan.(*LogicalJoin); ok && left.redundantSchema != nil {
		lRedundant = left.redundantSchema
	}
	if right, ok := rightPlan.(*LogicalJoin); ok && right.redundantSchema != nil {
		rRedundant = right.redundantSchema
	}
	joinPlan.redundantSchema = expression.MergeSchema(lRedundant, rRedundant)

	b.curClause = onClause
	onExpr, newPlan, err := b.rewrite(join.On.Expr, joinPlan, nil, false)
	onCondition := expression.SplitCNFItems(onExpr)
	joinPlan.attachOnConds(onCondition)
	joinPlan.JoinType = InnerJoin
	return joinPlan
}

Group by


// LogicalAggregation represents an aggregate plan.
type LogicalAggregation struct {
	logicalSchemaProducer

// AggFuncDesc describes an aggregation function signature, only used in planner.
	AggFuncs     []*aggregation.AggFuncDesc
	GroupByItems []expression.Expression
	// groupByCols stores the columns that are group-by items.
	groupByCols []*expression.Column

	possibleProperties [][]*expression.Column
	inputCount         float64 // inputCount is the input count of this plan.
}

type AggFuncDesc struct {
	// Name represents the aggregation function name.
	Name string
	// Args represents the arguments of the aggregation function.
	Args []expression.Expression
	// RetTp represents the return type of the aggregation function.
	RetTp *types.FieldType
	// Mode represents the execution mode of the aggregation function.
	Mode AggFunctionMode
	// HasDistinct represents whether the aggregation function contains distinct attribute.
	HasDistinct bool
}

Rule-based optimization for SELECT in tidb

2024-05-12 00:00:00 +0000



func logicalOptimize(flag uint64, logic LogicalPlan) (LogicalPlan, error) {
    var err error
    for i, rule := range optRuleList {
        // The order of flags is same as the order of optRule in the list.
        // We use a bitmask to record which opt rules should be used. If the i-th bit is 1, it means we should
        // apply i-th optimizing rule.
        if flag&(1<<uint(i)) == 0 {
            continue
        }
        logic, err = rule.optimize(logic)
        if err != nil {
            return nil, errors.Trace(err)
        }
    }
    return logic, errors.Trace(err)
}


Translate SQL into logical plan


select b from t1, t2 where t1.c = t2.c and t1.a > 5

  • Projection: select b
    • Selection: where t1.a > 5
      • Join: from t1, t2 where t1.c = t2.c
        • DataSource t1
        • DataSource t2

Example 2

select min(id) from t
  • Aggregation
    • TableScan
select id from t order by id asc limit 1
  • Projection
    • TableScan
      • Sort
        • Limit

Column pruning

  • Projection
func (p *LogicalProjection) PruneColumns(parentUsedCols []*expression.Column) {
	child := p.children[0]
	used := getUsedList(parentUsedCols, p.schema)
	for i := len(used) - 1; i >= 0; i-- {
		if !used[i] && !exprHasSetVar(p.Exprs[i]) {
			p.schema.Columns = append(p.schema.Columns[:i], p.schema.Columns[i+1:]...)
			p.Exprs = append(p.Exprs[:i], p.Exprs[i+1:]...)
		}
	}
	selfUsedCols := make([]*expression.Column, 0, len(p.Exprs))
	selfUsedCols = expression.ExtractColumnsFromExpressions(selfUsedCols, p.Exprs, nil)
	child.PruneColumns(selfUsedCols)
}
  • DataSource
// PruneColumns implements LogicalPlan interface.
func (ds *DataSource) PruneColumns(parentUsedCols []*expression.Column) {
	used := getUsedList(parentUsedCols, ds.schema)
	for i := len(used) - 1; i >= 0; i-- {
		if !used[i] {
			ds.schema.Columns = append(ds.schema.Columns[:i], ds.schema.Columns[i+1:]...)
			ds.Columns = append(ds.Columns[:i], ds.Columns[i+1:]...)
		}
	}
	for k, cols := range ds.schema.TblID2Handle {
		if ds.schema.ColumnIndex(cols[0]) == -1 {
			delete(ds.schema.TblID2Handle, k)
		}
	}
}
  • Aggregation
    • columns in group by
    • columns in the aggregation function
  • Sort
func (ls *LogicalSort) PruneColumns(parentUsedCols []*expression.Column) {
	child := ls.children[0]
	for i := len(ls.ByItems) - 1; i >= 0; i-- {
		cols := expression.ExtractColumns(ls.ByItems[i].Expr)
		if len(cols) == 0 {
			ls.ByItems = append(ls.ByItems[:i], ls.ByItems[i+1:]...)
		} else {
			parentUsedCols = append(parentUsedCols, expression.ExtractColumns(ls.ByItems[i].Expr)...)
		}
	}
        // add the used columns from parents to its own required list
	child.PruneColumns(parentUsedCols)
}
  • Selection
// PruneColumns implements LogicalPlan interface.
func (p *LogicalSelection) PruneColumns(parentUsedCols []*expression.Column) {
	child := p.children[0]
	parentUsedCols = expression.ExtractColumnsFromExpressions(parentUsedCols, p.Conditions, nil)
	child.PruneColumns(parentUsedCols)
}
  • Join
    • columns in the join

Remove unnecessary projection

  • Projection(A) -> Projection(A, B, C)
  • Aggregation(A) -> Projection(A, B, C)
// eliminate eliminates the redundant projection in a logical plan.
func (pe *projectionEliminater) eliminate(p LogicalPlan, replace map[string]*expression.Column, canEliminate bool) LogicalPlan {
	proj, isProj := p.(*LogicalProjection)
	childFlag := canEliminate
	if _, isUnion := p.(*LogicalUnionAll); isUnion {
		childFlag = false
	} else if _, isAgg := p.(*LogicalAggregation); isAgg || isProj {
		childFlag = true
	}
	for i, child := range p.Children() {
		p.Children()[i] = pe.eliminate(child, replace, childFlag)
	}

	switch x := p.(type) {
	case *LogicalJoin:
		x.schema = buildLogicalJoinSchema(x.JoinType, x)
	case *LogicalApply:
		x.schema = buildLogicalJoinSchema(x.JoinType, x)
	default:
		for _, dst := range p.Schema().Columns {
			resolveColumnAndReplace(dst, replace)
		}
	}
	p.replaceExprColumns(replace)

	if !(isProj && canEliminate && canProjectionBeEliminatedLoose(proj)) {
		return p
	}
	exprs := proj.Exprs
	for i, col := range proj.Schema().Columns {
		replace[string(col.HashCode())] = exprs[i].(*expression.Column)
	}
	return p.Children()[0]
}

How does TiDB translate an AST into execution plans

2024-05-10 00:00:00 +0000


// Compile compiles an ast.StmtNode to a physical plan.
func (c *Compiler) Compile(goCtx goctx.Context, stmtNode ast.StmtNode) (*ExecStmt, error) {
	//validation, name binding
	plan.Preprocess(c.Ctx, stmtNode, infoSchema, false)

	infoSchema := GetInfoSchema(c.Ctx)
	// build and optimize query plan
	finalPlan = plan.Optimize(c.Ctx, stmtNode, infoSchema)

	return &ExecStmt{
		InfoSchema: infoSchema,
		Plan:       finalPlan,
		Expensive:  isExpensive,
		Cacheable:  plan.Cacheable(stmtNode),
		Text:       stmtNode.Text(),
		StmtNode:   stmtNode,
		Ctx:        c.Ctx,
	}, nil
}

// Optimize does optimization and creates a Plan.
// The node must be prepared first.
func Optimize(ctx context.Context, node ast.Node, is infoschema.InfoSchema) (Plan, error) {
	builder := &planBuilder{
		ctx:       ctx,
		is:        is,
		colMapper: make(map[*ast.ColumnNameExpr]int),
	}
	p := builder.build(node)

        //Assuming p is a LogicalPlan, which could be optimzied
	return doOptimize(builder.optFlag, p)
}


func (b *planBuilder) build(node ast.Node) Plan {
	b.optFlag = flagPrunColumns
	switch x := node.(type) {
	case *ast.InsertStmt:
		return b.buildInsert(x)
	case *ast.SelectStmt:
		return b.buildSelect(x)
	}
	b.err = ErrUnsupportedType.Gen("Unsupported type %T", node)
	return nil
}

/*

Hierarchy: ResultSet <- GroupBy <- Selection <- Aggregation <- Projection

ResultSetNode interface has a ResultFields property, represents a Node that returns result set. Implementations include SelectStmt, SubqueryExpr, TableSource, TableName and Join.

*/



func (b *planBuilder) buildSelect(sel *ast.SelectStmt) LogicalPlan {
	var (
		p        LogicalPlan
		aggFuncs []*ast.AggregateFuncExpr
		totalMap map[*ast.AggregateFuncExpr]int
		gbyCols  []expression.Expression
	)
        //base plan node
	p = b.buildResultSetNode(sel.From.TableRefs)
	originalFields := sel.Fields.Fields
        //add children
	p, gbyCols = b.resolveGbyExprs(p, sel.GroupBy, sel.Fields.Fields)

        //add children
	p = b.buildSelection(p, sel.Where, nil)

	aggFuncs, totalMap = b.extractAggFuncs(sel.Fields.Fields)
	var aggIndexMap map[int]int
        //add children
	p, aggIndexMap = b.buildAggregation(p, aggFuncs, gbyCols)
	for k, v := range totalMap {
		totalMap[k] = aggIndexMap[v]
	}
	var oldLen int

        //add children
	p, oldLen = b.buildProjection(p, sel.Fields.Fields, totalMap)
		p = b.buildDistinct(p, oldLen)
		p = b.buildSort(p, sel.OrderBy.Items, orderMap)
		p = b.buildLimit(p, sel.Limit)
	sel.Fields.Fields = originalFields
	if oldLen != p.Schema().Len() {
		proj := LogicalProjection{Exprs: expression.Column2Exprs(p.Schema().Columns[:oldLen])}.init(b.ctx)
		proj.SetChildren(p)
		schema := expression.NewSchema(p.Schema().Clone().Columns[:oldLen]...)
		for _, col := range schema.Columns {
			col.FromID = proj.ID()
		}
		proj.SetSchema(schema)
		return proj
	}

	return p
}

func (b *planBuilder) buildResultSetNode(node ast.ResultSetNode) LogicalPlan {
	switch x := node.(type) {
	case *ast.Join:
		return b.buildJoin(x)
	case *ast.TableSource:
		var p LogicalPlan
		switch v := x.Source.(type) {
		case *ast.SelectStmt:
			p = b.buildSelect(v)
		case *ast.TableName:
			p = b.buildDataSource(v)
		if v, ok := p.(*DataSource); ok {
			v.TableAsName = &x.AsName
		}
		for _, col := range p.Schema().Columns {
			col.OrigTblName = col.TblName
			if x.AsName.L != "" {
				col.TblName = x.AsName
				col.DBName = model.NewCIStr("")
			}
		}
		// Duplicate column name in one table is not allowed.
		// "select * from (select 1, 1) as a;" is duplicate
		dupNames := make(map[string]struct{}, len(p.Schema().Columns))
		for _, col := range p.Schema().Columns {
			name := col.ColName.O
			if _, ok := dupNames[name]; ok {
				b.err = ErrDupFieldName.GenByArgs(name)
				return nil
			}
			dupNames[name] = struct{}{}
		}
		return p
	case *ast.SelectStmt:
		return b.buildSelect(x)
	}
}



// buildProjection returns a Projection plan and non-aux columns length.
func (b *planBuilder) buildProjection(p LogicalPlan, fields []*ast.SelectField, mapper map[*ast.AggregateFuncExpr]int) (LogicalPlan, int) {
	b.optFlag |= flagEliminateProjection
	b.curClause = fieldList
	proj := LogicalProjection{Exprs: make([]expression.Expression, 0, len(fields))}.init(b.ctx)
	schema := expression.NewSchema(make([]*expression.Column, 0, len(fields))...)
	oldLen := 0
	for _, field := range fields {
		newExpr, np, err := b.rewrite(field.Expr, p, mapper, true)
		p = np
		proj.Exprs = append(proj.Exprs, newExpr)

		col := b.buildProjectionField(proj.id, schema.Len()+1, field, newExpr)
		schema.Append(col)

		if !field.Auxiliary {
			oldLen++
		}
	}
	proj.SetSchema(schema)
	proj.SetChildren(p)
	return proj, oldLen
}

func (b *planBuilder) buildSelection(p LogicalPlan, where ast.ExprNode, AggMapper map[*ast.AggregateFuncExpr]int) LogicalPlan {
	b.optFlag = b.optFlag | flagPredicatePushDown
	if b.curClause != havingClause {
		b.curClause = whereClause
	}
	conditions := splitWhere(where)
	expressions := make([]expression.Expression, 0, len(conditions))
	selection := LogicalSelection{}.init(b.ctx)
	for _, cond := range conditions {
		expr, np, err := b.rewrite(cond, p, AggMapper, false)
		p = np
		if expr == nil {
			continue
		}
		cnfItems := expression.SplitCNFItems(expr)
		for _, item := range cnfItems {
			expressions = append(expressions, item)
		}
	}
	selection.Conditions = expressions
	selection.SetChildren(p)
	return selection
}

    // RecordSet is an abstract result set interface to help get data from Plan.
    type RecordSet interface {
        // Fields gets result fields.
        Fields() []*ResultField
        // Next returns the next row, nil row means there is no more to return.
        Next(ctx context.Context) (row types.Row, err error)
        // NextChunk reads records into chunk.
        NextChunk(ctx context.Context, chk *chunk.Chunk) error
        // NewChunk creates a new chunk with initial capacity.
        NewChunk() *chunk.Chunk
        // SupportChunk check if the RecordSet supports Chunk structure.
        SupportChunk() bool
        // Close closes the underlying iterator, call Next after Close will
        // restart the iteration.
        Close() error
    }


// Exec builds an Executor from a plan. If the Executor doesn't return result,
// like the INSERT, UPDATE statements, it executes in this function, if the Executor returns
// result, execution is done after this function returns, in the returned ast.RecordSet Next method.
func (a *ExecStmt) Exec(goCtx goctx.Context) (ast.RecordSet, error) {
	ctx := a.Ctx
	e := a.buildExecutor(ctx)

	e.Open(goCtx)

	var pi processinfoSetter
	if raw, ok := ctx.(processinfoSetter); ok {
		pi = raw
		sql := a.OriginText()
		// Update processinfo, ShowProcess() will use it.
		pi.SetProcessInfo(sql)
	}

	return &recordSet{
		executor:    e,
		stmt:        a,
		processinfo: pi,
		txnStartTS:  ctx.Txn().StartTS(),
	}, nil
}

type SelectStmt struct {
        dmlNode
        resultSetNode
        // SelectStmtOpts wraps around select hints and switches.
        *SelectStmtOpts
        // Distinct represents whether the select has distinct option.
        Distinct bool
        // From is the from clause of the query.
        From *TableRefsClause
        // Where is the where clause in select statement.
        Where ExprNode
        // Fields is the select expression list.
        Fields *FieldList
        // GroupBy is the group by expression list.
        GroupBy *GroupByClause
        // Having is the having condition.
        Having *HavingClause
        // OrderBy is the ordering expression list.
        OrderBy *OrderByClause
        // Limit is the limit clause.
        Limit *Limit
        // LockTp is the lock type
        LockTp SelectLockType
        // TableHints represents the level Optimizer Hint
        TableHints [](#)*TableOptimizerHint
}

// LogicalSelection represents a where or having predicate.
type LogicalSelection struct {
	baseLogicalPlan

	// Originally the WHERE or ON condition is parsed into a single expression,
	// but after we converted to CNF(Conjunctive normal form), it can be
	// split into a list of AND conditions.
	Conditions []expression.Expression
}

Software Engineering at Google

2024-03-10 00:00:00 +0000

Notes from Software Engineering at Google

Differences between programming and software engineering

Time

  • On a software engineering project, engineers need to be more concerned with the passage of time and the eventual need for change.
Should I upgrade?
  • When you are fundamentally incapable of reacting to a change in underlying technology or product direction, you’re placing a high-risk bet on the hope that such a change never becomes critical.
  • For any project that didn’t plan for upgrades from the start, that transition is likely very painful.
  • You’re performing a task that hasn’t yet been done for this project; more hidden assumptions have been baked-in.
  • The engineers trying to do the upgrade are less likely to have experience in this sort of task.
  • The size of the upgrade is often larger than usual, doing several years’ worth of upgrades at once instead of a more incremental upgrade. And thus, after actually going through such an upgrade once (or giving up part way through), it’s pretty reasonable to overestimate the cost of doing a subsequent upgrade and decide “Never again.” Companies that come to this conclusion end up committing to just throwing things out and rewriting their code, or deciding to never upgrade again.
  • Expect the first upgrade for a codebase to be significantly more expensive than later upgrades, even controlling for other factors.

Scale

Hyrum’s Law

  • With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
  • We cannot assume perfect adherence to published contracts or best practices.
  • In practice, the complexity and difficulty of a given change also depends on how useful a user finds some observable behavior of your API. If users cannot depend on such things, your API will be easy to change. Given enough time and enough users, even the most innocuous change will break something;

How to upgrade

A new Widget has been developed. The decision is made that everyone should use the new one and stop using the old one.

Not scalable
  • To motivate this, project leads say “We’ll delete the old Widget on August 15th; make sure you’ve converted to the new Widget.”
  • Teams depend on an ever-increasing number of Widgets, and a single build break can affect a growing percentage of the company.
  • Forcing users to respond to churn means that every affected team does a worse job ramping up, solves their immediate problem, and then throws away that now-useless knowledge.
Scalable
  • Instead of pushing migration work to customers, teams can internalize it themselves, with all the economies of scale that provides.
  • Infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place, in backward-compatible fashion. * Dependent projects are no longer spending progressively greater effort just to keep up.
  • Having a dedicated group of experts execute the change scales better than asking for more maintenance effort from every user: experts spend some time learning the whole problem in depth and then apply that expertise to every subproblem.
  • “If a product experiences outages or other problems as a result of infrastructure changes, but the issue wasn’t surfaced by tests in our Continuous Integration (CI) system, it is not the fault of the infrastructure change.”
    • “If you liked it, you should have put a CI test on it,”
    • Complicated, one-off bespoke tests that aren’t triggered by our common CI system do not count. Without this, an engineer on an infrastructure team could conceivably need to track down every team with any affected code and ask them how to run their tests.

Trade-offs

  • As software engineers, we are asked to make more complex decisions with higher-stakes outcomes, often based on imprecise estimates of time and growth.
  • We might sometimes defer maintenance changes, or even embrace policies that don’t scale well, with the knowledge that we’ll need to revisit those decisions. Those choices should be explicit and clear about the deferred costs.

Make decisions

  • It is important for there to be a decider for any topic and clear escalation paths when decisions seem to be wrong, but the goal is consensus, not unanimity. It’s fine and expected to see some instances of “I don’t agree with your metrics/valuation, but I see how you can come to that conclusion.”
  • Some of the quantities are subtle, or we don’t know how to measure them. Sometimes this manifests as “We don’t know how much engineer-time this will take.” Sometimes it is even more nebulous: how do you measure the engineering cost of a poorly designed API? Or the societal impact of a product choice?

How to Work Well on Teams

  • Ensuring that there is at least good documentation in addition to a primary and a secondary owner for each area of responsibility
  • Projects run into unpredictable design obstacles or political hazards, or we simply discover that things aren’t working as planned. Requirements morph unexpectedly. How do you get that feedback loop so that you know the instant your plans or designs need to change? Answer: by working in a team. Most engineers know the quote, “Many eyes make all bugs shallow,” but a better version might be, “Many eyes make sure your project stays relevant and on track.”
  • Group teams of four to eight people together in small rooms (or large offices) to make it easy (and non-embarrassing) for spontaneous conversation to happen.

Lose the ego

  • If you perform a root-cause analysis on almost any social conflict, you can ultimately trace it back to a lack of humility, respect, and/or trust.
  • Even if you know you’re the wisest person in the discussion, don’t wave it in people’s faces. For example, do you always feel like you need to have the first or last word on every subject? Do you feel the need to comment on every detail in a proposal or discussion?
  • “The appearance of conforming gets you a long way.” If you chose to assert your ego in any number of ways, “I am going to do it my way,” you pay a small steady price throughout the whole of your professional career. And this, over a whole lifetime, adds up to an enormous amount of needless trouble.

Learn to give and take criticism

  • For example, if you have an insecure collaborator, here’s what not to say: “Man, you totally got the control flow wrong on that method there. You should be using the standard xyzzy code pattern like everyone else.” This feedback is full of antipatterns: you’re telling someone they’re “wrong” (as if the world were black and white), demanding they change something, and accusing them of creating something that goes against what everyone else is doing (making them feel stupid). Your coworker will immediately be put on the offense, and their response is bound to be overly emotional.
  • A better way to say the same thing might be, “Hey, I’m confused by the control flow in this section here. I wonder if the xyzzy code pattern might make this clearer and easier to maintain?”

Knowledge sharing

  • Engineers have a tendency to reach for “this is bad!” far more quickly than is often warranted, especially for unfamiliar code, languages, or paradigms
  • The first time you learn something is the best time to see ways that the existing documentation and training materials can be improved

Challenges to Learning

A group of people that is split between people who know “everything” and novices, with little middle ground. This problem often reinforces itself if experts always do everything themselves and don’t take the time to develop new experts through mentoring or documentation. In this scenario, knowledge and responsibilities continue to accumulate on those who already have expertise, and new team members or novices are left to fend for themselves and ramp up more slowly.

Tribal and written knowledge complement each other. Even a perfectly expert team with perfect documentation needs to communicate with one another, coordinate with other teams, and adapt their strategies over time. No single knowledge-sharing approach is the correct solution for all types of learning,

Psychological Safety in Large Groups

  • No “well-actuallys”
    • Pedantic corrections that tend to be about grandstanding rather than precision.
  • No back-seat driving
    • Interrupting an existing discussion to offer opinions without committing to the conversation.