Part V: Probability and Statistics

The first 100 or 1000 flips do affect the sample mean. But 1000 flips will not affect its limit-because you are dividing by $N \to \infty$

F(x) is the cumulative distribution and its derivative p(x) is the probability density function. You could say that p(x) dx is the probability of a sample falling in between x and x + dx.

$E[x] = \int xp(x)dx$ $\sigma^2 = E[(x-m)^2] = \int p(x)(x-m)^2dx$

Normal distribution

Central Limit Theorem. Standard normal distribution N(0, 1).

$ F(\sigma) - F(-\sigma) \approx 2/3 $ $ F(2\sigma) - F(-2\sigma) \approx 0.95 $

The “binomial” probabilities count the number of heads in N coin flips. binomial approaches normal.

Stirling’s formula

Monte Carlo Estimation Methods

accepting uncertainty in the inputs and estimating the variance in the outputs. The error in approximating E[x] is ususally $1/ \sqrt N$

2-level Monte Carlo

Probability Distributions

Binomial: Tossing a coin n times

Poisson

p \to 0$ events and n trials with expected $\lambda = np$ successes. What is the probablity of k successes in n trials?

we wait a long time, or we include a large population, to produce a decent chance that one or more events will actually occur. Poisson is not for the probability that you will be struck by lightning : say one in a million, over your lifetime. It is for the probability that someone in a city of 100,000 will be struck.

Poisson assumes independent events. One of the most difficult aspects of applied probability is to estimate the dependence of one event on other events.

Exponential

The exponential distribution is continuous (not discrete). It describes the waiting time in a Poisson process. How long until lightning strikes your city ? The key assumption is that The waiting time is independent of the time you have already waited.

The future is independent of the past. Is this true for a television set to fail ? It is true for a computer to fail ? If failure depends on a random disaster, the waiting time has exponential distribution. The failure rate >. is constant. If failure comes from slow decay, the independence requirement will not hold.

The exponential distribution has no memory.: The probability of waiting at least y hours more for a table is unfortunately not affected by already having waited x hours

If failures only occur at integer times t = 0, 1, 2, 3, … then the exponential distribution (which has continuous time) is replaced by the geometric distribution (discrete time).

Log-normal

When x has a normal distribution, the exponential $e^x$ has a log-normal distribution. i.e., The distribution of x is log-normal when the distribution of x y = log(x) is normal.

The log-normal distribution appears when you take the product of many positive sample values. The arithmetic mean for normal compares with the geometric mean for log-normal.

Chi-squared

If x has the standard N(O, 1) distribution, then $x^2$ is chi-squared.

If $x_1….x_n$ are independent N(0, 1), then $x_1^2+…+ x_n^2$ has chi-squared distribution with n degrees of freedom

Multivariable Gaussian

When the variables are not independent, like stocks and bonds, C is symmetric and positive definite (in an extreme case, semidefinite). Then the joint distribution of Gaussian variables is “multivariate Gaussian”

Moments, Cumulants, and Inequalities

Suppose we know E[X], we want to know Pr(X>=a)

Markov’s inequality

Chebyshev’s inequality: The proof is to apply Markov’s inequality

Connecting a list of numbers to a (generating) function $f(x)=\Sigma a_nx^n$

Moments and Central Moments

$m_n =E[x^n]$

central moments (around m) $\mu_n=E[(x-m)^n]$

The normalized third central moment is the skewness of the distribution.

We generally use the fourth moment of a normal distribution as a basis for comparison. Then the kurtosis of any distribution is called kappa

Generating Functions and Cumulants

Probability generating function

Characteristic function. uses the characteristic function to prove the central limit theorem

Moment generating function M(t)

Cumulant generating function K(t)

We are capturing an infinite set of coefficients in each function. The key point of cumulants is that they lead to this property $K(t)=logM(t)$

Chernoff’s Inequality for Sums

The mean value of a sum. If those variables are independent, then also the variance

Chebyshev for a sum

Markov and Chebyshev Inequalities for Matrices

an essential step in the statistics of deep learning is to develop inequalities for the eigenvalues and the trace of random matrices.

Markov’s inequality. Suppose X is a semidefinite or definite random matrix, if A is any positive definite matrix, then

Prob {A- X is not positive semidefinite } <= Trace of $ E[X]A^{-1}$

Chebyshev’s inequality for a random matrix X with mean 0:

Matrix Chernoff Inequalities

Erdos-Renyi Random Graphs

Each edge is present with probability p. Then the question is : For which p is the graph likely to be connected?.

Covariance Matrices and Joint Probabilities

Prove that independent experiments have covariance 0

Max covariance = $\sigma_1\sigma_2$ when the two variables are fully dependent

U is a column in covariance matrix. $p_{ij}UU^T$ is positive semidefinite. So the whole matrix V (the sum of those rank 1 matrices) is at least semidefinit

The covariance matrix V is positive definite unless the experiments are dependent.

$V=E[(X-E(X))(X-E(X))^T]=\Sigma p_{ij}(X-E(X))(X-E(X))^T$

Variance of $c^TX$ = $c^TVc$

Diagonalizing the covariance matrix $V=Q\Lambda Q^T$ means finding M independent experiments as combinations of the original M experiments.

Another proof that covariance matrix V is positive semidefinite. V is the sum of the joint probability of each combination times the positive semidefinite matrix $(X-E(X))(X-E(X))^T$

The value of the expectation symbol E is that it also allows pdf’s : probability density functions like p(x, y, z) for continuous random variables x andy and z.

The covariance matrix of $Z =AX$ is $ V_Z=AV_XA^T$

Suppose x and y are independent random variables with mean 0 and variance 1, What are the covariance matrix for Z =(x, y, ax + by)?

Correlation

Normalize the random variables x to have their variance = 1 $ X = x/\sigma _x$

The correlation of x and y is the covariance of X and Y

Correlation matrix R

GPS Example

Distances from four or more satellites will pinpoint the receiver position (using least squares!) The speed of light changes in the ionosphere. But the correction will be almost the same for all nearby receivers. If one receiver stays in a known position, we can take differences from that one. Differential GPS reduces the error variance by fixing one receiver

Markov chains

$y_{n+1}=Py_n$

Find eigenvalues of a square matrix through determinant.

The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.

The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.

steady state with eigenvalue 1

$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns

The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$

Transition Probabilities

$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n

Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain

1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.

Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive

if $

\lambda_2

< 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P

Gambler’s Ruin

when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.

Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?