Part V: Probability and Statistics
The first 100 or 1000 flips do affect the sample mean. But 1000 flips will not affect its limit-because you are dividing by $N \to \infty$
F(x) is the cumulative distribution and its derivative p(x) is the probability density function. You could say that p(x) dx is the probability of a sample falling in between x and x + dx.
$E[x] = \int xp(x)dx$ $\sigma^2 = E[(x-m)^2] = \int p(x)(x-m)^2dx$
Normal distribution
Central Limit Theorem. Standard normal distribution N(0, 1).
$ F(\sigma) - F(-\sigma) \approx 2/3 $ $ F(2\sigma) - F(-2\sigma) \approx 0.95 $
The “binomial” probabilities count the number of heads in N coin flips. binomial approaches normal.
Stirling’s formula
Monte Carlo Estimation Methods
accepting uncertainty in the inputs and estimating the variance in the outputs. The error in approximating E[x] is ususally $1/ \sqrt N$
2-level Monte Carlo
Probability Distributions
Binomial: Tossing a coin n times
Poisson
p \to 0$ events and n trials with expected $\lambda = np$ successes. What is the probablity of k successes in n trials?
we wait a long time, or we include a large population, to produce a decent chance that one or more events will actually occur. Poisson is not for the probability that you will be struck by lightning : say one in a million, over your lifetime. It is for the probability that someone in a city of 100,000 will be struck.
Poisson assumes independent events. One of the most difficult aspects of applied probability is to estimate the dependence of one event on other events.
Exponential
The exponential distribution is continuous (not discrete). It describes the waiting time in a Poisson process. How long until lightning strikes your city ? The key assumption is that The waiting time is independent of the time you have already waited.
The future is independent of the past. Is this true for a television set to fail ? It is true for a computer to fail ? If failure depends on a random disaster, the waiting time has exponential distribution. The failure rate >. is constant. If failure comes from slow decay, the independence requirement will not hold.
The exponential distribution has no memory.: The probability of waiting at least y hours more for a table is unfortunately not affected by already having waited x hours
If failures only occur at integer times t = 0, 1, 2, 3, … then the exponential distribution (which has continuous time) is replaced by the geometric distribution (discrete time).
Log-normal
When x has a normal distribution, the exponential $e^x$ has a log-normal distribution. i.e., The distribution of x is log-normal when the distribution of x y = log(x) is normal.
The log-normal distribution appears when you take the product of many positive sample values. The arithmetic mean for normal compares with the geometric mean for log-normal.
Chi-squared
If x has the standard N(O, 1) distribution, then $x^2$ is chi-squared.
If $x_1….x_n$ are independent N(0, 1), then $x_1^2+…+ x_n^2$ has chi-squared distribution with n degrees of freedom
Multivariable Gaussian
When the variables are not independent, like stocks and bonds, C is symmetric and positive definite (in an extreme case, semidefinite). Then the joint distribution of Gaussian variables is “multivariate Gaussian”
Moments, Cumulants, and Inequalities
Suppose we know E[X], we want to know Pr(X>=a)
Markov’s inequality
Chebyshev’s inequality: The proof is to apply Markov’s inequality
Connecting a list of numbers to a (generating) function $f(x)=\Sigma a_nx^n$
Moments and Central Moments
$m_n =E[x^n]$
central moments (around m) $\mu_n=E[(x-m)^n]$
The normalized third central moment is the skewness of the distribution.
We generally use the fourth moment of a normal distribution as a basis for comparison. Then the kurtosis of any distribution is called kappa
Generating Functions and Cumulants
Probability generating function
Characteristic function. uses the characteristic function to prove the central limit theorem
Moment generating function M(t)
Cumulant generating function K(t)
We are capturing an infinite set of coefficients in each function. The key point of cumulants is that they lead to this property $K(t)=logM(t)$
Chernoff’s Inequality for Sums
The mean value of a sum. If those variables are independent, then also the variance
Chebyshev for a sum
Markov and Chebyshev Inequalities for Matrices
an essential step in the statistics of deep learning is to develop inequalities for the eigenvalues and the trace of random matrices.
Markov’s inequality. Suppose X is a semidefinite or definite random matrix, if A is any positive definite matrix, then
Prob {A- X is not positive semidefinite } <= Trace of $ E[X]A^{-1}$
Chebyshev’s inequality for a random matrix X with mean 0:
Matrix Chernoff Inequalities
Erdos-Renyi Random Graphs
Each edge is present with probability p. Then the question is : For which p is the graph likely to be connected?.
Covariance Matrices and Joint Probabilities
Prove that independent experiments have covariance 0
Max covariance = $\sigma_1\sigma_2$ when the two variables are fully dependent
U is a column in covariance matrix. $p_{ij}UU^T$ is positive semidefinite. So the whole matrix V (the sum of those rank 1 matrices) is at least semidefinit
The covariance matrix V is positive definite unless the experiments are dependent.
$V=E[(X-E(X))(X-E(X))^T]=\Sigma p_{ij}(X-E(X))(X-E(X))^T$
Variance of $c^TX$ = $c^TVc$
Diagonalizing the covariance matrix $V=Q\Lambda Q^T$ means finding M independent experiments as combinations of the original M experiments.
Another proof that covariance matrix V is positive semidefinite. V is the sum of the joint probability of each combination times the positive semidefinite matrix $(X-E(X))(X-E(X))^T$
The value of the expectation symbol E is that it also allows pdf’s : probability density functions like p(x, y, z) for continuous random variables x andy and z.
The covariance matrix of $Z =AX$ is $ V_Z=AV_XA^T$
Suppose x and y are independent random variables with mean 0 and variance 1, What are the covariance matrix for Z =(x, y, ax + by)?
Correlation
Normalize the random variables x to have their variance = 1 $ X = x/\sigma _x$
The correlation of x and y is the covariance of X and Y
Correlation matrix R
GPS Example
Distances from four or more satellites will pinpoint the receiver position (using least squares!) The speed of light changes in the ionosphere. But the correction will be almost the same for all nearby receivers. If one receiver stays in a known position, we can take differences from that one. Differential GPS reduces the error variance by fixing one receiver
Markov chains
$y_{n+1}=Py_n$
Find eigenvalues of a square matrix through determinant.
The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.
The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.
steady state with eigenvalue 1
$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns
The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$
Transition Probabilities
$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n
Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain
1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.
Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive
if $ | \lambda_2 | < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P |
Gambler’s Ruin
when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.
Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?
Find eigenvalues of a square matrix through determinant.
The determinant of a square matrix can be seen as a measure of the volume spanned by its row or column vectors. If the determinant is zero, it means these vectors are linearly dependent, and thus, the volume is zero. In geometric terms, it indicates that the transformation described by the matrix flattens the space into a lower dimension.
The determinant of a matrix is calculated from a specific combination of its elements, reflecting how rows or columns interact. If these are linearly dependent (i.e., one row or column can be expressed as a combination of others), then they don’t span an “n-dimensional volume” in the space, and the determinant becomes zero. The determinant can be viewed as a scaling factor for the volume of the unit cube under the linear transformation represented by the matrix: For a 2x2 matrix, it corresponds to the area of the parallelogram spanned by its columns (or rows). From an algebraic standpoint, if there are linearly dependent rows or columns, the matrix can be row-reduced to a form where one entire row (or column) becomes zero, leading the determinant to be zero, as any zero-row matrix has a zero determinant.
steady state with eigenvalue 1
$P=X\Lambda X^{-1}$, $P^\infty has the same split in both columns
The requirements for a positive Markov matrix: All p_{ij} > 0 and each column of P adds to 1. P has largest eigenvalue $\lambda _1 = 1$ and $x_1 > 0$
Transition Probabilities
$p_{ij}= Probability that x(n + 1) = i if x(n) = j$ $p_{ij}$ does not depend on n
Finite Markov Chain, Infinite State Markov Chain, Continuous Markov Chain
1 is an eigenvalue, because the rows of P - I add to the zero row. And no entry of Pis negative.
Perron-Frobenius Theorem $P_1 > 0$ guarantees success. For A > 0, All numbers in $Ax=\lambda _{max}x$ are strictly positive
if $ | \lambda_2 | < 1$, then P^n converges. The matrix P might not have n independent eigenvectors, so we can’t diagonalize P |
Gambler’s Ruin
when a player has won all the money-the game is over and there is no way to leave either of those states. With those two steady states we must expect eigenvalue 1 twice.
Shape of P. What are the four eigenvalues of P? What is the probability that the game will continue forever with no winner?