Probability and Random Processes

These are a bunch of useful ideas from undergraduate probability that I sometimes find useful in daily life.

Probability fundamentals

  1. Kolmogorov’s axioms:
    • Probabilities are non-negative
    • Probability of at least one possible outcome is one
    • For disjoint events A and B, P(AB)=P(A)+P(B)
  2. Conditional probability: P(A,B)=P(A|B)P(B)
    • Law of total probability: for a partition {Bi} of Ω, P(A)=iP(A|Bi)P(Bi)
  3. Independence implies uncorrelated (reverse is not necessarily true)
    • X,Y are independent if for all A,B: P(XA,YB)=P(XA)P(YB)
    • Covariance: Cov(X,Y)=E[(XE[X])(YE[Y])]
    • Correlation: Corr(X,Y)=Cov(X,Y)σ(X)σ(Y) (1Corr(X,Y)1 by Cauchy-Schwarz)

Common probability tools

  1. Expectation tricks:
    • Iterated expectation/tower rule: E[X]=E[E[X|Y]]
    • Law of total variance: Var(Y)=E[Var(Y|X)]+Var(E[Y|X])
    • Variance sum: for pairwise uncorrelated {Xi}, Var(iXi)=iVar(Xi)
    • Tail sum: E[X]=x=0P(X>x)
    • Moment generating function: MX(t)=E[etX]
      • eg. for Gaussian: MX(t)=exp(μt+12σ2t2)
      • The n-th moment of a random variable is E[Xn]=MX(n)(0)=dnMXdsn|s=0
  2. Probability manipulations:
    • Bayes’ rule: P(A|B)=P(B|A)P(A)P(B)
    • Derived distributions: for Y=g(X), fY(y)=fX(x)|g(x)|
    • Inclusion-exclusion: P(i=1nAi)=k=1n(1)k+11i1<<iknPr(Ai1Aik)
    • Order statistics: For continuous Xi, fX(i)(y)=n(n1i1)FX(y)i1(1FX(y))nifX(y)
    • Convolution: for indep. X,Y, Z=X+Y has pdf fZ(z)=t=fX(zt)fY(t)dt
      • The MGF of Z is MZ(t)=MX(at)MY(bt)
  3. Common problem solving techniques:
    • Counting: combinations, stars and bars, etc
    • Indicator variables: typically used to calculate expectation / variance
    • Symmetry: typically used to simplify calculations
    • Probabilistic method: showing proof of existence via randomly choosing

Common distributions

  1. Bernoulli: X=1 with probability p and X=0 otherwise
    • Binomial: sum of iid Bernoullis
  2. Geometric/exponential: trials until success with unique memoryless property
    • Min of exponentials is exponential with rate j=1nλj; P(Xk=miniXi)=λkj=1nλj
    • Erlang(k,λ) is the sum of k indep. exponentials with rate λ
    • Poisson(λ): the limit of a binomial as n and p0 and npλ
      • Merging: for indep. XPois(λ),YPois(μ), X+YPois(λ+μ)
      • Splitting: Poisson(λ) with arrivals dropped indep. w.p. p is Poisson(λp)
  3. Gaussian: ubiquitous distribution commonly used for modeling noise
    • For independent XN(μ1,σ12),YN(μ2,σ22), X+YN(μ1+μ2,σ12+σ22)
    • Jointly Gaussian: two random variables X,Y are J.G. if the vector (X,Y) is Gaussian
      • i.e. a J.G. vector Y can be written as Y=AZ+μ, where Z is a vector of standard iid Gaussians
      • Uncorrelated jointly Gaussian RVs are independent
    • Laplace: L1 version of Gaussian, fX(x)=12bexp(|xμ|b)
  4. Exponential family: fX(x|θ)=h(x)exp(η(θ)T(x)A(θ)) for h nonnegative
    • T(x) is the sufficient statistic of the distribution
    • η is the natural parameter
    • A(η) is the log-partition function
  5. Gamma distribution Γ(α,β); function Γ(α)=0xα1exdx:
    • Shape parameter α: Γ(α+1)=αΓ(α) (Γ(n+1)=n!)
    • Scale parameter β: if XΓ(α,1), βXΓ(α,β)
  6. Chi-square distribution with p DOF: χp2=i=1pZi2 for i.i.d. ZiN(0,1)

Concentration inequalities

  1. Union bound: P(i=1nAi)i=1nP(Ai)
  2. Markov’s inequality: P(Xa)E[X]a for nonnegative random variable X and a>0
  3. Chebyshev’s inequality: P(|XE[X]|c)Var(X)c2 (Markov’s on |XE[X]|)
  4. Chernoff’s inequality: P(Xa)=P(etXeta)E[etX][eta] (Markov’s on etX)
  5. Hoeffding’s inequality: P(SnE[Sn]t)exp(2t2i(biai)2) where aiXibi a.s.

Convergence

  1. Borel-Cantelli Lemma: if iP(Ai)<, then P(infinitely many Ai occur)=0
    • If iP(|XnX|>ϵ)<, then XnX a.s.
  2. Almost sure convergence: Xnna.s.X if P(limnXn=X)=1
    • i.e. the sequence X(n) deviates only a finite number of times from X
    • Strong Law of Large Numbers: empirical mean converges almost surely
  3. Convergence in probability: Xnni.p.X if limnP(|XnX|>ϵ)=0
    • i.e. the probability that Xn deviates only from X goes to zero (but can still deviate infinitely)
    • Weak Law of Large Numbers: empirical mean converges in probability
  4. Convergence in distribution: XnndX if x with P(X=x)=0, P(Xnx)nP(Xx)
    • i.e. Xn is modeled by the distribution X
    • Central Limit Theorem: distribution of outcomes converges to a standard normal
    • Markov Chains: state distribution converges to stationary distribution
  5. Lr (r-th mean) convergence: limnE[|XnX|r]=0
    • Dominated convergence: if XnX a.s., |Xn|<Y, and E[Y]<, then XnX in L1

Information theory

  1. Entropy: H(X)=E[log[p(X)]]
    • Chain rule for entropy: H(X,Y)=H(X)+H(Y|X)
    • Mutual information: I(X;Y)=H(X)H(X|Y)
    • Data processing inequality: for a Markov chain XYZ, I(X;Y)I(X;Z)
  2. Huffman encoding: optimally compresses X to H(X) bits with a prefix code
    • Source coding theorem: cannot compress X in less than H(X) bits
    • Kraft-McMillan inequality: w2l(w)1 for any prefix code
      • i.e. we don’t have to use a non-prefix code
  3. Asymptotic equipartition property: P(|1nlog p(X1,,Xn)H(X)|ϵ)1 as n
    • i.e. the sequence log 1p(xi) converges to H(X) by the Law of Large Numbers
  4. KL divergence: DKL(p||q)=Ep[log1q(X)]Ep[log1p(X)]
    • i.e. the number of extra bits from improper compression of p
    • Total variation distance: TVD(p||q)=maxx|p(x)q(x)|
      • Pinsker’s inequality: TVD(p||q)12DKL(p||q)
    • Jensen-Shannon divergence: JSD(p||q)=12(D(p||m)+D(q||m)),m=12(p+q)
  5. Channel coding theorem: channel capacity C=maxp(X)I(X;Y)
    • Binary erasure channel: bit erased with probability p, has capacity C=1p
    • Binary symmetric channel: bit swapped with probability p, has capacity C=1H(p)

Markov chains

Discrete (DTMC):

  1. Markov chains satisfy the Markov property: P(Xn|Xn1,Xn2,)=P(Xn|Xn1)
    • Common properties: recurrence (positive, null), transience, irreducibility, periodicity, reversibility
    • Solving Markov chains: stationary distribution (π=πP), first step equations, detailed balance
  2. Big theorem, stationary distribution, balance equations:
    • Detailed (a.k.a. local) balance equations hold if the Markov chain as a tree structure
    • Flow-in/flow-out holds for any cut, extends detailed balance equations
    • Stationary distribution exists for a class iff it is positive recurrent
      • If it exists, the stationary distribution for a communicating class is unique
    • If the whole chain is irreducible, then there is a unique stationary distribution
    • If whole chain is also aperiodic, then the chain converges a.s. to the stationary dist for any initial dist
  3. Other useful properties of Markov chains:
    • For undirected graphs, π(i)=degree(i)2E, where E is the number of edges in the graph
    • The reciprocal of the stationary dist is the expected time to return to a state, starting from that state
  4. Applications:
    • MCMC: set up a MC and sample from the stationary dist as a proxy for sampling from the original dist
    • Erdos-Renyi random graphs: n vertices, with each edge independently picked to be in the graph w.p. p

Continuous (CTMC):

  1. CTMCs have exponential transitions, rather than discrete
    • Holding time is the min of exponentials
    • Has rate matrix, detailed balance equations, recurrence, transience
    • Stationary distribution satisfies πQ=0
  2. Jump/embedded chain: create a DTMC that models the “jumps” of a CTMC
    • i.e. visitation order of the states, by considering transition probabilities as the min of exponentials
    • Transition probability from k to j is P(k,j)=λk,ji=1nλk,i
    • Crucially, no self loops, so does not take into account holding time
  3. Uniformization: create a DTMC by relating the rates in terms of a fixed discrete rate λ
    • Transition probability from k to j (for kj) is P(k,j)=λk,jλ
    • Transition probability from k to k (self-loop) is P(k,k)=1i=1,iknP(k,i)
    • Also can write transition matrix P in terms of rate matrix Q as P=I+1λQ
    • Has the same stationary distribution: πP=π(1λQ+I)=π(0+I)=π
  4. Poisson process: number of arrivals in time t is Poisson(λt) (with indep. non-overlapping intervals)
    • Merging: the sum of indep. Poisson processes with rates λ,μ is a new Poisson process with rate λ+μ
    • Splitting: for Poisson process w/ rate λ and drop arrivals w.p. p, we have a Poisson process w/ rate pλ
    • TkErlang(k,λ) is the distribution of sum of k independent exponentials with rate λ
    • Conditioned on n at time t (Nt=n), the arrivals are distributed uniformly
      • e.g. E[Ti+1Ti]=tn+1
    • Random incidence paradox: from the perspective of a point, the expected interarrival time is doubled
      • A Poisson process backwards is still a Poisson process

Hypothesis testing and statistics

  1. Neyman-Pearson: a form of frequentist hypothesis testing, where we assume no prior over parameter X
    • Suppose there are two outcomes, either X=0 (null) or X=1, the alternate hypothesis
    • Since we have no prior, there is no notion of the “most likely” outcome
    • Probability of False Alarm (PFA): P(X^=1|X=0) (“type 1” error)
    • Probability of Correct Detection (PCD): P(X^=1|X=1) (“type 2” error)
    • Goal: maximize PCD such that PFA is less than “budget” β
  2. ROC curve: maximizing PCD is equivalent to maximizing PFA subject to the PFA constraint β
    • AUC is the probability a random positive sample is ranked higher than a random negative sample
  3. p-value: given that X=0 is true, what is the probability we observed this data (X^=1)?
  4. Sufficient statistic: t=T(X) is sufficient for parameter θ if P(X|t) does not depend on θ
    • Fisher–Neyman factorization theorem: T is sufficient iff pθ(x)=gθ(T(x))h(x) for some gθ,h
      • i.e. the product of two functions where only g depends on θ (through T)
    • Minimal sufficiency: T is minimal sufficient if it can be reconstructed from any sufficient statistic
    • Completeness: T is complete if Eθ[f(T)]=0 implies f(T)=0 a.s.
      • If T is complete and sufficient, then T is minimal sufficient
    • V is ancillary if its distribution does not depend on θ
      • Basu’s theorem: if T is complete sufficient and V ancillary, they are independent

Estimation

  1. For an estimator y^=f(X):
    • Expected error: E[(f(X)Y)2]
    • Bias: E[f(X)Y]
    • Bias-variance tradeoff: E[(f(X)Y)2]=(Bias(f))2+Var(f)+σY2
  2. Maximum likelihood estimation: find parameters θ that maximize likelihood l(X|θ)
    • The MLE estimator can be biased, ex. German tank problem, variance of a Gaussian from samples
    • MLE is a special case of MAP, where the prior over θ is uniform
  3. Maximum a posteriori estimation: find parameters θ that maximize likelihood l(X|θ)f(θ)
    • ex. L(x)=i=1n(yiwxi)2+λ|w| corresponds to a Laplace prior
  4. Minimum Mean Square Estimation (MMSE): find the best function ϕ to minimize E[(Yϕ(X))2]
    • Consider Hilbert space where X,Y=E[XY], with corresponding norm |X|2=E[X2]
    • The MMSE estimator is E[Y|X] (Yϕ(X),f(X)=0 for all other f)
    • Both the LLSE and MMSE are unbiased, as E[XE[X]]=E[E[Y|X]E[Y]]=0
  5. Linear Least Squares Estimation (LLSE): best sq. error linear estimator of Y given 1,X
    • L[Y|X]=E[Y]+cov(X,Y)var(X)(XE[X])
    • Projection of Y onto both X~ and 1, where X~ is the transformed X such that X~,1=0
    • If the noise is Gaussian, the LLSE is also the MMSE
    • Kalman filtering: given observations Y1,,Yn, compute E[Xn|Y1,Y2,Yn]
      • Smoothing: given Y1,,Yn, estimate the past Xi for i<n: E[Xi|Y1,Y2,,Yn]
  6. Rao-Blackwell: for T sufficient and L convex, L(y,E[y^|T])L(y,y^) (strict if L is strict)
    • i.e., a randomized estimator y^(T) is worse than the non-randomized one E[y^|T]
  7. Cramer-Rao: for an unbiased estimator y^, Var(y^)1J(y)
    • i.e. the variance of any unbiased estimator is bounded by the reciprocal of the Fisher information
    • Fisher information: J(y)=Var(ylog(py(x))) (note: commonly use θ instead of y)
    • Efficiency: (J(y))1Var(y^) (if equals one for all y, y^ is fully efficient; may not be possible)

Addendum

At Berkeley, we had a tradition of creating “study guides” before exams, which are still useful to me today when looking up old material. Back when I was a teaching assistant, I wrote a study guide for our class on Probability and Random Processes that is the base for this webpage, which you can find here.

I plan on slowly adding material to this over time as I get around to it. It’s mostly here as a convenient resource for me to look up old content.

Notes mentioning this note

There are no notes linking to this note.