Probability and Random Processes
These are a bunch of useful ideas from undergraduate probability that I sometimes find useful in daily life.
Probability fundamentals
- Kolmogorov’s axioms:
- Probabilities are non-negative
- Probability of at least one possible outcome is one
- For disjoint events $A$ and $B$, $P(A \cup B) = P(A) + P(B)$
- Conditional probability: $P(A, B) = P(A | B) P(B)$
- Law of total probability: for a partition $\{B_i\}$ of $\Omega$, $P(A) = \sum_i P(A | B_i) P(B_i)$
- Independence implies uncorrelated (reverse is not necessarily true)
- $X, Y$ are independent if for all $A, B$: $P(X \in A, Y \in B) = P(X \in A) P(Y \in B)$
- Covariance: $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$
- Correlation: $\text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma(X) \sigma(Y)}$ ($-1 \leq \text{Corr}(X, Y) \leq 1$ by Cauchy-Schwarz)
Common probability tools
- Expectation tricks:
- Iterated expectation/tower rule: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]]$
- Law of total variance: $\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])$
- Variance sum: for pairwise uncorrelated $\{X_i\}$, $\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i)$
- Tail sum: $\mathbb{E}[X] = \int_{x=0}^\infty P(X > x)$
- Moment generating function: $M_X(t) = \mathbb{E}[e^{tX}]$
- eg. for Gaussian: $M_X(t) = \text{exp}(\mu t + \frac{1}{2} \sigma^2 t^2)$
- The $n$-th moment of a random variable is $\mathbb{E}[X^n] = M^{(n)}_X(0) = \frac{d^n M_X}{d s^n} |_{s=0}$
- Probability manipulations:
- Bayes’ rule: $P(A | B) = \frac{P(B | A) P(A)}{P(B)}$
- Derived distributions: for $Y = g(X)$, $f_Y(y) = \frac{f_X(x)}{|g’(x)|}$
- Inclusion-exclusion: $P(\bigcup_{i=1}^n A_i) = \sum_{k=1}^n (-1)^{k+1} \sum_{1 \leq i_1 < \dots < i_k \leq n} \Pr(A_{i_1} \cap \dots \cap A_{i_k})$
- Order statistics: For continuous $X_i$, $f_{X^{(i)}}(y) = n \binom{n-1}{i-1} F_X(y)^{i-1} (1-F_X(y))^{n-i} f_{X}(y)$
- Convolution: for indep. $X, Y$, $Z = X + Y$ has pdf $f_Z(z) = \int_{t = -\infty}^{\infty} f_X(z-t) f_Y(t) dt$
- The MGF of $Z$ is $M_Z(t) = M_X(at) M_Y(bt)$
- Common problem solving techniques:
- Counting: combinations, stars and bars, etc
- Indicator variables: typically used to calculate expectation / variance
- Symmetry: typically used to simplify calculations
- Probabilistic method: showing proof of existence via randomly choosing
Common distributions
- Bernoulli: $X=1$ with probability $p$ and $X=0$ otherwise
- Binomial: sum of iid Bernoullis
- Geometric/exponential: trials until success with unique memoryless property
- Min of exponentials is exponential with rate $\sum_{j=1}^n \lambda_j$; $P(X_k = \min_i X_i) = \frac{\lambda_k}{\sum_{j=1}^n \lambda_j}$
- Erlang($k, \lambda$) is the sum of $k$ indep. exponentials with rate $\lambda$
- Poisson($\lambda$): the limit of a binomial as $n \to \infty$ and $p \to 0$ and $np \to \lambda$
- Merging: for indep. $X \sim \text{Pois}(\lambda), Y \sim \text{Pois}(\mu)$, $X+Y \sim \text{Pois}(\lambda + \mu)$
- Splitting: $\text{Poisson}(\lambda)$ with arrivals dropped indep. w.p. $p$ is $\text{Poisson}(\lambda p)$
- Gaussian: ubiquitous distribution commonly used for modeling noise
- For independent $X \sim \mathcal{N}(\mu_1, \sigma_1^2), Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$, $X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
- Jointly Gaussian: two random variables $X, Y$ are J.G. if the vector $(X, Y)$ is Gaussian
- i.e. a J.G. vector $Y$ can be written as $Y = A Z + \mu$, where $Z$ is a vector of standard iid Gaussians
- Uncorrelated jointly Gaussian RVs are independent
- Laplace: L1 version of Gaussian, $f_X(x) = \frac{1}{2b} \text{exp}(-\frac{|x - \mu|}{b})$
- Exponential family: $f_X(x | \theta) = h(x) \text{exp}(\eta(\theta) T(x) - A(\theta))$ for $h$ nonnegative
- $T(x)$ is the sufficient statistic of the distribution
- $\eta$ is the natural parameter
- $A(\eta)$ is the log-partition function
- Gamma distribution $\Gamma(\alpha, \beta)$; function $\Gamma(\alpha) = \int_0^\infty x^{\alpha-1} e^{-x} dx$:
- Shape parameter $\alpha$: $\Gamma(\alpha + 1) = \alpha \Gamma(\alpha)$ ($\Gamma(n+1) = n!$)
- Scale parameter $\beta$: if $X \sim \Gamma(\alpha, 1)$, $\beta X \sim \Gamma(\alpha, \beta)$
- Chi-square distribution with $p$ DOF: $\chi_p^2 = \sum_{i=1}^p Z_i^2$ for i.i.d. $Z_i \sim \mathcal{N}(0, 1)$
Concentration inequalities
- Union bound: $P(\bigcup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)$
- Markov’s inequality: $P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}$ for nonnegative random variable $X$ and $a > 0$
- Chebyshev’s inequality: $P(|X - \mathbb{E}[X]| \geq c) \leq \frac{\text{Var}(X)}{c^2}$ (Markov’s on $|X - \mathbb{E}[X]|$)
- Chernoff’s inequality: $P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{[e^{ta}]}$ (Markov’s on $e^{tX}$)
- Hoeffding’s inequality: $P(S_n - \mathbb{E}[S_n] \geq t) \leq \text{exp}(-\frac{2t^2}{\sum_i (b_i - a_i)^2})$ where $a_i \leq X_i \leq b_i$ a.s.
Convergence
- Borel-Cantelli Lemma: if $\sum_i P(A_i) < \infty$, then $P(\text{infinitely many } A_i \text{ occur}) = 0$
- If $\sum_i P(|X_n - X| > \epsilon) < \infty$, then $X_n \to X$ a.s.
- Almost sure convergence: $X_n \xrightarrow[n \to \infty]{\text{a.s.}} X$ if $P(\lim_{n \to \infty} X_n = X) = 1$
- i.e. the sequence $X^{(n)}$ deviates only a finite number of times from $X$
- Strong Law of Large Numbers: empirical mean converges almost surely
- Convergence in probability: $X_n \xrightarrow[n \to \infty]{\text{i.p.}} X$ if $\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0$
- i.e. the probability that $X_n$ deviates only from $X$ goes to zero (but can still deviate infinitely)
- Weak Law of Large Numbers: empirical mean converges in probability
- Convergence in distribution: $X_n \xrightarrow[n \to \infty]{\text{d}} X$ if $\forall x$ with $P(X = x) = 0$, $P(X_n \leq x) \xrightarrow[n \to \infty]{} P(X \leq x)$
- i.e. $X_n$ is modeled by the distribution $X$
- Central Limit Theorem: distribution of outcomes converges to a standard normal
- Markov Chains: state distribution converges to stationary distribution
- $L^r$ ($r$-th mean) convergence: $\text{lim}_{n \to \infty} \mathbb{E}[|X_n - X|^r] = 0$
- Dominated convergence: if $X_n \to X$ a.s., $|X_n| < Y$, and $\mathbb{E}[Y] < \infty$, then $X_n \to X$ in $L^1$
Information theory
- Entropy: $H(X) = -\mathbb{E}[\text{log} [p(X)]]$
- Chain rule for entropy: $H(X,Y) = H(X) + H(Y|X)$
- Mutual information: $I(X; Y) = H(X) - H(X | Y)$
- Data processing inequality: for a Markov chain $X \to Y \to Z$, $I(X; Y) \geq I(X; Z)$
- Huffman encoding: optimally compresses $X$ to $H(X)$ bits with a prefix code
- Source coding theorem: cannot compress $X$ in less than $H(X)$ bits
- Kraft-McMillan inequality: $\sum_w 2^{-l(w)} \leq 1$ for any prefix code
- i.e. we don’t have to use a non-prefix code
- Asymptotic equipartition property: $P(|-\frac{1}{n} \text{log } p(X_1, \dots, X_n) - H(X)| \leq \epsilon) \to 1$ as $n \to \infty$
- i.e. the sequence $\text{log } \frac{1}{p(x_i)}$ converges to $H(X)$ by the Law of Large Numbers
- KL divergence: $D_{KL}(p || q) = \mathbb{E}_p[\text{log} \frac{1}{q(X)}] - \mathbb{E}_p[\text{log} \frac{1}{p(X)}]$
- i.e. the number of extra bits from improper compression of $p$
- Total variation distance: $TVD(p || q) = \text{max}_x|p(x) - q(x)|$
- Pinsker’s inequality: $TVD(p || q) \leq \sqrt{\frac{1}{2}D_{KL}(p || q)}$
- Jensen-Shannon divergence: $JSD(p || q) = \frac{1}{2}(D(p || m) + D(q || m)), m = \frac{1}{2}(p + q)$
- Channel coding theorem: channel capacity $C = \text{max}_{p(X)} I(X; Y)$
- Binary erasure channel: bit erased with probability $p$, has capacity $C = 1 - p$
- Binary symmetric channel: bit swapped with probability $p$, has capacity $C = 1 - H(p)$
Markov chains
Discrete (DTMC):
- Markov chains satisfy the Markov property: $P(X_n | X_{n-1}, X_{n-2}, …) = P(X_n | X_{n-1})$
- Common properties: recurrence (positive, null), transience, irreducibility, periodicity, reversibility
- Solving Markov chains: stationary distribution ($\pi = \pi P$), first step equations, detailed balance
- Big theorem, stationary distribution, balance equations:
- Detailed (a.k.a. local) balance equations hold if the Markov chain as a tree structure
- Flow-in/flow-out holds for any cut, extends detailed balance equations
- Stationary distribution exists for a class iff it is positive recurrent
- If it exists, the stationary distribution for a communicating class is unique
- If the whole chain is irreducible, then there is a unique stationary distribution
- If whole chain is also aperiodic, then the chain converges a.s. to the stationary dist for any initial dist
- Other useful properties of Markov chains:
- For undirected graphs, $\pi(i) = \frac{\text{degree}(i)}{2 E}$, where $E$ is the number of edges in the graph
- The reciprocal of the stationary dist is the expected time to return to a state, starting from that state
- Applications:
- MCMC: set up a MC and sample from the stationary dist as a proxy for sampling from the original dist
- Erdos-Renyi random graphs: $n$ vertices, with each edge independently picked to be in the graph w.p. $p$
Continuous (CTMC):
- CTMCs have exponential transitions, rather than discrete
- Holding time is the min of exponentials
- Has rate matrix, detailed balance equations, recurrence, transience
- Stationary distribution satisfies $\pi Q = 0$
- Jump/embedded chain: create a DTMC that models the “jumps” of a CTMC
- i.e. visitation order of the states, by considering transition probabilities as the min of exponentials
- Transition probability from $k$ to $j$ is $P(k, j) = \frac{\lambda_{k,j}}{\sum_{i=1}^n \lambda_{k,i}}$
- Crucially, no self loops, so does not take into account holding time
- Uniformization: create a DTMC by relating the rates in terms of a fixed discrete rate $\lambda$
- Transition probability from $k$ to $j$ (for $k \neq j$) is $P(k, j) = \frac{\lambda_{k, j}}{\lambda}$
- Transition probability from $k$ to $k$ (self-loop) is $P(k, k) = 1 - \sum_{i=1, i \neq k}^n P(k, i)$
- Also can write transition matrix $P$ in terms of rate matrix $Q$ as $P = I + \frac{1}{\lambda} Q$
- Has the same stationary distribution: $\pi P = \pi (\frac{1}{\lambda} Q + I) = \pi (0 + I) = \pi$
- Poisson process: number of arrivals in time $t$ is Poisson($\lambda t$) (with indep. non-overlapping intervals)
- Merging: the sum of indep. Poisson processes with rates $\lambda, \mu$ is a new Poisson process with rate $\lambda + \mu$
- Splitting: for Poisson process w/ rate $\lambda$ and drop arrivals w.p. $p$, we have a Poisson process w/ rate $p \lambda$
- $T_k \sim \text{Erlang}(k, \lambda)$ is the distribution of sum of $k$ independent exponentials with rate $\lambda$
- Conditioned on $n$ at time $t$ ($N_t = n$), the arrivals are distributed uniformly
- e.g. $\mathbb{E}[T_{i+1} - T_i] = \frac{t}{n+1}$
- Random incidence paradox: from the perspective of a point, the expected interarrival time is doubled
- A Poisson process backwards is still a Poisson process
Hypothesis testing and statistics
- Neyman-Pearson: a form of frequentist hypothesis testing, where we assume no prior over parameter $X$
- Suppose there are two outcomes, either $X=0$ (null) or $X=1$, the alternate hypothesis
- Since we have no prior, there is no notion of the “most likely” outcome
- Probability of False Alarm (PFA): $P(\hat{X}=1 | X=0)$ (“type 1” error)
- Probability of Correct Detection (PCD): $P(\hat{X}=1 | X=1)$ (“type 2” error)
- Goal: maximize PCD such that PFA is less than “budget” $\beta$
- ROC curve: maximizing PCD is equivalent to maximizing PFA subject to the PFA constraint $\beta$
- AUC is the probability a random positive sample is ranked higher than a random negative sample
- $p$-value: given that $X = 0$ is true, what is the probability we observed this data ($\hat{X} = 1$)?
- Sufficient statistic: $t = T(X)$ is sufficient for parameter $\theta$ if $P(X | t)$ does not depend on $\theta$
- Fisher–Neyman factorization theorem: $T$ is sufficient iff $p_\theta(x) = g_\theta(T(x)) h(x)$ for some $g_\theta, h$
- i.e. the product of two functions where only $g$ depends on $\theta$ (through $T$)
- Minimal sufficiency: $T$ is minimal sufficient if it can be reconstructed from any sufficient statistic
- Completeness: $T$ is complete if $\mathbb{E}_\theta[f(T)] = 0$ implies $f(T) = 0$ a.s.
- If $T$ is complete and sufficient, then $T$ is minimal sufficient
- $V$ is ancillary if its distribution does not depend on $\theta$
- Basu’s theorem: if $T$ is complete sufficient and $V$ ancillary, they are independent
- Fisher–Neyman factorization theorem: $T$ is sufficient iff $p_\theta(x) = g_\theta(T(x)) h(x)$ for some $g_\theta, h$
Estimation
- For an estimator $\hat{y} = f(X)$:
- Expected error: $\mathbb{E}[(f(X) - Y)^2]$
- Bias: $\mathbb{E}[f(X) - Y]$
- Bias-variance tradeoff: $\mathbb{E}[(f(X) - Y)^2] = (\text{Bias}(f))^2 + \text{Var}(f) + \sigma_Y^2$
- Maximum likelihood estimation: find parameters $\theta$ that maximize likelihood $l(X | \theta)$
- The MLE estimator can be biased, ex. German tank problem, variance of a Gaussian from samples
- MLE is a special case of MAP, where the prior over $\theta$ is uniform
- Maximum a posteriori estimation: find parameters $\theta$ that maximize likelihood $l(X | \theta) f(\theta)$
- ex. $L(x) = \sum_{i=1}^n (y_i - wx_i)^2 + \lambda |w|$ corresponds to a Laplace prior
- Minimum Mean Square Estimation (MMSE): find the best function $\phi$ to minimize $\mathbb{E}[(Y - \phi(X))^2]$
- Consider Hilbert space where $\langle X, Y \rangle = \mathbb{E}[XY]$, with corresponding norm $| X |^2 = \mathbb{E}[X^2]$
- The MMSE estimator is $\mathbb{E}[Y | X]$ ($\langle Y - \phi(X), f(X) \rangle = 0$ for all other $f$)
- Both the LLSE and MMSE are unbiased, as $\mathbb{E}[X - \mathbb{E}[X]] = \mathbb{E}[\mathbb{E}[Y|X] - \mathbb{E}[Y]] = 0$
- Linear Least Squares Estimation (LLSE): best sq. error linear estimator of $Y$ given $1, X$
- $L[Y|X] = \mathbb{E}[Y] + \frac{\text{cov}(X,Y)}{\text{var}(X)} (X - \mathbb{E}[X])$
- Projection of $Y$ onto both $\tilde{X}$ and $1$, where $\tilde{X}$ is the transformed $X$ such that $\langle \tilde{X}, 1 \rangle = 0$
- If the noise is Gaussian, the LLSE is also the MMSE
- Kalman filtering: given observations $Y_1, …, Y_n$, compute $\mathbb{E}[X_n | Y_1, Y_2, … Y_n]$
- Smoothing: given $Y_1, …, Y_n$, estimate the past $X_i$ for $i<n$: $\mathbb{E}[X_i | Y_1, Y_2, …, Y_n]$
- Rao-Blackwell: for $T$ sufficient and $L$ convex, $L(y, \mathbb{E}[\hat{y}|T]) \leq L(y, \hat{y})$ (strict if $L$ is strict)
- i.e., a randomized estimator $\hat{y}(T)$ is worse than the non-randomized one $\mathbb{E}[\hat{y}|T]$
- Cramer-Rao: for an unbiased estimator $\hat{y}$, $\text{Var}(\hat{y}) \geq \frac{1}{J(y)}$
- i.e. the variance of any unbiased estimator is bounded by the reciprocal of the Fisher information
- Fisher information: $J(y) = \text{Var}(\nabla_y \text{log}(p_y(x)))$ (note: commonly use $\theta$ instead of $y$)
- Efficiency: $\frac{(J(y))^{-1}}{\text{Var}(\hat{y})}$ (if equals one for all $y$, $\hat{y}$ is fully efficient; may not be possible)
Addendum
At Berkeley, we had a tradition of creating “study guides” before exams, which are still useful to me today when looking up old material. Back when I was a teaching assistant, I wrote a study guide for our class on Probability and Random Processes that is the base for this webpage, which you can find here.
I plan on slowly adding material to this over time as I get around to it. It’s mostly here as a convenient resource for me to look up old content.
Notes mentioning this note
There are no notes linking to this note.