Probability and Random Processes
These are a bunch of useful ideas from undergraduate probability that I sometimes find useful in daily life.
Probability fundamentals
- Kolmogorov’s axioms:
- Probabilities are non-negative
- Probability of at least one possible outcome is one
- For disjoint events
and ,
- Conditional probability:
- Law of total probability: for a partition
of ,
- Law of total probability: for a partition
- Independence implies uncorrelated (reverse is not necessarily true)
are independent if for all :- Covariance:
- Correlation:
( by Cauchy-Schwarz)
Common probability tools
- Expectation tricks:
- Iterated expectation/tower rule:
- Law of total variance:
- Variance sum: for pairwise uncorrelated
, - Tail sum:
- Moment generating function:
- eg. for Gaussian:
- The
-th moment of a random variable is
- eg. for Gaussian:
- Iterated expectation/tower rule:
- Probability manipulations:
- Bayes’ rule:
- Derived distributions: for
, - Inclusion-exclusion:
- Order statistics: For continuous
, - Convolution: for indep.
, has pdf- The MGF of
is
- The MGF of
- Bayes’ rule:
- Common problem solving techniques:
- Counting: combinations, stars and bars, etc
- Indicator variables: typically used to calculate expectation / variance
- Symmetry: typically used to simplify calculations
- Probabilistic method: showing proof of existence via randomly choosing
Common distributions
- Bernoulli:
with probability and otherwise- Binomial: sum of iid Bernoullis
- Geometric/exponential: trials until success with unique memoryless property
- Min of exponentials is exponential with rate
; - Erlang(
) is the sum of indep. exponentials with rate - Poisson(
): the limit of a binomial as and and- Merging: for indep.
, - Splitting:
with arrivals dropped indep. w.p. is
- Merging: for indep.
- Min of exponentials is exponential with rate
- Gaussian: ubiquitous distribution commonly used for modeling noise
- For independent
, - Jointly Gaussian: two random variables
are J.G. if the vector is Gaussian- i.e. a J.G. vector
can be written as , where is a vector of standard iid Gaussians - Uncorrelated jointly Gaussian RVs are independent
- i.e. a J.G. vector
- Laplace: L1 version of Gaussian,
- For independent
- Exponential family:
for nonnegative is the sufficient statistic of the distribution is the natural parameter is the log-partition function
- Gamma distribution
; function :- Shape parameter
: ( ) - Scale parameter
: if ,
- Shape parameter
- Chi-square distribution with
DOF: for i.i.d.
Concentration inequalities
- Union bound:
- Markov’s inequality:
for nonnegative random variable and - Chebyshev’s inequality:
(Markov’s on ) - Chernoff’s inequality:
(Markov’s on ) - Hoeffding’s inequality:
where a.s.
Convergence
- Borel-Cantelli Lemma: if
, then- If
, then a.s.
- If
- Almost sure convergence:
if- i.e. the sequence
deviates only a finite number of times from - Strong Law of Large Numbers: empirical mean converges almost surely
- i.e. the sequence
- Convergence in probability:
if- i.e. the probability that
deviates only from goes to zero (but can still deviate infinitely) - Weak Law of Large Numbers: empirical mean converges in probability
- i.e. the probability that
- Convergence in distribution:
if with ,- i.e.
is modeled by the distribution - Central Limit Theorem: distribution of outcomes converges to a standard normal
- Markov Chains: state distribution converges to stationary distribution
- i.e.
( -th mean) convergence:- Dominated convergence: if
a.s., , and , then in
- Dominated convergence: if
Information theory
- Entropy:
- Chain rule for entropy:
- Mutual information:
- Data processing inequality: for a Markov chain
,
- Chain rule for entropy:
- Huffman encoding: optimally compresses
to bits with a prefix code- Source coding theorem: cannot compress
in less than bits - Kraft-McMillan inequality:
for any prefix code- i.e. we don’t have to use a non-prefix code
- Source coding theorem: cannot compress
- Asymptotic equipartition property:
as- i.e. the sequence
converges to by the Law of Large Numbers
- i.e. the sequence
- KL divergence:
- i.e. the number of extra bits from improper compression of
- Total variation distance:
- Pinsker’s inequality:
- Pinsker’s inequality:
- Jensen-Shannon divergence:
- i.e. the number of extra bits from improper compression of
- Channel coding theorem: channel capacity
- Binary erasure channel: bit erased with probability
, has capacity - Binary symmetric channel: bit swapped with probability
, has capacity
- Binary erasure channel: bit erased with probability
Markov chains
Discrete (DTMC):
- Markov chains satisfy the Markov property:
- Common properties: recurrence (positive, null), transience, irreducibility, periodicity, reversibility
- Solving Markov chains: stationary distribution (
), first step equations, detailed balance
- Big theorem, stationary distribution, balance equations:
- Detailed (a.k.a. local) balance equations hold if the Markov chain as a tree structure
- Flow-in/flow-out holds for any cut, extends detailed balance equations
- Stationary distribution exists for a class iff it is positive recurrent
- If it exists, the stationary distribution for a communicating class is unique
- If the whole chain is irreducible, then there is a unique stationary distribution
- If whole chain is also aperiodic, then the chain converges a.s. to the stationary dist for any initial dist
- Other useful properties of Markov chains:
- For undirected graphs,
, where is the number of edges in the graph - The reciprocal of the stationary dist is the expected time to return to a state, starting from that state
- For undirected graphs,
- Applications:
- MCMC: set up a MC and sample from the stationary dist as a proxy for sampling from the original dist
- Erdos-Renyi random graphs:
vertices, with each edge independently picked to be in the graph w.p.
Continuous (CTMC):
- CTMCs have exponential transitions, rather than discrete
- Holding time is the min of exponentials
- Has rate matrix, detailed balance equations, recurrence, transience
- Stationary distribution satisfies
- Jump/embedded chain: create a DTMC that models the “jumps” of a CTMC
- i.e. visitation order of the states, by considering transition probabilities as the min of exponentials
- Transition probability from
to is - Crucially, no self loops, so does not take into account holding time
- Uniformization: create a DTMC by relating the rates in terms of a fixed discrete rate
- Transition probability from
to (for ) is - Transition probability from
to (self-loop) is - Also can write transition matrix
in terms of rate matrix as - Has the same stationary distribution:
- Transition probability from
- Poisson process: number of arrivals in time
is Poisson( ) (with indep. non-overlapping intervals)- Merging: the sum of indep. Poisson processes with rates
is a new Poisson process with rate - Splitting: for Poisson process w/ rate
and drop arrivals w.p. , we have a Poisson process w/ rate is the distribution of sum of independent exponentials with rate- Conditioned on
at time ( ), the arrivals are distributed uniformly- e.g.
- e.g.
- Random incidence paradox: from the perspective of a point, the expected interarrival time is doubled
- A Poisson process backwards is still a Poisson process
- Merging: the sum of indep. Poisson processes with rates
Hypothesis testing and statistics
- Neyman-Pearson: a form of frequentist hypothesis testing, where we assume no prior over parameter
- Suppose there are two outcomes, either
(null) or , the alternate hypothesis - Since we have no prior, there is no notion of the “most likely” outcome
- Probability of False Alarm (PFA):
(“type 1” error) - Probability of Correct Detection (PCD):
(“type 2” error) - Goal: maximize PCD such that PFA is less than “budget”
- Suppose there are two outcomes, either
- ROC curve: maximizing PCD is equivalent to maximizing PFA subject to the PFA constraint
- AUC is the probability a random positive sample is ranked higher than a random negative sample
-value: given that is true, what is the probability we observed this data ( )?- Sufficient statistic:
is sufficient for parameter if does not depend on- Fisher–Neyman factorization theorem:
is sufficient iff for some- i.e. the product of two functions where only
depends on (through )
- i.e. the product of two functions where only
- Minimal sufficiency:
is minimal sufficient if it can be reconstructed from any sufficient statistic - Completeness:
is complete if implies a.s.- If
is complete and sufficient, then is minimal sufficient
- If
is ancillary if its distribution does not depend on- Basu’s theorem: if
is complete sufficient and ancillary, they are independent
- Basu’s theorem: if
- Fisher–Neyman factorization theorem:
Estimation
- For an estimator
:- Expected error:
- Bias:
- Bias-variance tradeoff:
- Expected error:
- Maximum likelihood estimation: find parameters
that maximize likelihood- The MLE estimator can be biased, ex. German tank problem, variance of a Gaussian from samples
- MLE is a special case of MAP, where the prior over
is uniform
- Maximum a posteriori estimation: find parameters
that maximize likelihood- ex.
corresponds to a Laplace prior
- ex.
- Minimum Mean Square Estimation (MMSE): find the best function
to minimize- Consider Hilbert space where
, with corresponding norm - The MMSE estimator is
( for all other ) - Both the LLSE and MMSE are unbiased, as
- Consider Hilbert space where
- Linear Least Squares Estimation (LLSE): best sq. error linear estimator of
given- Projection of
onto both and , where is the transformed such that - If the noise is Gaussian, the LLSE is also the MMSE
- Kalman filtering: given observations
, compute- Smoothing: given
, estimate the past for :
- Smoothing: given
- Rao-Blackwell: for
sufficient and convex, (strict if is strict)- i.e., a randomized estimator
is worse than the non-randomized one
- i.e., a randomized estimator
- Cramer-Rao: for an unbiased estimator
,- i.e. the variance of any unbiased estimator is bounded by the reciprocal of the Fisher information
- Fisher information:
(note: commonly use instead of ) - Efficiency:
(if equals one for all , is fully efficient; may not be possible)
Addendum
At Berkeley, we had a tradition of creating “study guides” before exams, which are still useful to me today when looking up old material. Back when I was a teaching assistant, I wrote a study guide for our class on Probability and Random Processes that is the base for this webpage, which you can find here.
I plan on slowly adding material to this over time as I get around to it. It’s mostly here as a convenient resource for me to look up old content.
Notes mentioning this note
There are no notes linking to this note.