Linear Algebra and Convex Optimization
Vector spaces
- Vector spaces: a set
with addition and scalar multiplication- Basis: a minimal set of vectors
where every element in is a linear combination of - Dimension: the number of vectors in
- Inner product:
- Conjugate symmetry:
- Linearity:
- Positive definiteness:
- Conjugate symmetry:
- Norm: function from a vector space to non-negative reals
- Positive definiteness:
- Triangle inequality:
- Absolute homogeneity:
- Positive definiteness:
- Basis: a minimal set of vectors
- Common vector norms:
- L-
: - L-
:
- L-
- Useful formulas:
- Cauchy-Schwarz:
- Vector angle:
- Cardinality (# nonzero elements):
- Cauchy-Schwarz:
Matrix algebra
- Common matrix norms:
-norm: : max absolute column sum (spectral): : max absolute row sum- Sub-multiplicative property for
-norm:
- Nuclear:
- Frobenius:
- Norm relationships for
matrix : - Common matrix properties:
- Range: the span of the columns of
: - Rank: number of linearly independent columns of
- Symmetry: a matrix
is symmetric if- Hermitian: complex generalization of symmetric
- Positive-semidefinite: if
for all (PD if strict) - Gain:
- Trace (sum of diagonal elements):
- Determinant:
- Spectral radius:
- Condition number:
- Range: the span of the columns of
- Fundamental Theorem of Linear Algebra:
Matrix calculus
- Gradient for function
:- Scalar (
): (dim ) - Hessian (
): (dim ) - Jacobian (
): (dim )
- Scalar (
- Quotient rule: for
, - Product rule: for
( : vector, : scalar), - Taylor’s theorem:
- Common matrix derivatives:
- For
: - For
: - For
where is symmetric: - For
where : - For
where : - For
: - For
where : - For
where and :
- For
Matrix decompositions
- Eigendecomposition:
- Eigenvalues are the elements of the diagonal matrix
, satisfying- Characteristic equation:
- Rayleigh quotient:
(used for variational definition)
- Characteristic equation:
- Eigenvectors are the columns of
(which is orthonormal, i.e. ) - Inverse:
- Spectral theorem: any symmetric matrix is diagonalizable
- Eigenvalues are the elements of the diagonal matrix
- Singular value decomposition (SVD):
are eigen-vector/values of , is invertible is invertible
- Principal component analysis equivalent formulations:
- Variance-maximizing directions:
- Least-squares min directions:
- Rank-one approximation:
- Variance-maximizing directions:
- Moore-Penrose pseudoinverse:
-
if full column rank (right inverse) -
if full row rank (left inverse) - If
is the SVD, then
-
- LU decomposition:
, lower triangular and upper triangular- LDL decomposition: if
symmetric, - Cholesky decomposition: iff
is symmetric PD, with lower triangular
- LDL decomposition: if
- QR decomposition:
, orthonormal and upper triangular- Gram-Schmidt: orthonormalize columns of
to get , use inner products to fill in- Modified G-S: can replace diagonal elements with zeros for rectangular
- More numerically stable than LU but more expensive
- Gram-Schmidt: orthonormalize columns of
- Schur complement: block matrix
has Schur complement (if invertible) - Block-triangular decomposition: decomposes a square matrix
into block-triangular form- If
with each square, then is block-triangular - Diagonal blocks
can be further decomposed (e.g., LU, QR, SVD) - Permutation matrices can symmetrically reorder rows and columns to make
block-triangular
- If
Convexity
- Convex set
: for , the line segment for all- Hyperplanes, halfspaces, balls, norms, and polytopes are all convex
- Convex hull: all coefficients
and sum to - Conic hull: all coefficients
(removes unit constraint)
- Convex function
: for in the domain,- Affine functions, norms, and quadratic functions are all convex
- Epigraph (area above line) of a convex function is a convex set
- First-order condition:
is convex iff - Second-order condition:
is convex iff for all
- Composition rules:
- Jensen’s inequality: for convex
, - Affine precomposition:
is convex for convex and affine - Nonnegative weighted sum:
is convex for convex , and - Pointwise maximum:
is convex for convex , - Perspective:
is convex for convex and
- Jensen’s inequality: for convex
Duality
- Constrained optimization problem:
s.t. and- We assume the primal
is convex, with constraints convex, affine
- We assume the primal
- Dual problem:
- Lagrangian:
- Always concave, even if primal is not convex
- Weak duality: dual optimum
primal optimum - Strong duality: dual optimum
primal optimum if Slater’s condition holds (primal strictly feasible)
- Lagrangian:
- Slater’s condition: if the feasible region has an interior point, strong duality holds
- Strictly feasible:
is in the relative interior of the domain of , and: for all (note strict inequality) for all
- Strictly feasible:
- KKT conditions:
- Primal feasibility:
, - Dual feasibility:
- Complementary slackness:
- Stationarity:
- If strong duality holds, then KKT is both necessary and sufficient for optimality
- Primal feasibility:
Linear programming
- Standard form:
subject to ,- Dual:
subject to ,
- Dual:
- Simplex method: moves along vertices of feasible polyhedron until optimum reached
- Reduced costs:
where solves - Optimality condition: if all reduced costs
, current vertex is optimal - Bland’s rule: choose entering variable with smallest index among negative reduced costs
- Reduced costs:
- Interior point methods:
- Central path: set of points
solving perturbed KKT: , , for all (centralizing condition)
- Path following methods: approximately follow central path as
- Short-step: take Newton step, reduce
by fixed factor - Long-step: take damped Newton step, reduce
adaptively
- Short-step: take Newton step, reduce
- Central path: set of points
Applications
- Least squares:
- Min-norm solution:
- Projection matrix:
using Moore-Penrose psuedoinverse - LASSO:
- Ridge / Tikhonov:
- Weighted:
- Min-norm solution:
- Gradient descent:
- Unconstrained optimization:
for differentiable- Gradient descent:
- Step size
typically chosen by line search or fixed - Converges if
is -smooth and -strongly convex with
- Step size
- Newton’s method:
- Converges quadratically if
is Lipschitz and initial point close enough to optimum
- Converges quadratically if
- Gradient descent:
- Constrained optimization:
subject to- Projected gradient descent:
- Projection
- Projection
- Frank-Wolfe algorithm:
where- Doesn’t require projection, good for structured constraint sets
- Converges at rate
if convex and Lipschitz
- Projected gradient descent:
- Unconstrained optimization:
- Support vector machines
- Hard-margin SVM:
subject to for all- Maximizes margin
between hyperplanes and
- Maximizes margin
- Soft-margin SVM:
subject to ,- Allows misclassifications with slack variables
, trade-off controlled by
- Allows misclassifications with slack variables
- Kernel trick: replace dot products
with kernel- Allows learning nonlinear decision boundaries in high-dimensional
space - Common kernels:
- Polynomial
- RBF
- Polynomial
- Allows learning nonlinear decision boundaries in high-dimensional
- Hard-margin SVM:
Notes mentioning this note
There are no notes linking to this note.