Linear Algebra and Convex Optimization

Vector spaces

  1. Vector spaces: a set V with addition and scalar multiplication
    • Basis: a minimal set of vectors B where every element in V is a linear combination of B
    • Dimension: the number of vectors in B
    • Inner product: ,:V×VF
      • Conjugate symmetry: u,v=v,u
      • Linearity: ax+by,z=ax,z+by,z
      • Positive definiteness: x,x>0x=0
    • Norm: function from a vector space to non-negative reals
      • Positive definiteness: ||x||=0x=0
      • Triangle inequality: ||x+y||||x||+||y||
      • Absolute homogeneity: ||αx||=||α||||x||
  2. Common vector norms:
    • L-p: ||x||p=(i|xi|p)1p
    • L-: ||x||=maxi|xi|
  3. Useful formulas:
    • Cauchy-Schwarz: |u,v|2u,uv,v
    • Vector angle: cos θ=xTy||x||2||y||2
    • Cardinality (# nonzero elements): card(x)||x||12||x||22
    • 1n||x||2||x||||x||2||x||1n||x||2n||x||

Matrix algebra

  1. Common matrix norms:
    • p-norm: ||A||p=maxx||Ax||p||x||p
      • p=1: max absolute column sum
      • p=2 (spectral): σmax(A)
      • p=: max absolute row sum
      • Sub-multiplicative property for p-norm: ||AB||p||A||p||B||p
    • Nuclear: ||A||=irσi
    • Frobenius: ||A||F=tr(ATA)=iσi2
  2. Norm relationships for m×n matrix A:
    • 1n||A||||A||2m||A||
    • 1m||A||1||A||2n||A||1
    • ||A||2||A||Fr||A||2
  3. Common matrix properties:
    • Range: the span of the columns of A: {v|v=iciai,ciR}
    • Rank: number of linearly independent columns of A
    • Symmetry: a matrix A is symmetric if A=AT
      • Hermitian: complex generalization of symmetric
    • Positive-semidefinite: if xTAx0 for all x (PD if strict)
    • Gain: maxx||Ax||||x||
    • Trace (sum of diagonal elements): tr(A)=iσi
    • Determinant: det(A)=iσi
    • Spectral radius: ρ(A)=σ1=maxi|λi(A)|
    • Condition number: κ(A)=σ1σn=||A||2||A1||2
  4. Fundamental Theorem of Linear Algebra:
    • Null(A)Range(AT)=Rn
    • Null(AT)Range(A)=Rn

Matrix calculus

  1. Gradient for function g:
    • Scalar (g:RnR): (g(x))i=gxi(x) (dim n×1)
    • Hessian (g:RnR): (2g(x))ij=2gxixj(x) (dim n×n)
    • Jacobian (g:RnRm): (Dg(x))ij=gixj (dim m×n)
  2. Quotient rule: for h(x)=n(x)d(x), h(x)=d(x)n(x)n(x)d(x)(d(x))2
  3. Product rule: for g(x)=v(x)s(x) (v: vector, s: scalar), Dg(x)=Dv(x)s(x)+v(x)Ds(x)
  4. Taylor’s theorem: f(x+Δx)=f(x)+f(x),Δx+12(Δx)T2f(x)Δx+
  5. Common matrix derivatives:
    • For g(a,n)=aTXb:
      • Xg(a,b)=abT
      • ag(a,b)=Xb
      • bg(a,b)=XTa
    • For g(X)=tr(XTA):
      • g(X)=A
    • For g(X)=tr(XTAX) where A is symmetric:
      • g(X)=2AX
      • 2g(X)=2A
    • For g(X)=tr(Σ1X) where Σ0:
      • Σg(X)=Σ1XΣ1
    • For g(X)=tr(XlogX) where X0:
      • g(X)=logX+I
    • For g(X)=||AXB||F2:
      • g(X)=2AT(AXB)
      • 2g(X)=2ATA
    • For g(X)=logdetX where X0:
      • g(X)=(X1)T
      • 2g(X)=(X1)TdXX1
    • For g(x)=f(Ax) where f:RmR and ARm×n:
      • g(x)=ATf(Ax)
      • 2g(x)=AT2f(Ax)A

Matrix decompositions

  1. Eigendecomposition: A=UΛUT
    • Eigenvalues are the elements of the diagonal matrix Λ, satisfying Au=λu
      • Characteristic equation: det(λIA)=0
      • Rayleigh quotient: R(A)=xTAxxTx (used for variational definition)
    • Eigenvectors are the columns of U (which is orthonormal, i.e. UUT=I)
    • Inverse: A1=UΛ1UT
    • Spectral theorem: any symmetric matrix is diagonalizable
  2. Singular value decomposition (SVD): A=UΣVT
    • vi,σi are eigen-vector/values of ATA, ui=1σiAvi
      • rank(A)=nmATA is invertible
      • rank(A)=mnAAT is invertible
    • Principal component analysis equivalent formulations:
      • Variance-maximizing directions: argmaxuuTCu
      • Least-squares min directions: argminiminvi||xiviu||22
      • Rank-one approximation: argminY||XY||F
  3. Moore-Penrose pseudoinverse: A=VΣ1UT
    • A=(ATA)1AT if A full column rank (right inverse)
    • A=AT(AAT)1 if A full row rank (left inverse)
    • If A=UΣVT is the SVD, then A=VΣUT
  4. LU decomposition: A=LU, L lower triangular and U upper triangular
    • LDL decomposition: if A symmetric, A=LDLT
    • Cholesky decomposition: iff A is symmetric PD, A=LLT with L lower triangular
  5. QR decomposition: A=QR, Q orthonormal and R upper triangular
    • Gram-Schmidt: orthonormalize columns of A to get Q, use inner products to fill in R
      • [q1=a1||a1||qi~=aij<iaiT(aiTqj)qj][||a1||a2Tq1a3Tq10||q2~||a3Tq200||q3~||]
      • Modified G-S: can replace diagonal elements with zeros for rectangular A
    • More numerically stable than LU but more expensive
  6. Schur complement: block matrix [ABCD] has Schur complement DCA1B (if A invertible)
    • [ABCD]=[I0CA1I][A00DCA1B][IA1B0I]
  7. Block-triangular decomposition: decomposes a square matrix A into block-triangular form
    • If A=[A11A12A1m0A22A2m00Amm] with each Aii square, then A is block-triangular
    • Diagonal blocks Aii can be further decomposed (e.g., LU, QR, SVD)
    • Permutation matrices can symmetrically reorder rows and columns to make A block-triangular

Convexity

  1. Convex set C: for x,yC, the line segment tx+(1t)yC for all 0t1
    • Hyperplanes, halfspaces, balls, norms, and polytopes are all convex
    • Convex hull: all coefficients 0 and sum to 1
    • Conic hull: all coefficients 0 (removes unit constraint)
  2. Convex function f: for x,y in the domain, f(tx+(1t)y)tf(x)+(1t)f(y)
    • Affine functions, norms, and quadratic functions are all convex
    • Epigraph (area above line) of a convex function is a convex set
    • First-order condition: f is convex iff f(y)f(x)+f(x)T(yx)
    • Second-order condition: f is convex iff 2f(x)0 for all x
  3. Composition rules:
    • Jensen’s inequality: for convex f, f(i=1npixi)i=1npif(xi)
    • Affine precomposition: f(Ax+b) is convex for convex f and affine Ax+b
    • Nonnegative weighted sum: αf+βg is convex for convex f, g and α,β0
    • Pointwise maximum: max{f(x),g(x)} is convex for convex f, g
    • Perspective: tf(x/t) is convex for convex f and t>0

Duality

  1. Constrained optimization problem: minf(x) s.t. gi(x)0 and hj(x)=0
    • We assume the primal f is convex, with constraints g convex, h affine
  2. Dual problem: maxλ0,νminxL(x,λ,ν)
    • Lagrangian: L(x,λ,ν)=f(x)+iλigi(x)+jνjhj(x)
    • Always concave, even if primal is not convex
    • Weak duality: dual optimum primal optimum
    • Strong duality: dual optimum = primal optimum if Slater’s condition holds (primal strictly feasible)
  3. Slater’s condition: if the feasible region has an interior point, strong duality holds
    • Strictly feasible: x is in the relative interior of the domain of D, and:
      • gi(x)<0 for all i (note strict inequality)
      • hj(x)=0 for all j
  4. KKT conditions:
    • Primal feasibility: gi(x)0, hj(x)=0
    • Dual feasibility: λi0
    • Complementary slackness: λigi(x)=0
    • Stationarity: f(x)+iλigi(x)+jνjhj(x)=0
    • If strong duality holds, then KKT is both necessary and sufficient for optimality

Linear programming

  1. Standard form: minxcTx subject to Ax=b, x0
    • Dual: maxy,sbTy subject to ATy+s=c, s0
  2. Simplex method: moves along vertices of feasible polyhedron until optimum reached
    • Reduced costs: c¯j=cjyTAj where y solves yTAB=cBT
    • Optimality condition: if all reduced costs 0, current vertex is optimal
    • Bland’s rule: choose entering variable with smallest index among negative reduced costs
  3. Interior point methods:
    • Central path: set of points (x(τ),y(τ),s(τ)) solving perturbed KKT:
      • Ax=b, x0
      • ATy+s=c, s0
      • xisi=τ for all i (centralizing condition)
    • Path following methods: approximately follow central path as τ0
      • Short-step: take Newton step, reduce τ by fixed factor
      • Long-step: take damped Newton step, reduce τ adaptively

Applications

  1. Least squares: minx||Axy||22
    • Min-norm solution: x=(ATA)1ATy
    • Projection matrix: AA using Moore-Penrose psuedoinverse
    • LASSO: minx||Axy||22+λ||x||1
    • Ridge / Tikhonov: minx||Axy||22+λ||x||22
      • x=(ATA+λI)1ATy
    • Weighted: minx(Axy)TW(Axy)
      • x=(ATWA)1ATWy
  2. Gradient descent:
    • Unconstrained optimization: minxf(x) for differentiable f
      • Gradient descent: x(k+1)=x(k)tkf(x(k))
        • Step size tk typically chosen by line search or fixed
        • Converges if f is L-smooth and μ-strongly convex with tk2L+μ
      • Newton’s method: x(k+1)=x(k)tk(2f(x(k)))1f(x(k))
        • Converges quadratically if 2f(x) is Lipschitz and initial point close enough to optimum
    • Constrained optimization: minxf(x) subject to xC
      • Projected gradient descent: x(k+1)=ProjC(x(k)tkf(x(k)))
        • Projection ProjC(y)=argminxC||xy||22
      • Frank-Wolfe algorithm: x(k+1)=x(k)+γk(s(k)x(k)) where s(k)=argminsCs,f(x(k))
        • Doesn’t require projection, good for structured constraint sets
        • Converges at rate O(1k) if f convex and f Lipschitz
  3. Support vector machines
    • Hard-margin SVM: minw,b12||w||2 subject to yi(wTxi+b)1 for all i
      • Maximizes margin 2||w|| between hyperplanes wTx+b=1 and wTx+b=1
    • Soft-margin SVM: minw,b,ξ12||w||2+Ciξi subject to yi(wTxi+b)1ξi, ξi0
      • Allows misclassifications with slack variables ξi, trade-off controlled by C>0
    • Kernel trick: replace dot products xiTxj with kernel K(xi,xj)=ϕ(xi)Tϕ(xj)
      • Allows learning nonlinear decision boundaries in high-dimensional ϕ(x) space
      • Common kernels:
        • Polynomial K(x,z)=(1+xTz)d
        • RBF K(x,z)=exp(||xz||22σ2)

Notes mentioning this note

There are no notes linking to this note.