Linear Algebra and Convex Optimization

Last updated on March 21, 2024

Vector spaces

Vector spaces: a set $V$ with addition and scalar multiplication
- Basis: a minimal set of vectors $B$ where every element in $V$ is a linear combination of $B$
- Dimension: the number of vectors in $B$
- Inner product: $⟨ \cdot, \cdot ⟩ : V \times V \to F$
  - Conjugate symmetry: $⟨ u, v ⟩ = \overset{―}{⟨ v, u ⟩}$
  - Linearity: $⟨ a x + b y, z ⟩ = a ⟨ x, z ⟩ + b ⟨ y, z ⟩$
  - Positive definiteness: $⟨ x, x ⟩ > 0 ⟹ x = 0$
- Norm: function from a vector space to non-negative reals
  - Positive definiteness: $| | x | | = 0 ⟹ x = 0$
  - Triangle inequality: $| | x + y | | \leq | | x | | + | | y | |$
  - Absolute homogeneity: $| | α x | | = | | α | | \cdot | | x | |$
Common vector norms:
- L- $p$ : $| | x | |_{p} = (\sum_{i} | x_{i} |^{p})^{\frac{1}{p}}$
- L- $\infty$ : $| | x | |_{\infty} = max_{i} | x_{i} |$
Useful formulas:
- Cauchy-Schwarz: $| ⟨ u, v ⟩ |^{2} \leq ⟨ u, u ⟩ \cdot ⟨ v, v ⟩$
- Vector angle: $cos θ = \frac{x^{T} y}{| | x | |_{2} | | y | |_{2}}$
- Cardinality (# nonzero elements): $card (x) \geq \frac{| | x | |_{1}^{2}}{| | x | |_{2}^{2}}$
- $\frac{1}{\sqrt{n}} | | x | |_{2} \leq | | x | |_{\infty} \leq | | x | |_{2} \leq | | x | |_{1} \leq \sqrt{n} | | x | |_{2} \leq n | | x | |_{\infty}$

Matrix algebra

Common matrix norms:
- $p$ -norm: $| | A | |_{p} = max_{x} \frac{| | A x | |_{p}}{| | x | |_{p}}$
  - $p = 1$ : max absolute column sum
  - $p = 2$ (spectral): $σ_{max} (A)$
  - $p = \infty$ : max absolute row sum
  - Sub-multiplicative property for $p$ -norm: $| | A B | |_{p} \leq | | A | |_{p} | | B | |_{p}$
- Nuclear: $| | A | |_{*} = \sum_{i}^{r} σ_{i}$
- Frobenius: $| | A | |_{F} = \sqrt{tr (A^{T} A)} = \sqrt{\sum_{i} σ_{i}^{2}}$
Norm relationships for $m \times n$ matrix $A$ :
- $\frac{1}{\sqrt{n}} | | A | |_{\infty} \leq | | A | |_{2} \leq \sqrt{m} | | A | |_{\infty}$
- $\frac{1}{\sqrt{m}} | | A | |_{1} \leq | | A | |_{2} \leq \sqrt{n} | | A | |_{1}$
- $| | A | |_{2} \leq | | A | |_{F} \leq \sqrt{r} | | A | |_{2}$
Common matrix properties:
- Range: the span of the columns of $A$ : ${v | v = \sum_{i} c_{i} a_{i}, c_{i} \in R}$
- Rank: number of linearly independent columns of $A$
- Symmetry: a matrix $A$ is symmetric if $A = A^{T}$
  - Hermitian: complex generalization of symmetric
- Positive-semidefinite: if $x^{T} A x \geq 0$ for all $x$ (PD if strict)
- Gain: $max_{x} \frac{| | A x | |}{| | x | |}$
- Trace (sum of diagonal elements): $tr (A) = \sum_{i} σ_{i}$
- Determinant: $det (A) = \prod_{i} σ_{i}$
- Spectral radius: $ρ (A) = σ_{1} = max_{i} | λ_{i} (A) |$
- Condition number: $κ (A) = \frac{σ_{1}}{σ_{n}} = | | A | |_{2} | | A^{- 1} | |_{2}$
Fundamental Theorem of Linear Algebra:
- $Null (A) \oplus Range (A^{T}) = R^{n}$
- $Null (A^{T}) \oplus Range (A) = R^{n}$

Matrix calculus

Gradient for function $g$ :
- Scalar ( $g : R^{n} \to R$ ): $(\nabla g (x))_{i} = \frac{\partial g}{\partial x_{i}} (x)$ (dim $n \times 1$ )
- Hessian ( $g : R^{n} \to R$ ): $(\nabla^{2} g (x))_{i j} = \frac{\partial^{2} g}{\partial x_{i} \partial x_{j}} (x)$ (dim $n \times n$ )
- Jacobian ( $g : R^{n} \to R^{m}$ ): $(D g (x))_{i j} = \frac{\partial g_{i}}{\partial x_{j}}$ (dim $m \times n$ )
Quotient rule: for $h (x) = \frac{n (x)}{d (x)}$ , $\nabla h (x) = \frac{d (x) \nabla n (x) - n (x) \nabla d (x)}{(d (x))^{2}}$
Product rule: for $g (x) = v (x) s (x)$ ( $v$ : vector, $s$ : scalar), $D g (x) = D v (x) s (x) + v (x) D s (x)$
Taylor’s theorem: $f (x + Δ x) = f (x) + ⟨ \nabla f (x), Δ x ⟩ + \frac{1}{2} (Δ x)^{T} \nabla^{2} f (x) Δ x + \dots$
Common matrix derivatives:
- For $g (a, n) = a^{T} X b$ :
  - $\nabla_{X} g (a, b) = a b^{T}$
  - $\nabla_{a} g (a, b) = X b$
  - $\nabla_{b} g (a, b) = X^{T} a$
- For $g (X) = tr (X^{T} A)$ :
  - $\nabla g (X) = A$
- For $g (X) = tr (X^{T} A X)$ where $A$ is symmetric:
  - $\nabla g (X) = 2 A X$
  - $\nabla^{2} g (X) = 2 A$
- For $g (X) = tr (Σ^{- 1} X)$ where $Σ ≻ 0$ :
  - $\nabla_{Σ} g (X) = - Σ^{- 1} X Σ^{- 1}$
- For $g (X) = tr (X \log X)$ where $X ⪰ 0$ :
  - $\nabla g (X) = \log X + I$
- For $g (X) = | | A X - B | |_{F}^{2}$ :
  - $\nabla g (X) = 2 A^{T} (A X - B)$
  - $\nabla^{2} g (X) = 2 A^{T} A$
- For $g (X) = \log det X$ where $X ≻ 0$ :
  - $\nabla g (X) = (X^{- 1})^{T}$
  - $\nabla^{2} g (X) = - (X^{- 1})^{T} d X X^{- 1}$
- For $g (x) = f (A x)$ where $f : R^{m} \to R$ and $A \in R^{m \times n}$ :
  - $\nabla g (x) = A^{T} \nabla f (A x)$
  - $\nabla^{2} g (x) = A^{T} \nabla^{2} f (A x) A$

Matrix decompositions

Eigendecomposition: $A = U Λ U^{T}$
- Eigenvalues are the elements of the diagonal matrix $Λ$ , satisfying $A u = λ u$
  - Characteristic equation: $det (λ I - A) = 0$
  - Rayleigh quotient: $R (A) = \frac{x^{T} A x}{x^{T} x}$ (used for variational definition)
- Eigenvectors are the columns of $U$ (which is orthonormal, i.e. $U U^{T} = I$ )
- Inverse: $A^{- 1} = U Λ^{- 1} U^{T}$
- Spectral theorem: any symmetric matrix is diagonalizable
Singular value decomposition (SVD): $A = U Σ V^{T}$
- $v_{i}, σ_{i}$ are eigen-vector/values of $A^{T} A$ , $u_{i} = \frac{1}{σ_{i}} A v_{i}$
  - $rank (A) = n \leq m ⟺ A^{T} A$ is invertible
  - $rank (A) = m \leq n ⟺ A A^{T}$ is invertible
- Principal component analysis equivalent formulations:
  - Variance-maximizing directions: ${argmax}_{u} u^{T} C u$
  - Least-squares min directions: $argmin \sum_{i} min_{v_{i}} | | x_{i} - v_{i} u | |_{2}^{2}$
  - Rank-one approximation: ${argmin}_{Y} | | X - Y | |_{F}$
Moore-Penrose pseudoinverse: $A^{†} = V Σ^{- 1} U^{T}$
- $A^{†} = (A^{T} A)^{- 1} A^{T}$ if $A$ full column rank (right inverse)
- $A^{†} = A^{T} (A A^{T})^{- 1}$ if $A$ full row rank (left inverse)
- If $A = U Σ V^{T}$ is the SVD, then $A^{†} = V Σ^{†} U^{T}$
LU decomposition: $A = L U$ , $L$ lower triangular and $U$ upper triangular
- LDL decomposition: if $A$ symmetric, $A = L D L^{T}$
- Cholesky decomposition: iff $A$ is symmetric PD, $A = L L^{T}$ with $L$ lower triangular
QR decomposition: $A = Q R$ , $Q$ orthonormal and $R$ upper triangular
- Gram-Schmidt: orthonormalize columns of $A$ to get $Q$ , use inner products to fill in $R$
  - $[\begin{matrix} q_{1} = \frac{a_{1}}{| | a_{1} | |} & \tilde{q_{i}} = a_{i} - \sum_{j < i} a_{i}^{T} - (a_{i}^{T} q_{j}) q_{j} & \dots \end{matrix}] [\begin{matrix} | | a_{1} | | & a_{2}^{T} q_{1} & a_{3}^{T} q_{1} \\ 0 & | | \tilde{q_{2}} | | & a_{3}^{T} q_{2} \\ 0 & 0 & | | \tilde{q_{3}} | | \end{matrix}]$
  - Modified G-S: can replace diagonal elements with zeros for rectangular $A$
- More numerically stable than LU but more expensive
Schur complement: block matrix $[\begin{matrix} A & B \\ C & D \end{matrix}]$ has Schur complement $D - C A^{- 1} B$ (if $A$ invertible)
- $[\begin{matrix} A & B \\ C & D \end{matrix}] = [\begin{matrix} I & 0 \\ C A^{- 1} & I \end{matrix}] [\begin{matrix} A & 0 \\ 0 & D - C A^{- 1} B \end{matrix}] [\begin{matrix} I & A^{- 1} B \\ 0 & I \end{matrix}]$
Block-triangular decomposition: decomposes a square matrix $A$ into block-triangular form
- If $A = [\begin{matrix} A_{11} & A_{12} & \dots & A_{1 m} \\ 0 & A_{22} & \dots & A_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & A_{m m} \end{matrix}]$ with each $A_{i i}$ square, then $A$ is block-triangular
- Diagonal blocks $A_{i i}$ can be further decomposed (e.g., LU, QR, SVD)
- Permutation matrices can symmetrically reorder rows and columns to make $A$ block-triangular

Convexity

Convex set $C$ : for $x, y \in C$ , the line segment $t x + (1 - t) y \in C$ for all $0 \leq t \leq 1$
- Hyperplanes, halfspaces, balls, norms, and polytopes are all convex
- Convex hull: all coefficients $\geq 0$ and sum to $1$
- Conic hull: all coefficients $\geq 0$ (removes unit constraint)
Convex function $f$ : for $x, y$ in the domain, $f (t x + (1 - t) y) \leq t f (x) + (1 - t) f (y)$
- Affine functions, norms, and quadratic functions are all convex
- Epigraph (area above line) of a convex function is a convex set
- First-order condition: $f$ is convex iff $f (y) \geq f (x) + \nabla f (x)^{T} (y - x)$
- Second-order condition: $f$ is convex iff $\nabla^{2} f (x) ⪰ 0$ for all $x$
Composition rules:
- Jensen’s inequality: for convex $f$ , $f (\sum_{i = 1}^{n} p_{i} x_{i}) \leq \sum_{i = 1}^{n} p_{i} f (x_{i})$
- Affine precomposition: $f (A x + b)$ is convex for convex $f$ and affine $A x + b$
- Nonnegative weighted sum: $α f + β g$ is convex for convex $f$ , $g$ and $α, β \geq 0$
- Pointwise maximum: $max {f (x), g (x)}$ is convex for convex $f$ , $g$
- Perspective: $t f (x / t)$ is convex for convex $f$ and $t > 0$

Duality

Constrained optimization problem: $min f (x)$ s.t. $g_{i} (x) \leq 0$ and $h_{j} (x) = 0$
- We assume the primal $f$ is convex, with constraints $g$ convex, $h$ affine
Dual problem: $max_{λ \geq 0, ν} min_{x} L (x, λ, ν)$
- Lagrangian: $L (x, λ, ν) = f (x) + \sum_{i} λ_{i} g_{i} (x) + \sum_{j} ν_{j} h_{j} (x)$
- Always concave, even if primal is not convex
- Weak duality: dual optimum $\leq$ primal optimum
- Strong duality: dual optimum $=$ primal optimum if Slater’s condition holds (primal strictly feasible)
Slater’s condition: if the feasible region has an interior point, strong duality holds
- Strictly feasible: $x$ is in the relative interior of the domain of $D$ , and:
  - $g_{i} (x) < 0$ for all $i$ (note strict inequality)
  - $h_{j} (x) = 0$ for all $j$
KKT conditions:
- Primal feasibility: $g_{i} (x) \leq 0$ , $h_{j} (x) = 0$
- Dual feasibility: $λ_{i} \geq 0$
- Complementary slackness: $λ_{i} g_{i} (x) = 0$
- Stationarity: $\nabla f (x) + \sum_{i} λ_{i} \nabla g_{i} (x) + \sum_{j} ν_{j} \nabla h_{j} (x) = 0$
- If strong duality holds, then KKT is both necessary and sufficient for optimality

Linear programming

Standard form: $min_{x} c^{T} x$ subject to $A x = b$ , $x \geq 0$
- Dual: $max_{y, s} b^{T} y$ subject to $A^{T} y + s = c$ , $s \geq 0$
Simplex method: moves along vertices of feasible polyhedron until optimum reached
- Reduced costs: ${\bar{c}}_{j} = c_{j} - y^{T} A_{j}$ where $y$ solves $y^{T} A_{B} = c_{B}^{T}$
- Optimality condition: if all reduced costs $\geq 0$ , current vertex is optimal
- Bland’s rule: choose entering variable with smallest index among negative reduced costs
Interior point methods:
- Central path: set of points $(x (τ), y (τ), s (τ))$ solving perturbed KKT:
  - $A x = b$ , $x \geq 0$
  - $A^{T} y + s = c$ , $s \geq 0$
  - $x_{i} s_{i} = τ$ for all $i$ (centralizing condition)
- Path following methods: approximately follow central path as $τ \to 0$
  - Short-step: take Newton step, reduce $τ$ by fixed factor
  - Long-step: take damped Newton step, reduce $τ$ adaptively

Applications

Least squares: $min_{x} | | A x - y | |_{2}^{2}$
- Min-norm solution: $x^{*} = (A^{T} A)^{- 1} A^{T} y$
- Projection matrix: $A A^{†}$ using Moore-Penrose psuedoinverse
- LASSO: $min_{x} | | A x - y | |_{2}^{2} + λ | | x | |_{1}$
- Ridge / Tikhonov: $min_{x} | | A x - y | |_{2}^{2} + λ | | x | |_{2}^{2}$
  - $x^{*} = (A^{T} A + λ I)^{- 1} A^{T} y$
- Weighted: $min_{x} (A x - y)^{T} W (A x - y)$
  - $x^{*} = (A^{T} W A)^{- 1} A^{T} W y$
Gradient descent:
- Unconstrained optimization: $min_{x} f (x)$ for differentiable $f$
  - Gradient descent: $x^{(k + 1)} = x^{(k)} - t_{k} \nabla f (x^{(k)})$
    - Step size $t_{k}$ typically chosen by line search or fixed
    - Converges if $f$ is $L$ -smooth and $μ$ -strongly convex with $t_{k} \leq \frac{2}{L + μ}$
  - Newton’s method: $x^{(k + 1)} = x^{(k)} - t_{k} (\nabla^{2} f (x^{(k)}))^{- 1} \nabla f (x^{(k)})$
    - Converges quadratically if $\nabla^{2} f (x)$ is Lipschitz and initial point close enough to optimum
- Constrained optimization: $min_{x} f (x)$ subject to $x \in C$
  - Projected gradient descent: $x^{(k + 1)} = {Proj}_{C} (x^{(k)} - t_{k} \nabla f (x^{(k)}))$
    - Projection ${Proj}_{C} (y) = \arg min_{x \in C} | | x - y | |_{2}^{2}$
  - Frank-Wolfe algorithm: $x^{(k + 1)} = x^{(k)} + γ_{k} (s^{(k)} - x^{(k)})$ where $s^{(k)} = \arg min_{s \in C} ⟨ s, \nabla f (x^{(k)}) ⟩$
    - Doesn’t require projection, good for structured constraint sets
    - Converges at rate $O (\frac{1}{k})$ if $f$ convex and $\nabla f$ Lipschitz
Support vector machines
- Hard-margin SVM: $min_{w, b} \frac{1}{2} | | w | |^{2}$ subject to $y_{i} (w^{T} x_{i} + b) \geq 1$ for all $i$
  - Maximizes margin $\frac{2}{| | w | |}$ between hyperplanes $w^{T} x + b = 1$ and $w^{T} x + b = - 1$
- Soft-margin SVM: $min_{w, b, ξ} \frac{1}{2} | | w | |^{2} + C \sum_{i} ξ_{i}$ subject to $y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$ , $ξ_{i} \geq 0$
  - Allows misclassifications with slack variables $ξ_{i}$ , trade-off controlled by $C > 0$
- Kernel trick: replace dot products $x_{i}^{T} x_{j}$ with kernel $K (x_{i}, x_{j}) = ϕ (x_{i})^{T} ϕ (x_{j})$
  - Allows learning nonlinear decision boundaries in high-dimensional $ϕ (x)$ space
  - Common kernels:
    - Polynomial $K (x, z) = (1 + x^{T} z)^{d}$
    - RBF $K (x, z) = \exp (- \frac{| | x - z | |^{2}}{2 σ^{2}})$

Notes mentioning this note

There are no notes linking to this note.

Linear Algebra and Convex Optimization

Vector spaces

Matrix algebra

Matrix calculus

Matrix decompositions

Convexity

Duality

Linear programming

Applications

Notes mentioning this note

Table of Contents

Notes mentioning this note