Linear Algebra and Convex Optimization

Vector spaces

  1. Vector spaces: a set $V$ with addition and scalar multiplication
    • Basis: a minimal set of vectors $B$ where every element in $V$ is a linear combination of $B$
    • Dimension: the number of vectors in $B$
    • Inner product: $\langle \cdot, \cdot \rangle: V \times V \to F$
      • Conjugate symmetry: $\langle u, v \rangle = \overline{\langle v, u \rangle}$
      • Linearity: $\langle ax + by, z \rangle = a \langle x, z \rangle + b \langle y, z \rangle$
      • Positive definiteness: $\langle x, x \rangle > 0 \implies x = 0$
    • Norm: function from a vector space to non-negative reals
      • Positive definiteness: $||x|| = 0 \implies x = 0$
      • Triangle inequality: $||x + y|| \leq ||x|| + ||y||$
      • Absolute homogeneity: $||\alpha x|| = ||\alpha|| \cdot ||x||$
  2. Common vector norms:
    • L-$p$: $||x||_p = (\sum_{i} | x_i |^p)^{\frac{1}{p}}$
    • L-$\infty$: $||x||_\infty = \max_i |x_i|$
  3. Useful formulas:
    • Cauchy-Schwarz: $|\langle u, v \rangle|^2 \leq \langle u, u \rangle \cdot \langle v, v \rangle$
    • Vector angle: $\text{cos } \theta = \frac{x^T y}{||x||_2 ||y||_2}$
    • Cardinality (# nonzero elements): $\text{card}(x) \geq \frac{||x||^2_1}{||x||^2_2}$
    • $\frac{1}{\sqrt{n}} ||x||_2 \leq ||x||_\infty \leq ||x||_2 \leq ||x||_1 \leq \sqrt{n}||x||_2 \leq n ||x||_\infty$

Matrix algebra

  1. Common matrix norms:
    • $p$-norm: \(||A||_p = \max_x \frac{||Ax||_p}{||x||_p}\)
      • $p = 1$: max absolute column sum
      • $p = 2$ (spectral): $\sigma_{\text{max}}(A)$
      • $p = \infty$: max absolute row sum
      • Sub-multiplicative property for $p$-norm: \(||A B||_p \leq ||A||_p ||B||_p\)
    • Nuclear: \(||A||_* = \sum_{i}^r \sigma_i\)
    • Frobenius: \(||A||_F = \sqrt{\text{tr}(A^T A)} = \sqrt{\sum_i \sigma^2_i}\)
  2. Norm relationships for $m \times n$ matrix $A$:
    • $\frac{1}{\sqrt{n}} ||A||_\infty \leq ||A||_2 \leq \sqrt{m}||A||_\infty$
    • $\frac{1}{\sqrt{m}} ||A||_1 \leq ||A||_2 \leq \sqrt{n} ||A||_1$
    • $||A||_2 \leq ||A||_F \leq \sqrt{r} ||A||_2$
  3. Common matrix properties:
    • Range: the span of the columns of $A$: \(\{v | v = \sum_i c_i a_i, c_i \in \mathbb{R} \}\)
    • Rank: number of linearly independent columns of $A$
    • Symmetry: a matrix $A$ is symmetric if $A = A^T$
      • Hermitian: complex generalization of symmetric
    • Positive-semidefinite: if $x^T A x \geq 0$ for all $x$ (PD if strict)
    • Gain: \(\max_x \frac{||Ax||}{||x||}\)
    • Trace (sum of diagonal elements): \(\text{tr}(A) = \sum_{i} \sigma_i\)
    • Determinant: \(\text{det}(A) = \prod_{i} \sigma_i\)
    • Spectral radius: \(\rho(A) = \sigma_1 = \max_i | \lambda_i(A) |\)
    • Condition number: \(\kappa(A) = \frac{\sigma_1}{\sigma_n} = ||A||_2 ||A^{-1}||_2\)
  4. Fundamental Theorem of Linear Algebra:
    • $\text{Null}(A) \oplus \text{Range}(A^T) = \mathbb{R}^n$
    • $\text{Null}(A^T) \oplus \text{Range}(A) = \mathbb{R}^n$

Matrix calculus

  1. Gradient for function $g$:
    • Scalar ($g: \mathbb{R}^n \to \mathbb{R}$): $(\nabla g(x))_i = \frac{\partial g}{\partial x_i}(x)$ (dim $n \times 1$)
    • Hessian ($g: \mathbb{R}^n \to \mathbb{R}$): $(\nabla^2 g(x))_{ij} = \frac{\partial^2 g}{\partial x_i \partial x_j}(x)$ (dim $n \times n$)
    • Jacobian ($g: \mathbb{R}^n \to \mathbb{R}^m$): $(Dg(x))_{ij} = \frac{\partial g_i}{\partial x_j}$ (dim $m \times n$)
  2. Quotient rule: for $h(x) = \frac{n(x)}{d(x)}$, \(\nabla h(x) = \frac{d(x) \nabla n(x) - n(x) \nabla d(x)}{(d(x))^2}\)
  3. Product rule: for $g(x) = v(x) s(x)$ ($v$: vector, $s$: scalar), \(Dg(x) = Dv(x)s(x) + v(x) Ds(x)\)
  4. Taylor’s theorem: \(f(x + \Delta x) = f(x) + \langle \nabla f(x), \Delta x \rangle + \frac{1}{2} (\Delta x)^T \nabla^2 f(x) \Delta x + \ldots\)
  5. Common matrix derivatives:
    • For $g(a, n) = a^T X b$:
      • $\nabla_X g(a, b) = a b^T$
      • $\nabla_a g(a, b) = X b$
      • $\nabla_b g(a, b) = X^T a$
    • For $g(X) = \text{tr}(X^T A)$:
      • $\nabla g(X) = A$
    • For $g(X) = \text{tr}(X^T A X)$ where $A$ is symmetric:
      • $\nabla g(X) = 2 A X$
      • $\nabla^2 g(X) = 2 A$
    • For $g(X) = \text{tr}(\Sigma^{-1} X)$ where $\Sigma \succ 0$:
      • $\nabla_{\Sigma} g(X) = -\Sigma^{-1} X \Sigma^{-1}$
    • For $g(X) = \text{tr}(X \log X)$ where $X \succeq 0$:
      • $\nabla g(X) = \log X + I$
    • For $g(X) = ||AX - B||^2_F$:
      • $\nabla g(X) = 2 A^T (AX - B)$
      • $\nabla^2 g(X) = 2 A^T A$
    • For $g(X) = \log \det X$ where $X \succ 0$:
      • $\nabla g(X) = (X^{-1})^T$
      • $\nabla^2 g(X) = -(X^{-1})^T dX X^{-1}$
    • For $g(x) = f(Ax)$ where $f: \mathbb{R}^m \to \mathbb{R}$ and $A \in \mathbb{R}^{m \times n}$:
      • $\nabla g(x) = A^T \nabla f(Ax)$
      • $\nabla^2 g(x) = A^T \nabla^2 f(Ax) A$

Matrix decompositions

  1. Eigendecomposition: $A = U \Lambda U^T$
    • Eigenvalues are the elements of the diagonal matrix $\Lambda$, satisfying $Au = \lambda u$
      • Characteristic equation: $\text{det}(\lambda I - A) = 0$
      • Rayleigh quotient: $R(A) = \frac{x^T A x}{x^T x}$ (used for variational definition)
    • Eigenvectors are the columns of $U$ (which is orthonormal, i.e. $U U^T = I$)
    • Inverse: $A^{-1} = U \Lambda^{-1} U^T$
    • Spectral theorem: any symmetric matrix is diagonalizable
  2. Singular value decomposition (SVD): $A = U \Sigma V^T$
    • $v_i, \sigma_i$ are eigen-vector/values of $A^T A$, $u_i = \frac{1}{\sigma_i} A v_i$
      • $\text{rank}(A) = n \leq m \iff A^T A$ is invertible
      • $\text{rank}(A) = m \leq n \iff A A^T$ is invertible
    • Principal component analysis equivalent formulations:
      • Variance-maximizing directions: \(\text{argmax}_u u^T C u\)
      • Least-squares min directions: \(\text{argmin} \sum_i \min_{v_i} ||x_i - v_i u||^2_2\)
      • Rank-one approximation: \(\text{argmin}_Y ||X - Y||_F\)
  3. Moore-Penrose pseudoinverse: $A^\dagger = V \Sigma^{-1} U^T$
    • $A^\dagger = (A^T A)^{-1} A^T$ if $A$ full column rank (right inverse)
    • $A^\dagger = A^T (A A^T)^{-1}$ if $A$ full row rank (left inverse)
    • If $A = U \Sigma V^T$ is the SVD, then $A^\dagger = V \Sigma^\dagger U^T$
  4. LU decomposition: $A = LU$, $L$ lower triangular and $U$ upper triangular
    • LDL decomposition: if $A$ symmetric, $A = L D L^T$
    • Cholesky decomposition: iff $A$ is symmetric PD, $A = L L^T$ with $L$ lower triangular
  5. QR decomposition: $A = QR$, $Q$ orthonormal and $R$ upper triangular
    • Gram-Schmidt: orthonormalize columns of $A$ to get $Q$, use inner products to fill in $R$
      • $\begin{bmatrix} & \\ q_1 = \frac{a_1}{||a_1||} & \tilde{q_i} = a_i - \sum_{j < i} a^T_i - (a_i^T q_j) q_j & \ldots \\ & \end{bmatrix} \begin{bmatrix} ||a_1|| & a^T_2 q_1 & a^T_3 q_1 \\ 0 & ||\tilde{q_2}|| & a_3^T q_2 \\ 0 & 0 & ||\tilde{q_3}|| \end{bmatrix}$
      • Modified G-S: can replace diagonal elements with zeros for rectangular $A$
    • More numerically stable than LU but more expensive
  6. Schur complement: block matrix $\begin{bmatrix} A & B \\ C & D \end{bmatrix}$ has Schur complement $D - C A^{-1} B$ (if $A$ invertible)
    • $\begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ C A^{-1} & I \end{bmatrix} \begin{bmatrix} A & 0 \\ 0 & D - C A^{-1} B
      \end{bmatrix} \begin{bmatrix} I & A^{-1} B \\ 0 & I \end{bmatrix}$
  7. Block-triangular decomposition: decomposes a square matrix $A$ into block-triangular form
    • If $A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1m} \\ 0 & A_{22} & \cdots & A_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & A_{mm} \end{bmatrix}$ with each $A_{ii}$ square, then $A$ is block-triangular
    • Diagonal blocks $A_{ii}$ can be further decomposed (e.g., LU, QR, SVD)
    • Permutation matrices can symmetrically reorder rows and columns to make $A$ block-triangular

Convexity

  1. Convex set $C$: for $x, y \in C$, the line segment $tx + (1-t)y \in C$ for all $0 \leq t \leq 1$
    • Hyperplanes, halfspaces, balls, norms, and polytopes are all convex
    • Convex hull: all coefficients $\geq 0$ and sum to $1$
    • Conic hull: all coefficients $\geq 0$ (removes unit constraint)
  2. Convex function $f$: for $x, y$ in the domain, $f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$
    • Affine functions, norms, and quadratic functions are all convex
    • Epigraph (area above line) of a convex function is a convex set
    • First-order condition: $f$ is convex iff $f(y) \geq f(x) + \nabla f(x)^T (y-x)$
    • Second-order condition: $f$ is convex iff $\nabla^2 f(x) \succeq 0$ for all $x$
  3. Composition rules:
    • Jensen’s inequality: for convex $f$, \(f\left(\sum_{i=1}^n p_i x_i\right) \leq \sum_{i=1}^n p_i f(x_i)\)
    • Affine precomposition: $f(Ax + b)$ is convex for convex $f$ and affine $Ax + b$
    • Nonnegative weighted sum: $\alpha f + \beta g$ is convex for convex $f$, $g$ and $\alpha, \beta \geq 0$
    • Pointwise maximum: $\max\{f(x), g(x)\}$ is convex for convex $f$, $g$
    • Perspective: $tf(x/t)$ is convex for convex $f$ and $t > 0$

Duality

  1. Constrained optimization problem: $\min f(x)$ s.t. $g_i(x) \leq 0$ and $h_j(x) = 0$
    • We assume the primal $f$ is convex, with constraints $g$ convex, $h$ affine
  2. Dual problem: $\max_{\lambda \geq 0, \nu} \min_x L(x, \lambda, \nu)$
    • Lagrangian: $L(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)$
    • Always concave, even if primal is not convex
    • Weak duality: dual optimum $\leq$ primal optimum
    • Strong duality: dual optimum $=$ primal optimum if Slater’s condition holds (primal strictly feasible)
  3. Slater’s condition: if the feasible region has an interior point, strong duality holds
    • Strictly feasible: $x$ is in the relative interior of the domain of $D$, and:
      • $g_i(x) < 0$ for all $i$ (note strict inequality)
      • $h_j(x) = 0$ for all $j$
  4. KKT conditions:
    • Primal feasibility: $g_i(x) \leq 0$, $h_j(x) = 0$
    • Dual feasibility: $\lambda_i \geq 0$
    • Complementary slackness: $\lambda_i g_i(x) = 0$
    • Stationarity: $\nabla f(x) + \sum_i \lambda_i \nabla g_i(x) + \sum_j \nu_j \nabla h_j(x) = 0$
    • If strong duality holds, then KKT is both necessary and sufficient for optimality

Linear programming

  1. Standard form: $\min_{x} c^T x$ subject to $Ax = b$, $x \geq 0$
    • Dual: $\max_{y, s} b^T y$ subject to $A^T y + s = c$, $s \geq 0$
  2. Simplex method: moves along vertices of feasible polyhedron until optimum reached
    • Reduced costs: $\bar{c}_j = c_j - y^T A_j$ where $y$ solves $y^T A_B = c_B^T$
    • Optimality condition: if all reduced costs $\geq 0$, current vertex is optimal
    • Bland’s rule: choose entering variable with smallest index among negative reduced costs
  3. Interior point methods:
    • Central path: set of points $(x(\tau), y(\tau), s(\tau))$ solving perturbed KKT:
      • $Ax = b$, $x \geq 0$
      • $A^T y + s = c$, $s \geq 0$
      • $x_i s_i = \tau$ for all $i$ (centralizing condition)
    • Path following methods: approximately follow central path as $\tau \to 0$
      • Short-step: take Newton step, reduce $\tau$ by fixed factor
      • Long-step: take damped Newton step, reduce $\tau$ adaptively

Applications

  1. Least squares: \(\min_x ||Ax - y||^2_2\)
    • Min-norm solution: $x^* = (A^T A)^{-1} A^T y$
    • Projection matrix: $A A^\dagger$ using Moore-Penrose psuedoinverse
    • LASSO: $\min_x ||Ax - y||^2_2 + \lambda ||x||_1$
    • Ridge / Tikhonov: $\min_x ||Ax - y||^2_2 + \lambda ||x||^2_2$
      • $x^* = (A^T A + \lambda I)^{-1} A^T y$
    • Weighted: $\min_x (Ax - y)^T W (Ax - y)$
      • $x^* = (A^T W A)^{-1} A^T W y$
  2. Gradient descent:
    • Unconstrained optimization: $\min_x f(x)$ for differentiable $f$
      • Gradient descent: $x^{(k+1)} = x^{(k)} - t_k \nabla f(x^{(k)})$
        • Step size $t_k$ typically chosen by line search or fixed
        • Converges if $f$ is $L$-smooth and $\mu$-strongly convex with $t_k \leq \frac{2}{L+\mu}$
      • Newton’s method: $x^{(k+1)} = x^{(k)} - t_k (\nabla^2 f(x^{(k)}))^{-1} \nabla f(x^{(k)})$
        • Converges quadratically if $\nabla^2 f(x)$ is Lipschitz and initial point close enough to optimum
    • Constrained optimization: $\min_x f(x)$ subject to $x \in C$
      • Projected gradient descent: $x^{(k+1)} = \text{Proj}_C(x^{(k)} - t_k \nabla f(x^{(k)}))$
        • Projection \(\text{Proj}_C(y) = \arg\min_{x \in C} ||x - y||_2^2\)
      • Frank-Wolfe algorithm: $x^{(k+1)} = x^{(k)} + \gamma_k (s^{(k)} - x^{(k)})$ where $s^{(k)} = \arg\min_{s \in C} \langle s, \nabla f(x^{(k)}) \rangle$
        • Doesn’t require projection, good for structured constraint sets
        • Converges at rate $O(\frac{1}{k})$ if $f$ convex and $\nabla f$ Lipschitz
  3. Support vector machines
    • Hard-margin SVM: $\min_{w, b} \frac{1}{2} ||w||^2$ subject to $y_i(w^T x_i + b) \geq 1$ for all $i$
      • Maximizes margin $\frac{2}{||w||}$ between hyperplanes $w^T x + b = 1$ and $w^T x + b = -1$
    • Soft-margin SVM: $\min_{w, b, \xi} \frac{1}{2} ||w||^2 + C \sum_i \xi_i$ subject to $y_i(w^T x_i + b) \geq 1 - \xi_i$, $\xi_i \geq 0$
      • Allows misclassifications with slack variables $\xi_i$, trade-off controlled by $C > 0$
    • Kernel trick: replace dot products $x_i^T x_j$ with kernel $K(x_i, x_j) = \phi(x_i)^T \phi(x_j)$
      • Allows learning nonlinear decision boundaries in high-dimensional $\phi(x)$ space
      • Common kernels:
        • Polynomial \(K(x,z) = (1 + x^T z)^d\)
        • RBF \(K(x,z) = \exp(-\frac{||x-z||^2}{2\sigma^2})\)

Notes mentioning this note

There are no notes linking to this note.