Linear Algebra and Convex Optimization
Vector spaces
- Vector spaces: a set $V$ with addition and scalar multiplication
- Basis: a minimal set of vectors $B$ where every element in $V$ is a linear combination of $B$
- Dimension: the number of vectors in $B$
- Inner product: $\langle \cdot, \cdot \rangle: V \times V \to F$
- Conjugate symmetry: $\langle u, v \rangle = \overline{\langle v, u \rangle}$
- Linearity: $\langle ax + by, z \rangle = a \langle x, z \rangle + b \langle y, z \rangle$
- Positive definiteness: $\langle x, x \rangle > 0 \implies x = 0$
- Norm: function from a vector space to non-negative reals
- Positive definiteness: $||x|| = 0 \implies x = 0$
- Triangle inequality: $||x + y|| \leq ||x|| + ||y||$
- Absolute homogeneity: $||\alpha x|| = ||\alpha|| \cdot ||x||$
- Common vector norms:
- L-$p$: $||x||_p = (\sum_{i} | x_i |^p)^{\frac{1}{p}}$
- L-$\infty$: $||x||_\infty = \max_i |x_i|$
- Useful formulas:
- Cauchy-Schwarz: $|\langle u, v \rangle|^2 \leq \langle u, u \rangle \cdot \langle v, v \rangle$
- Vector angle: $\text{cos } \theta = \frac{x^T y}{||x||_2 ||y||_2}$
- Cardinality (# nonzero elements): $\text{card}(x) \geq \frac{||x||^2_1}{||x||^2_2}$
- $\frac{1}{\sqrt{n}} ||x||_2 \leq ||x||_\infty \leq ||x||_2 \leq ||x||_1 \leq \sqrt{n}||x||_2 \leq n ||x||_\infty$
Matrix algebra
- Common matrix norms:
- $p$-norm: \(||A||_p = \max_x \frac{||Ax||_p}{||x||_p}\)
- $p = 1$: max absolute column sum
- $p = 2$ (spectral): $\sigma_{\text{max}}(A)$
- $p = \infty$: max absolute row sum
- Sub-multiplicative property for $p$-norm: \(||A B||_p \leq ||A||_p ||B||_p\)
- Nuclear: \(||A||_* = \sum_{i}^r \sigma_i\)
- Frobenius: \(||A||_F = \sqrt{\text{tr}(A^T A)} = \sqrt{\sum_i \sigma^2_i}\)
- $p$-norm: \(||A||_p = \max_x \frac{||Ax||_p}{||x||_p}\)
- Norm relationships for $m \times n$ matrix $A$:
- $\frac{1}{\sqrt{n}} ||A||_\infty \leq ||A||_2 \leq \sqrt{m}||A||_\infty$
- $\frac{1}{\sqrt{m}} ||A||_1 \leq ||A||_2 \leq \sqrt{n} ||A||_1$
- $||A||_2 \leq ||A||_F \leq \sqrt{r} ||A||_2$
- Common matrix properties:
- Range: the span of the columns of $A$: \(\{v | v = \sum_i c_i a_i, c_i \in \mathbb{R} \}\)
- Rank: number of linearly independent columns of $A$
- Symmetry: a matrix $A$ is symmetric if $A = A^T$
- Hermitian: complex generalization of symmetric
- Positive-semidefinite: if $x^T A x \geq 0$ for all $x$ (PD if strict)
- Gain: \(\max_x \frac{||Ax||}{||x||}\)
- Trace (sum of diagonal elements): \(\text{tr}(A) = \sum_{i} \sigma_i\)
- Determinant: \(\text{det}(A) = \prod_{i} \sigma_i\)
- Spectral radius: \(\rho(A) = \sigma_1 = \max_i | \lambda_i(A) |\)
- Condition number: \(\kappa(A) = \frac{\sigma_1}{\sigma_n} = ||A||_2 ||A^{-1}||_2\)
- Fundamental Theorem of Linear Algebra:
- $\text{Null}(A) \oplus \text{Range}(A^T) = \mathbb{R}^n$
- $\text{Null}(A^T) \oplus \text{Range}(A) = \mathbb{R}^n$
Matrix calculus
- Gradient for function $g$:
- Scalar ($g: \mathbb{R}^n \to \mathbb{R}$): $(\nabla g(x))_i = \frac{\partial g}{\partial x_i}(x)$ (dim $n \times 1$)
- Hessian ($g: \mathbb{R}^n \to \mathbb{R}$): $(\nabla^2 g(x))_{ij} = \frac{\partial^2 g}{\partial x_i \partial x_j}(x)$ (dim $n \times n$)
- Jacobian ($g: \mathbb{R}^n \to \mathbb{R}^m$): $(Dg(x))_{ij} = \frac{\partial g_i}{\partial x_j}$ (dim $m \times n$)
- Quotient rule: for $h(x) = \frac{n(x)}{d(x)}$, \(\nabla h(x) = \frac{d(x) \nabla n(x) - n(x) \nabla d(x)}{(d(x))^2}\)
- Product rule: for $g(x) = v(x) s(x)$ ($v$: vector, $s$: scalar), \(Dg(x) = Dv(x)s(x) + v(x) Ds(x)\)
- Taylor’s theorem: \(f(x + \Delta x) = f(x) + \langle \nabla f(x), \Delta x \rangle + \frac{1}{2} (\Delta x)^T \nabla^2 f(x) \Delta x + \ldots\)
- Common matrix derivatives:
- For $g(a, n) = a^T X b$:
- $\nabla_X g(a, b) = a b^T$
- $\nabla_a g(a, b) = X b$
- $\nabla_b g(a, b) = X^T a$
- For $g(X) = \text{tr}(X^T A)$:
- $\nabla g(X) = A$
- For $g(X) = \text{tr}(X^T A X)$ where $A$ is symmetric:
- $\nabla g(X) = 2 A X$
- $\nabla^2 g(X) = 2 A$
- For $g(X) = \text{tr}(\Sigma^{-1} X)$ where $\Sigma \succ 0$:
- $\nabla_{\Sigma} g(X) = -\Sigma^{-1} X \Sigma^{-1}$
- For $g(X) = \text{tr}(X \log X)$ where $X \succeq 0$:
- $\nabla g(X) = \log X + I$
- For $g(X) = ||AX - B||^2_F$:
- $\nabla g(X) = 2 A^T (AX - B)$
- $\nabla^2 g(X) = 2 A^T A$
- For $g(X) = \log \det X$ where $X \succ 0$:
- $\nabla g(X) = (X^{-1})^T$
- $\nabla^2 g(X) = -(X^{-1})^T dX X^{-1}$
- For $g(x) = f(Ax)$ where $f: \mathbb{R}^m \to \mathbb{R}$ and $A \in \mathbb{R}^{m \times n}$:
- $\nabla g(x) = A^T \nabla f(Ax)$
- $\nabla^2 g(x) = A^T \nabla^2 f(Ax) A$
- For $g(a, n) = a^T X b$:
Matrix decompositions
- Eigendecomposition: $A = U \Lambda U^T$
- Eigenvalues are the elements of the diagonal matrix $\Lambda$, satisfying $Au = \lambda u$
- Characteristic equation: $\text{det}(\lambda I - A) = 0$
- Rayleigh quotient: $R(A) = \frac{x^T A x}{x^T x}$ (used for variational definition)
- Eigenvectors are the columns of $U$ (which is orthonormal, i.e. $U U^T = I$)
- Inverse: $A^{-1} = U \Lambda^{-1} U^T$
- Spectral theorem: any symmetric matrix is diagonalizable
- Eigenvalues are the elements of the diagonal matrix $\Lambda$, satisfying $Au = \lambda u$
- Singular value decomposition (SVD): $A = U \Sigma V^T$
- $v_i, \sigma_i$ are eigen-vector/values of $A^T A$, $u_i = \frac{1}{\sigma_i} A v_i$
- $\text{rank}(A) = n \leq m \iff A^T A$ is invertible
- $\text{rank}(A) = m \leq n \iff A A^T$ is invertible
- Principal component analysis equivalent formulations:
- Variance-maximizing directions: \(\text{argmax}_u u^T C u\)
- Least-squares min directions: \(\text{argmin} \sum_i \min_{v_i} ||x_i - v_i u||^2_2\)
- Rank-one approximation: \(\text{argmin}_Y ||X - Y||_F\)
- $v_i, \sigma_i$ are eigen-vector/values of $A^T A$, $u_i = \frac{1}{\sigma_i} A v_i$
- Moore-Penrose pseudoinverse: $A^\dagger = V \Sigma^{-1} U^T$
- $A^\dagger = (A^T A)^{-1} A^T$ if $A$ full column rank (right inverse)
- $A^\dagger = A^T (A A^T)^{-1}$ if $A$ full row rank (left inverse)
- If $A = U \Sigma V^T$ is the SVD, then $A^\dagger = V \Sigma^\dagger U^T$
- LU decomposition: $A = LU$, $L$ lower triangular and $U$ upper triangular
- LDL decomposition: if $A$ symmetric, $A = L D L^T$
- Cholesky decomposition: iff $A$ is symmetric PD, $A = L L^T$ with $L$ lower triangular
- QR decomposition: $A = QR$, $Q$ orthonormal and $R$ upper triangular
- Gram-Schmidt: orthonormalize columns of $A$ to get $Q$, use inner products to fill in $R$
- $\begin{bmatrix} & \\ q_1 = \frac{a_1}{||a_1||} & \tilde{q_i} = a_i - \sum_{j < i} a^T_i - (a_i^T q_j) q_j & \ldots \\ & \end{bmatrix} \begin{bmatrix} ||a_1|| & a^T_2 q_1 & a^T_3 q_1 \\ 0 & ||\tilde{q_2}|| & a_3^T q_2 \\ 0 & 0 & ||\tilde{q_3}|| \end{bmatrix}$
- Modified G-S: can replace diagonal elements with zeros for rectangular $A$
- More numerically stable than LU but more expensive
- Gram-Schmidt: orthonormalize columns of $A$ to get $Q$, use inner products to fill in $R$
- Schur complement: block matrix $\begin{bmatrix} A & B \\ C & D \end{bmatrix}$ has Schur complement $D - C A^{-1} B$ (if $A$ invertible)
- $\begin{bmatrix}
A & B \\ C & D
\end{bmatrix} = \begin{bmatrix}
I & 0 \\ C A^{-1} & I
\end{bmatrix} \begin{bmatrix}
A & 0 \\ 0 & D - C A^{-1} B
\end{bmatrix} \begin{bmatrix} I & A^{-1} B \\ 0 & I \end{bmatrix}$
- $\begin{bmatrix}
A & B \\ C & D
\end{bmatrix} = \begin{bmatrix}
I & 0 \\ C A^{-1} & I
\end{bmatrix} \begin{bmatrix}
A & 0 \\ 0 & D - C A^{-1} B
- Block-triangular decomposition: decomposes a square matrix $A$ into block-triangular form
- If $A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1m} \\ 0 & A_{22} & \cdots & A_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & A_{mm} \end{bmatrix}$ with each $A_{ii}$ square, then $A$ is block-triangular
- Diagonal blocks $A_{ii}$ can be further decomposed (e.g., LU, QR, SVD)
- Permutation matrices can symmetrically reorder rows and columns to make $A$ block-triangular
Convexity
- Convex set $C$: for $x, y \in C$, the line segment $tx + (1-t)y \in C$ for all $0 \leq t \leq 1$
- Hyperplanes, halfspaces, balls, norms, and polytopes are all convex
- Convex hull: all coefficients $\geq 0$ and sum to $1$
- Conic hull: all coefficients $\geq 0$ (removes unit constraint)
- Convex function $f$: for $x, y$ in the domain, $f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$
- Affine functions, norms, and quadratic functions are all convex
- Epigraph (area above line) of a convex function is a convex set
- First-order condition: $f$ is convex iff $f(y) \geq f(x) + \nabla f(x)^T (y-x)$
- Second-order condition: $f$ is convex iff $\nabla^2 f(x) \succeq 0$ for all $x$
- Composition rules:
- Jensen’s inequality: for convex $f$, \(f\left(\sum_{i=1}^n p_i x_i\right) \leq \sum_{i=1}^n p_i f(x_i)\)
- Affine precomposition: $f(Ax + b)$ is convex for convex $f$ and affine $Ax + b$
- Nonnegative weighted sum: $\alpha f + \beta g$ is convex for convex $f$, $g$ and $\alpha, \beta \geq 0$
- Pointwise maximum: $\max\{f(x), g(x)\}$ is convex for convex $f$, $g$
- Perspective: $tf(x/t)$ is convex for convex $f$ and $t > 0$
Duality
- Constrained optimization problem: $\min f(x)$ s.t. $g_i(x) \leq 0$ and $h_j(x) = 0$
- We assume the primal $f$ is convex, with constraints $g$ convex, $h$ affine
- Dual problem: $\max_{\lambda \geq 0, \nu} \min_x L(x, \lambda, \nu)$
- Lagrangian: $L(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)$
- Always concave, even if primal is not convex
- Weak duality: dual optimum $\leq$ primal optimum
- Strong duality: dual optimum $=$ primal optimum if Slater’s condition holds (primal strictly feasible)
- Slater’s condition: if the feasible region has an interior point, strong duality holds
- Strictly feasible: $x$ is in the relative interior of the domain of $D$, and:
- $g_i(x) < 0$ for all $i$ (note strict inequality)
- $h_j(x) = 0$ for all $j$
- Strictly feasible: $x$ is in the relative interior of the domain of $D$, and:
- KKT conditions:
- Primal feasibility: $g_i(x) \leq 0$, $h_j(x) = 0$
- Dual feasibility: $\lambda_i \geq 0$
- Complementary slackness: $\lambda_i g_i(x) = 0$
- Stationarity: $\nabla f(x) + \sum_i \lambda_i \nabla g_i(x) + \sum_j \nu_j \nabla h_j(x) = 0$
- If strong duality holds, then KKT is both necessary and sufficient for optimality
Linear programming
- Standard form: $\min_{x} c^T x$ subject to $Ax = b$, $x \geq 0$
- Dual: $\max_{y, s} b^T y$ subject to $A^T y + s = c$, $s \geq 0$
- Simplex method: moves along vertices of feasible polyhedron until optimum reached
- Reduced costs: $\bar{c}_j = c_j - y^T A_j$ where $y$ solves $y^T A_B = c_B^T$
- Optimality condition: if all reduced costs $\geq 0$, current vertex is optimal
- Bland’s rule: choose entering variable with smallest index among negative reduced costs
- Interior point methods:
- Central path: set of points $(x(\tau), y(\tau), s(\tau))$ solving perturbed KKT:
- $Ax = b$, $x \geq 0$
- $A^T y + s = c$, $s \geq 0$
- $x_i s_i = \tau$ for all $i$ (centralizing condition)
- Path following methods: approximately follow central path as $\tau \to 0$
- Short-step: take Newton step, reduce $\tau$ by fixed factor
- Long-step: take damped Newton step, reduce $\tau$ adaptively
- Central path: set of points $(x(\tau), y(\tau), s(\tau))$ solving perturbed KKT:
Applications
- Least squares: \(\min_x ||Ax - y||^2_2\)
- Min-norm solution: $x^* = (A^T A)^{-1} A^T y$
- Projection matrix: $A A^\dagger$ using Moore-Penrose psuedoinverse
- LASSO: $\min_x ||Ax - y||^2_2 + \lambda ||x||_1$
- Ridge / Tikhonov: $\min_x ||Ax - y||^2_2 + \lambda ||x||^2_2$
- $x^* = (A^T A + \lambda I)^{-1} A^T y$
- Weighted: $\min_x (Ax - y)^T W (Ax - y)$
- $x^* = (A^T W A)^{-1} A^T W y$
- Gradient descent:
- Unconstrained optimization: $\min_x f(x)$ for differentiable $f$
- Gradient descent: $x^{(k+1)} = x^{(k)} - t_k \nabla f(x^{(k)})$
- Step size $t_k$ typically chosen by line search or fixed
- Converges if $f$ is $L$-smooth and $\mu$-strongly convex with $t_k \leq \frac{2}{L+\mu}$
- Newton’s method: $x^{(k+1)} = x^{(k)} - t_k (\nabla^2 f(x^{(k)}))^{-1} \nabla f(x^{(k)})$
- Converges quadratically if $\nabla^2 f(x)$ is Lipschitz and initial point close enough to optimum
- Gradient descent: $x^{(k+1)} = x^{(k)} - t_k \nabla f(x^{(k)})$
- Constrained optimization: $\min_x f(x)$ subject to $x \in C$
- Projected gradient descent: $x^{(k+1)} = \text{Proj}_C(x^{(k)} - t_k \nabla f(x^{(k)}))$
- Projection \(\text{Proj}_C(y) = \arg\min_{x \in C} ||x - y||_2^2\)
- Frank-Wolfe algorithm: $x^{(k+1)} = x^{(k)} + \gamma_k (s^{(k)} - x^{(k)})$ where $s^{(k)} = \arg\min_{s \in C} \langle s, \nabla f(x^{(k)}) \rangle$
- Doesn’t require projection, good for structured constraint sets
- Converges at rate $O(\frac{1}{k})$ if $f$ convex and $\nabla f$ Lipschitz
- Projected gradient descent: $x^{(k+1)} = \text{Proj}_C(x^{(k)} - t_k \nabla f(x^{(k)}))$
- Unconstrained optimization: $\min_x f(x)$ for differentiable $f$
- Support vector machines
- Hard-margin SVM: $\min_{w, b} \frac{1}{2} ||w||^2$ subject to $y_i(w^T x_i + b) \geq 1$ for all $i$
- Maximizes margin $\frac{2}{||w||}$ between hyperplanes $w^T x + b = 1$ and $w^T x + b = -1$
- Soft-margin SVM: $\min_{w, b, \xi} \frac{1}{2} ||w||^2 + C \sum_i \xi_i$ subject to $y_i(w^T x_i + b) \geq 1 - \xi_i$, $\xi_i \geq 0$
- Allows misclassifications with slack variables $\xi_i$, trade-off controlled by $C > 0$
- Kernel trick: replace dot products $x_i^T x_j$ with kernel $K(x_i, x_j) = \phi(x_i)^T \phi(x_j)$
- Allows learning nonlinear decision boundaries in high-dimensional $\phi(x)$ space
- Common kernels:
- Polynomial \(K(x,z) = (1 + x^T z)^d\)
- RBF \(K(x,z) = \exp(-\frac{||x-z||^2}{2\sigma^2})\)
- Hard-margin SVM: $\min_{w, b} \frac{1}{2} ||w||^2$ subject to $y_i(w^T x_i + b) \geq 1$ for all $i$
Notes mentioning this note
There are no notes linking to this note.