Math 450 Optimization Methods in Machine Learning (Spring 2021)


Analysis of various optimization methods and their application in training common machine learning models. A brief review of gradient descent, line search, Newton’s method, and Python programming. Specific topics include: quasi-Newton method, stochastic gradient descent, momentum, and variance reduction. Coding these methods in the one of the most popular machine learning library PyTorch with GPU accelerators on Google Colab and Kaggle cloud platforms. For more info on the lecture time, formats, homework policy, Canvas site, please refer to the official syllabus.


The preparing of the several modules in this course borrows many things from Dr. Long Chen’s Math 110A and Math 110B classes at UC Irvine, and certain parts in Module 4 is designed based on Dr. Long Chen’s notes on optimization through first order ODE system solvers (personal communication).

Textbook and references:

Numerical Optimization, by Jorge Nocedal and Stephen J. Wright, Springer Series in Operations Research and Financial Engineering book series. Free e-copy at

Lectures on Convex Optimization, by Yurii Nesterov, Springer Optimization and Its Applications book series. Free e-copy at

Note: to download the electronic copy of the books from the publisher, for both you need either log into the WashU library proxy or use a campus network.

Reference: Bottou, L., Curtis, F.E. and Nocedal, J., 2018. “Optimization methods for large-scale machine learning”. SIAM Review, 60(2), pp.223-311. arXiv link

Github repository


This is only tentative, and with a fairly good chance, is subject to change due to COVID-19. Another note is that the lecture numbering may not directly correspond to the actual lecture given.

Module 1

Duration: Week 1-2

Problem: how machine learning workflows build a nonlinear function $h(\mathbf{w})$ based on the data to achieve certain tasks.

Examples: Given images of handwritten digits, building a function that can tell us which arabic number a newcoming image corresponds to.

Material: Neural networks, loss functions, cross-entropy, convolutional neural networks, regression and classification problems.

Coding: review of Python (numpy, scikit-learn), introduction to PyTorch tensor operations (matrix-vector, matrix-matrix multiplication, element-wise operation, dimension tracking), PyTorch’s simple Sequential() model building interface.

Module 2

Duration: Week 3-6

Problem: how to minimize a convex and smooth function $\displaystyle\min_{\mathbf{x}\in \mathbb{R}^N} L(\mathbf{x})$ using first order methods.

Examples: Solving a linear regression or logistic regression problem, least square problems like $\displaystyle \min_{\mathbf{w}\in \mathbb{R}^N} \Vert A\mathbf{w} - \mathbf{b}\Vert^2 + \epsilon\Vert \mathbf{w} \Vert^2$. Identifying $\displaystyle\min_{\boldsymbol{x}} \frac{1}{2} \boldsymbol{x}^{\top} A\boldsymbol{x} - \boldsymbol{b}^{\top}\boldsymbol{x}$ as solving $A\boldsymbol{x} = \boldsymbol{b}$ for a positive and symmetric matrix $A$.

Material: Review of the Math 449 second half, convergence analysis of gradient descent, conjugate gradient descent.

Coding: Porting the numpy-based implementation of optimization methods using simple PyTorch vector/matrix operations. autograd submodule of torch. Use Sequential to build a simple neural network from torch.nn submodule.

Module 3

Duration: Week 7-9

Problem: Stochastic variants of optimization methods for $\displaystyle\min_{\mathbf{w}\in \mathbb{R}^n} \frac{1}{N}\sum_{i=1}^N L_i(\mathbf{w})$ when $N$ is large.

Examples: $h(X; \mathbf{w})$ stands for the output of an underlying machine learning model (e.g., a neural network), $X$ are input data of the model with $\mathbf{w}$ representing the parameters of the model. Its corresponding least square loss function to be minimized is $\displaystyle \min_{\mathbf{w}\in \mathbb{R}^N} \Vert h(X; \mathbf{w}) - \mathbf{y}\Vert^2 + \epsilon\Vert \mathbf{w} \Vert^2$.

Material: Stochastic gradient descent (SGD), convergence analysis, step size choice, mini-batching and batch size, overfitting, validation error, $\ell^1$- and $\ell^2$- regularization, early stopping. (Coding) Writing a stochastic gradient descend optimizer using both optim interfaces in PyTorch to train a neural network. Writing classes using nn.Module.

Module 4

Duration: Week 10-11

Problem: How to accelerate the convergence of GD/SGD for $\displaystyle\min_{\mathbf{w}\in \mathbb{R}^n} \frac{1}{N}\sum_{i=1}^N L_i(\mathbf{w})$.

Examples: In the GD variants for $\displaystyle\min_{\mathbf{w}\in \mathbb{R}^n} L(\mathbf{w})$: $\displaystyle\mathbf{w}{k+1} = \mathbf{w}_k - \alpha_k \nabla{\mathbf{w}} L(\mathbf{w}k)$, making use of previous iterations’ gradient $\nabla{\mathbf{w}} L(\mathbf{w}_{k-1})$ to accelerate the convergence.

Material: Review of Math 449 on how the condition number of the Hessian affects the convergence, momentum and acceleration, momentum for quadratic optimization, convergence proof using first order ODE system. (Coding) Writing accelerated optimizer using PyTorch optim interface, manipulating the cache using param_groups, load state_dict of a model and pass it to the optimizer using a closure().

Module 5

Duration: Week 11-12

Problem: how to minimize a convex and smooth function $\displaystyle\min_{\mathbf{x}\in \mathbb{R}^N} L(\mathbf{x})$ using various second order methods (mainly quasi-Newton methods).

Examples: Same as module 2, but this time we try to exploit the information from the Hessian of an objective function.

Material: Review of the Math 449 second half, convergence analysis of Newton’s method, quasi-Newton method. (Coding) Implement the quasi-Newton method in PyTorch optim interface with autograd.

Module 6

Duration: Week 13-14

Problem: Advanced techniques to optimize $\displaystyle\min_{\mathbf{w}\in \mathbb{R}^n} \frac{1}{N}\sum_{i=1}^N L_i(\mathbf{w})$ and final project Q&A sessions.

Examples: Same with Module 5 example, but this time we try to exploit the information from more data and more iterations.

Material: variance reduction techniques such as stochastic variance reduced gradient(SVRG), SAGA, coordinate descent. (Coding) Discussions on the parameter tuning for the final project of the Kaggle in-class competition.