Anming Gu
I'm a recent graduate of Boston University, and an incoming PhD student. I am currently working with Prof. Edward Chien and Kristjan Greenewald on optimal transport for machine learning.
My graduate coursework includes:
- Mathematics: Functional Analysis, Stochastic Calculus, Mathematics of Deep Learning, PDEs, Stochastic PDEs
- Computer Science: Complexity Theory, Mathematical Methods for Theoretical Computer Science
Teaching experience:
- Algorithmic Data Mining, S25
- Analysis of Algorithms, S22, F24, S25
- Algebraic Algorithms, F24
- Theory of Computation, S24
- Concepts of Programming Languages, F23
CV  / 
Google Scholar  / 
Github
|
|
Research
I'm interested in optimal transport, optimization/sampling, robust statistics, and differential privacy. I'm also more broadly interested in problems at the intersection of probability, theoretical computer science, and machine learning.
General research directions that seem interesting to me:
- Applications of sampling: diffusion, functional inequalities, spin glasses, and stochastic localization
- Interplay between differential privacy and robust statistics
- Langevin dynamics has been shown to be connected to sampling from the exponential mechanism in differential privacy. Are there any implications of mean-field Langevin dynamics for differential privacy?
(α-β) denotes alphabetical order, * denotes equal contribution, and ‡ denotes student advising
|
|
Mirror Mean-Field Langevin Dynamics
In preparation.
link to come
The mean-field Langevin dynamics minimizes an entropy-regularized nonlinear convex functional over Wasserstein space. It has gained attention recently due to its connection to noisy gradient descent for mean-field two-layer neural networks. We extend the analysis of mean-field Langevin dynamics to the mirror mean-field Langevin dynamics setting, where optimization is constrained to a convex subset of Euclidean space.
|
|
Differentially Private Wasserstein Barycenters
(α-β) Mark Bun, Edward Chien, Kristjan Greenewald, Anming Gu, Sasidhar Kunapuli‡
In preparation.
link to come / code to come
A Wasserstein barycenter is the mean of a set of probability measures under the optimal transport metric, and it has numerous applications spanning machine learning, statistics, and computer graphics. In applications, the input measures are often empirical distributions formed from datasets, hence privatizing the output barycenter is desired if the input datasets contain sensitive records. We provide the first differentially private algorithms for approximate computation of Wasserstein barycenters between empirical distributions.
|
|
Compute-Optimal LLMs Provably Generalize Better with Scale
Marc Anton Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J Zico Kolter, Andrew Gordon Wilson
International Conference on Learning Representations, 2025.
openreview
Why do larger language models generalize better? To address this question, we develop generalization bounds on the LLM pretraining objective in the compute optimal regime. We prove a novel fully empirical Freedman-type martingale concentration inequality, tightening existing bounds to account for the low loss variance. With larger models this variance decreases, meaning that our generalization bounds can even get tighter as the models get larger. We pair these findings with an analysis of the theoretically achievable quantization bitrates based on the Hessian of the loss function, controlling the other component of the bounded gap. With these results, we move towards a more complete understanding of why LLMs generalize.
|
|
Partially Observed Trajectory Inference using Optimal Transport and a Dynamics Prior
Anming Gu, Edward Chien, Kristjan Greenewald
International Conference on Learning Representations, 2025.
Preliminary version in OPT Workshop on Optimization for Machine Learning, 2024. [link]
arXiv / code / thesis slides / poster
Trajectory inference is the problem of recovering a stochastic process from temporal marginals. We consider the setting when we cannot observe the process directly but we have access to a known velocity field. Using tools in optimal transport, stochastic calculus, and optimization theory, we show that a minimum entropy estimator will recover the latent trajectory of the process. We provide theoretical guarantees that our estimator will converge to the ground truth as the number of observations becomes dense in the time domain. We also provide empirical results to show the robustness of our method.
|
|
k-Mixup Regularization for Deep Learning via Optimal Transport
Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, Edward Chien
Transactions on Machine Learning Research, 2023.
arXiv / code
Mixup is a regularization technique for training neural networks that perturbs input training data in the direction of other randomly chosen training data. We propose a new variant of mixup that uses optimal transport to perturb training data in the direction of other training data that are more similar to the input data. We show theoretically and experimentally that our method is more effective than mixup at improving generalization performance.
|
|