I completed my PhD in Mathematics and Statistics at MIT under the supervision of Philippe Rigollet, my thesis is titled "Likelihood-Free Hypothesis Testing and Applications of the Energy Distance". Prior to MIT I obtained a Master’s in Mathematics and Statistics from Corpus Christi College at the University of Oxford. I am currently working as a quant at Citadel Securities in NYC.

In my free time I enjoy playing piano.

Publications

All my work can also be found on google scholar.

Density estimation using the perceptron Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy and Rui Sun Journal of Machine Learning Research (2025) [abstract] [arXiv]
We propose a new density estimation algorithm. Given n i.i.d. samples from a distribution belonging to a class of densities on ℝ^d, our estimator outputs any density in the class whose "perceptron discrepancy" with the empirical distribution is at most O(√d/√n). The perceptron discrepancy between two distributions is defined as the largest difference in mass that they place on any halfspace of ℝ^d. It is shown that this estimator achieves expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that regularity of the prior distribution could be an explanation for the efficiency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (e.g. in the discriminators of GANs).

We generalize the above to show that replacing the "perceptron discrepancy" with the generalized energy distance of Székeley-Rizzo further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, thus making it especially useful for fitting generative models. To the best of our knowledge, it is the first example of a distance with such properties for which there are minimax statistical guarantees.
Likelihood-free hypothesis testing Patrik Róbert Gerber and Yury Polyanskiy IEEE Transactions on Information Theory (2024) [abstract] [arXiv]
Consider the problem of testing Z ~ ℙᵐ vs Z ~ ℚᵐ from m samples. Generally, to achieve a small error rate it is necessary and sufficient to have m≍1/ϵ², where ϵ measures the separation between ℙ and ℚ in total variation (𝖳𝖵). Achieving this, however, requires complete knowledge of the distributions ℙ and ℚ and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem, which we call likelihood-free (or simulation-based) hypothesis testing, where access to ℙ and ℚ (which are a priori only known to belong to a large non-parametric family P) is given through n iid samples from each. We demostrate existence of a fundamental trade-off between n and m given by nm ≍ n²_𝖦𝗈𝖥(ϵ,P), where n_𝖦𝗈𝖥 is the minimax sample complexity of testing between the hypotheses H₀: ℙ=ℚ vs H₁: 𝖳𝖵(ℙ,ℚ)≥ϵ. We show this for three non-parametric families P: β-smooth densities over [0,1]ᵈ, the Gaussian sequence model over a Sobolev ellipsoid, and the collection of distributions P on a large alphabet [k] with pmfs bounded by c/k for fixed c. The test that we propose (based on the L²-distance statistic of Ingster) simultaneously achieves all points on the tradeoff curve for these families. In particular, when m≫1/ϵ² our test requires the number of simulation samples n to be orders of magnitude smaller than what is needed for density estimation with accuracy ≍ϵ (under 𝖳𝖵). This demonstrates the possibility of testing without fully estimating the distributions.
Kernel-Based Tests for Likelihood-Free Hypothesis Testing Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy and Rui Sun NeurIPS (2023) [abstract] [arXiv]
Given n observations from two balanced classes, consider the task of labeling an additional m inputs that are known to all belong to one of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions (n=∞) the problem is solved optimally by the likelihood-ratio test; when m=1 it corresponds to binary classification; and when m≈n it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between m and n: increasing the data sample m reduces the amount n of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes - a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under maximum mean discrepancy (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric m vs n trade-off.
Minimax optimal testing by classification Patrik Róbert Gerber, Yanjun Han and Yury Polyanskiy COLT (2023) [abstract] [arXiv]
This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing (𝖢𝖠𝖳). In 𝖢𝖠𝖳, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and alternative are only specified via simulators (as in many scientific experiments). We study goodness-of-fit, two-sample (𝖳𝖲) and likelihood-free hypothesis testing (𝖫𝖥𝖧𝖳), and show that 𝖢𝖠𝖳 achieves (near-)minimax optimal sample complexity in both the dependence on the total-variation (𝖳𝖵) separation ϵ and the probability of error δ in a variety of non-parametric settings, including discrete distributions, d-dimensional distributions with a smooth density, and the Gaussian sequence model. In particular, we close the high probability sample complexity of 𝖫𝖥𝖧𝖳 for each class. As another highlight, we recover the minimax optimal complexity of 𝖳𝖲 over discrete distributions, which was recently established by Diakonikolas et al. (2021). The corresponding 𝖢𝖠𝖳 simply compares empirical frequencies in the first half of the data, and rejects the null when the classification accuracy on the second half is better than random.
Fisher information lower bounds for sampling Sinho Chewi, Patrik Róbert Gerber, Holden Lee and Chen Lu ALT (2022) [abstract] [arXiv]
We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most ε² from the target distribution requires poly(1/ε) queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.
The query complexity of sampling from strongly log-concave distributions in one dimension Sinho Chewi, Patrik Róbert Gerber, Chen Lu, Thibaut Le Gouic and Philippe Rigollet COLT (2022) [abstract] [arXiv]
We establish the first tight lower bound of Ω(loglogκ) on the query complexity of sampling from the class of strongly log-concave and log-smooth distributions with condition number κ in one dimension. Whereas existing guarantees for MCMC-based algorithms scale polynomially in κ, we introduce a novel algorithm based on rejection sampling that closes this doubly exponential gap.
Rejection sampling from shape-constrained distributions in sublinear time Sinho Chewi, Patrik Róbert Gerber, Chen Lu, Thibaut Le Gouic and Philippe Rigollet AISTATS (2022) [abstract] [arXiv]
We consider the task of generating exact samples from a target distribution, known up to normalization, over a finite alphabet. The classical algorithm for this task is rejection sampling, and although it has been used in practice for decades, there is surprisingly little study of its fundamental limitations. In this work, we study the query complexity of rejection sampling in a minimax framework for various classes of discrete distributions. Our results provide new algorithms for sampling whose complexity scales sublinearly with the alphabet size. When applied to adversarial bandits, we show that a slight modification of the Exp3 algorithm reduces the per-iteration complexity from O(K) to O(log²K), where K is the number of arms.
Gaussian discrepancy: a probabilistic relaxation of vector balancing Sinho Chewi, Patrik Róbert Gerber, Philippe Rigollet and Paxton Turner Discrete Applied Mathematics (2022) [abstract] [arXiv]
We introduce a novel relaxation of combinatorial discrepancy called Gaussian discrepancy, whereby binary signings are replaced with correlated standard Gaussian random variables. This relaxation effectively reformulates an optimization problem over the Boolean hypercube into one over the space of correlation matrices. We show that Gaussian discrepancy is a tighter relaxation than the previously studied vector and spherical discrepancy problems, and we construct a fast online algorithm that achieves a version of the Banaszczyk bound for Gaussian discrepancy. This work also raises new questions such as the Komlo ́s conjecture for Gaussian discrepancy, which may shed light on classical discrepancy problems.
Averaging on the Bures-Wasserstein manifold: dimension-free convergence of gradient descent Jason M Altschuler, Sinho Chewi, Patrik Róbert Gerber and Austin J Stromme NeurIPS, Spotlight (2021) [abstract] [arXiv]
We study first-order optimization algorithms for computing the barycenter of Gaussian distributions with respect to the optimal transport metric. Although the objective is geodesically non-convex, Riemannian GD empirically converges rapidly, in fact faster than off-the-shelf methods such as Euclidean GD and SDP solvers. This stands in stark contrast to the best-known theoretical results for Riemannian GD, which depend exponentially on the dimension. In this work, we prove new geodesic convexity results which provide stronger control of the iterates, yielding a dimension-free convergence rate. Our techniques also enable the analysis of two related notions of averaging, the entropically-regularized barycenter and the geometric median, providing the first convergence guarantees for Riemannian GD for these problems.

Teaching

During my Ph.D. I was a teaching assistant for the following courses:

6.3720/6.3722 — Introduction to Statistical Data Analysis (2024 Spring)
18.821 — Mathematics Project Laboratory (2023 Fall)
18.650 — Fundamentals of Statistics (2022 Fall, 2023 Spring)
18.656/9.521/IDS.160 — Mathematical Statistics - A non-asymptotic approach (2022 Spring)
15.070J/6.265J — Discrete Probability and Stochastic Processes (2021 Spring)
18.675 — Theory of Probability (2020 Fall)

I was an academic mentor at √mathroots (2020, 2022), which is a mathematical talent accelerator summer program for high-potential high school students from underrepresented backgrounds or underserved communities.