New preprint: Machine teaching!

Hot off the presses, a new paper by me, Xuezhou Zhang, and Xiaojin (Jerry) Zhu. The paper is titled “An Optimal Control Approach to Sequential Machine Teaching” and is now available on arXiv.

The paper is about “machine teaching”, which can be thought of as the dual problem to “machine learning”. The general paradigm is that we are trying to learn a model (described by a set of parameters) that characterizes some process. We observe data samples drawn from this process and the learning algorithm uses the data to refine its model parameters in an iterative fashion. Machine learning is the forward problem: using the given data to update the parameters. Different choices of algorithms can lead to different learning rates, for example. Machine teaching is the scenario where you know the model you’re trying to learn and you know the learner’s algorithm. Your task is to select the sequence of data samples that leads to the fastest possible learning.

Machine teaching comes up when investigating issues of security and robustness of machine learning algorithms. For example: how susceptible is our algorithm to a “data poisoning attack”, where false data is injected to steer our algorithm off course? Machine teaching also has potential use in designing better curricula for classroom teaching tasks.

In this paper, we apply optimal control theory to the simple case of an SGD learner with a squared loss and training examples that are constrained to satisfy a norm bound. We prove that there always exists an optimal trajectory through parameter space that lies in a 2D subspace, regardless of the ambient dimension of the parameter space. In other words, optimal teaching trajectories can be simple even when there are millions of parameters to learn. Also, we show through numerical experiments that the optimal teaching strategy can be quite counter-intuitive in many cases — optimal trajectories often take a circuitous route through parameter space and yet can perform arbitrarily better than obvious heuristics such as a “one-step greedy approach” or a “straight line” approach.

Princeton Day of Optimization 2018

I attended the inaugural Princeton Day of Optimization hosted by the Princeton Operations Research and Financial Engineering (ORFE). The theme of the day was “The intersection of optimization, control, and machine learning,” and the event did not disappoint! There were six keynote talks on a variety of topics related to the theme and a poster session over lunch showcasing the works of many of the graduate student attendees. The event was organized by my colleague Amir Ali Ahmadi and I must say he did a fantastic job. Putting together an event like this is a tremendous amount of work and everything happened smoothly. The Princeton Day of Optimization will be held every two years and I’m looking forward to the next installment! This was my first time seeing the Princeton campus and I took the opportunity to walk around and admire some of the beautiful historic buildings on the campus. Pictured right is the Princeton University Chapel.

New preprint: A canonical form for distributed optimization algorithms

We have just posted a preprint titled: “A Canonical Form for First-Order Distributed Optimization Algorithms”. It is available here and on arXiv. This is joint work with my student Akhil Sundararajan, my postdoc Bryan Van Scoy, and myself. Our work provides a canonical form for distributed optimization algorithms that will enable more streamlined analysis and design.

There has been much recent interest in the analysis and design of distributed optimization algorithms. In this context, “distributed” means that the computation happens on different computers across a network. Each computing node may have access to different local information. The goal is for each node to arrive at the solution of a global optimization problem that involves everybody’s data, whilst each node only performs local computation and communicates only with nearby nodes. All of this should happen without ever needing to store all the data in one place. This paradigm makes sense in cases where the individual nodes are remote and have limited computing capability, or we are trying to conserve power by transmitting and computing as little as possible. Also, we may not want a single node to gather and store all the available information for privacy or security reasons.

In the past 5 years, other researchers have proposed a variety of solutions to this problem. Proposed methods involve local computation, local memory updates, and communication with nearby nodes, in some prescribed order. Because there is a number of different ways to perform these actions, there is a combinatorial explosion of possible designs, analyses, and outcomes.

Our paper provides a way of streamlining research efforts in this area: we showed that in fact, there aren’t that many different possible distributed optimization algorithms. We were able to parameterize a broad sub-family of algorithms using only 5 scalar parameters. In doing so, we were able to neatly categorize existing methods and in some cases, show that different established algorithms were in fact the same algorithm!

Our hope is that this work will help streamline research efforts in distributed optimization in moving toward a more principled understanding of analysis and design.

New preprint: A general absolute stability theorem

We have just posted a preprint titled: “Unified Necessary and Sufficient Conditions for the Robust Stability of Interconnected Sector-Bounded Systems”. It is available here and on arXiv. This is joint work with my student Saman Cyrus and myself. Our work provides a new absolute stability result that generalizes several existing works and enables new analyses, specifically tight certification of exponential stability.

The problem of “absolute stability” is a classical robust control problem that has been studied for over 75 years. The basic setup is that a known system is interconnected with another system that is nonlinear, uncertain, difficult to model, or otherwise troublesome. The goal is to certify “robust” stability. That is: can we guarantee that the interconnected system will be stable for all possible instances of the troublesome system?

The most classical version of this problem is when the troublesome system is “sector-bounded”. This happens for example if our system contains a saturation nonlinearity. Most results for this setting take the form: “if the known system satisfies a certain property, then the interconnected system will be robustly stable”. In other words, they are “sufficient” conditions for stability. Classical examples in the literature include passivity theory, the small-gain theorem, and the circle criterion. There have also been many efforts to generalize or unify these results. For example, researchers have observed that if we use a particular loop transformation, the three aforementioned results are one and the same!

Our contribution in this paper was to formulate a general version of the classical robust stability theorem that has the following properties:

  • The theorem is both necessary and sufficient. So if the condition fails, there must exist a nonlinearity that yields an unstable interconnection. This parallels existing necessary conditions from the literature.
  • It is formulated in a semi-inner product space. This distills the result to its essential features. In this general space, there need not be any notion of “time”. So we can do away with the typical requirements of stability, causality, and boundedness.
  • Existing results can be recovered by simply choosing the correct vector space and associated semi-inner product. This means we can recover passivity, small-gain, circle, discrete vs continuous time, and more.
  • We also provide a computationally verifiable linear matrix inequality (LMI) that allows one to efficiently check the conditions of the theorem in the special cases of ordinary stability (a known result) and exponential stability (a new result).
Finally, and perhaps most importantly, we can apply our general result to new semi-inner product spaces to obtain new and previously unknown results! In the paper, we showcase one such example, where we formulate a new result that gives necessary and sufficient conditions for exponential stability with a prescribed rate of convergence.

ISMP’18 in Bordeaux

I attended the 23rd International Symposium on Mathematical Programming in Bordeaux, France. ISMP is an optimization conference that is held once every three years and it was my first time ever attending. It was also my first time visiting the south of France since my childhood. A nice opportunity to enjoy French cooking and do a bit of sightseeing… On the right is the Église Sainte-Croix de Bordeaux, a Roman Catholic church that dates back to the 11th century.

At the conference, I gave a 30-min talk on our recent work in distributed optimization. Distributed optimization has been a popular recent topic in the field of controls. The basic setup is that a distributed network of agents are trying to jointly optimize a cost function while only having access to a subset of the globally available data. For example, you can imagine a network of robots equipped with sensors, each of which can gather local information, perform computations, and share data with its neighbors. The goal is to have all agents eventually converge on a global model that accounts for all available data. Performing the computation in a decentralized fashion results in a system that is robust to changes such as an agent malfunctioning or communication links randomly failing.

Specifically, we used a robust control framework to analyze and compare the performance of various distributed optimization algorithms. While pure iterative optimization algorithms (with no distributed component) and pure consensus algorithms (information sharing but no optimization) have been extensively analyzed, the theory behind distributed optimization (a combination of optimization and consensus) is not as mature. Our hope is that the sorts of automated tools we are developing will help further our understanding of distributed algorithms, in the sense of analysis as well as design. This was joint work with my student Akhil Sundararajan and my postdocs Bryan Van Scoy and Bin Hu. The slides for my talk are available here.

My postdoc Bryan Van Scoy also came to ISMP and gave a talk of his own on an accelerated optimization algorithm he developed called the Triple Momentum Method. It is the fastest known accelerated method in the sense that it has the best worst-case performance over the set of strongly convex functions. For more information and for a link to slides, check out Bryan’s website!

ACC’18 in Milwaukee

I attended the 2018 American Control Conference in Milwaukee, WI. This was a unique opportunity, being a local conference (only a short bus ride away from Madison!) so I took the opportunity to bring the whole research group (photo on the right).

At the conference, my student Saman Cyrus presented our paper (also co-authored by Bin Hu and Bryan Van Scoy) titled “A robust accelerated optimization algorithm for strongly convex functions”. In this work, we proposed an accelerated algorithm with a single tuning parameter that controls the trade-off between convergence rate and robustness to multiplicative gradient noise. We call it the “Robust Momentum Method” (RMM). Existing algorithms tend to be “slow and robust” like Gradient Descent or “fast and fragile” like the Heavy-Ball method or Nesterov’s method. With RMM, the single tuning parameter directly controls this trade-off, leading to intermediate options that are both fast and robust. This was Saman’s first conference presentation and he gave an excellent talk! His slides are available here.

I enjoy attending seminars outside my field, and there were two excellent plenaries at ACC this year that fit the bill. Robert Wood from Harvard spoke about the engineering challenge of building micro-robots that mimic flying insects. It’s a truly multi-disciplinary endeavor combining micro-fabrication, fluid mechanics, structural mechanics, material science, electronics, control engineering, and origami! The second talk that really expanded my horizons was by Emory Brown from Harvard Medical School and MIT. Emory is a professor of Medical Engineering, Computational Neuroscience, Health Sciences and Technology, and Anesthesia. He is also a practicing anesthesiologist and has a PhD in statistics. The talk was about controlling one of the most complicated things in the world: the human brain.

Manuscript on Lyapunov functions for optimization algorithms

Adrien Taylor, Bryan Van Scoy, and I have recently posted a manuscript on arxiv entitled “Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees”.

Lyapunov functions are a mathematical tool used in the analysis of dynamical systems for implicitly certifying stability. The rough idea is that we look for some sort of “energy function” that we can prove always decreases, and then we can relate the rate of decay of energy to the rate of decay of a more meaningful quantity, such as the state of our dynamical system.

In the context of optimization algorithms, the dynamical system is the algorithm itself (a discrete-time dynamical system), and the function we are optimizing is treated as uncertainty. The function is uncertain because we don’t assume we know it precisely, just that we know some of its properties (e.g. strong convexity and Lipschitz gradients). The goal is to look for a Lyapunov function that will always decrease regardless of what our function happens to be, so long as it satisfies the aforementioned properties. Finding the best Lyapunov function is a difficult problem in general, so one typically uses a “Lyapunov candidate”, which is a parameterized family of functions that can be efficiently searched by computational means. A popular choice is to use quadratic functions because in the case of a linear system, the existence of a quadratic Lyapunov function is necessary and sufficient for stability. In the case of robust optimization, where we are uncertain about the function, there is no such guarantee of quadratic necessity.

In this paper, we provide an efficient way of searching for quadratic Lyapunov functions that prove the robust stability of algorithms. Specifically, we show that for any fixed rate rho, the feasibility of a particular (small) semidefinite program is equivalent to the existence of a quadratic Lyapunov function that certifies a linear convergence rate of rho in the worst case. The key result here is the necessity. While previous results used semidefinite programs as sufficient conditions for convergence, our result is also necessary. So if our semidefinite program is infeasible, it means that there is no quadratic Lyapunov function that certifies stability with a rate of rho.


I am thrilled to announce that I have received the CAREER award from the National Science Foundation! Here is a description of the award from the NSF website:

CAREER: The Faculty Early Career Development (CAREER) Program is a Foundation-wide activity that offers the National Science Foundation’s most prestigious awards in support of early-career faculty who have the potential to serve as academic role models in research and education and to lead advances in the mission of their department or organization. Activities pursued by early-career faculty should build a firm foundation for a lifetime of leadership in integrating education and research. NSF encourages submission of CAREER proposals from early-career faculty at all CAREER-eligible organizations and especially encourages women, members of underrepresented minority groups, and persons with disabilities to apply.

This 5-year award will enable me and my research group to pursue our work in applying tools from robust control to the analysis and design of iterative algorithms. Here is a link to the official announcement from the NSF.

I am very grateful to the National Science Foundation for their support and I am looking forward to pursuing this research in the coming years!

Manuscript on analyzing Approximate SGD

Bin Hu and I have recently posted a manuscript on arxiv entitled “Analysis of Approximate Stochastic Gradient using quadratic constraints and sequential semidefinite programs”. Approximate Stochastic Gradient refers to the standard SGD algorithm, which is widely used in solving optimization problems, but with the twist that gradient information is noisy.

In this paper, we investigate the case where noise is either additive, multiplicative, or both. We also consider different cases concerning the objective functions. For example, the case where each component function is smooth (not necessarily convex) but the sum of the component functions is strongly convex. Generally, our approach allows one to specify different constraints on the individual functions vs. the sum of the functions and obtain worst-case (and average-case) performance guarantees. Each case reduces to analyzing the feasibility of a small semidefinite program that can be evaluated numerically or analytically.

Tuning the stepsize presents interesting challenges, because you generally have to choose between converging rapidly to a large ball or converging slowly to a smaller ball. We investigate what happens when using a constant stepsize and we also look at the optimal case, i.e. optimally tuning stepsize at each iteration by solving a Bellman equation. This approach gives insight into the intricate trade-offs that couple stepsize selection, convergence rate, optimization accuracy, and robustness to gradient inaccuracy.

CMO-BIRS workshop: Beyond Convexity

I attended a week-long workshop titled “Beyond Convexity: Emerging Challenges in Data Science”, hosted by the Casa Matemática Oaxaca (CMO) in Oaxaca, Mexico.

The workshop consisted of talks, breakout sessions, and many discussions on topics including semidefinite programming, nonlinear/nonconvex optimization, deep learning, and statistics. Much time was spent brainstorming about unsolved problems and discussing emerging topics in data science. The combination of a beautiful and secluded venue, and a small size (roughly 30 attendees) led to many thought-provoking discussions. I returned to Madison with new knowledge, new ideas, and new colleagues. Couldn’t ask for more!

As part of the workshop, I gave a 30-minute talk where I presented recent work by Bin Hu and myself on using dissipativity theory to analyze and interpret the convergence properties of optimization algorithms. A video of my talk is available here and my slides are available here.

I’m grateful for the hard work put in by the organizers: Tamara Kolda (Sandia National Labs), and my colleagues at UW-Madison: Rob Nowak, Becca Willett, and Stephen Wright. Bravo! The photo above is a panorama taken at Monte Albán, one of Oaxaca’s most famous archaeological sites.