New preprint: Speed-robustness trade-off for optimization algorithms

Excited to share a new manuscript, by Bryan Van Scoy and me titled “The Speed-Robustness Trade-Off for First-Order Methods with Additive Gradient Noise”. It is now available on arXiv.

Background

When solving an optimization problem, a ubiquitous approach is to use an iterative algorithm that finds progressively better candidate solutions until a satisfactory solution is obtained. In cases where the problem instances are very large, such as in machine learning applications, gradient-based iterative methods are the most popular. Think of the task of finding the point of highest elevation in a vast landscape. At each step, you measure the slope of the terrain, and move in the direction of steepest increase. Since we don’t know what the terrain shape will be ahead of time, we want algorithms that work well for a wide variety of terrains. For this reason, a popular way of evaluating algorithms is to look at their worst-case convergence rate on a given class of functions. For example, we might be solving a large linear regression problem, so we know ahead of time that the function we are optimizing is a convex quadratic. Much work has gone into finding algorithms that are “optimal” for a given class of functions. For example, we know that the Heavy Ball method is optimal for convex quadratics and the Triple Momentum Method is optimal for smooth strongly convex functions.

However, worst-case convergence rate is not always the best metric to use. To continue the hill-climbing analogy, the terrain might be covered with grass and rocks, so the local slope measurements we take might be misleading. In such cases, we might want to design our algorithm differently, perhaps including some form of averaging to mitigate this uncertainty in our measurements, or taking smaller steps so we don’t move too far in the wrong direction. In the context of machine learning, evaluating the gradient of the function we want to optimize might be a costly procedure that can only be done approximately. In other words, our gradients are noisy. It turns out that algorithms designed to perform “optimally” in the worst-case sense don’t usually fare well when used in noisy scenarios.

A more subtle point is that not all metrics used to quantify performance are suitable for algorithm design. For example, when the class of functions of interest is very large (everything from a desert plain to rocky canyons), it is difficult to design an algorithm that performs well in the worst case without being excessively conservative. For example, we expect some terrains to be more challenging than others, but if we try to design a single algorithm whose worst-case performance is as good as possible, it typically leads to algorithms that always perform as their worst case. Even when presented with “easy” terrains, such algorithms perform as if the terrain was more difficult. In summary, algorithms that perform well in the worst case don’t necessarily perform well on average. Another analogy: if you’re constantly paranoid about missing flights so you always arrive at the airport 5 hours early, then you will never miss a flight. But your trips will take longer than they should. Is it worth it? This depends on your sensitivity to risk!

Contributions

We set out to design algorithms that are less conservative than those obtained using a worst-case performance metric, in the hope that it would lead to more practical algorithms. To this effect, we investigate a second performance metric, which evaluates the sensitivity to gradient noise. Roughly, we imagined that we were very close to our optimal point (so the gradient should be zero), but we receive an incorrect gradient measurement due to the noise. How much will this incorrect gradient throw us off? If the algorithm is robust, a single bad measurement will have little effect. But if the algorithm is sensitive, then the effect will be large. Related concepts from control theory include the “gain” and the “H2 norm” of a dynamical system. For some algorithms, it is clear that worst-case convergence rate and sensitivity to gradient noise are competing objectives. Here is a simulation of Gradient Descent on convex quadratic functions (averaged over 1000 trials), where we vary the stepsize (same as the hill-climbing example, but you change how far you walk at each step). Taking larger steps gives you faster initial convergence (speed), but causes you to settle farther away from the optimal point (steady state error) because the gradient error prevents you from getting closer. The right panel is the same as the left panel, but shown on a log scale to accentuate differences.

Although Gradient Descent makes this trade-off explicit, similar trade-offs exist for all other algorithms. The problem is that for Heavy Ball or Triple Momentum, for example, these algorithms have 2 or 3 tunable parameters; it’s not clear how they should be tuned to get the best balance between speed and sensitivity. It’s not even clear that these are good starting points at all! What if an entirely different algorithm performed better? In our paper, we investigated three function classes:

  • $Q_{m,L}$: Strongly convex quadratic functions whose Hessian has eigenvalues in the range ($m$,$L$).
  • $F_{m,L}$: Smooth strongly convex functions whose strong convexity and Lipschitz parameters are $m$ and $L$, respectively.
  • $S_{m,L}$: Functions that satisfy the smooth and strongly convex criterion with ($m$, $L$) with respect to the optimal point, but not necessarily all pairs of points. Such functions are not necessarily convex, but always have a unique optimal point.
These function classes are nested: $Q_{m,L} \subseteq F_{m,L} \subseteq S_{m,L}$. Therefore, we expect algorithms designed for $Q_{m,L}$ to be both faster and less sensitive than those designed for $F_{m,L}$ or $S_{m,L}$, since it is a smaller class of functions. In our work, we used different techniques to address each function class, but a common thread was that we searched for Lyapunov functions to certify robust stability.

For each function class, we designed a family of algorithms parameterized by a single scalar tuning parameter. This parameter explicitly trades off our the worst-case convergence rate and the sensitivity to noise. We named our designs RHB, RAM, and RGD, respectively. Like the Gradient Descent example, our algorithm designs can be easily tuned depending on the desired balance. As a side note, Gradient Descent is not optimal for any of the three function classes mentioned above! Our designs are structurally similar to Triple Momentum, but with different tunings. When tuned in the traditional way (fastest worst-case convergence rate possible), we actually recover the best-known methods from the literature. We can also produce nice trade-off curves that show how each algorithm family performs. In the left panel, we used $m=1$ and $L=5$. In the right panel, we used $m=1$ and $L=50$.

The challenge was to produce algorithms that achieved a near-optimal trade-off, but also had simple algebraic forms that could be easily implemented on a computer. Our designs are not actually Pareto-optimal, but they are very close, and strike a good balance between optimality and simplicity. Our designs have the parametric form: \[ x^{t+1} = x^t-\alpha\, \nabla\! f\bigl( x^t + \eta\,(x^t-x^{t-1})\bigr) + \beta\,(x^t-x^{t-1}) \]For example, our RAM method, which is near-optimal for the class $F_{m,L}$, is given by: \[ \alpha = \frac{(1+\rho)(1-\rho)^2}{m}, \quad \beta = \rho\,\frac{L\,(1-\rho+2\rho^2)-m\,(1+\rho)}{(L-m)(3-\rho)}, \quad\text{and}\quad \eta = \rho\,\frac{L\,(1-\rho^2)-m\,(1+2\rho-\rho^2)}{(L-m)(3-\rho)(1-\rho^2)} \]Here, $1-\sqrt{\frac{m}{L}} \leq \rho \lt 1$ is the tuning parameter that controls the worst-case contraction factor. Smaller $\rho$ means faster worst-case convergence. When chosen as small as possible, we recover the Triple Momentum Method.

To verify near-optimality of our designs, we performed brute-force searches and produced plots such as the one below, which is for $F_{m,L}$ with $m=1$ and $L=10$. We searched through all 3-parameter algorithms with Triple-Momentum structure, each producing a single gray dot. We also included other popular algorithms such as Heavy Ball (HB), Triple Momentum (TM), Nesterov’s method (FG), and Gradient Descent (GD). Our algorithm (RAM), shows up as a curve rather than a dot because it is a one-parameter tunable family. As we can see, none of the gray dots lie below the RAM curve. We also plotted the Robust Momentum Method (RM), which was an earlier attempt at making this sort of trade-off that we presented at ACC’18 in Milwaukee.

There is a great deal more, but it’s too much to include in this blog post. If you find this interesting, take a look at our preprint! For comments/discussion, I posted a link on Twitter.

New preprint: Robust stabilization!

I’m excited to share our new preprint on robust stabilization, by Saman Cyrus and me. The title is “Generalized necessary and sufficient robust boundedness results for feedback systems”, and it is now available on arXiv. This work is the culmination of Saman’s PhD and formed the core of his thesis, which he recently successfully defended (!).

Robust stabilization is an old problem, dating back at least 70 years to the seminal work of Lur’e, Zames, and Willems. The basic idea is that a known system G is in feedback with an unknown nonlinearity H constrained to lie in some known set, and we seek sufficient conditions on G that ensure stability of the interconnection. In this work, we looked at the case where the nonlinearity is constrained to lie in a cone (also known as “sector-bounded”). Results of this type include: small-gain, circle, passivity, and extended conicity. It may be possible to extend our ideas to dynamic versions of robust stability (multiplier theory, dissipativity, or integral quadratic constraints), but this is an area for future work.

Robust stabilization results are typically formulated in the extended space L2e and can be expressed in a variety of ways, depending how which assumptions are made (causality, stability, linearity, time-invariance, etc.). Moreover, most of these results are only sufficient while some others are “necessary and sufficient”. Necessity can also mean two different things; when the conditions fail, can we construct worst-case signals only? Or can we also construct worst-case nonlinearities?

Our aim in this work was to disentangle the different formulations of these results and distill them into a single unified result that makes as few assumptions as possible. We worked with relations in an arbitrary semi-inner product space, which allowed us to avoid the notions of “causality”, “stability”, “time-invariance”, and “well-posedness”. In fact, we don’t even assume a notion of “time”! In this setting, we prove a very general necessary and sufficient condition for robust boundedness that requires essentially no assumptions. When our condition fails to hold, we show how to explicitly construct worst-case signals, and an associated worst-case nonlinearity. Moreover, our construction produces a nonlinearity that happens to be linear, which is nice.

Specializing our result to causal operators on L2e, this is what happens:

  1. When our condition fails, we can still construct worst-case signals. This works even when the plant G is nonlinear. Note: although the S-procedure can be used to prove this result, it would require assuming G is linear and would be non-constructive. We provide a direct construction and no linearity assumption is needed.
  2. The second step of our construction (constructing a worst-case nonlinearity) does not work in the L2e case because it typically produces a non-causal nonlinearity.
  3. By making additional assumptions, namely that G is LTI, we can construct worst-case nonlinearities that are causal. Our construction produces an LTI nonlinearity (in fact, a pure delay).
These results coincide with what was found in the literature. For example, the necessary-and-sufficient circle criterion of Vidyasagar assumes an LTI plant and constructs a worst-case nonlinearity using a pure delay. Also, the necessary-and-sufficient small gain theorem of Zhou, Doyle, and Glover assumes an LTI plant and constructs a worst-case LTI nonlinearity. In the case where G is LTI, we also provide frequency-domain and LMI flavors of our result.

Our work unifies existing results from the literature in the following way:

  1. Existing necessary-and-sufficient results are special cases of our general result.
  2. Existing sufficient-only results follow directly from our general necessary-and-sufficient condition once we apply the appropriate relaxations. We give examples in the text of how this works.

This turned out to be a very educational research project for both Saman and myself, on a rich and interesting topic, and I hope were able to do it justice! I also posted a Twitter thread summarizing the paper.

ACC’20 in Denver (online)

I attended the 2020 American Control Conference, which was supposed to take place in Denver, Colorado, but moved to a 100% online format as a precaution against COVID-19. I normally include a photo of the conference venue with these blog posts, but I don’t think anybody would be interested in seeing my home office setup so I’ll go with no photo this time!

At ACC, I presented a paper with Peter Seiler on the topic of using IQCs for algorithm synthesis. Iterative algorithms like gradient descent can be viewed as robust controllers. In robust control, we have an uncertain dynamical system we would like to control, and we assume the uncertainty is bounded. For example, we could have a mechanical or electrical system consisting of components that are subject to manufacturing tolerances and so we can’t know their exact parameters. Or perhaps their parameters change slowly over time as wear and tear sets in. The goal is to design a feedback controller that works for all systems in this bounded set. This robustness is desirable because if we build 100 identical robots, they won’t actually be identical. Using a robust controller means that we can design a single controller to use in all robots, and it will work even though the robots are not exactly identical. Optimization algorithms are very similar: we want to design an iterative optimization algorithm that works on a variety of different objective functions. The optimization algorithm plays the role of the controller and the objective function plays the role of the uncertain plant. In this paper, we showed how IQC synthesis tools from robust control theory can be used to design robust iterative algorithms.

This year, ACC experimented with a new type of presentation. Rather than the standard 20-minute talk, presenters were given the option to do 3-minute “rapid interactive” talks. I opted for this new format, and since all talks had to be pre-recorded, I’m happy to share with you my 3-minute talk explaining our paper! Here is a link to the video. My short talk was also awarded one of the inaugural “Rapid Interactive People’s Choice Awards”!

Overall, I the conference was very well run and I commend the organizers and staff for getting everything together under such difficult circumstances. I definitely miss the ability to meet with people face-to-face and the spontaneous encounters and discussions that happen at large conferences, but there are also advantages to an online format. There are no logistical/scheduling issues that arise from two talks happening at the same time. And the ability to pause a talk or watch it again later is very nice!

CDC’19 in Nice

I attended the 2019 Conference on Decision and Control in Nice, France. The photo above is a panoramic view of Nice taken from the top of Castle Hill. My research group had a solid showing at this year’s CDC with 3 papers. In no particular order, here are short summaries of each paper and a links to the slides for each talk.

  • “Integral quadratic constraints: Exact convergence rates and worst-case trajectories” with my postdoc Bryan Van Scoy. This paper considers the problem of robust stability; in a feedback interconnection where one of the components is uncertain (but the uncertainty is bounded), when can we guarantee that the interconnection will be stable? The most general robust stability result available is the IQC theorem of Megretski and Rantzer [[LINK]], which roughly states that robust stability can be established by checking the feasiblity of a particular linear matrix inequality (LMI). It is known that the result goes both ways: if the LMI is not feasible, then (at least in theory), there should exist some admissible choice of the uncertainty that will lead to instability. Bryan’s paper provides an explicit way of constructing such worst-case realizations by leveraging a theorem of strong alternatives for LMIs. This result is useful because it allows us to go beyond “could this system fail?”, and address the question “how could this system fail?”. Here are links to the paper and the slides from Bryan’s talk.
  • “Explicit agent-level optimal cooperative controllers for dynamically decoupled systems with output feedback” with my student Mruganka Kashyap. This paper considers a cooperative control problem where a group of agents (these could be robots, drones, etc.) attempt to solve a cooperative task under limited communication. In general, problems of this type are difficult to solve. However, solving simpler cases can provide useful intuition for what might work well in more general scenarios. In this work, we assumed the agents were “dynamically decoupled” and have linear dynamics. We also assume all disturbances are Gaussian and the global objective (which may couple the agents’ states and inputs) is quadratic. Previous work has shown that these assumptions ensure that the optimal strategy is a linear one and can be computed efficiently. Our contribution takes this a step further and elucidates the exact structure (down to the level of individual agents) that is used for optimal control. The optimal structure is reminiscent of the classical LQG structure: there is an “estimator”, a “compensator”, and a separation principle too! Here are links to the paper and the slides from Mruganka’s talk. Incidentally, this was Mruganka’s first talk at an international conference! See the photo on the right!
  • “Unified necessary and sufficient conditions for the robust stability of interconnected sector-bounded systems” with my student Saman Cyrus. The Lur’e problem is perhaps one of the most fundamental and well-studied examples of a robust control problem: assessing the stability properties of a known linear time-invariant plant in feedback with an unknown sector-bounded nonlinearity. This problem is is fundamental because many electrical, mechanical, and chemical systems (read: engineering systems!) are close to being linear. The departures from linearity often involve phenomena such as model errors, saturations, friction/stiction, or other hysteresis effects, which can often be well-modeled as sector-bounded nonlinearities. In the 60+ years of literature on this topic, various results have emerged: passivity theory, the small-gain theorem, the circle criterion, and more. Our paper presents an effort to unify these results (and their various incarnations) by using a very general mathematical framework. The benefit of our approach is that we can write a single robustness theorem that makes minimal assumptions yet can be specialized to recover many existing results. The advantage of having a single general theorem is that it allows us to derive new specializations! In our paper, we present one such new result: a necessary and sufficient condition for robust exponentially-weighted stability of Lur’e systems. Here are links to the paper and the slides from the talk.

One final highlight: I ran into my former PhD advisor, Sanjay Lall! Here is a photo of a small subset of my academic family: From left-to-right: Sanjay (my advisor), me, Bryan (my postdoc), and Mruganka (my student).

INFORMS’19 in Seattle

I attended the 2019 INFORMS Annual Meeting in Seattle, WA on Oct 20-23, 2019. INFORMS is the leading international association for Operations Research & Analytics professionals. It’s outside my wheelhouse, but I thought it would be nice to attend since Operations Research does touch on optimization, modeling, and data-driven decision-making, which are topics that interest me.

This was my first time attending an INFORMS conference and I was blown away by the sheer size and scope. There were well over 6,000 attendees and the conference had 100 simultaneous tracks over 4 full days. Compare this to CDC (the flagship annual controls conference), which is about 1/4 of the size.

At INFORMS, I gave two invited talks. The first was in a session organized by Jelena Diakonikolas, who will be my colleague at UW–Madison starting in Spring 2020! My talk was about the similarities and differences between discrete-time and continuous-time dynamical systems and how it relates to algorithm analysis. For example, notions of controllability/observability, stability regions, Lyapunov results, and more have direct analogs in discrete and continuous time. However, notions such as interval stability (e.g. Kharitonov’s theorem) or robust stability (e.g. Aizerman’s conjecture) are notably different in discrete and continuous time! My slides can be found here.

My second talk was in a session organized by Alireza Fallah, a PhD student in Asu Ozdaglar’s group at MIT. This talk was a short version of my talk on unified algorithm analysis and how small semidefinite programs can be used to automate the analysis and design of iterative optimization algorithms. For slides from a similar (but longer) version of this talk, see here.

Overall, I’m happy that I finally got to experience INFORMS. The two sessions I participated in were great (thanks Jelena and Alireza!). I think it’s a particularly nice conference to meet with industry contacts, since there is a substantial industry presence at this conference. That being said, I found it quite overwhelming at times because there were so many talks happening all at once. There is something to be said about smaller venues (see my recent NecSys and Allerton posts!) because the conferences tend to be more cohesive and it’s easier to find your friends when the venue isn’t the size of an airport!

Allerton’19

I attended the 57th annual Allerton Conference on Communication, Control, and Computing in Monticello, IL. At the conference, I presented an invited talk in a session organized by Bin Hu, my former postdoc who is now at UIUC — so it was nice to see Bin again!

My talk was largely about the following recent paper (arxiv link) by my student Akhil Sundararajan, my postdoc Bryan Van Scoy, and me. The manuscript should get published at some point in 2020 once we make our final revisions. The paper describes how gradient-based distributed optimization algorithms can be parameterized and analyzed (in terms of their worst-case convergence rate) using simple semidefinite programs. In the paper, we study the case where the underlying graph is possibly time-varying (with uniformly bounded spectral gap), and we characterize the worst-case performance under any adversarially-varying graph. The benefit of this approach is that it enables rapid analysis, which means it also enables efficient design! Our paper also presents an “optimal” algorithm design (optimal in the sense that it has the fastest worst-case convergence rate) and a comparison between this algorithm and existing algorithms from the literature. Finally, we show that our approach likely produces tight bounds because we can construct worst-case sequences of graphs and functions that achieve the bound.

NecSys’19 in Chicago


I attended the 8th IFAC Workshop on Distributed Estimation and Control in Networked Systems (NecSys) in Chicago on September 16-17, 2019. NecSys is a small international single-track conference that tends to attract optimization/controls/distributed systems researchers. The last time I attended NecSys was in 2012 (in Santa Barbara) while I was still a postdoc, and I have fond memories of it. It was nice for NecSys to be local again this year, as it meant that I could bring most of my research group along!

At the conference, Bryan presented a poster on his recent distributed optimization work (arxiv link). In this work, Bryan showed that near-optimal gradient-based distributed optimization algorithms can be designed by alternating between simple gradient steps and gossip steps. In the paper, Bryan shows how to tune the ratio of gradient to gossip steps based on the spectral gap of the adjacency matrix of the underlying communication graph (a measure of how “connected” the graph is) in order to achieve the fastest possible convergence rate.

On a personal note, I had the opportunity at NecSys to reconnect with my former PhD advisor Sanjay Lall and my former postdoc advisor Ben Recht (first photo). The fact that NecSys was local also led to a little road trip back to Madison with my students Mruganka, Saman, and Akhil! We stopped for dinner in “Little India”, which is a neighborhood filled with Indian and Pakistani restaurants and shops in a suburb north of Chicago. Akhil guided us to some traditional Chaat (second photo) and it was delicious!

ACC’19 in Philadelphia



I attended the 2019 American Control Conference in Philadelphia, Pennsylvania. The conference was held in downtown Philadelphia, which is filled with beautiful historical buildings, such as the Union League building, pictured on the right.

This was a conference of firsts. Together with my former postdoc Bin Hu, who is now faculty at UIUC, I co-organized my first ACC workshop! Here is a link to the workshop website. It was a full-day workshop on the interplay between optimization, control, and machine learning. Our morning session focused on how tools from controls, differential equations, and dynamical systems could be used to better analyze and design first-order optimization algorithms, which are a key tool in machine learning. Our afternoon session focused on how ideas from machine learning, reinforcement learning, and complexity theory could be used to address large-scale and computationally challenging control problems. We were lucky to have many great speakers, and our workshop was the most popular one at ACC this year — check out the full room on the right! If you’re interested in these topics, the workshop website linked contains info for all the talks that took place (authors, titles, and slides!).

In addition to the workshop, my student Akhil Sundararajan and postdoc Bryan Van Scoy joined me at ACC and Akhil presented a paper by the three of us titled: “A canonical form for first-order distributed optimization algorithms”. In distributed optimization, the goal is for a number of agents (computer servers, robots, mobile phones, etc.) to jointly solve an optimization problem. The problem is global, but the data collection and processing happens locally. The challenge is to design an algorithm that mediates how each agent processes its local data and what is communicated with the other agents so that each agent eventually learns the solution to the global problem. There have been many such algorithms proposed in recent years, but little insight on what makes one algorithm better than another, or how to go about actually designing such an algorithm. To make matters more confusing, there are many equivalent sets of equations that can be used to describe the same algorithm. So it might happen (and it has!) that two researchers invent the same algorithm without realizing it. Our paper makes an effort to classify distributed algorithms, by providing a canonical form that uses as few parameters as possible yet parameterizes a broad class of algorithms (including many existing published algorithms). Our unifying parameterization is a first step toward a more systematic and principled analysis and design of distributed optimization algorithms.

My student Akhil, who is also first author on the paper, presented our work. As a side note, this was Akhil’s first conference presentation, and he did a fantastic job! (see photo on the right). If you’re interested in seeing Akhil’s slides, you can download them here.

L4DC’19 at MIT

I had the pleasure to attend the inaugural Conference on Learning for Dynamics and Control at MIT. The goal of this new conference is to bring together researchers at the interface between machine learning, control, optimization, and related areas, with a focus on addressing scientific and application challenges in real-time physical processes modeled by dynamical or control systems. The conference format is still in flux, but this year’s version was a single-track format with diverse high-profile speakers presenting on a range of topics including: model-predictive control, deep learning, simulating dynamical systems, robotics, traffic systems, and more. There was also two excellent poster sessions. Overall, the conference was a huge success. Over 400 people attended, and all on very short notice! Only about 3 months elapsed between when the idea of this conference was conceived and when the conference actually occurred.

My two highlights of the conference were: (1) Getting to see all the Berkeley folks I knew through my postdoc but haven’t had much of a chance to interact with since then. Many any of them are now moving on to exciting careers in academia and industry. (2) Having the opportunity to meet Alexandre Megretski for the first time and chat about robust control and IQCs. Megretski wrote some of the seminal work in these areas and it was an honor to meet him in person.

Kudos to the organizers (Ali Jadbabaie, George Pappas, Pablo Parrilo, Ben Recht, and Melanie Zellinger) for putting together such an amazing conference on such short notice! I’m looking forward to attending again next year!

8th Midwest Workshop on Control and Game Theory

I recently attended the 8th Midwest Workshop on Control and Game Theory hosted by the University of Washington in St. Louis. The two-day workshop brought together faculty, postdocs, and grad students in the fields of controls, optimization, game theory, and economics. This was my first time visiting Wash U, and it’s a beautiful campus! The buildings are all built of sandstone, and it reminds me of Queen’s University in Kingston, Canada.

I quite like smaller workshop formats such as this one, so much less hectic than the large conferences I attend like CDC or ICML. It’s also nice to meet and network with other researchers in the Midwest!

At the workshop, I have a 30-minute talk on using a robust controls perspective to analyzing and designing optimization algorithms. The video of my talk is available on YouTube.