# Optimization, Control and Reinforcement Learning

### Optimization, Control and Reinforcement Learning Session

#### 9:00am-12:00pm, February 25 on Zoom

Optimization acts as the engine of the whole modern “data-knowledge-decision” pipeline. Control theory and reinforcement learning could make an impact in the rapidly-changing dynamic environment with sequential decision making. The ubiquity of systems that need automatic control and various successes of reinforcement learning systems further justify the necessity of further developing theories.

The Coordinate Science Laboratory (originally the Control Systems Laboratory) keeps conducting groundbreaking research in related fields. The success is the fruit of the collaborative and interdisciplinary environment of CSL. The Optimization, Control, and Reinforcement Learning session will have a keynote speech by Prof. Mahyar Fazlyab a prominent researcher in the area.

### Keynote Speaker – Prof. Mahyar Fazlyab, Johns Hopkins University

Time: 11:00 am – 12:00 pm, February 25

Talk Abstract: Neural Networks (NNs) have become increasingly effective at many difficult machine-learning and control tasks. However, the nonlinear and large-scale nature of neural networks makes them hard to analyze and, therefore, they are mostly used as black-box models without formal guarantees. This issue becomes even more complicated when NNs are used in learning-enabled closed-loop systems, where a small perturbation can substantially impact the system being controlled. In the first part of this talk, we present a convex optimization framework, inspired by robust control, that can provide useful certificates of stability, safety, and robustness for NN-driven systems. In the second part of the talk, we address the problem of incorporating the safety guarantees of robust control into NN‐driven uncertain dynamical systems. Specifically, we will develop scalable methods that integrate custom projection layers into a neural network‐based policy that enforce robust control specifications (stability, safety, etc) during both training and runtime.

### Invited Student Speaker – Glen Chou, University of Michigan

“Overcoming Uncertainty in Learned Dynamics and Perception Modules for Safe Motion Planning”

Time: 10:20 am – 11:00 am, February 25

Talk Abstract: Trustworthy robots must be able to complete tasks reliably while obeying safety constraints. While traditional motion planners can achieve this when the environment is carefully modeled a priori, these guarantees fall apart in more realistic unstructured scenarios, where uncertainty runs rampant and robots invariably need data to refine their understanding of the world. Though machine learning provides a means to obtain perception and dynamics models from data, blindly trusting these potentially unreliable models when planning can cause unsafe and unpredictable behavior at runtime. This talk will present our recent work in determining where learned models can be trusted for reliable motion planning, and to what extent. Specifically, we estimate bounds on the error of the learned models inside a domain around the training data. Using tools from contraction theory, we propagate this model error bound into a trajectory tracking error bound. This tracking bound is used to constrain the planner to only return plans that can be safely tracked in spite of the errors in the perception and dynamics. We demonstrate in simulation that these theoretical guarantees translate to empirical success, enabling safe task completion at runtime on a variety of challenging high-dimensional, underactuated systems using rich sensor observations (e.g. RGB-D images) in the feedback control loop.

### Student Speakers

#### Xingang Guo

“Convex Programs and Lyapunov Functions for Reinforcement Learning: A Unified Perspective on the Analysis of Value-Based Methods”

Time: 9:00 am – 9:20 am, February 25

Value-based methods play a fundamental role in Markov Decision Processes (MDPs) and Reinforcement Learning (RL). In this work, we present a unified convergence analysis framework for a large class of valued-based methods including Value Computation (VC), Value Iteration (VI), and Temporal Difference (TD) learning with linear function approximation. First, we show an intrinsic connection between value-based methods and dynamic systems. Then we combine the existing dynamical system theory to derive necessary and sufficient conditions for convergence certifications of the aforementioned methods. The derived conditions are convex programs, either in terms of linear programming (LP) or Semidefinite Programming (SDP). In particular, we solve these convex programs to provide analytical proofs of convergence of value-based methods and obtain explicit Lyapunov functions. This simple routine unifies the analysis of value-based reinforcement learning algorithms, revealing some intriguing connections between feedback control systems and reinforcement learning algorithms. It is our hope that such a connection can inspire more work at the intersection of system/control theory and reinforcement learning.

#### Arjun Gupta

“Learning Value Functions from Undirected State-only Experience”

Time: 9:20 am – 9:40 am, February 25

This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s, s′, r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.

#### Dawei Sun

“Multi-agent Motion Planning from Signal Temporal Logic Specifications”

Time: 9:40 am – 10:00 am, February 25

We tackle the challenging problem of multi-agent cooperative motion planning for complex tasks described using signal temporal logic (STL), where robots can have nonlinear and nonholonomic dynamics. Existing methods in multi-agent motion planning, especially those based on discrete abstractions and model predictive control (MPC), suffer from limited scalability with respect to the complexity of the task, the size of the workspace, and the planning horizon. We present a method based on timed waypoints to address this issue. We show that timed waypoints can help abstract nonlinear behaviors of the system as safety envelopes around the reference path defined by those waypoints. Then the search for waypoints satisfying the STL specifications can be inductively encoded as a mixed-integer linear program. The agents following the synthesized timed waypoints have their tasks automatically allocated, and are guaranteed to satisfy the STL specifications while avoiding collisions. We evaluate the algorithm on a wide variety of benchmarks. Results show that it supports multi-agent planning from complex specification over long planning horizons, and significantly outperforms state-of-the-art abstraction-based and MPC-based motion planning methods. The implementation is available at github.

The problem of robust mean estimation in high dimensions is studied, in which a certain fraction (less than half) of the datapoints can be arbitrarily corrupted. Motivated by compressive sensing, the robust mean estimation problem is formulated as the minimization of the $\ell_0$-norm’ of an \emph{outlier indicator vector}, under a second moment constraint on the datapoints. The $\ell_0$-norm’ is then relaxed to the $\ell_p$-norm ($0<p\leq 1$) in the objective, and it is shown that the global minima for each of these objectives are order-optimal and have optimal breakdown point for the robust mean estimation problem. Furthermore, a computationally tractable iterative $\ell_p$-minimization and hard thresholding algorithm is proposed that outputs an order-optimal robust estimate of the population mean. The proposed algorithm (with breakdown point $\approx 0.3$) does not require prior knowledge of the fraction of outliers, in contrast with most existing algorithms, and for $p=1$ it has near-linear time complexity. Both synthetic and real data experiments demonstrate that the proposed algorithm outperforms state-of-the-art robust mean estimation methods.