Iterated offline reinforcement learning and contextual bandits: motivations and research themes
Joint work with Merwan Barlier, Igor Colin, Ludovic Dos Santos, Gabriel Hurtado, Cedric Malherbe, and Albert Thomas.
One of our main research objectives at the Noah’s Ark Lab in Huawei France is to build autopilots for engineering systems using AI.
We can improve various metrics: making the systems better, cheaper, more reliable, safer, or more energy efficient. Typical systems we are working on include the cooling system of data centers and the systems that manage the connections in the wireless antenne or your local Wifi network. The applications of the technology we are developing are countless: given that engineering systems are the backbone of most of industry and transportation, making AI useful in this domain is arguably a multi-trillion dollar endeavor.
The control view: systems and agents
The abstraction of a controlled engineering system contains a system (plane) and an engineer (pilot). The system provides observables (instrument readouts) and performance metrics (aka rewards: e.g., speed, consumption, safety alarms) at each discrete timestep, while the engineer steers the system using control actions, attempting to optimize the metrics or to keep them within limits. The goal of reinforcement learning and bandits is to “learn the engineer”, leading to an autopilot. More precisely, we learn the policy that maps the observable and reward sequence to optimized control actions, using historical system logs (traces) and/or by interacting with the system.
Why offline/batch reinforcement learning (RL)?
You may ask: why don’t we just use the beautiful RL algorithms that, among others, helped to beat the Go world champion in 2016? The answer is simple: these algorithms need to interact with the system over millions or billions of time steps while losing a lot of times during learning. Engineering systems are tightly controlled by systems engineers whose responsibility is to run them safely. They rarely allow the data scientist to directly experiment with the system. This means that we need to learn a good control policy based on small offline data and iterate with systems engineers slowly. Below is the offline RL loop embedded into a typical project management process.
The process iterates the following steps:
- The systems engineer, besides doing her work of controlling the system, logs the trace of (action, observable, reward) tuple.
- She sends the data set to the data scientist.
- He designs a controller based on the collected data with the goal of improving the system (mean reward).
- She executes the policy, logs the trace again and sends it back to the data scientist.
Policy validation and data taking is slow: it may take weeks to months to iterate once. This puts our setup somewhere between offline and online RL, we can call it slowly growing batch RL or slow iterated offline RL.
Why micro-data RL?
Physical systems are slow and are not getting faster with time (unlike computers). We often need to start with a static system log containing a couple of hundreds or thousands of time steps, and we are not allowed to hot test a new policy on a data center cooling system before the systems engineer is certain that it is safe. Micro-data RL is the term for using RL on systems where the main bottleneck or source of cost is access to data (as opposed to, for example, computational power). The term was introduced in robotics research. This regime requires performance metrics that put as much emphasis on sample complexity (learning speed with respect to sample size) as on asymptotic performance, and algorithms that are designed to make efficient use of small data.
The goal of RL is to learn a control policy. Offline MFRL does this in a single shot on the collected offline data set. In MBRL we first learn a model of the environment, essentially a multi-variate time-series predictor, forecasting the evolution of the system given its history of observables and control actions. We can then use this model, often in the form of a simulator, to learn the control policy. Aircraft simulators have been used to train pilots. The idea is the same here except that we do not have a physical model so we learn it from data.
We have reasons to explore both model-free and model-based approaches.
Why MFRL?
- MFRL seems to achieve better asymptotic performance.
- MFRL is better researched and the existing algorithms can serve as strong baselines.
- We need to plan on the model learned in MBRL, and these planners are essentially MFRL algorithms.
Why MBRL?
- It is considered the best approach for the micro-data regime.
- A common argument for the supremacy of model-free RL over model-based RL is that we waste predictive power on predicting all system observables, not only those that matter for optimizing the return. But variables in a system log are usually crafted to aid the systems engineer, so they are arguably all relevant for optimizing the policy.
- System models (simulators) are useful on their own. First, they can be validated on by the systems engineer. She will trust an opaquely learned controller much less than a controller learned and demonstrated on a simulator which she can check against her experience on how the system works. Second, they can serve for consistency check and predictive maintenance. When the system does not respond according to the model, we can set up an alarm to wake up the systems engineer. Third, they can be used to validate policies learned offline.
Why contextual bandits?
One of the alleviating attributes of engineering systems (versus, for example, games or even robots) is that most of the time we receive rewards at every time step, and the effect of an action has little delay. In some circumstances, it is even possible to control these systems using bandits which optimize the immediate reward. The goal of a policy is then to find the action leading to the largest reward, without caring about the future. Bandits are essentially model-free, but models learned on held-out traces can be used to validate the policies.
Models for dynamic systems
This theme covers the questions around dynamic system models, which are essentially multi-variate time series predictors, learned on non-iid and nonstationary system traces.
- Which models to choose and based on what criteria?
In this blog post (summarizing this paper) we compare generative models in a rigorous experimental setup by fixing the planning agent to random shooting and compare the most popular generative models using two dynamical metrics, four metrics evaluated on static traces, and seven requirements which practicing data scientist may find important. We establish a new state-of-the-art sample complexity on the well-known Acrobot system, and declare deep autoregressive mixture density nets (DARMDN) the most preferable model for their ability to model multivariate and heteroscedastic posterior predictives, and for their robustness and flexibility to model heterogeneous system observables. - On noisy systems, separating epistemic and aleatory uncertainties is considered a good practice. Can we verify this? What are the best approaches to do it and what is the experimental framework in which the different approaches can be compared? What are the best environments to use in the experiments?
- Heteroscedasticity at training time proved to be crucial in this paper. Why? The objective is to reproduce the phenomenon on the smallest toy system possible and study it both experimentally and theoretically.
- In noisy engineering systems it happens that the control action has little effect on the reward, compared to the effect of the state or context. Learning a joint model sometimes “shadows” the effect of the action which is detrimental to learning a good control agent. Finding this (unbiased) action sensitivity of the reward explicitly in the system modelling phase is thus crucial, leading to questions similar to experimental design, causal learning and learning treatment effects in healthcare.
- On complex systems constructing the summary of the history (the context) may be nontrivial. Using prior knowledge obtained from the systems engineer is one direction, using attention-type neural architectures is another.
- In our setup, we have to learn new models for each new feedback from the real system. Instead of doing it from scratch, it can be interesting to leverage transfer learning techniques to ensure an easier and smoother training process.
- One of the side effects of learning a system model is that we can check the behavior of the system against the model. If there is a discrepancy, we can act, for example, by triggering a maintenance action or alerting the systems engineer. We can design both on-line checks, “fear reactions” of the autopilot, and off-line data checks, inserted in the slow offline RL iteration.
Models for dynamic systems
This theme covers the questions around dynamic system models, which are essentially multi-variate time series predictors, learned on non-iid and nonstationary system traces.
- Which models to choose and based on what criteria?
In this blog post (summarizing this paper) we compare generative models in a rigorous experimental setup by fixing the planning agent to random shooting and compare the most popular generative models using two dynamical metrics, four metrics evaluated on static traces, and seven requirements which practicing data scientist may find important. We establish a new state-of-the-art sample complexity on the well-known Acrobot system, and declare deep autoregressive mixture density nets (DARMDN) the most preferable model for their ability to model multivariate and heteroscedastic posterior predictives, and for their robustness and flexibility to model heterogeneous system observables. - On noisy systems, separating epistemic and aleatoric uncertainties is considered a good practice. Can we verify this? What are the best approaches to do it and what is the experimental framework in which the different approaches can be compared? What are the best environments to use in the experiments?
- Heteroscedasticity at training time proved to be crucial in this paper. Why? The objective is to reproduce the phenomenon on the smallest toy system possible and study it both experimentally and theoretically.
- In noisy engineering systems it happens that the control action has little effect on the reward, compared to the effect of the state or context. Learning a joint model sometimes “shadows” the effect of the action which is detrimental to learning a good control agent. Finding this (unbiased) action sensitivity of the reward explicitly in the system modelling phase is thus crucial, leading to questions similar to causal learning and learning treatment effects in healthcare.
- On complex systems constructing the summary of the history (the context) may be nontrivial. Using prior knowledge obtained from the systems engineer is one direction, using attention-type neural architectures is another.
Model-free RL
Model-free RL algorithms can be either used in the offline RL loop as standalone approaches, or in the model-based RL loop as planning agents.
- Which are the best model-free RL algorithms, especially in terms of sample complexity, crucial in the micro-data regime?
- Which are the best contextual bandit algorithms?
- Once we learn a system model, the control agent can be trained on the simulator using any model-free RL or CB technique. Of course, the performance of this controller will depend not only on the model-free RL/CB technique but also on how well the simulator mimics the real system. The questions around this theme are: i) Which model-free or planning agents to choose?, ii) How to incorporate robustness to covariate shift due to offline learning the model into learning the model-free control agent on the model? iii) What are the criteria to choose the planning agent in the iterated offline MBRL/CB setup?
- How to explore? Pure offline RL has no notion of exploration, but in our slow iterated offline setup exploration is a crucial.
Safety
One of the crucial issues on engineering systems is that we are not allowed to break them either when we learn or when we deploy the learned policy. In some cases there exist good first-principle (physical) simulators or test systems equipped with a “red button”, in which case safety is a lesser issue at learning time.
- How to formulate and enforce safety when learning and when deploying the learned agent?
- Even with well-formulated safety metrics, we end up with a bi-objective problem. How compare different policies that represent different operating points on the safety/reward plane? What are the best algorithm where the systems engineer can set the desired safety level can be set flexibly.
- How to add safety to the exploration policy, so the systems engineer can limit the number of safety violations while gathering as much information as possible?
Multi-agent control
It happens often that the system does not exist in isolation, rather it has multiple copies, either interacting or not. The most well-known example is the self-driven car, but we also have spatially distributed and connected systems in telecommunications.
- When we collect data from multiple non-interacting systems, sharing their experience of is a great way to accelerate learning.
- In case the systems are built in sequence, transferring the learned model from one system to another, perhaps parameterized differently, is the principal challenge.
- In a connected network of interacting systems, we may use both experience sharing and transfer, but we also need to take into consideration the interaction between the systems and the control agents when we design the RL or CB algorithm.
- When dealing with multiple systems, we also need to deal with multiple rewards. One way to adapt any technique to this setting is to optimize the average of these rewards. This may lead to overly focusing on some systems while ignoring others. One could instead focus on optimizing the rewards in a fair way, and improve the policies on most of the systems.
Policy evaluation and AutoML
Policy evaluation is a critical point in the iterated offline RL loop. It is needed for selecting the best algorithms and to tune the models. Our ultimate goal is not only automating the pilot, but also automating the process that learns the autopilot. As an intermediate step we aim at providing easy-to-use tools to the systems engineers so they can debug and develop both system models and control agents.