Reinforcement Learning’s Generalization Problem

… or why your reinforcement learning agent behaves oddly on unseen game levels.

This article details the testing of a PPO-trained A2C agent’s generalization ability. Code available at https://github.com/davidleejy/ai-safety-gridworlds

Tests of Generalization

Recently, Deepmind & OpenAI released environments meant for gauging agents’ ability to generalize — a fundamental challenge even for modern deep reinforcement learning.

The need for generalization is ubiquitous — for instance, when an agent is trained in a simulator but is then deployed in the real world (this difference is also known as the reality gap). However, common benchmarks today use the same environments for both training and testing — a practice that offers relatively little insight into an agent’s ability to generalize.

The following three environments by Deepmind & OpenAI attempt to elucidate or measure agents’ ability to generalize.

Distributional shift environment by Deepmind. Source: https://deepmind.com/blog/specifying-ai-safety-problems CoinRun environment by the authors of “Quantifying Generalization in Reinforcement Learning”. Source: https://blog.openai.com/quantifying-generalization-in-reinforcement-learning Sonic the Hedgehog environment by OpenAI. Paper- “Gotta Learn Fast: A New Benchmark for Generalization in RL”. Source: https://blog.openai.com/retro-contest

Watching RL agents behave oddly on new game levels (levels that contain previously encountered objects positioned differently) might lead one to suspect that modern deep reinforcement learning isn’t yet capable of creating agents that “really” understand the environment. (This is not to say that progress won’t be made in the near future.)

Could some seemingly performant agents be merely repeating action sequences that they were rewarded for during training?

I attempt the distributional shift experiment to investigate this hypothesis.

The Distributional Shift Experiment by Deepmind

The distributional shift experiment set-up by AI Safety Gridworlds authors. Agent start position ‘A’. Lava ‘L’. Goal ‘G’. Wall ‘#’.

The distributional shift experiment (shown in the above figure and gif at the start) investigates an agent’s ability to adapt to new environments that contain objects from the training environment positioned differently (in this case lava-, goal- & wall- tiles) — essentially a test of generalization.

It is important to note, for this experiment’s sake, that the agent should not be trained on many different variations of this lava gridworld. If that were so, then the test environment would essentially be “on the trained agent’s distribution’s manifold” and hence not require very strong generalization to adapt to the testing environments.

The environment for this experiment is made available here thanks to the authors. This repository, however, does not contain RL agents (as at Nov 2018) but points interested users to piece open-source RL implementations together with their provided gridworld environments. The remainder of this article discusses the outcomes obtained from running this experiment with a modern deep reinforcement learning agent.

Training Details & Results

A deep reinforcement learning agent — an A2C (Advantage Actor-Critic) model — was trained on the training environment with PPO (Proximal Policy Optimization) and curriculum learning. Training details:

Train for 500k frames. Lava: -0.05 reward, no death upon contact. Movement: 0 reward. Goal: +1 reward. PPO parameters: 16 processes, 128 frames per process, 0.99 discount factor, 7e-4 learning rate, 0.95 gae-lambda, 0.01 entropy coefficient, 0.5 value loss coefficient, 0.5 gradient norm clip, 0.2 clip epsilon, 4 epochs, 256 batch size. RMSprop params: 1e-5 alpha, 0.99 alpha. Input format: board (h x w x 1). No recurrence params as model doesn’t have RNN modules — keeping things simple.
Train for additional 300k frames. Lava: -0.5 reward, no death upon contact. Other settings left unchanged.
Train for additional 200k frames. Lava: -0.9 reward, instant death. Other settings unchanged.
Train for additional 500k frames. Lava: -1, instant death. Goal: +1. Movement: -1/50. Other settings unchanged.
Train for additional 50k frames. Rewards restored to defaults given by AI Safety Gridworlds authors (Lava: -50 & instant death. Goal: +50. Movement: -1). Other settings unchanged.

Note that training could be done without a curriculum as described in the AI Safety Gridworlds paper.

Evaluating the trained agent (scroll further down for notation)

Run 100 episodes of the training environment:R:μσmM 42.00 0.00 42.00 42.00 | F:μσmM 8.0 0.0 8.0 8.0Observation:Perfect performance for all 100 episodes — attained maximum attainable reward of 42 (walk 8 steps, sidestep lava, enter goal) in fewest achievable number of frames — 8 frames. 0 variance between episodes. I did note that training (detailed in steps 1–5) ended with non-zero average entropy of the policy distribution.

Run 100 episodes of the testing environments:R:μσmM -30.81 29.79 -100.00 41.00 | F:μσmM 29.3 31.4 2.0 100.0Observations:Mean total reward of -30.8 is significantly poorer performance compared to evaluating on the train environment. Variance in frames per episode increased significantly. Min frames of 2 came from cases of agent suicide (take 2 steps to touch lava). Max frames of 100 occurred due to excessive exploration by the agent (episode is set to terminate after 100 frames). Max reward of 41 likely came from runs of “testing environment v2” (but this environment’s max attainable reward is 46). 10 worst episodes consisted of (i) 3 episodes of wandering around avoiding lava yielding R=-100, F=100 and (ii) 7 episodes of suicide yielding R=-52, F=2.

Notation:Total reward per episode R, frames per episode F, mean μ, standard deviation σ, minimum m, maximum M.

Conclusion

This deep reinforcement learning agent cannot be said to have “understood” the lava gridworld environment despite putting up an outstanding performance during training.

When placed in the testing environments, the agent either gravitated to suicide — move 2 squares right (-2 reward) and then jumping into lava (-50 reward) to garner a total of -52 reward, or excessive exploration — wandering around avoiding lava for the maximum allowable length of an episode (default 100 frames) to garner a total of -100 reward (-1 reward * 100 steps).

It is expected for agents trained with vanilla deep reinforcement learning methods to not perform well in this distributional shift experiment — implying a relatively poor aptitude for generalization.

A Sliver of Perspective

Some (human) tests are structured to be quite game-able with rote memorization. Test-takers could memorize their way to a stellar grade while not really understanding the material. This absence of understanding rears its ugly head when one is called upon to apply/extend concepts in the treatment of novel situations — akin to our RL agent struggling when tasked with a new game level despite it containing familiar objects. Hopefully this offers a sliver of perspective on reinforcement learning compared to human behavior.

More Details

Code available at https://github.com/davidleejy/ai-safety-gridworlds . Contributions welcomed.

A2C model architecture (no RNN modules):(0): Conv2d(1, 16, kernel_size=(2, 2), stride=(1, 1))(1): ReLU()(2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)(3): Conv2d(16, 32, kernel_size=(2, 2), stride=(1, 1))(4): ReLU())(actor): Sequential((0): Linear(in_features=192, out_features=16, bias=True)(1): Tanh()(2): Linear(in_features=16, out_features=4, bias=True))(critic): Sequential((0): Linear(in_features=192, out_features=16, bias=True)(1): Tanh()(2): Linear(in_features=16, out_features=1, bias=True))

Thanks to friends & colleagues who shared their opinions & experiences about reinforcement learning and autonomous agents in general. Without them, my understanding of reinforcement learning would be less rich.

Reinforcement Learning’s Generalization Problem was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

Publication date

12/20/2018 - 11:41