Open AI Gym教程(6):自定义环境的基本步骤

虽然Open AI Gym自带了大量的环境可供测试,但我们终会碰到需要自定义环境的时候,本篇介绍构造自定义环境的基本步骤。

这里借用一个简单的例子,猜数字游戏,这个游戏和我们常见的猜数字类似,稍有不同。游戏随机给出一个浮点数(可以取值的范围事先不知道,但可以给出对于的值域空间Spaces),你可以最多猜200次,然后游戏根据你猜的数字,给出4中不同的测量值(Observation).

  • 0 — 初始,只在系统reset()后可以为0
  • 1 — 猜的数偏低
  • 2 — 猜的数和游戏给定的值相同,由于是浮点数,系统比较两个值偏差在1%内就认为相同。
  • 3 –猜的数偏大。

系统给出的奖励如下:

  • 1 如果猜中数字 ,允许1%偏差
  • 0 没有猜中, 也就是偏差高于1%

根据游戏规则,我们可以定义这个环境的Action 和 Observation 的值域空间,Observation比较简单,使用整数类型就可以

observation_space = spaces.Discrete(4)

而Action 可以用多种定义,最简单的是直接使用猜的数字作为值域空间,比如:

self.bounds = 10000

self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]),
                                       dtype=np.float32)
self.observation_space = spaces.Discrete(4)

此外,还可以使用Tuple类型, (Discrete(1),(Box(low=0, high=np.array([self.bounds]),
dtype=np.float32))
第一个增大或是减小,第二个为增大或是减小的数值。本例采用第一种方法。

根据上述的分析, 我们就可以来设计自定义Gym环境。一个Gym环境的项目结构如下:

.
├── LICENSE
├── README.md
├── guessing_number
│   ├── __init__.py
│   └── envs
│       ├── __init__.py
│       └── guessing_number_env.py
├── setup.py
└── test.py

其中setup.py 一般为:

from setuptools import find_packages, setup

setup(
    name="guessing_number",
    version="0.0.1",
    install_requires=["gym>=0.2.3", "numpy"],
    packages=find_packages(),
)

 

定义项目的名称,版本,以及所依赖的其它软件包。

而 guessing_number/__init__.py 一般如下:

from gym.envs.registration import register

register(id="GuessingNumber-v0", entry_point="guessing_number.envs:GuessingNumberEnv")

 

其中环境个名称的格式为GuessingNumber-v0,一般在环境名称后使用v0,v1来代表不同的版本。
entry_point给出环境人主类的人口点。

参考Open AI Gym教程(3): 环境 Env 我们可以定义GuessingNumberEnv如下:

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding


class GuessingNumberEnv(gym.Env):
    """Number guessing game

    The object of the game is to guess within 1% of the randomly chosen number
    within 200 time steps

    After each step the agent is provided with one of four possible observations
    which indicate where the guess is in relation to the randomly chosen number

    0 - No guess yet submitted (only after reset)
    1 - Guess is lower than the target
    2 - Guess is equal to the target
    3 - Guess is higher than the target

    The rewards are:
    0 if the agent's guess is outside of 1% of the target
    1 if the agent's guess is inside 1% of the target

    The episode terminates after the agent guesses within 1% of the target or
    200 steps have been taken

    The agent will need to use a memory of previously submitted actions and observations
    in order to efficiently explore the available actions

    The purpose is to have agents optimise their exploration parameters (e.g. how far to
    explore from previous actions) based on previous experience. Because the goal changes
    each episode a state-value or action-value function isn't able to provide any additional
    benefit apart from being able to tell whether to increase or decrease the next guess.

    The perfect agent would likely learn the bounds of the action space (without referring
    to them explicitly) and then follow binary tree style exploration towards to goal number
    """
    def __init__(self):
        self.range = 1000  # Randomly selected number is within +/- this value
        self.bounds = 10000

        self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]),
                                       dtype=np.float32)
        self.observation_space = spaces.Discrete(4)

        self.number = 0
        self.guess_count = 0
        self.guess_max = 200
        self.observation = 0

        self.seed()
        self.reset()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        assert self.action_space.contains(action)

        if action < self.number:
            self.observation = 1

        elif action == self.number:
            self.observation = 2

        elif action > self.number:
            self.observation = 3

        reward = 0
        done = False

        if (self.number - self.range * 0.01) < action < (self.number + self.range * 0.01):
            reward = 1
            done = True

        self.guess_count += 1
        if self.guess_count >= self.guess_max:
            done = True

        return self.observation, reward, done, {"number": self.number, "guesses": self.guess_count}

    def reset(self):
        self.number = self.np_random.uniform(-self.range, self.range)
        self.guess_count = 0
        self.observation = 0
        return self.observation

项目完成后可以使用

pip install -e .

来安装这个环境。

最有我们可以设计两个不同的Agent来玩这个猜数字游戏,一个是随机猜:

class RandomAgent(object):
    """The world's simplest agent!"""

    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation, reward, done):
        return self.action_space.sample()

 

第二个是改进后的随机猜,根据step返回的observation (1-偏小,3-偏大)适当调整所猜的数字:

class BetterRandomAgent(object):
    """The world's 2nd simplest agent!"""

    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation, last_action):
        new_action = last_action
        if observation == 1:
            new_action = last_action + abs(last_action / 2)

        elif observation == 3:
            new_action = last_action - abs(last_action / 2)
        if abs(last_action - new_action) < 1e-1:
            new_action = self.action_space.sample()
        return new_action

 

有了这两个Agent,我们就可以测试这个环境.分别运行100次,看看每个Agent猜中的次数,和平均猜的次数:

import gym

import guessing_number

class RandomAgent(object):
    """The world's simplest agent!"""

    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation, reward, done):
        return self.action_space.sample()


class BetterRandomAgent(object):
    """The world's 2nd simplest agent!"""

    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation, last_action):
        new_action = last_action
        if observation == 1:
            new_action = last_action + abs(last_action / 2)

        elif observation == 3:
            new_action = last_action - abs(last_action / 2)
        if abs(last_action - new_action) < 1e-1:
            new_action = self.action_space.sample()
        return new_action


if __name__ == '__main__':

    env = gym.make('GuessingNumber-v0')
    env.seed(0)
    agent = BetterRandomAgent(env.action_space)

    episode_count = 100
    reward = 0
    done = False

    total_reward = 0
    total_guesses = 0
    for i in range(episode_count):
        last_action = env.action_space.sample()
        ob = env.reset()
        while True:
            action = agent.act(ob, last_action)
            ob, reward, done, info = env.step(action)
            last_action = action

            # print(f'count={info["guesses"]},number={info["number"]},guess={action},ob={ob},reward={reward}')
            if done:
                total_reward += reward
                total_guesses += int(info["guesses"])
                break

    print(f'Total better random reward {total_reward}, average guess {round(total_guesses / 100, 1)}')

    env.seed(0)
    agent = RandomAgent(env.action_space)
    reward = 0
    done = False

    total_reward = 0
    total_guesses = 0

    for i in range(episode_count):
        ob = env.reset()
        while True:
            action = agent.act(ob, reward, done)
            ob, reward, done, info = env.step(action)

            if done:
                total_reward += reward
                total_guesses += int(info["guesses"])
                break

    # Close the env and write monitor result info to disk
    env.close()
    print(f'Total random reward {total_reward}, average guess {round(total_guesses / 100, 1)}')

 

几个可能的运行结果:

Total better random reward 100, average guess 35.9
Total random reward 15, average guess 180.6
-----
Total better random reward 100, average guess 39.2
Total random reward 20, average guess 175.9
--
Total better random reward 100, average guess 38.2
Total random reward 24, average guess 177.2
--
Total better random reward 100, average guess 38.6
Total random reward 18, average guess 180.4

 

可以看到改进后的Agent,几乎每次都猜中,平均30多次就猜中,而完全随机的Agent,猜中20次左右,每次需要180次左右。

如果我们使用二分法来设计Agent,由于事先不知道所猜数的范围,所以可以先设计算法,得到所猜数的范围,比如从100开始,每次翻倍,直到observation从偏小变成偏大。
我们的例子的取值范围为(-10000,10000),因此只要8次就可以得到范围为[-25600,25600]。然后使用二分法,最多也是8次就可以猜中数字。

本篇源码 https://github.com/guidebee/guessing_number