政策

版權 2023 TF-Agents 作者。

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視原始碼

下載筆記本

簡介

在強化學習術語中，「政策」會將環境的觀察結果對應到動作或動作分配。在 TF-Agents 中，環境的觀察結果包含在名為 TimeStep('step_type', 'discount', 'reward', 'observation') 的具名元組中，而政策會將時間步對應到動作或動作分配。大多數政策會使用 timestep.observation，有些政策會使用 timestep.step_type (例如，在狀態政策中，於劇集開始時重設狀態)，但通常會忽略 timestep.discount 和 timestep.reward。

政策與 TF-Agents 中的其他元件有以下關聯。大多數政策都有一個神經網路，可從 TimeStep 計算動作和/或動作分配。代理程式可以包含一或多個用於不同用途的政策，例如，用於部署的訓練主要政策，以及用於資料收集的雜訊政策。政策可以儲存/還原，而且可以獨立於代理程式用於資料收集、評估等。

有些政策比較容易以 Tensorflow 撰寫 (例如，具有神經網路的政策)，而其他政策則比較容易以 Python 撰寫 (例如，遵循動作指令碼)。因此，在 TF agents 中，我們允許 Python 和 Tensorflow 政策。此外，以 TensorFlow 撰寫的政策可能必須在 Python 環境中使用，反之亦然，例如，TensorFlow 政策用於訓練，但稍後會部署在生產 Python 環境中。為了簡化此流程，我們提供了用於在 Python 和 TensorFlow 政策之間轉換的包裝函式。

另一種有趣的政策類別是政策包裝函式，可用於以特定方式修改給定的政策，例如新增特定類型的雜訊、建立隨機政策的貪婪或 epsilon-貪婪版本、隨機混合多個政策等。

設定

如果您尚未安裝 tf-agents，請執行

pip install tf-agents
pip install tf-keras

import os
# Keep using keras-2 (tf-keras) rather than keras-3 (keras).
os.environ['TF_USE_LEGACY_KERAS'] = '1'

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np

from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.networks import network

from tf_agents.policies import py_policy
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

from tf_agents.policies import tf_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import actor_policy
from tf_agents.policies import q_policy
from tf_agents.policies import greedy_policy

from tf_agents.trajectories import time_step as ts

Python 政策

Python 政策的介面定義於 policies/py_policy.PyPolicy 中。主要方法如下

class Base(object):

  @abc.abstractmethod
  def __init__(self, time_step_spec, action_spec, policy_state_spec=()):
    self._time_step_spec = time_step_spec
    self._action_spec = action_spec
    self._policy_state_spec = policy_state_spec

  @abc.abstractmethod
  def reset(self, policy_state=()):
    # return initial_policy_state.
    pass

  @abc.abstractmethod
  def action(self, time_step, policy_state=()):
    # return a PolicyStep(action, state, info) named tuple.
    pass

  @abc.abstractmethod
  def distribution(self, time_step, policy_state=()):
    # Not implemented in python, only for TF policies.
    pass

  @abc.abstractmethod
  def update(self, policy):
    # update self to be similar to the input `policy`.
    pass

  @property
  def time_step_spec(self):
    return self._time_step_spec

  @property
  def action_spec(self):
    return self._action_spec

  @property
  def policy_state_spec(self):
    return self._policy_state_spec

最重要的方法是 action(time_step)，此方法會將包含環境觀察結果的 time_step 對應到名為 PolicyStep 的具名元組，其中包含以下屬性

action：要套用至環境的動作。
state：政策的狀態 (例如 RNN 狀態)，將饋送到下一個動作呼叫。
info：選用的額外資訊，例如動作記錄機率。

time_step_spec 和 action_spec 是輸入時間步和輸出動作的規格。政策也具有 reset 函式，通常用於重設狀態政策中的狀態。update(new_policy) 函式會將 self 更新為 new_policy。

現在，讓我們看看幾個 Python 政策的範例。

範例 1：隨機 Python 政策

PyPolicy 的簡單範例是 RandomPyPolicy，此政策會為指定的離散/連續 action_spec 產生隨機動作。輸入的 time_step 會遭到忽略。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,
    action_spec=action_spec)
time_step = None
action_step = my_random_py_policy.action(time_step)
print(action_step)
action_step = my_random_py_policy.action(time_step)
print(action_step)

PolicyStep(action=array([5, 3], dtype=int32), state=(), info=())
PolicyStep(action=array([-4,  3], dtype=int32), state=(), info=())

範例 2：指令碼 Python 政策

指令碼政策會播放動作指令碼，該指令碼表示為 (num_repeats, action) 元組的清單。每次呼叫 action 函式時，都會傳回清單中的下一個動作，直到完成指定的重複次數，然後繼續清單中的下一個動作。reset 方法可用於從清單開頭開始執行。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
action_script = [(1, np.array([5, 2], dtype=np.int32)),
                 (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeats` to 0 will skip this action.
                 (2, np.array([1, 2], dtype=np.int32)),
                 (1, np.array([3, 4], dtype=np.int32))]

my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

policy_state = my_scripted_py_policy.get_initial_state()
time_step = None
print('Executing scripted policy...')
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)
action_step= my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)
action_step = my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)

print('Resetting my_scripted_py_policy...')
policy_state = my_scripted_py_policy.get_initial_state()
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)

Executing scripted policy...
PolicyStep(action=array([5, 2], dtype=int32), state=[0, 1], info=())
PolicyStep(action=array([1, 2], dtype=int32), state=[2, 1], info=())
PolicyStep(action=array([1, 2], dtype=int32), state=[2, 2], info=())
Resetting my_scripted_py_policy...
PolicyStep(action=array([5, 2], dtype=int32), state=[0, 1], info=())

TensorFlow 政策

TensorFlow 政策遵循與 Python 政策相同的介面。讓我們看看幾個範例。

範例 1：隨機 TF 政策

RandomTFPolicy 可用於根據指定的離散/連續 action_spec 產生隨機動作。輸入的 time_step 會遭到忽略。

action_spec = tensor_spec.BoundedTensorSpec(
    (2,), tf.float32, minimum=-1, maximum=3)
input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)

my_random_tf_policy = random_tf_policy.RandomTFPolicy(
    action_spec=action_spec, time_step_spec=time_step_spec)
observation = tf.ones(time_step_spec.observation.shape)
time_step = ts.restart(observation)
action_step = my_random_tf_policy.action(time_step)

print('Action:')
print(action_step.action)

Action:
tf.Tensor([9.8276138e-04 2.8761353e+00], shape=(2,), dtype=float32)

範例 2：Actor 政策

Actor 政策可以使用網路建立，此網路可將 time_steps 對應到動作，或是將 time_steps 對應到動作分配。

使用動作網路

讓我們定義如下的網路

class ActionNet(network.Network):

  def __init__(self, input_tensor_spec, output_tensor_spec):
    super(ActionNet, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name='ActionNet')
    self._output_tensor_spec = output_tensor_spec
    self._sub_layers = [
        tf.keras.layers.Dense(
            action_spec.shape.num_elements(), activation=tf.nn.tanh),
    ]

  def call(self, observations, step_type, network_state):
    del step_type

    output = tf.cast(observations, dtype=tf.float32)
    for layer in self._sub_layers:
      output = layer(output)
    actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())

    # Scale and shift actions to the correct range if necessary.
    return actions, network_state

在 TensorFlow 中，大多數網路層級都設計用於批次作業，因此我們預期輸入 time_steps 會進行批次處理，而且網路的輸出也會進行批次處理。此外，網路也負責在指定 action_spec 的正確範圍內產生動作。這通常是透過以下方式完成：例如，對最後一層使用 tanh 啟動函式，以產生 [-1, 1] 範圍內的動作，然後縮放並將其位移到輸入 action_spec 的正確範圍 (例如，請參閱 tf_agents/agents/ddpg/networks.actor_network())。

現在，我們可以利用上述網路建立 actor 政策。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((3,),
                                            tf.float32,
                                            minimum=-1,
                                            maximum=1)

action_net = ActionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_net)

我們可以將其套用至任何遵循 time_step_spec 的時間步批次

batch_size = 2
observations = tf.ones([2] + time_step_spec.observation.shape.as_list())

time_step = ts.restart(observations, batch_size)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)

distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor(
[[ 0.85880756 -0.74206954 -0.7772715 ]
 [ 0.85880756 -0.74206954 -0.7772715 ]], shape=(2, 3), dtype=float32)
Action distribution:
tfp.distributions.Deterministic("Deterministic", batch_shape=[2, 3], event_shape=[], dtype=float32)

在上述範例中，我們使用產生動作張量的動作網路建立了政策。在此情況下，policy.distribution(time_step) 是 policy.action(time_step) 輸出周圍的決定性 (delta) 分配。產生隨機政策的一種方式是將 actor 政策包裝在政策包裝函式中，以將雜訊新增至動作。另一種方式是使用動作分配網路 (而非動作網路) 建立 actor 政策，如下所示。

使用動作分配網路

class ActionDistributionNet(ActionNet):

  def call(self, observations, step_type, network_state):
    action_means, network_state = super(ActionDistributionNet, self).call(
        observations, step_type, network_state)

    action_std = tf.ones_like(action_means)
    return tfp.distributions.MultivariateNormalDiag(action_means, action_std), network_state


action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_distribution_net)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)
distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor(
[[-1.          1.          1.        ]
 [-0.7074561   0.1602813   0.34091526]], shape=(2, 3), dtype=float32)
Action distribution:
tfp.distributions.MultivariateNormalDiag("MultivariateNormalDiag", batch_shape=[2], event_shape=[3], dtype=float32)

請注意，在上述範例中，動作會裁剪到指定 action spec [-1, 1] 的範圍。這是因為 ActorPolicy 的建構函式引數預設為 clip=True。將其設為 false 將傳回網路產生的未裁剪動作。

隨機政策可以轉換為決定性政策，例如，使用 GreedyPolicy 包裝函式，此函式會選擇 stochastic_policy.distribution().mode() 作為其動作，並以這個貪婪動作為中心建立決定性/delta 分配，作為其 distribution()。

範例 3：Q 政策

Q 政策用於 DQN 等代理程式中，並以 Q 網路為基礎，此網路會預測每個離散動作的 Q 值。對於給定的時間步，Q 政策中的動作分配是使用 q 值作為 logits 建立的類別分配。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((),
                                            tf.int32,
                                            minimum=0,
                                            maximum=2)
num_actions = action_spec.maximum - action_spec.minimum + 1


class QNetwork(network.Network):

  def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):
    super(QNetwork, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name=name)
    self._sub_layers = [
        tf.keras.layers.Dense(num_actions),
    ]

  def call(self, inputs, step_type=None, network_state=()):
    del step_type
    inputs = tf.cast(inputs, tf.float32)
    for layer in self._sub_layers:
      inputs = layer(inputs)
    return inputs, network_state


batch_size = 2
observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())
time_steps = ts.restart(observation, batch_size=batch_size)

my_q_network = QNetwork(
    input_tensor_spec=input_tensor_spec,
    action_spec=action_spec)
my_q_policy = q_policy.QPolicy(
    time_step_spec, action_spec, q_network=my_q_network)
action_step = my_q_policy.action(time_steps)
distribution_step = my_q_policy.distribution(time_steps)

print('Action:')
print(action_step.action)

print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor([2 0], shape=(2,), dtype=int32)
Action distribution:
tfp.distributions.Categorical("Categorical", batch_shape=[2], event_shape=[], dtype=int32)

政策包裝函式

政策包裝函式可用於包裝和修改給定的政策，例如新增雜訊。政策包裝函式是 Policy (Python/TensorFlow) 的子類別，因此可以像任何其他政策一樣使用。

範例：貪婪政策

貪婪包裝函式可用於包裝任何實作 distribution() 的 TensorFlow 政策。GreedyPolicy.action() 將傳回 wrapped_policy.distribution().mode()，而 GreedyPolicy.distribution() 是以 GreedyPolicy.action() 為中心建立的決定性/delta 分配

my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)

action_step = my_greedy_policy.action(time_steps)
print('Action:')
print(action_step.action)

distribution_step = my_greedy_policy.distribution(time_steps)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor([0 0], shape=(2,), dtype=int32)
Action distribution:
tfp.distributions.Deterministic("Deterministic", batch_shape=[2], event_shape=[], dtype=int32)