遷移多工作站 CPU/GPU 訓練

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視原始碼

下載筆記本

本指南示範如何將多工作站分散式訓練工作流程從 TensorFlow 1 遷移至 TensorFlow 2。

若要使用 CPU/GPU 執行多工作站訓練

在 TensorFlow 1 中，您通常會使用 tf.estimator.train_and_evaluate 和 tf.estimator.Estimator API。
在 TensorFlow 2 中，使用 Keras API 撰寫模型、損失函數、最佳化工具和指標。然後，透過 Keras Model.fit API 或自訂訓練迴圈 (搭配 tf.GradientTape) 以及 tf.distribute.experimental.ParameterServerStrategy 或 tf.distribute.MultiWorkerMirroredStrategy，將訓練分散至多個工作站。如需更多詳細資訊，請參閱下列教學課程

設定

首先匯入一些必要的項目和一個簡單的資料集以進行示範

# The notebook uses a dataset instance for `Model.fit` with
# `ParameterServerStrategy`, which depends on symbols in TF 2.7.
# Install a utility needed for this demonstration
!pip install portpicker

import tensorflow as tf
import tensorflow.compat.v1 as tf1

features = [[1., 1.5], [2., 2.5], [3., 3.5]]
labels = [[0.3], [0.5], [0.7]]
eval_features = [[4., 4.5], [5., 5.5], [6., 6.5]]
eval_labels = [[0.8], [0.9], [1.]]

在 TensorFlow 中，您需要 'TF_CONFIG' 設定環境變數，才能在多部機器上進行訓練。使用 'TF_CONFIG' 指定 'cluster' 和 'task' 的位址。(如需更多資訊，請參閱分散式訓練指南。)

import json
import os

tf_config = {
    'cluster': {
        'chief': ['localhost:11111'],
        'worker': ['localhost:12345', 'localhost:23456', 'localhost:21212'],
        'ps': ['localhost:12121', 'localhost:13131'],
    },
    'task': {'type': 'chief', 'index': 0}
}

os.environ['TF_CONFIG'] = json.dumps(tf_config)

使用 del 陳述式移除變數 (但在 TensorFlow 1 的真實世界多工作站訓練中，您不必這麼做)

del os.environ['TF_CONFIG']

TensorFlow 1：搭配 tf.estimator API 的多工作站分散式訓練

下列程式碼片段示範 TF1 中多工作站訓練的標準工作流程：您將使用 tf.estimator.Estimator、tf.estimator.TrainSpec、tf.estimator.EvalSpec 和 tf.estimator.train_and_evaluate API 來分散訓練

def _input_fn():
  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)

def _eval_input_fn():
  return tf1.data.Dataset.from_tensor_slices(
      (eval_features, eval_labels)).batch(1)

def _model_fn(features, labels, mode):
  logits = tf1.layers.Dense(1)(features)
  loss = tf1.losses.mean_squared_error(labels=labels, predictions=logits)
  optimizer = tf1.train.AdagradOptimizer(0.05)
  train_op = optimizer.minimize(loss, global_step=tf1.train.get_global_step())
  return tf1.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

estimator = tf1.estimator.Estimator(model_fn=_model_fn)
train_spec = tf1.estimator.TrainSpec(input_fn=_input_fn)
eval_spec = tf1.estimator.EvalSpec(input_fn=_eval_input_fn)
tf1.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

TensorFlow 2：搭配分散策略的多工作站訓練

在 TensorFlow 2 中，跨多個工作站 (使用 CPU、GPU 和 TPU) 的分散式訓練是透過 tf.distribute.Strategy 進行。

下列範例示範如何使用兩個這類策略：tf.distribute.experimental.ParameterServerStrategy 和 tf.distribute.MultiWorkerMirroredStrategy，這兩個策略都設計用於搭配多個工作站的 CPU/GPU 訓練。

ParameterServerStrategy 採用協調器 ('chief')，使其更適用於此 Colab 筆記本中的環境。您將在此處使用一些公用程式來設定可在此處執行的體驗所必需的支援元素：您將建立程序內叢集，其中執行緒用於模擬參數伺服器 ('ps') 和工作站 ('worker')。如需參數伺服器訓練的更多資訊，請參閱使用 ParameterServerStrategy 進行參數伺服器訓練教學課程。

在此範例中，首先使用 tf.distribute.cluster_resolver.TFConfigClusterResolver 定義 'TF_CONFIG' 環境變數，以提供叢集資訊。如果您使用叢集管理系統進行分散式訓練，請檢查其是否已為您提供 'TF_CONFIG'，在這種情況下，您不需要明確設定此環境變數。(如需更多資訊，請參閱使用 TensorFlow 進行分散式訓練指南中設定 'TF_CONFIG' 環境變數一節。)

# Find ports that are available for the `'chief'` (the coordinator),
# `'worker'`s, and `'ps'` (parameter servers).
import portpicker

chief_port = portpicker.pick_unused_port()
worker_ports = [portpicker.pick_unused_port() for _ in range(3)]
ps_ports = [portpicker.pick_unused_port() for _ in range(2)]

# Dump the cluster information to `'TF_CONFIG'`.
tf_config = {
    'cluster': {
        'chief': ["localhost:%s" % chief_port],
        'worker': ["localhost:%s" % port for port in worker_ports],
        'ps':  ["localhost:%s" % port for port in ps_ports],
    },
    'task': {'type': 'chief', 'index': 0}
}
os.environ['TF_CONFIG'] = json.dumps(tf_config)

# Use a cluster resolver to bridge the information to the strategy created below.
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()

然後，依序為工作站和參數伺服器建立 tf.distribute.Server

# Workers need some inter_ops threads to work properly.
# This is only needed for this notebook to demo. Real servers
# should not need this.
worker_config = tf.compat.v1.ConfigProto()
worker_config.inter_op_parallelism_threads = 4

for i in range(3):
  tf.distribute.Server(
      cluster_resolver.cluster_spec(),
      job_name="worker",
      task_index=i,
      config=worker_config)

for i in range(2):
  tf.distribute.Server(
      cluster_resolver.cluster_spec(),
      job_name="ps",
      task_index=i)

在真實世界的分散式訓練中，您將使用多部機器，而不是在協調器上啟動所有 tf.distribute.Server，並且指定為 "worker" 和 "ps" (參數伺服器) 的機器將各自執行 tf.distribute.Server。如需更多詳細資訊，請參閱參數伺服器訓練教學課程中真實世界中的叢集一節。

準備就緒後，建立 ParameterServerStrategy 物件

strategy = tf.distribute.experimental.ParameterServerStrategy(cluster_resolver)

建立策略物件後，定義模型、最佳化工具和其他變數，並在 Strategy.scope API 中呼叫 Keras Model.compile 以分散訓練。(如需更多資訊，請參閱 Strategy.scope API 文件。)

如果您偏好自訂訓練 (例如，定義正向和反向傳播)，請參閱參數伺服器訓練教學課程中使用自訂訓練迴圈進行訓練一節，以取得更多詳細資訊。

dataset = tf.data.Dataset.from_tensor_slices(
      (features, labels)).shuffle(10).repeat().batch(64)

eval_dataset = tf.data.Dataset.from_tensor_slices(
      (eval_features, eval_labels)).repeat().batch(1)

with strategy.scope():
  model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])
  optimizer = tf.keras.optimizers.legacy.Adagrad(learning_rate=0.05)
  model.compile(optimizer, "mse")

model.fit(dataset, epochs=5, steps_per_epoch=10)

model.evaluate(eval_dataset, steps=10, return_dict=True)

分割器 (tf.distribute.experimental.partitioners)

TensorFlow 2 中的 ParameterServerStrategy 支援變數分割，並提供與 TensorFlow 1 相同的分割器，但名稱較不混淆： - tf.compat.v1.variable_axis_size_partitioner -> tf.distribute.experimental.partitioners.MaxSizePartitioner：將分片保持在最大大小以下的分割器)。 - tf.compat.v1.min_max_variable_partitioner -> tf.distribute.experimental.partitioners.MinSizePartitioner：為每個分片配置最小大小的分割器。 - tf.compat.v1.fixed_size_partitioner -> tf.distribute.experimental.partitioners.FixedShardsPartitioner：配置固定分片數量的分割器。

或者，您可以使用 MultiWorkerMirroredStrategy 物件

# To clean up the `TF_CONFIG` used for `ParameterServerStrategy`.
del os.environ['TF_CONFIG']
strategy = tf.distribute.MultiWorkerMirroredStrategy()

您可以將上述使用的策略替換為 MultiWorkerMirroredStrategy 物件，以使用此策略執行訓練。

與 tf.estimator API 相同，由於 MultiWorkerMirroredStrategy 是多用戶端策略，因此無法在此 Colab 筆記本中輕鬆執行分散式訓練。因此，將上述程式碼替換為此策略最終會在本地端執行。多工作站訓練搭配 Keras Model.fit/自訂訓練迴圈教學課程示範如何在設定 'TF_CONFIG' 變數的情況下，在 Colab 的 localhost 上使用兩個工作站執行多工作站訓練。實際上，您會在外部 IP 位址/連接埠上建立多個工作站，並使用 'TF_CONFIG' 變數來指定每個工作站的叢集設定。

後續步驟

若要深入瞭解 TensorFlow 2 中搭配 tf.distribute.experimental.ParameterServerStrategy 和 tf.distribute.MultiWorkerMirroredStrategy 的多工作站分散式訓練，請考慮下列資源

教學課程：使用 ParameterServerStrategy 和 Keras Model.fit/自訂訓練迴圈進行參數伺服器訓練
教學課程：搭配 MultiWorkerMirroredStrategy 和 Keras Model.fit 的多工作站訓練
教學課程：搭配 MultiWorkerMirroredStrategy 和自訂訓練迴圈的多工作站訓練
指南：使用 TensorFlow 進行分散式訓練
指南：使用 TensorFlow Profiler 最佳化 TensorFlow GPU 效能
指南：使用 GPU (使用多個 GPU 一節)