TensorFlow 上的 NumPy API

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視原始碼

下載筆記本

總覽

TensorFlow 實作了 NumPy API 的子集，以 tf.experimental.numpy 形式提供。這允許執行由 TensorFlow 加速的 NumPy 程式碼，同時也允許存取所有 TensorFlow 的 API。

設定

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
import timeit

print("Using TensorFlow version %s" % tf.__version__)

啟用 NumPy 行為

為了將 tnp 用作 NumPy，請為 TensorFlow 啟用 NumPy 行為

tnp.experimental_enable_numpy_behavior()

此呼叫會在 TensorFlow 中啟用類型提升，並在將常值轉換為張量時，變更類型推斷以更嚴格地遵循 NumPy 標準。

TensorFlow NumPy ND 陣列

tf.experimental.numpy.ndarray 的一個執行個體，稱為 ND 陣列，表示放置在特定裝置上的給定 dtype 的多維密集陣列。它是 tf.Tensor 的別名。查看 ND 陣列類別，以取得有用的方法，例如 ndarray.T、ndarray.reshape、ndarray.ravel 等。

首先建立 ND 陣列物件，然後調用不同的方法。

# Create an ND array and check out different attributes.
ones = tnp.ones([5, 3], dtype=tnp.float32)
print("Created ND array with shape = %s, rank = %s, "
      "dtype = %s on device = %s\n" % (
          ones.shape, ones.ndim, ones.dtype, ones.device))

# `ndarray` is just an alias to `tf.Tensor`.
print("Is `ones` an instance of tf.Tensor: %s\n" % isinstance(ones, tf.Tensor))

# Try commonly used member functions.
print("ndarray.T has shape %s" % str(ones.T.shape))
print("narray.reshape(-1) has shape %s" % ones.reshape(-1).shape)

類型提升

TensorFlow 中有 4 個類型提升選項。

預設情況下，TensorFlow 會針對混合類型運算引發錯誤，而不是提升類型。
執行 tf.numpy.experimental_enable_numpy_behavior() 會將 TensorFlow 切換為使用 NumPy 類型提升規則 (如下所述)。
在 TensorFlow 2.15 之後，有兩個新選項 (詳細資訊請參閱TF NumPy 類型提升)
- tf.numpy.experimental_enable_numpy_behavior(dtype_conversion_mode="all") 使用 Jax 類型提升規則。
- tf.numpy.experimental_enable_numpy_behavior(dtype_conversion_mode="safe") 使用 Jax 類型提升規則，但不允許某些不安全的提升。

NumPy 類型提升

TensorFlow NumPy API 具有完善定義的語意，可將常值轉換為 ND 陣列，以及對 ND 陣列輸入執行類型提升。有關更多詳細資訊，請參閱 np.result_type。

TensorFlow API 會保持 tf.Tensor 輸入不變，並且不會對其執行類型提升，而 TensorFlow NumPy API 會根據 NumPy 類型提升規則提升所有輸入。在下一個範例中，您將執行類型提升。首先，對不同類型的 ND 陣列輸入執行加法，並注意輸出類型。TensorFlow API 不允許這些類型提升中的任何一種。

print("Type promotion for operations")
values = [tnp.asarray(1, dtype=d) for d in
          (tnp.int32, tnp.int64, tnp.float32, tnp.float64)]
for i, v1 in enumerate(values):
  for v2 in values[i + 1:]:
    print("%s + %s => %s" %
          (v1.dtype.name, v2.dtype.name, (v1 + v2).dtype.name))

最後，使用 ndarray.asarray 將常值轉換為 ND 陣列，並注意產生的類型。

print("Type inference during array creation")
print("tnp.asarray(1).dtype == tnp.%s" % tnp.asarray(1).dtype.name)
print("tnp.asarray(1.).dtype == tnp.%s\n" % tnp.asarray(1.).dtype.name)

在將常值轉換為 ND 陣列時，NumPy 偏好寬類型，例如 tnp.int64 和 tnp.float64。相反地，tf.convert_to_tensor 偏好 tf.int32 和 tf.float32 類型，用於將常數轉換為 tf.Tensor。TensorFlow NumPy API 遵循整數的 NumPy 行為。至於浮點數，experimental_enable_numpy_behavior 的 prefer_float32 引數可讓您控制是否偏好 tf.float32 而非 tf.float64 (預設為 False)。例如

tnp.experimental_enable_numpy_behavior(prefer_float32=True)
print("When prefer_float32 is True:")
print("tnp.asarray(1.).dtype == tnp.%s" % tnp.asarray(1.).dtype.name)
print("tnp.add(1., 2.).dtype == tnp.%s" % tnp.add(1., 2.).dtype.name)

tnp.experimental_enable_numpy_behavior(prefer_float32=False)
print("When prefer_float32 is False:")
print("tnp.asarray(1.).dtype == tnp.%s" % tnp.asarray(1.).dtype.name)
print("tnp.add(1., 2.).dtype == tnp.%s" % tnp.add(1., 2.).dtype.name)

廣播

與 TensorFlow 類似，NumPy 為「廣播」值定義了豐富的語意。您可以查看 NumPy 廣播指南以取得更多資訊，並將其與 TensorFlow 廣播語意進行比較。

x = tnp.ones([2, 3])
y = tnp.ones([3])
z = tnp.ones([1, 2, 1])
print("Broadcasting shapes %s, %s and %s gives shape %s" % (
    x.shape, y.shape, z.shape, (x + y + z).shape))

索引

NumPy 定義了非常複雜的索引規則。請參閱 NumPy 索引指南。請注意下面使用 ND 陣列作為索引。

x = tnp.arange(24).reshape(2, 3, 4)

print("Basic indexing")
print(x[1, tnp.newaxis, 1:3, ...], "\n")

print("Boolean indexing")
print(x[:, (True, False, True)], "\n")

print("Advanced indexing")
print(x[1, (0, 0, 1), tnp.asarray([0, 1, 1])])

# Mutation is currently not supported
try:
  tnp.arange(6)[1] = -1
except TypeError:
  print("Currently, TensorFlow NumPy does not support mutation.")

範例模型

接下來，您可以瞭解如何建立模型並在其上執行推論。這個簡單的模型應用 relu 層，然後是線性投影。後面的章節將展示如何使用 TensorFlow 的 GradientTape 計算此模型的梯度。

class Model(object):
  """Model with a dense and a linear layer."""

  def __init__(self):
    self.weights = None

  def predict(self, inputs):
    if self.weights is None:
      size = inputs.shape[1]
      # Note that type `tnp.float32` is used for performance.
      stddev = tnp.sqrt(size).astype(tnp.float32)
      w1 = tnp.random.randn(size, 64).astype(tnp.float32) / stddev
      bias = tnp.random.randn(64).astype(tnp.float32)
      w2 = tnp.random.randn(64, 2).astype(tnp.float32) / 8
      self.weights = (w1, bias, w2)
    else:
      w1, bias, w2 = self.weights
    y = tnp.matmul(inputs, w1) + bias
    y = tnp.maximum(y, 0)  # Relu
    return tnp.matmul(y, w2)  # Linear projection

model = Model()
# Create input data and compute predictions.
print(model.predict(tnp.ones([2, 32], dtype=tnp.float32)))

TensorFlow NumPy 和 NumPy

TensorFlow NumPy 實作了完整 NumPy 規格的子集。雖然隨著時間的推移會新增更多符號，但在不久的將來將不會支援某些系統性功能。這些功能包括 NumPy C API 支援、Swig 整合、Fortran 儲存順序、檢視和 stride_tricks，以及某些 dtype (例如 np.recarray 和 np.object)。有關更多詳細資訊，請參閱TensorFlow NumPy API 文件。

NumPy 互通性

TensorFlow ND 陣列可以與 NumPy 函數互通。這些物件實作了 __array__ 介面。NumPy 使用此介面將函數引數轉換為 np.ndarray 值，然後再處理它們。

同樣地，TensorFlow NumPy 函數可以接受不同類型的輸入，包括 np.ndarray。這些輸入會透過在其上調用 ndarray.asarray 來轉換為 ND 陣列。

ND 陣列與 np.ndarray 之間的轉換可能會觸發實際的資料複製。有關更多詳細資訊，請參閱關於緩衝區複製的章節。

# ND array passed into NumPy function.
np_sum = np.sum(tnp.ones([2, 3]))
print("sum = %s. Class: %s" % (float(np_sum), np_sum.__class__))

# `np.ndarray` passed into TensorFlow NumPy function.
tnp_sum = tnp.sum(np.ones([2, 3]))
print("sum = %s. Class: %s" % (float(tnp_sum), tnp_sum.__class__))

# It is easy to plot ND arrays, given the __array__ interface.
labels = 15 + 2 * tnp.random.randn(1, 1000)
_ = plt.hist(labels)

緩衝區複製

將 TensorFlow NumPy 與 NumPy 程式碼混合使用可能會觸發資料複製。這是因為 TensorFlow NumPy 對記憶體對齊的要求比 NumPy 更嚴格。

當 np.ndarray 傳遞到 TensorFlow NumPy 時，它會檢查對齊要求，並在需要時觸發複製。當將 ND 陣列 CPU 緩衝區傳遞到 NumPy 時，緩衝區通常會滿足對齊要求，而 NumPy 不需要建立副本。

ND 陣列可以參考放置在本地 CPU 記憶體以外裝置上的緩衝區。在這種情況下，調用 NumPy 函數將根據需要觸發跨網路或裝置的複製。

鑑於此，通常應謹慎地與 NumPy API 呼叫混合使用，使用者應注意複製資料的額外負荷。將 TensorFlow NumPy 呼叫與 TensorFlow 呼叫交錯通常是安全的，並且可以避免複製資料。有關更多詳細資訊，請參閱關於TensorFlow 互通性的章節。

運算子優先順序

TensorFlow NumPy 定義的 __array_priority__ 高於 NumPy 的。這表示對於涉及 ND 陣列和 np.ndarray 的運算子，前者將優先，即 np.ndarray 輸入將轉換為 ND 陣列，並且將調用運算子的 TensorFlow NumPy 實作。

x = tnp.ones([2]) + np.ones([2])
print("x = %s\nclass = %s" % (x, x.__class__))

TF NumPy 和 TensorFlow

TensorFlow NumPy 建構於 TensorFlow 之上，因此可以與 TensorFlow 無縫互通。

`tf.Tensor` 和 ND 陣列

ND 陣列是 tf.Tensor 的別名，因此顯然它們可以混合使用而不會觸發實際的資料複製。

x = tf.constant([1, 2])
print(x)

# `asarray` and `convert_to_tensor` here are no-ops.
tnp_x = tnp.asarray(x)
print(tnp_x)
print(tf.convert_to_tensor(tnp_x))

# Note that tf.Tensor.numpy() will continue to return `np.ndarray`.
print(x.numpy(), x.numpy().__class__)

TensorFlow 互通性

ND 陣列可以傳遞到 TensorFlow API，因為 ND 陣列只是 tf.Tensor 的別名。如前所述，即使對於放置在加速器或遠端裝置上的資料，這種互通操作也不會進行資料複製。

相反地，tf.Tensor 物件可以傳遞到 tf.experimental.numpy API，而無需執行資料複製。

# ND array passed into TensorFlow function.
tf_sum = tf.reduce_sum(tnp.ones([2, 3], tnp.float32))
print("Output = %s" % tf_sum)

# `tf.Tensor` passed into TensorFlow NumPy function.
tnp_sum = tnp.sum(tf.ones([2, 3]))
print("Output = %s" % tnp_sum)

梯度和 Jacobian 矩陣：tf.GradientTape

TensorFlow 的 GradientTape 可用於透過 TensorFlow 和 TensorFlow NumPy 程式碼進行反向傳播。

使用在範例模型章節中建立的模型，並計算梯度和 Jacobian 矩陣。

def create_batch(batch_size=32):
  """Creates a batch of input and labels."""
  return (tnp.random.randn(batch_size, 32).astype(tnp.float32),
          tnp.random.randn(batch_size, 2).astype(tnp.float32))

def compute_gradients(model, inputs, labels):
  """Computes gradients of squared loss between model prediction and labels."""
  with tf.GradientTape() as tape:
    assert model.weights is not None
    # Note that `model.weights` need to be explicitly watched since they
    # are not tf.Variables.
    tape.watch(model.weights)
    # Compute prediction and loss
    prediction = model.predict(inputs)
    loss = tnp.sum(tnp.square(prediction - labels))
  # This call computes the gradient through the computation above.
  return tape.gradient(loss, model.weights)

inputs, labels = create_batch()
gradients = compute_gradients(model, inputs, labels)

# Inspect the shapes of returned gradients to verify they match the
# parameter shapes.
print("Parameter shapes:", [w.shape for w in model.weights])
print("Gradient shapes:", [g.shape for g in gradients])
# Verify that gradients are of type ND array.
assert isinstance(gradients[0], tnp.ndarray)

# Computes a batch of jacobians. Each row is the jacobian of an element in the
# batch of outputs w.r.t. the corresponding input batch element.
def prediction_batch_jacobian(inputs):
  with tf.GradientTape() as tape:
    tape.watch(inputs)
    prediction = model.predict(inputs)
  return prediction, tape.batch_jacobian(prediction, inputs)

inp_batch = tnp.ones([16, 32], tnp.float32)
output, batch_jacobian = prediction_batch_jacobian(inp_batch)
# Note how the batch jacobian shape relates to the input and output shapes.
print("Output shape: %s, input shape: %s" % (output.shape, inp_batch.shape))
print("Batch jacobian shape:", batch_jacobian.shape)

追蹤編譯：tf.function

TensorFlow 的 tf.function 的運作方式是「追蹤編譯」程式碼，然後最佳化這些追蹤以獲得更快的效能。請參閱圖和函式簡介。

tf.function 也可用於最佳化 TensorFlow NumPy 程式碼。以下是一個簡單的範例，用於示範加速效果。請注意，tf.function 程式碼的主體包含對 TensorFlow NumPy API 的呼叫。

inputs, labels = create_batch(512)
print("Eager performance")
compute_gradients(model, inputs, labels)
print(timeit.timeit(lambda: compute_gradients(model, inputs, labels),
                    number=10) * 100, "ms")

print("\ntf.function compiled performance")
compiled_compute_gradients = tf.function(compute_gradients)
compiled_compute_gradients(model, inputs, labels)  # warmup
print(timeit.timeit(lambda: compiled_compute_gradients(model, inputs, labels),
                    number=10) * 100, "ms")

向量化：tf.vectorized_map

TensorFlow 內建支援向量化平行迴圈，這可以將速度提高一到兩個數量級。這些加速效果可透過 tf.vectorized_map API 存取，也適用於 TensorFlow NumPy 程式碼。

有時計算批次中每個輸出的梯度 (相對於相應的輸入批次元素) 很有用。可以使用 tf.vectorized_map 有效率地完成此計算，如下所示。

@tf.function
def vectorized_per_example_gradients(inputs, labels):
  def single_example_gradient(arg):
    inp, label = arg
    return compute_gradients(model,
                             tnp.expand_dims(inp, 0),
                             tnp.expand_dims(label, 0))
  # Note that a call to `tf.vectorized_map` semantically maps
  # `single_example_gradient` over each row of `inputs` and `labels`.
  # The interface is similar to `tf.map_fn`.
  # The underlying machinery vectorizes away this map loop which gives
  # nice speedups.
  return tf.vectorized_map(single_example_gradient, (inputs, labels))

batch_size = 128
inputs, labels = create_batch(batch_size)

per_example_gradients = vectorized_per_example_gradients(inputs, labels)
for w, p in zip(model.weights, per_example_gradients):
  print("Weight shape: %s, batch size: %s, per example gradient shape: %s " % (
      w.shape, batch_size, p.shape))

# Benchmark the vectorized computation above and compare with
# unvectorized sequential computation using `tf.map_fn`.
@tf.function
def unvectorized_per_example_gradients(inputs, labels):
  def single_example_gradient(arg):
    inp, label = arg
    return compute_gradients(model,
                             tnp.expand_dims(inp, 0),
                             tnp.expand_dims(label, 0))

  return tf.map_fn(single_example_gradient, (inputs, labels),
                   fn_output_signature=(tf.float32, tf.float32, tf.float32))

print("Running vectorized computation")
print(timeit.timeit(lambda: vectorized_per_example_gradients(inputs, labels),
                    number=10) * 100, "ms")

print("\nRunning unvectorized computation")
per_example_gradients = unvectorized_per_example_gradients(inputs, labels)
print(timeit.timeit(lambda: unvectorized_per_example_gradients(inputs, labels),
                    number=10) * 100, "ms")

裝置放置

TensorFlow NumPy 可以將運算放置在 CPU、GPU、TPU 和遠端裝置上。它使用標準 TensorFlow 機制進行裝置放置。以下是一個簡單的範例，展示如何列出所有裝置，然後將某些運算放置在特定裝置上。

TensorFlow 也具有用於跨裝置複製運算和執行集體縮減的 API，此處將不涵蓋這些 API。

列出裝置

tf.config.list_logical_devices 和 tf.config.list_physical_devices 可用於尋找要使用的裝置。

print("All logical devices:", tf.config.list_logical_devices())
print("All physical devices:", tf.config.list_physical_devices())

# Try to get the GPU device. If unavailable, fallback to CPU.
try:
  device = tf.config.list_logical_devices(device_type="GPU")[0]
except IndexError:
  device = "/device:CPU:0"

放置運算：`tf.device`

可以透過在 tf.device 範圍內呼叫運算來將運算放置在裝置上。

print("Using device: %s" % str(device))
# Run operations in the `tf.device` scope.
# If a GPU is available, these operations execute on the GPU and outputs are
# placed on the GPU memory.
with tf.device(device):
  prediction = model.predict(create_batch(5)[0])

print("prediction is placed on %s" % prediction.device)

跨裝置複製 ND 陣列：`tnp.copy`

在特定裝置範圍內調用 tnp.copy 會將資料複製到該裝置，除非資料已在該裝置上。

with tf.device("/device:CPU:0"):
  prediction_cpu = tnp.copy(prediction)
print(prediction.device)
print(prediction_cpu.device)

效能比較

TensorFlow NumPy 使用高度最佳化的 TensorFlow 核心，這些核心可以分派到 CPU、GPU 和 TPU 上。TensorFlow 也執行許多編譯器最佳化，例如運算融合，這會轉化為效能和記憶體改善。請參閱使用 Grappler 進行 TensorFlow 圖最佳化以了解更多資訊。

但是，與 NumPy 相比，TensorFlow 在分派運算時具有更高的額外負荷。對於由小運算 (小於約 10 微秒) 組成的工作負載，這些額外負荷可能會主導執行時間，而 NumPy 可能會提供更好的效能。對於其他情況，TensorFlow 通常應提供更好的效能。

執行以下基準測試，以比較不同輸入大小的 NumPy 和 TensorFlow NumPy 效能。

def benchmark(f, inputs, number=30, force_gpu_sync=False):
  """Utility to benchmark `f` on each value in `inputs`."""
  times = []
  for inp in inputs:
    def _g():
      if force_gpu_sync:
        one = tnp.asarray(1)
      f(inp)
      if force_gpu_sync:
        with tf.device("CPU:0"):
          tnp.copy(one)  # Force a sync for GPU case

    _g()  # warmup
    t = timeit.timeit(_g, number=number)
    times.append(t * 1000. / number)
  return times


def plot(np_times, tnp_times, compiled_tnp_times, has_gpu, tnp_times_gpu):
  """Plot the different runtimes."""
  plt.xlabel("size")
  plt.ylabel("time (ms)")
  plt.title("Sigmoid benchmark: TF NumPy vs NumPy")
  plt.plot(sizes, np_times, label="NumPy")
  plt.plot(sizes, tnp_times, label="TF NumPy (CPU)")
  plt.plot(sizes, compiled_tnp_times, label="Compiled TF NumPy (CPU)")
  if has_gpu:
    plt.plot(sizes, tnp_times_gpu, label="TF NumPy (GPU)")
  plt.legend()

# Define a simple implementation of `sigmoid`, and benchmark it using
# NumPy and TensorFlow NumPy for different input sizes.

def np_sigmoid(y):
  return 1. / (1. + np.exp(-y))

def tnp_sigmoid(y):
  return 1. / (1. + tnp.exp(-y))

@tf.function
def compiled_tnp_sigmoid(y):
  return tnp_sigmoid(y)

sizes = (2 ** 0, 2 ** 5, 2 ** 10, 2 ** 15, 2 ** 20)
np_inputs = [np.random.randn(size).astype(np.float32) for size in sizes]
np_times = benchmark(np_sigmoid, np_inputs)

with tf.device("/device:CPU:0"):
  tnp_inputs = [tnp.random.randn(size).astype(np.float32) for size in sizes]
  tnp_times = benchmark(tnp_sigmoid, tnp_inputs)
  compiled_tnp_times = benchmark(compiled_tnp_sigmoid, tnp_inputs)

has_gpu = len(tf.config.list_logical_devices("GPU"))
if has_gpu:
  with tf.device("/device:GPU:0"):
    tnp_inputs = [tnp.random.randn(size).astype(np.float32) for size in sizes]
    tnp_times_gpu = benchmark(compiled_tnp_sigmoid, tnp_inputs, 100, True)
else:
  tnp_times_gpu = None
plot(np_times, tnp_times, compiled_tnp_times, has_gpu, tnp_times_gpu)