進階自動微分

《Introduction to gradients and automatic differentiation 指南》包含在 TensorFlow 中計算梯度所需的一切資訊。本指南著重介紹 tf.GradientTape API 中更深入、較不常見的功能。

設定

import tensorflow as tf

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rcParams['figure.figsize'] = (8, 6)

控制梯度記錄

在《automatic differentiation guide》中，您已瞭解如何在建構梯度計算時，控制磁帶監看哪些變數和張量。

磁帶也提供多種方法來操控記錄。

停止記錄

如果您希望停止記錄梯度，可以使用 tf.GradientTape.stop_recording 暫時停止記錄。

如果您不想在模型中間區分複雜的運算，這有助於減少額外負荷。這可能包括計算指標或中繼結果

x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape() as t:
  x_sq = x * x
  with t.stop_recording():
    y_sq = y * y
  z = x_sq + y_sq

grad = t.gradient(z, {'x': x, 'y': y})

print('dz/dx:', grad['x'])  # 2*x => 4
print('dz/dy:', grad['y'])

重設/從頭開始記錄

如果您希望完全重新開始，請使用 tf.GradientTape.reset。通常，直接結束梯度磁帶區塊並重新啟動會更容易理解，但當難以或無法結束磁帶區塊時，您可以使用 reset 方法。

x = tf.Variable(2.0)
y = tf.Variable(3.0)
reset = True

with tf.GradientTape() as t:
  y_sq = y * y
  if reset:
    # Throw out all the tape recorded so far.
    t.reset()
  z = x * x + y_sq

grad = t.gradient(z, {'x': x, 'y': y})

print('dz/dx:', grad['x'])  # 2*x => 4
print('dz/dy:', grad['y'])

精確停止梯度流

與上方的全域磁帶控制項不同，tf.stop_gradient 函式更加精確。它可用於停止梯度沿特定路徑流動，而無需存取磁帶本身

x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape() as t:
  y_sq = y**2
  z = x**2 + tf.stop_gradient(y_sq)

grad = t.gradient(z, {'x': x, 'y': y})

print('dz/dx:', grad['x'])  # 2*x => 4
print('dz/dy:', grad['y'])

自訂梯度

在某些情況下，您可能想要精確控制梯度的計算方式，而不是使用預設方式。這些情況包括：

針對您正在編寫的新運算，沒有已定義的梯度。
預設計算在數值上不穩定。
您希望快取正向傳遞中的昂貴運算。
您想要修改值 (例如，使用 tf.clip_by_value 或 tf.math.round)，而無需修改梯度。

針對第一種情況，若要編寫新的運算，您可以使用 tf.RegisterGradient 設定您自己的運算 (詳細資訊請參閱 API 文件)。(請注意，梯度登錄檔是全域性的，因此請謹慎變更。)

針對後三種情況，您可以使用 tf.custom_gradient。

以下範例示範如何將 tf.clip_by_norm 套用至中繼梯度

# Establish an identity operation, but clip during the gradient pass.
@tf.custom_gradient
def clip_gradients(y):
  def backward(dy):
    return tf.clip_by_norm(dy, 0.5)
  return y, backward

v = tf.Variable(2.0)
with tf.GradientTape() as t:
  output = clip_gradients(v * v)
print(t.gradient(output, v))  # calls "backward", which clips 4 to 2

如需更多詳細資訊，請參閱 tf.custom_gradient 裝飾器 API 文件。

SavedModel 中的自訂梯度

可以使用 tf.saved_model.SaveOptions(experimental_custom_gradients=True) 選項，將自訂梯度儲存至 SavedModel。

若要儲存到 SavedModel 中，梯度函式必須是可追蹤的 (如要瞭解詳情，請參閱《Better performance with tf.function 指南》)。

class MyModule(tf.Module):

  @tf.function(input_signature=[tf.TensorSpec(None)])
  def call_custom_grad(self, x):
    return clip_gradients(x)

model = MyModule()

tf.saved_model.save(
    model,
    'saved_model',
    options=tf.saved_model.SaveOptions(experimental_custom_gradients=True))

# The loaded gradients will be the same as the above example.
v = tf.Variable(2.0)
loaded = tf.saved_model.load('saved_model')
with tf.GradientTape() as t:
  output = loaded.call_custom_grad(v * v)
print(t.gradient(output, v))

關於上述範例的注意事項：如果您嘗試將上述程式碼替換為 tf.saved_model.SaveOptions(experimental_custom_gradients=False)，則梯度在載入時仍會產生相同的結果。原因是梯度登錄檔仍包含函式 call_custom_op 中使用的自訂梯度。不過，如果您在未儲存自訂梯度的情況下重新啟動執行階段，則在 tf.GradientTape 下執行已載入的模型會擲回錯誤：LookupError: No gradient defined for operation 'IdentityN' (op type: IdentityN)。

多個磁帶

多個磁帶可無縫互動。

例如，此處的每個磁帶會監看不同的張量集

x0 = tf.constant(0.0)
x1 = tf.constant(0.0)

with tf.GradientTape() as tape0, tf.GradientTape() as tape1:
  tape0.watch(x0)
  tape1.watch(x1)

  y0 = tf.math.sin(x0)
  y1 = tf.nn.sigmoid(x1)

  y = y0 + y1

  ys = tf.reduce_sum(y)

tape0.gradient(ys, x0).numpy()   # cos(x) => 1.0

tape1.gradient(ys, x1).numpy()   # sigmoid(x1)*(1-sigmoid(x1)) => 0.25

高階梯度

tf.GradientTape 內容管理員內部的運算會記錄下來以進行自動微分。如果梯度是在該內容中計算，則也會記錄梯度計算。因此，完全相同的 API 也適用於高階梯度。

例如

x = tf.Variable(1.0)  # Create a Tensorflow variable initialized to 1.0

with tf.GradientTape() as t2:
  with tf.GradientTape() as t1:
    y = x * x * x

  # Compute the gradient inside the outer `t2` context manager
  # which means the gradient computation is differentiable as well.
  dy_dx = t1.gradient(y, x)
d2y_dx2 = t2.gradient(dy_dx, x)

print('dy_dx:', dy_dx.numpy())  # 3 * x**2 => 3.0
print('d2y_dx2:', d2y_dx2.numpy())  # 6 * x => 6.0

雖然這確實為您提供純量函式的二階導數，但此模式無法廣泛應用於產生 Hessian 矩陣，因為 tf.GradientTape.gradient 僅計算純量的梯度。若要建構 Hessian 矩陣，請前往 Hessian 範例下的 Jacobian 矩陣章節。

當您從梯度計算純量，然後產生的純量充當第二個梯度計算的來源時，「巢狀呼叫 tf.GradientTape.gradient」是很好的模式，如下列範例所示。

範例：輸入梯度正規化

許多模型容易受到「對抗範例」的影響。這組技術會修改模型的輸入，以混淆模型的輸出。最簡單的實作方式 (例如《Adversarial example using the Fast Gradient Signed Method attack》) 是沿著輸出相對於輸入的梯度 (即「輸入梯度」) 執行單一步驟。

提高對抗範例穩健性的一種技術是輸入梯度正規化 (Finlay & Oberman，2019 年)，其嘗試將輸入梯度的幅度降至最低。如果輸入梯度較小，則輸出的變更也應該很小。

以下是輸入梯度正規化的基本實作方式。實作方式如下：

使用內部磁帶計算輸出相對於輸入的梯度。
計算該輸入梯度的幅度。
計算該幅度相對於模型的梯度。

x = tf.random.normal([7, 5])

layer = tf.keras.layers.Dense(10, activation=tf.nn.relu)

with tf.GradientTape() as t2:
  # The inner tape only takes the gradient with respect to the input,
  # not the variables.
  with tf.GradientTape(watch_accessed_variables=False) as t1:
    t1.watch(x)
    y = layer(x)
    out = tf.reduce_sum(layer(x)**2)
  # 1. Calculate the input gradient.
  g1 = t1.gradient(out, x)
  # 2. Calculate the magnitude of the input gradient.
  g1_mag = tf.norm(g1)

# 3. Calculate the gradient of the magnitude with respect to the model.
dg1_mag = t2.gradient(g1_mag, layer.trainable_variables)

[var.shape for var in dg1_mag]

Jacobian 矩陣

所有先前的範例都採用純量目標相對於某些來源張量的梯度。

Jacobian 矩陣代表向量值函式的梯度。每列包含向量其中一個元素的梯度。

tf.GradientTape.jacobian 方法可讓您有效率地計算 Jacobian 矩陣。

請注意

與 gradient 類似：sources 參數可以是張量或張量容器。
與 gradient 不同：target 張量必須是單一張量。

純量來源

第一個範例是向量目標相對於純量來源的 Jacobian 矩陣。

x = tf.linspace(-10.0, 10.0, 200+1)
delta = tf.Variable(0.0)

with tf.GradientTape() as tape:
  y = tf.nn.sigmoid(x+delta)

dy_dx = tape.jacobian(y, delta)

當您取得相對於純量的 Jacobian 矩陣時，結果會具有目標的形狀，並提供每個元素相對於來源的梯度

print(y.shape)
print(dy_dx.shape)

plt.plot(x.numpy(), y, label='y')
plt.plot(x.numpy(), dy_dx, label='dy/dx')
plt.legend()
_ = plt.xlabel('x')

張量來源

無論輸入是純量還是張量，tf.GradientTape.jacobian 都會有效率地計算來源的每個元素相對於目標 (或多個目標) 的每個元素的梯度。

例如，此層的輸出形狀為 (10, 7)

x = tf.random.normal([7, 5])
layer = tf.keras.layers.Dense(10, activation=tf.nn.relu)

with tf.GradientTape(persistent=True) as tape:
  y = layer(x)

y.shape

而此層核心的形狀為 (5, 10)

layer.kernel.shape

輸出相對於核心的 Jacobian 矩陣形狀是這兩個形狀串連在一起

j = tape.jacobian(y, layer.kernel)
j.shape

如果您對目標的維度求和，則會剩下原本應由 tf.GradientTape.gradient 計算的總和梯度

g = tape.gradient(y, layer.kernel)
print('g.shape:', g.shape)

j_sum = tf.reduce_sum(j, axis=[0, 1])
delta = tf.reduce_max(abs(g - j_sum)).numpy()
assert delta < 1e-3
print('delta:', delta)

範例：Hessian 矩陣

雖然 tf.GradientTape 未提供建構 Hessian 矩陣的明確方法，但可以使用 tf.GradientTape.jacobian 方法來建構。

x = tf.random.normal([7, 5])
layer1 = tf.keras.layers.Dense(8, activation=tf.nn.relu)
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.relu)

with tf.GradientTape() as t2:
  with tf.GradientTape() as t1:
    x = layer1(x)
    x = layer2(x)
    loss = tf.reduce_mean(x**2)

  g = t1.gradient(loss, layer1.kernel)

h = t2.jacobian(g, layer1.kernel)

print(f'layer.kernel.shape: {layer1.kernel.shape}')
print(f'h.shape: {h.shape}')

若要將此 Hessian 矩陣用於牛頓法步驟，您首先需要將其軸展平為矩陣，並將梯度展平為向量

n_params = tf.reduce_prod(layer1.kernel.shape)

g_vec = tf.reshape(g, [n_params, 1])
h_mat = tf.reshape(h, [n_params, n_params])

Hessian 矩陣應為對稱

def imshow_zero_center(image, **kwargs):
  lim = tf.reduce_max(abs(image))
  plt.imshow(image, vmin=-lim, vmax=lim, cmap='seismic', **kwargs)
  plt.colorbar()

imshow_zero_center(h_mat)

牛頓法更新步驟如下所示

eps = 1e-3
eye_eps = tf.eye(h_mat.shape[0])*eps

# X(k+1) = X(k) - (∇²f(X(k)))^-1 @ ∇f(X(k))
# h_mat = ∇²f(X(k))
# g_vec = ∇f(X(k))
update = tf.linalg.solve(h_mat + eye_eps, g_vec)

# Reshape the update and apply it to the variable.
_ = layer1.kernel.assign_sub(tf.reshape(update, layer1.kernel.shape))

雖然這對於單一 tf.Variable 來說相對簡單，但將其套用至重要的模型需要仔細串連和切片，才能產生跨多個變數的完整 Hessian 矩陣。

批次 Jacobian 矩陣

在某些情況下，您想要取得目標堆疊中每個目標相對於來源堆疊中每個來源的 Jacobian 矩陣，其中每個目標來源配對的 Jacobian 矩陣是獨立的。

例如，此處的輸入 x 形狀為 (batch, ins)，而輸出 y 形狀為 (batch, outs)

x = tf.random.normal([7, 5])

layer1 = tf.keras.layers.Dense(8, activation=tf.nn.elu)
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.elu)

with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape:
  tape.watch(x)
  y = layer1(x)
  y = layer2(y)

y.shape

相對於 x 的完整 Jacobian 矩陣形狀為 (batch, ins, batch, outs)，即使您只想要 (batch, ins, outs)

j = tape.jacobian(y, x)
j.shape

如果堆疊中每個項目的梯度是獨立的，則此張量的每個 (batch, batch) 切片都是對角矩陣

imshow_zero_center(j[:, 0, :, 0])
_ = plt.title('A (batch, batch) slice')

def plot_as_patches(j):
  # Reorder axes so the diagonals will each form a contiguous patch.
  j = tf.transpose(j, [1, 0, 3, 2])
  # Pad in between each patch.
  lim = tf.reduce_max(abs(j))
  j = tf.pad(j, [[0, 0], [1, 1], [0, 0], [1, 1]],
             constant_values=-lim)
  # Reshape to form a single image.
  s = j.shape
  j = tf.reshape(j, [s[0]*s[1], s[2]*s[3]])
  imshow_zero_center(j, extent=[-0.5, s[2]-0.5, s[0]-0.5, -0.5])

plot_as_patches(j)
_ = plt.title('All (batch, batch) slices are diagonal')

若要取得所需的結果，您可以對重複的 batch 維度求和，或使用 tf.einsum 選取對角線

j_sum = tf.reduce_sum(j, axis=2)
print(j_sum.shape)
j_select = tf.einsum('bxby->bxy', j)
print(j_select.shape)

一開始就不要額外維度來進行計算會更有效率。tf.GradientTape.batch_jacobian 方法正是這樣做

jb = tape.batch_jacobian(y, x)
jb.shape

error = tf.reduce_max(abs(jb - j_sum))
assert error < 1e-3
print(error.numpy())

x = tf.random.normal([7, 5])

layer1 = tf.keras.layers.Dense(8, activation=tf.nn.elu)
bn = tf.keras.layers.BatchNormalization()
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.elu)

with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape:
  tape.watch(x)
  y = layer1(x)
  y = bn(y, training=True)
  y = layer2(y)

j = tape.jacobian(y, x)
print(f'j.shape: {j.shape}')

plot_as_patches(j)

_ = plt.title('These slices are not diagonal')
_ = plt.xlabel("Don't use `batch_jacobian`")

在這種情況下，batch_jacobian 仍然會執行並傳回某些內容 (具有預期形狀)，但其內容意義不明

jb = tape.batch_jacobian(y, x)
print(f'jb.shape: {jb.shape}')