TensorFlow 受限最佳化範例：使用 CelebA 資料集

這個筆記本示範如何使用 TFCO 程式庫輕鬆建立及最佳化受限問題。當我們發現模型在不同資料切片上的效能不均時，這個方法有助於改善模型，我們可以使用公平性指標找出這些問題。Google AI 原則的第二條指出，我們的技術應避免產生或強化不公平的偏見，而我們相信這項技術有助於在某些情況下改善模型公平性。具體來說，這個筆記本將會

訓練簡單的不受限神經網路模型，以使用 tf.keras 和大規模 CelebFaces Attributes (CelebA) 資料集偵測影像中人物的笑容。
使用公平性指標，評估模型在各年齡層群組中針對常用公平性指標的效能。
設定簡單的受限最佳化問題，以在各年齡層群組中實現更公平的效能。
重新訓練現在受限的模型，並再次評估效能，確保我們選擇的公平性指標已獲得改善。

上次更新日期：2020 年 2 月 11 日

安裝

這個筆記本是在連線至 Python 3 Google Compute Engine 後端的 Colaboratory 中建立。如果您希望在不同的環境中託管這個筆記本，只要您在以下儲存格中包含所有必要的套件，應該就不會遇到任何重大問題。

請注意，第一次執行 pip 安裝時，系統可能會要求您重新啟動執行階段，因為預先安裝的套件已過時。完成重新啟動後，系統就會使用正確的套件。

Pip 安裝

!pip install -q -U pip==20.2

!pip install git+https://github.com/google-research/tensorflow_constrained_optimization
!pip install -q tensorflow-datasets tensorflow
!pip install fairness-indicators \
  "absl-py==0.12.0" \
  "apache-beam<3,>=2.40" \
  "avro-python3==1.9.1" \
  "pyzmq==17.0.0"

請注意，根據您執行以下儲存格的時間，您可能會收到關於 Colab 中 TensorFlow 預設版本即將切換至 TensorFlow 2.X 的警告。您可以安全地忽略該警告，因為這個筆記本的設計與 TensorFlow 1.X 和 2.X 相容。

匯入模組

import os
import sys
import tempfile
import urllib

import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

import numpy as np

import tensorflow_constrained_optimization as tfco

from tensorflow_metadata.proto.v0 import schema_pb2
from tfx_bsl.tfxio import tensor_adapter
from tfx_bsl.tfxio import tf_example_record

此外，我們還新增了一些特定於公平性指標的匯入項目，我們將使用這些項目來評估和視覺化模型的效能。

公平性指標相關匯入項目

import tensorflow_model_analysis as tfma
import fairness_indicators as fi
from google.protobuf import text_format
import apache_beam as beam

雖然 TFCO 與立即執行和圖形執行相容，但這個筆記本假設預設啟用立即執行，就像在 TensorFlow 2.x 中一樣。為了確保不會發生任何問題，將在以下儲存格中啟用立即執行。

啟用立即執行並列印版本

if tf.__version__ < "2.0.0":
  tf.compat.v1.enable_eager_execution()
  print("Eager execution enabled.")
else:
  print("Eager execution enabled by default.")

print("TensorFlow " + tf.__version__)
print("TFMA " + tfma.VERSION_STRING)
print("TFDS " + tfds.version.__version__)
print("FI " + fi.version.__version__)

CelebA 資料集

CelebA 是一個大規模臉部屬性資料集，包含超過 20 萬張名人圖片，每張圖片都有 40 個屬性註解 (例如髮型、時尚配件、臉部特徵等) 和 5 個地標位置 (眼睛、嘴巴和鼻子位置)。如需更多詳細資訊，請參閱論文。在獲得所有者許可的情況下，我們已將這個資料集儲存在 Google Cloud Storage 上，並且主要透過 TensorFlow Datasets (tfds) 存取。

在這個筆記本中

我們的模型將嘗試分類圖片的主角是否在微笑，以「微笑」屬性^* 表示。
圖片將從 218x178 調整大小為 28x28，以減少訓練時的執行時間和記憶體。
我們模型的效能將跨年齡層群組進行評估，使用二元「年輕」屬性。在這個筆記本中，我們將稱之為「年齡層群組」。

^* 雖然關於這個資料集的標籤方法資訊不多，但我們假設「微笑」屬性是由主角臉上愉悅、友善或有趣的表情決定的。為了這個案例研究的目的，我們將這些標籤視為實際情況。

gcs_base_dir = "gs://celeb_a_dataset/"
celeb_a_builder = tfds.builder("celeb_a", data_dir=gcs_base_dir, version='2.0.0')

celeb_a_builder.download_and_prepare()

num_test_shards_dict = {'0.3.0': 4, '2.0.0': 2} # Used because we download the test dataset separately
version = str(celeb_a_builder.info.version)
print('Celeb_A dataset version: %s' % version)

測試資料集輔助函式

local_root = tempfile.mkdtemp(prefix='test-data')
def local_test_filename_base():
  return local_root

def local_test_file_full_prefix():
  return os.path.join(local_test_filename_base(), "celeb_a-test.tfrecord")

def copy_test_files_to_local():
  filename_base = local_test_file_full_prefix()
  num_test_shards = num_test_shards_dict[version]
  for shard in range(num_test_shards):
    url = "https://storage.googleapis.com/celeb_a_dataset/celeb_a/%s/celeb_a-test.tfrecord-0000%s-of-0000%s" % (version, shard, num_test_shards)
    filename = "%s-0000%s-of-0000%s" % (filename_base, shard, num_test_shards)
    res = urllib.request.urlretrieve(url, filename)

注意事項

在繼續之前，使用 CelebA 時有幾個注意事項要記住

雖然原則上這個筆記本可以使用任何臉部圖片資料集，但之所以選擇 CelebA，是因為它包含公眾人物的公有領域圖片。
CelebA 中的所有屬性註解都以二元類別運作。例如，「年輕」屬性 (由資料集標籤員決定) 在圖片中被表示為存在或不存在。
CelebA 的分類並未反映人類屬性的真實多樣性。
為了這個筆記本的目的，包含「年輕」屬性的特徵稱為「年齡層群組」，其中圖片中「年輕」屬性的存在被標記為「年輕」年齡層群組的成員，「年輕」屬性的不存在被標記為「非年輕」年齡層群組的成員。這些都是假設，因為在原始論文中沒有提及這些資訊。
因此，在這個筆記本中訓練的模型效能與 CelebA 作者運作和註解屬性的方式有關。
這個模型不應用於商業目的，因為這會違反CelebA 的非商業研究協議。

設定輸入函式

後續的儲存格將有助於簡化輸入管線並視覺化效能。

首先，我們定義一些與資料相關的變數，並定義必要的預先處理函式。

定義變數

ATTR_KEY = "attributes"
IMAGE_KEY = "image"
LABEL_KEY = "Smiling"
GROUP_KEY = "Young"
IMAGE_SIZE = 28

定義預先處理函式

def preprocess_input_dict(feat_dict):
  # Separate out the image and target variable from the feature dictionary.
  image = feat_dict[IMAGE_KEY]
  label = feat_dict[ATTR_KEY][LABEL_KEY]
  group = feat_dict[ATTR_KEY][GROUP_KEY]

  # Resize and normalize image.
  image = tf.cast(image, tf.float32)
  image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE])
  image /= 255.0

  # Cast label and group to float32.
  label = tf.cast(label, tf.float32)
  group = tf.cast(group, tf.float32)

  feat_dict[IMAGE_KEY] = image
  feat_dict[ATTR_KEY][LABEL_KEY] = label
  feat_dict[ATTR_KEY][GROUP_KEY] = group

  return feat_dict

get_image_and_label = lambda feat_dict: (feat_dict[IMAGE_KEY], feat_dict[ATTR_KEY][LABEL_KEY])
get_image_label_and_group = lambda feat_dict: (feat_dict[IMAGE_KEY], feat_dict[ATTR_KEY][LABEL_KEY], feat_dict[ATTR_KEY][GROUP_KEY])

然後，我們建立其餘 Colab 中需要的資料函式。

# Train data returning either 2 or 3 elements (the third element being the group)
def celeb_a_train_data_wo_group(batch_size):
  celeb_a_train_data = celeb_a_builder.as_dataset(split='train').shuffle(1024).repeat().batch(batch_size).map(preprocess_input_dict)
  return celeb_a_train_data.map(get_image_and_label)
def celeb_a_train_data_w_group(batch_size):
  celeb_a_train_data = celeb_a_builder.as_dataset(split='train').shuffle(1024).repeat().batch(batch_size).map(preprocess_input_dict)
  return celeb_a_train_data.map(get_image_label_and_group)

# Test data for the overall evaluation
celeb_a_test_data = celeb_a_builder.as_dataset(split='test').batch(1).map(preprocess_input_dict).map(get_image_label_and_group)
# Copy test data locally to be able to read it into tfma
copy_test_files_to_local()

建構簡單的 DNN 模型

因為這個筆記本的重點是 TFCO，我們將組裝一個簡單、不受限的 tf.keras.Sequential 模型。

我們可以透過增加一些複雜性 (例如，更多密集連接層、探索不同的啟動函式、增加影像大小) 來大幅改善模型效能，但這可能會分散人們對示範使用 Keras 時應用 TFCO 程式庫有多容易的注意力。因此，模型將保持簡單 — 但鼓勵您探索這個空間。

def create_model():
  # For this notebook, accuracy will be used to evaluate performance.
  METRICS = [
    tf.keras.metrics.BinaryAccuracy(name='accuracy')
  ]

  # The model consists of:
  # 1. An input layer that represents the 28x28x3 image flatten.
  # 2. A fully connected layer with 64 units activated by a ReLU function.
  # 3. A single-unit readout layer to output real-scores instead of probabilities.
  model = keras.Sequential([
      keras.layers.Flatten(input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3), name='image'),
      keras.layers.Dense(64, activation='relu'),
      keras.layers.Dense(1, activation=None)
  ])

  # TFCO by default uses hinge loss — and that will also be used in the model.
  model.compile(
      optimizer=tf.keras.optimizers.Adam(0.001),
      loss='hinge',
      metrics=METRICS)
  return model

我們也定義了一個函式來設定種子，以確保結果可重現。請注意，這個 Colab 旨在作為教育工具，不具備微調生產管線的穩定性。在未設定種子的情況下執行可能會導致不同的結果。

def set_seeds():
  np.random.seed(121212)
  tf.compat.v1.set_random_seed(212121)

公平性指標輔助函式

在訓練模型之前，我們先定義一些輔助函式，讓我們能夠透過公平性指標評估模型的效能。

首先，我們建立一個輔助函式，以便在訓練模型後儲存模型。

def save_model(model, subdir):
  base_dir = tempfile.mkdtemp(prefix='saved_models')
  model_location = os.path.join(base_dir, subdir)
  model.save(model_location, save_format='tf')
  return model_location

接下來，我們定義用於預先處理資料的函式，以便正確地將資料傳遞給 TFMA。

資料預先處理函式，適用於

def tfds_filepattern_for_split(dataset_name, split):
  return f"{local_test_file_full_prefix()}*"

class PreprocessCelebA(object):
  """Class that deserializes, decodes and applies additional preprocessing for CelebA input."""
  def __init__(self, dataset_name):
    builder = tfds.builder(dataset_name)
    self.features = builder.info.features
    example_specs = self.features.get_serialized_info()
    self.parser = tfds.core.example_parser.ExampleParser(example_specs)

  def __call__(self, serialized_example):
    # Deserialize
    deserialized_example = self.parser.parse_example(serialized_example)
    # Decode
    decoded_example = self.features.decode_example(deserialized_example)
    # Additional preprocessing
    image = decoded_example[IMAGE_KEY]
    label = decoded_example[ATTR_KEY][LABEL_KEY]
    # Resize and scale image.
    image = tf.cast(image, tf.float32)
    image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE])
    image /= 255.0
    image = tf.reshape(image, [-1])
    # Cast label and group to float32.
    label = tf.cast(label, tf.float32)

    group = decoded_example[ATTR_KEY][GROUP_KEY]

    output = tf.train.Example()
    output.features.feature[IMAGE_KEY].float_list.value.extend(image.numpy().tolist())
    output.features.feature[LABEL_KEY].float_list.value.append(label.numpy())
    output.features.feature[GROUP_KEY].bytes_list.value.append(b"Young" if group.numpy() else b'Not Young')
    return output.SerializeToString()

def tfds_as_pcollection(beam_pipeline, dataset_name, split):
  return (
      beam_pipeline
   | 'Read records' >> beam.io.ReadFromTFRecord(tfds_filepattern_for_split(dataset_name, split))
   | 'Preprocess' >> beam.Map(PreprocessCelebA(dataset_name))
  )

最後，我們定義一個在 TFMA 中評估結果的函式。

def get_eval_results(model_location, eval_subdir):
  base_dir = tempfile.mkdtemp(prefix='saved_eval_results')
  tfma_eval_result_path = os.path.join(base_dir, eval_subdir)

  eval_config_pbtxt = """
        model_specs {
          label_key: "%s"
        }
        metrics_specs {
          metrics {
            class_name: "FairnessIndicators"
            config: '{ "thresholds": [0.22, 0.5, 0.75] }'
          }
          metrics {
            class_name: "ExampleCount"
          }
        }
        slicing_specs {}
        slicing_specs { feature_keys: "%s" }
        options {
          compute_confidence_intervals { value: False }
          disabled_outputs{values: "analysis"}
        }
      """ % (LABEL_KEY, GROUP_KEY)

  eval_config = text_format.Parse(eval_config_pbtxt, tfma.EvalConfig())

  eval_shared_model = tfma.default_eval_shared_model(
        eval_saved_model_path=model_location, tags=[tf.saved_model.SERVING])

  schema_pbtxt = """
        tensor_representation_group {
          key: ""
          value {
            tensor_representation {
              key: "%s"
              value {
                dense_tensor {
                  column_name: "%s"
                  shape {
                    dim { size: 28 }
                    dim { size: 28 }
                    dim { size: 3 }
                  }
                }
              }
            }
          }
        }
        feature {
          name: "%s"
          type: FLOAT
        }
        feature {
          name: "%s"
          type: FLOAT
        }
        feature {
          name: "%s"
          type: BYTES
        }
        """ % (IMAGE_KEY, IMAGE_KEY, IMAGE_KEY, LABEL_KEY, GROUP_KEY)
  schema = text_format.Parse(schema_pbtxt, schema_pb2.Schema())
  coder = tf_example_record.TFExampleBeamRecord(
      physical_format='inmem', schema=schema,
      raw_record_column_name=tfma.ARROW_INPUT_COLUMN)
  tensor_adapter_config = tensor_adapter.TensorAdapterConfig(
    arrow_schema=coder.ArrowSchema(),
    tensor_representations=coder.TensorRepresentations())
  # Run the fairness evaluation.
  with beam.Pipeline() as pipeline:
    _ = (
          tfds_as_pcollection(pipeline, 'celeb_a', 'test')
          | 'ExamplesToRecordBatch' >> coder.BeamSource()
          | 'ExtractEvaluateAndWriteResults' >>
          tfma.ExtractEvaluateAndWriteResults(
              eval_config=eval_config,
              eval_shared_model=eval_shared_model,
              output_path=tfma_eval_result_path,
              tensor_adapter_config=tensor_adapter_config)
    )
  return tfma.load_eval_result(output_path=tfma_eval_result_path)

訓練與評估不受限模型

現在模型已定義且輸入管線已到位，我們已準備好訓練模型。為了減少執行時間和記憶體量，我們將透過將資料切片成小批次並僅重複幾次迭代來訓練模型。

請注意，在 TensorFlow < 2.0.0 中執行這個筆記本可能會導致 np.where 的棄用警告。您可以安全地忽略這個警告，因為 TensorFlow 在 2.X 中透過使用 tf.where 而非 np.where 來解決這個問題。

BATCH_SIZE = 32

# Set seeds to get reproducible results
set_seeds()

model_unconstrained = create_model()
model_unconstrained.fit(celeb_a_train_data_wo_group(BATCH_SIZE), epochs=5, steps_per_epoch=1000)

在測試資料上評估模型應產生略高於 85% 的最終準確度分數。對於一個沒有微調的簡單模型來說，這還不錯。

print('Overall Results, Unconstrained')
celeb_a_test_data = celeb_a_builder.as_dataset(split='test').batch(1).map(preprocess_input_dict).map(get_image_label_and_group)
results = model_unconstrained.evaluate(celeb_a_test_data)

但是，跨年齡層群組評估的效能可能會揭示一些缺點。

為了進一步探索這個問題，我們使用公平性指標 (透過 TFMA) 評估模型。特別是，我們有興趣了解在誤判率方面，「年輕」和「非年輕」類別之間的效能是否存在顯著差距。

當模型錯誤地預測正類別時，就會發生誤判錯誤。在這種情況下，當實際情況是名人「沒有微笑」的圖片，而模型預測為「微笑」時，就會發生誤判結果。延伸來說，誤判率 (在上面的視覺化中使用) 是測試準確度的衡量標準。雖然在這個情況下，這是一個相對普通的錯誤，但誤判錯誤有時可能會導致更麻煩的行為。例如，垃圾郵件分類器中的誤判錯誤可能會導致使用者錯過重要的電子郵件。

model_location = save_model(model_unconstrained, 'model_export_unconstrained')
eval_results_unconstrained = get_eval_results(model_location, 'eval_results_unconstrained')

如上所述，我們專注於誤判率。目前版本的公平性指標 (0.1.2) 預設選取誤否定率。執行以下程式碼行後，取消選取 false_negative_rate 並選取 false_positive_rate，以查看我們感興趣的指標。

tfma.addons.fairness.view.widget_view.render_fairness_indicator(eval_results_unconstrained)

如以上結果所示，我們確實看到「年輕」和「非年輕」類別之間存在不相稱的差距。

這就是 TFCO 可以提供協助的地方，它可以將誤判率限制在更可接受的標準內。

受限模型設定

如 TFCO 程式庫中記載，有幾個輔助程式可讓您更輕鬆地限制問題

tfco.rate_context() – 這將用於建構每個年齡層群組類別的限制條件。
tfco.RateMinimizationProblem()– 這裡要最小化的比率表示式將是受限於年齡層群組的誤判率。換句話說，現在的效能將根據年齡層群組的誤判率與整體資料集的誤判率之間的差異來評估。為了這個示範，誤判率將設定為小於或等於 5% 的限制條件。
tfco.ProxyLagrangianOptimizerV2() – 這是實際解決比率限制問題的輔助程式。

以下儲存格將呼叫這些輔助程式，以設定具有公平性限制的模型訓練。

# The batch size is needed to create the input, labels and group tensors.
# These tensors are initialized with all 0's. They will eventually be assigned
# the batch content to them. A large batch size is chosen so that there are
# enough number of "Young" and "Not Young" examples in each batch.
set_seeds()
model_constrained = create_model()
BATCH_SIZE = 32

# Create input tensor.
input_tensor = tf.Variable(
    np.zeros((BATCH_SIZE, IMAGE_SIZE, IMAGE_SIZE, 3), dtype="float32"),
    name="input")

# Create labels and group tensors (assuming both labels and groups are binary).
labels_tensor = tf.Variable(
    np.zeros(BATCH_SIZE, dtype="float32"), name="labels")
groups_tensor = tf.Variable(
    np.zeros(BATCH_SIZE, dtype="float32"), name="groups")

# Create a function that returns the applied 'model' to the input tensor
# and generates constrained predictions.
def predictions():
  return model_constrained(input_tensor)

# Create overall context and subsetted context.
# The subsetted context contains subset of examples where group attribute < 1
# (i.e. the subset of "Not Young" celebrity images).
# "groups_tensor < 1" is used instead of "groups_tensor == 0" as the former
# would be a comparison on the tensor value, while the latter would be a
# comparison on the Tensor object.
context = tfco.rate_context(predictions, labels=lambda:labels_tensor)
context_subset = context.subset(lambda:groups_tensor < 1)

# Setup list of constraints.
# In this notebook, the constraint will just be: FPR to less or equal to 5%.
constraints = [tfco.false_positive_rate(context_subset) <= 0.05]

# Setup rate minimization problem: minimize overall error rate s.t. constraints.
problem = tfco.RateMinimizationProblem(tfco.error_rate(context), constraints)

# Create constrained optimizer and obtain train_op.
# Separate optimizers are specified for the objective and constraints
optimizer = tfco.ProxyLagrangianOptimizerV2(
      optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.001),
      constraint_optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.001),
      num_constraints=problem.num_constraints)

# A list of all trainable variables is also needed to use TFCO.
var_list = (model_constrained.trainable_weights + list(problem.trainable_variables) +
            optimizer.trainable_variables())

現在模型已設定完成，可以開始訓練具有跨年齡層群組的誤判率限制的模型。

現在，由於受限模型的最後一次迭代不一定是在已定義限制條件方面效能最佳的模型，因此 TFCO 程式庫配備了 tfco.find_best_candidate_index()，它可以協助從每次迭代後找到的迭代中選擇最佳迭代。將 tfco.find_best_candidate_index() 視為一個額外的啟發法，它可以根據準確度和公平性限制 (在本例中為跨年齡層群組的誤判率) 分別針對訓練資料對每個結果進行排名。這樣一來，它可以搜尋整體準確度和公平性限制之間更好的權衡。

以下儲存格將在開始訓練時施加限制條件，同時也會尋找每次迭代效能最佳的模型。

# Obtain train set batches.

NUM_ITERATIONS = 100  # Number of training iterations.
SKIP_ITERATIONS = 10  # Print training stats once in this many iterations.

# Create temp directory for saving snapshots of models.
temp_directory = tempfile.mktemp()
os.mkdir(temp_directory)

# List of objective and constraints across iterations.
objective_list = []
violations_list = []

# Training iterations.
iteration_count = 0
for (image, label, group) in celeb_a_train_data_w_group(BATCH_SIZE):
  # Assign current batch to input, labels and groups tensors.
  input_tensor.assign(image)
  labels_tensor.assign(label)
  groups_tensor.assign(group)

  # Run gradient update.
  optimizer.minimize(problem, var_list=var_list)

  # Record objective and violations.
  objective = problem.objective()
  violations = problem.constraints()

  sys.stdout.write(
      "\r Iteration %d: Hinge Loss = %.3f, Max. Constraint Violation = %.3f"
      % (iteration_count + 1, objective, max(violations)))

  # Snapshot model once in SKIP_ITERATIONS iterations.
  if iteration_count % SKIP_ITERATIONS == 0:
    objective_list.append(objective)
    violations_list.append(violations)

    # Save snapshot of model weights.
    model_constrained.save_weights(
        temp_directory + "/celeb_a_constrained_" +
        str(iteration_count / SKIP_ITERATIONS) + ".h5")

  iteration_count += 1
  if iteration_count >= NUM_ITERATIONS:
    break

# Choose best model from recorded iterates and load that model.
best_index = tfco.find_best_candidate_index(
    np.array(objective_list), np.array(violations_list))

model_constrained.load_weights(
    temp_directory + "/celeb_a_constrained_" + str(best_index) + ".0.h5")

# Remove temp directory.
os.system("rm -r " + temp_directory)

在套用限制條件後，我們再次使用公平性指標評估結果。

model_location = save_model(model_constrained, 'model_export_constrained')
eval_result_constrained = get_eval_results(model_location, 'eval_results_constrained')

與上次使用公平性指標時一樣，取消選取 false_negative_rate 並選取 false_positive_rate，以查看我們感興趣的指標。

請注意，為了公平地比較兩個版本的模型，務必使用將整體誤判率設定為大致相等的閾值。這可確保我們看到的是實際變化，而不是僅僅是模型中的移動，相當於簡單地移動閾值邊界。在我們的案例中，比較 0.5 的不受限模型和 0.22 的受限模型，可以為模型提供公平的比較。

eval_results_dict = {
    'constrained': eval_result_constrained,
    'unconstrained': eval_results_unconstrained,
}
tfma.addons.fairness.view.widget_view.render_fairness_indicator(multi_eval_results=eval_results_dict)

憑藉 TFCO 將更複雜的需求表示為比率限制條件的能力，我們協助這個模型實現了更理想的結果，而對整體效能幾乎沒有影響。當然，仍然有改進的空間，但至少 TFCO 能夠找到一個接近滿足限制條件並盡可能減少群組之間差異的模型。