Tensorflow 模型分析指標和繪圖

總覽

TFMA 支援下列指標和繪圖

標準 keras 指標 (tf.keras.metrics.*)
- 請注意，您不需要 keras 模型即可使用 keras 指標。指標是在 beam 中使用指標類別直接在圖形外部計算。
標準 TFMA 指標和繪圖 (tfma.metrics.*)
自訂 keras 指標 (衍生自 tf.keras.metrics.Metric 的指標)
自訂 TFMA 指標 (衍生自 tfma.metrics.Metric 的指標，使用自訂 beam 組合器或衍生自其他指標的指標)。

TFMA 也提供內建支援，可轉換二元分類指標以用於多類別/多標籤問題

根據類別 ID、前 K 項等進行二元化
以微平均、巨平均等為基礎的彙總指標

TFMA 也提供內建支援，適用於查詢/排序型指標，其中範例會自動在管線中依查詢鍵分組。

加總起來，有超過 50 個標準指標和繪圖可用於各種問題，包括迴歸、二元分類、多類別/多標籤分類、排序等。

設定

在 TFMA 中設定指標有兩種方式：(1) 使用 tfma.MetricsSpec，或 (2) 在 python 中建立 tf.keras.metrics.* 和/或 tfma.metrics.* 類別的執行個體，並使用 tfma.metrics.specs_from_metrics 將其轉換為 tfma.MetricsSpec 的清單。

以下章節說明不同機器學習問題的範例設定。

迴歸指標

以下是迴歸問題的範例設定。請參閱 tf.keras.metrics.* 和 tfma.metrics.* 模組，瞭解可能支援的其他指標。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "MeanSquaredError" }
    metrics { class_name: "Accuracy" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics {
      class_name: "CalibrationPlot"
      config: '"min_value": 0, "max_value": 10'
    }
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.MeanSquaredError(name='mse'),
    tf.keras.metrics.Accuracy(name='accuracy'),
    tfma.metrics.MeanLabel(name='mean_label'),
    tfma.metrics.MeanPrediction(name='mean_prediction'),
    tfma.metrics.Calibration(name='calibration'),
    tfma.metrics.CalibrationPlot(
        name='calibration', min_value=0, max_value=10)
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

請注意，此設定也可透過呼叫 tfma.metrics.default_regression_specs 取得。

二元分類指標

以下是二元分類問題的範例設定。請參閱 tf.keras.metrics.* 和 tfma.metrics.* 模組，瞭解可能支援的其他指標。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "BinaryCrossentropy" }
    metrics { class_name: "BinaryAccuracy" }
    metrics { class_name: "AUC" }
    metrics { class_name: "AUCPrecisionRecall" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics { class_name: "ConfusionMatrixPlot" }
    metrics { class_name: "CalibrationPlot" }
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.BinaryCrossentropy(name='binary_crossentropy'),
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    tf.keras.metrics.AUC(
        name='auc_precision_recall', curve='PR', num_thresholds=10000),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall'),
    tfma.metrics.MeanLabel(name='mean_label'),
    tfma.metrics.MeanPrediction(name='mean_prediction'),
    tfma.metrics.Calibration(name='calibration'),
    tfma.metrics.ConfusionMatrixPlot(name='confusion_matrix_plot'),
    tfma.metrics.CalibrationPlot(name='calibration_plot')
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

請注意，此設定也可透過呼叫 tfma.metrics.default_binary_classification_specs 取得。

多類別/多標籤分類指標

以下是多類別分類問題的範例設定。請參閱 tf.keras.metrics.* 和 tfma.metrics.* 模組，瞭解可能支援的其他指標。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "SparseCategoricalCrossentropy" }
    metrics { class_name: "SparseCategoricalAccuracy" }
    metrics { class_name: "Precision" config: '"top_k": 1' }
    metrics { class_name: "Precision" config: '"top_k": 3' }
    metrics { class_name: "Recall" config: '"top_k": 1' }
    metrics { class_name: "Recall" config: '"top_k": 3' }
    metrics { class_name: "MultiClassConfusionMatrixPlot" }
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.SparseCategoricalCrossentropy(
        name='sparse_categorical_crossentropy'),
    tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision', top_k=1),
    tf.keras.metrics.Precision(name='precision', top_k=3),
    tf.keras.metrics.Recall(name='recall', top_k=1),
    tf.keras.metrics.Recall(name='recall', top_k=3),
    tfma.metrics.MultiClassConfusionMatrixPlot(
        name='multi_class_confusion_matrix_plot'),
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

請注意，此設定也可透過呼叫 tfma.metrics.default_multi_class_classification_specs 取得。

多類別/多標籤二元化指標

多類別/多標籤指標可以二元化，以使用 tfma.BinarizationOptions 產生每個類別、每個前 k 項等的指標。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    binarize: { class_ids: { values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] } }
    // Metrics to binarize
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    // Metrics to binarize
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, binarize=tfma.BinarizationOptions(
        class_ids={'values': [0,1,2,3,4,5,6,7,8,9]}))

多類別/多標籤彙總指標

多類別/多標籤指標可以彙總，以使用 tfma.AggregationOptions 產生二元分類指標的單一彙總值。

請注意，彙總設定與二元化設定無關，因此您可以同時使用 tfma.AggregationOptions 和 tfma.BinarizationOptions。

微平均

微平均可以使用 tfma.AggregationOptions 內的 micro_average 選項執行。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: { micro_average: true }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, aggregate=tfma.AggregationOptions(micro_average=True))

微平均也支援設定 top_k，其中只有前 k 個值會用於計算。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      micro_average: true
      top_k_list: { values: [1, 3] }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(micro_average=True,
                                      top_k_list={'values': [1, 3]}))

巨平均/加權巨平均

巨平均可以使用 tfma.AggregationOptions 內的 macro_average 或 weighted_macro_average 選項執行。除非使用 top_k 設定，否則巨平均需要設定 class_weights，才能知道要計算哪些類別的平均值。如果未提供 class_weight，則會假設為 0.0。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      macro_average: true
      class_weights: { key: 0 value: 1.0 }
      class_weights: { key: 1 value: 1.0 }
      class_weights: { key: 2 value: 1.0 }
      class_weights: { key: 3 value: 1.0 }
      class_weights: { key: 4 value: 1.0 }
      class_weights: { key: 5 value: 1.0 }
      class_weights: { key: 6 value: 1.0 }
      class_weights: { key: 7 value: 1.0 }
      class_weights: { key: 8 value: 1.0 }
      class_weights: { key: 9 value: 1.0 }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(
        macro_average=True, class_weights={i: 1.0 for i in range(10)}))

與微平均類似，巨平均也支援設定 top_k，其中只有前 k 個值會用於計算。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      macro_average: true
      top_k_list: { values: [1, 3] }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(macro_average=True,
                                      top_k_list={'values': [1, 3]}))

查詢/排序型指標

查詢/排序型指標是透過在指標規格中指定 query_key 選項來啟用。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    query_key: "doc_id"
    metrics {
      class_name: "NDCG"
      config: '"gain_key": "gain", "top_k_list": [1, 2]'
    }
    metrics { class_name: "MinLabelPosition" }
  }
""", tfma.EvalConfig()).metrics_specs

相同的設定可以使用下列 python 程式碼建立

metrics = [
    tfma.metrics.NDCG(name='ndcg', gain_key='gain', top_k_list=[1, 2]),
    tfma.metrics.MinLabelPosition(name='min_label_position')
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics, query_key='doc_id')

多模型評估指標

TFMA 支援同時評估多個模型。執行多模型評估時，將會針對每個模型計算指標。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    # no model_names means all models
    ...
  }
""", tfma.EvalConfig()).metrics_specs

如果需要針對模型子集計算指標，請在 metric_specs 中設定 model_names。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    model_names: ["my-model1"]
    ...
  }
""", tfma.EvalConfig()).metrics_specs

specs_from_metrics API 也支援傳遞模型名稱

metrics = [
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, model_names=['my-model1'])

模型比較指標

TFMA 支援針對候選模型與基準模型評估比較指標。設定候選模型和基準模型配對的簡單方法是傳遞具有正確模型名稱 (tfma.BASELINE_KEY 和 tfma.CANDIDATE_KEY) 的 eval_shared_model


eval_config = text_format.Parse("""
  model_specs {
    # ... model_spec without names ...
  }
  metrics_spec {
    # ... metrics ...
  }
""", tfma.EvalConfig())

eval_shared_models = [
  tfma.default_eval_shared_model(
      model_name=tfma.CANDIDATE_KEY,
      eval_saved_model_path='/path/to/saved/candidate/model',
      eval_config=eval_config),
  tfma.default_eval_shared_model(
      model_name=tfma.BASELINE_KEY,
      eval_saved_model_path='/path/to/saved/baseline/model',
      eval_config=eval_config),
]

eval_result = tfma.run_model_analysis(
    eval_shared_models,
    eval_config=eval_config,
    # This assumes your data is a TFRecords file containing records in the
    # tf.train.Example format.
    data_location="/path/to/file/containing/tfrecords",
    output_path="/path/for/output")

比較指標會針對所有可差異指標自動計算 (目前只有純量值指標，例如準確度和 AUC)。

多輸出模型指標

TFMA 支援評估具有不同輸出的模型上的指標。多輸出模型會以字典形式儲存其輸出預測，並以輸出名稱作為鍵。使用多輸出模型時，與一組指標相關聯的輸出名稱必須在 MetricsSpec 的 output_names 區段中指定。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    output_names: ["my-output"]
    ...
  }
""", tfma.EvalConfig()).metrics_specs

specs_from_metrics API 也支援傳遞輸出名稱

metrics = [
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, output_names=['my-output'])

自訂指標設定

TFMA 允許自訂用於不同指標的設定。例如，您可能想要變更名稱、設定閾值等。這可以透過將 config 區段新增至指標設定來完成。設定是使用將傳遞至指標 __init__ 方法的參數的 JSON 字串版本來指定 (為了方便使用，可以省略開頭和結尾的 '{' 和 '}' 括號)。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics {
      class_name: "ConfusionMatrixAtThresholds"
      config: '"thresholds": [0.3, 0.5, 0.8]'
    }
  }
""", tfma.MetricsSpec()).metrics_specs

當然也直接支援此自訂

metrics = [
   tfma.metrics.ConfusionMatrixAtThresholds(thresholds=[0.3, 0.5, 0.8]),
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

輸出

指標評估的輸出是一系列指標鍵/值和/或繪圖鍵/值，具體取決於使用的設定。

指標鍵

MetricKeys 是使用結構化鍵類型定義。此鍵可唯一識別指標的下列每個方面

指標名稱 (auc、mean_label 等)
模型名稱 (僅在多模型評估時使用)
輸出名稱 (僅在評估多輸出模型時使用)
子鍵 (例如，如果多類別模型二元化，則為類別 ID)

指標值

MetricValues 是使用 proto 定義，該 proto 封裝不同指標支援的不同值類型 (例如 double、ConfusionMatrixAtThresholds 等)。

以下是支援的指標值類型

double_value - double 類型的包裝函式。
bytes_value - 位元組值。
bounded_value - 代表實數值，可能是逐點估計值，以及一些近似邊界 (選擇性)。具有屬性 value、lower_bound 和 upper_bound。
value_at_cutoffs - 截止值的值 (例如 precision@K、recall@K)。具有屬性 values，每個屬性都具有屬性 cutoff 和 value。
confusion_matrix_at_thresholds - 閾值的混淆矩陣。具有屬性 matrices，每個屬性都具有 threshold、precision、recall 和混淆矩陣值 (例如 false_negatives) 的屬性。
array_value - 適用於傳回值陣列的指標。

繪圖鍵

PlotKeys 與指標鍵類似，但由於歷史原因，所有繪圖值都儲存在單一 proto 中，因此繪圖鍵沒有名稱。

繪圖值

所有支援的繪圖都儲存在名為 PlotData 的單一 proto 中。

EvalResult

評估執行的傳回值是 tfma.EvalResult。此記錄包含 slicing_metrics，其將指標鍵編碼為多層級字典，其中層級分別對應於輸出名稱、類別 ID、指標名稱和指標值。這旨在用於 Jupiter 筆記本中的 UI 顯示。如果需要存取基礎資料，則應改用 metrics 結果檔案 (請參閱 metrics_for_slice.proto)。

自訂

除了作為已儲存 keras (或舊版 EvalSavedModel) 一部分新增的自訂指標之外。有兩種方式可在儲存後自訂 TFMA 中的指標：(1) 定義自訂 keras 指標類別，以及 (2) 定義由 beam 組合器支援的自訂 TFMA 指標類別。

在這兩種情況下，指標都是透過指定指標類別的名稱和相關聯的模組來設定。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "MyMetric" module: "my.module"}
  }
""", tfma.EvalConfig()).metrics_specs

自訂 Keras 指標

若要建立自訂 keras 指標，使用者需要使用其實作擴充 tf.keras.metrics.Metric，然後確保指標的模組在評估時可用。

請注意，對於模型儲存後新增的指標，TFMA 僅支援將標籤 (即 y_true)、預測 (y_pred) 和範例權重 (sample_weight) 作為參數傳遞至 update_state 方法的指標。

Keras 指標範例

以下是自訂 keras 指標的範例

class MyMetric(tf.keras.metrics.Mean):

  def __init__(self, name='my_metric', dtype=None):
    super(MyMetric, self).__init__(name=name, dtype=dtype)

  def update_state(self, y_true, y_pred, sample_weight=None):
    return super(MyMetric, self).update_state(
        y_pred, sample_weight=sample_weight)

自訂 TFMA 指標

若要建立自訂 TFMA 指標，使用者需要使用其實作擴充 tfma.metrics.Metric，然後確保指標的模組在評估時可用。

指標

tfma.metrics.Metric 實作是由一組 kwargs 組成，這些 kwargs 定義指標設定，以及用於建立計算 (可能有多個) 以計算指標值的函式。有兩種主要計算類型可以使用：tfma.metrics.MetricComputation 和 tfma.metrics.DerivedMetricComputation，將在以下章節中說明。建立這些計算的函式將以下列參數作為輸入傳遞

eval_config: tfam.EvalConfig
- 傳遞至評估器的評估設定 (適用於查閱模型規格設定，例如要使用的預測鍵等)。
model_names: List[Text]
- 要計算指標的模型名稱清單 (如果是單一模型，則為 None)
output_names: List[Text].
- 要計算指標的輸出名稱清單 (如果是單一模型，則為 None)
sub_keys: List[tfma.SubKey].
- 要計算指標的子鍵清單 (類別 ID、前 K 項等) (或 None)
aggregation_type: tfma.AggregationType
- 如果計算彙總指標，則為彙總類型。
class_weights: Dict[int, float].
- 如果計算彙總指標，則為要使用的類別權重。
query_key: Text
- 如果計算查詢/排序型指標，則為使用的查詢鍵。

如果指標未與這些設定中的一或多個設定相關聯，則可能會將這些參數從其簽名定義中省略。

如果每個模型、輸出和子鍵的指標計算方式相同，則可以使用公用程式 tfma.metrics.merge_per_key_computations，針對每個輸入分別執行相同的計算。

MetricComputation

MetricComputation 是由 preprocessors 和 combiner 的組合組成。preprocessors 是 preprocessor 的清單，preprocessor 是 beam.DoFn，其將擷取作為輸入，並輸出組合器將使用的初始狀態 (如需有關擷取是什麼的詳細資訊，請參閱架構)。所有預先處理器都會依照清單順序依序執行。如果 preprocessors 為空，則組合器會傳遞 StandardMetricInputs (標準指標輸入包含標籤、預測和 example_weights)。combiner 是 beam.CombineFn，其將 (分區鍵、預先處理器輸出) 的元組作為輸入，並將 (slice_key、指標結果字典) 的元組作為其結果輸出。

請注意，分區發生在 preprocessors 和 combiner 之間。

請注意，如果指標計算想要同時使用標準指標輸入，但使用來自 features 擷取的幾個特徵來擴增它，則可以使用特殊的 FeaturePreprocessor，它會將多個組合器要求的特徵合併到單一共用 StandardMetricsInputs 值中，該值會傳遞至所有組合器 (組合器負責讀取它們感興趣的特徵，並忽略其餘特徵)。

範例

以下是 TFMA 指標定義的非常簡單的範例，用於計算 ExampleCount

class ExampleCount(tfma.metrics.Metric):

  def __init__(self, name: Text = 'example_count'):
    super(ExampleCount, self).__init__(_example_count, name=name)


def _example_count(
    name: Text = 'example_count') -> tfma.metrics.MetricComputations:
  key = tfma.metrics.MetricKey(name=name)
  return [
      tfma.metrics.MetricComputation(
          keys=[key],
          preprocessors=[_ExampleCountPreprocessor()],
          combiner=_ExampleCountCombiner(key))
  ]


class _ExampleCountPreprocessor(beam.DoFn):

  def process(self, extracts: tfma.Extracts) -> Iterable[int]:
    yield 1


class _ExampleCountCombiner(beam.CombineFn):

  def __init__(self, metric_key: tfma.metrics.MetricKey):
    self._metric_key = metric_key

  def create_accumulator(self) -> int:
    return 0

  def add_input(self, accumulator: int, state: int) -> int:
    return accumulator + state

  def merge_accumulators(self, accumulators: Iterable[int]) -> int:
    accumulators = iter(accumulators)
    result = next(accumulator)
    for accumulator in accumulators:
      result += accumulator
    return result

  def extract_output(self,
                     accumulator: int) -> Dict[tfma.metrics.MetricKey, int]:
    return {self._metric_key: accumulator}

DerivedMetricComputation

DerivedMetricComputation 是由結果函式組成，該函式用於根據其他指標計算的輸出計算指標值。結果函式會將已計算值的字典作為輸入，並輸出其他指標結果的字典。

請注意，在指標建立的計算清單中包含衍生計算所依賴的計算是可以接受的 (建議)。這可避免必須預先建立和傳遞在多個指標之間共用的計算。評估器會自動重複資料刪除具有相同定義的計算，因此實際上只執行一個計算。

範例

TJUR 指標提供衍生指標的良好範例。