注意：TensorFlow Lite 現在是 Google AI Edge 的一部分。最新文件位於 ai.google.dev/edge/lite。瞭解詳情

訓練後整數 Quantization 與 int16 啟動

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視原始碼

下載筆記本

總覽

TensorFlow Lite 現在支援在將模型從 TensorFlow 轉換為 TensorFlow Lite 的 FlatBuffer 格式時，將啟動轉換為 16 位元整數值，並將權重轉換為 8 位元整數值。我們將此模式稱為「16x8 Quantization 模式」。當啟動對 Quantization 敏感時，此模式可以顯著提高 Quantization 模型的準確性，同時仍可實現近 3-4 倍的模型大小縮減。此外，此完整 Quantization 模型可由僅限整數的硬體加速器使用。

受益於此訓練後 Quantization 模式的模型範例包括

超高解析度，
音訊訊號處理，例如降噪和波束成形，
圖片降噪，
從單張圖片重建 HDR

在本教學課程中，您將從頭開始訓練 MNIST 模型、在 TensorFlow 中檢查其準確性，然後使用此模式將模型轉換為 Tensorflow Lite FlatBuffer。最後，您將檢查轉換後模型的準確性，並將其與原始 float32 模型進行比較。請注意，此範例示範了此模式的用法，並未顯示相較於 TensorFlow Lite 中其他可用 Quantization 技術的優勢。

建構 MNIST 模型

設定

import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pathlib

檢查 16x8 Quantization 模式是否可用

tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8

訓練並匯出模型

# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation=tf.nn.relu),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_data=(test_images, test_labels)
)

在此範例中，您僅將模型訓練了一個週期，因此僅訓練到約 96% 的準確性。

轉換為 TensorFlow Lite 模型

現在，您可以使用 TensorFlow Lite Converter，將訓練後的模型轉換為 TensorFlow Lite 模型。

現在，使用 TFliteConverter 將模型轉換為預設 float32 格式

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

將其寫出到 .tflite 檔案

tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

若要改為將模型 Quantization 為 16x8 Quantization 模式，請先設定 optimizations 旗標以使用預設最佳化。然後在目標規格中指定 16x8 Quantization 模式是必要支援的運算

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8]

如同 int8 訓練後 Quantization 的情況，可以透過將轉換器選項 inference_input(output)_type 設定為 tf.int16，來產生完整的整數 Quantization 模型。

設定校正資料

mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = tf.cast(mnist_train[0], tf.float32) / 255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images)).batch(1)
def representative_data_gen():
  for input_value in mnist_ds.take(100):
    # Model has only one input so each data point has one element.
    yield [input_value]
converter.representative_dataset = representative_data_gen

最後，照常轉換模型。請注意，預設情況下，為了方便叫用，轉換後的模型仍會使用 float 輸入和輸出。

tflite_16x8_model = converter.convert()
tflite_model_16x8_file = tflite_models_dir/"mnist_model_quant_16x8.tflite"
tflite_model_16x8_file.write_bytes(tflite_16x8_model)

請注意，產生的檔案大小約為 1/3。

ls -lh {tflite_models_dir}

執行 TensorFlow Lite 模型

使用 Python TensorFlow Lite Interpreter 執行 TensorFlow Lite 模型。

將模型載入到 Interpreter

interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()

interpreter_16x8 = tf.lite.Interpreter(model_path=str(tflite_model_16x8_file))
interpreter_16x8.allocate_tensors()

在單張圖片上測試模型

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

interpreter.set_tensor(input_index, test_image)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)

import matplotlib.pylab as plt

plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter_16x8.get_input_details()[0]["index"]
output_index = interpreter_16x8.get_output_details()[0]["index"]

interpreter_16x8.set_tensor(input_index, test_image)
interpreter_16x8.invoke()
predictions = interpreter_16x8.get_tensor(output_index)

plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

評估模型

# A helper function to evaluate the TF Lite model using "test" dataset.
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for test_image in test_images:
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # Compare prediction results with ground truth labels to calculate accuracy.
  accurate_count = 0
  for index in range(len(prediction_digits)):
    if prediction_digits[index] == test_labels[index]:
      accurate_count += 1
  accuracy = accurate_count * 1.0 / len(prediction_digits)

  return accuracy

print(evaluate_model(interpreter))

在 16x8 Quantization 模型上重複評估

# NOTE: This quantization mode is an experimental post-training mode,
# it does not have any optimized kernels implementations or
# specialized machine learning hardware accelerators. Therefore,
# it could be slower than the float interpreter.
print(evaluate_model(interpreter_16x8))

在此範例中，您已將模型 Quantization 為 16x8，準確性沒有差異，但大小縮減了 3 倍。