注意：TensorFlow Lite 現在是 Google AI Edge 的一部分。最新文件位於 ai.google.dev/edge/lite。深入瞭解

使用 TensorFlow Lite Model Maker 重新訓練語音辨識模型

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視原始碼

下載筆記本

在本 Colab 筆記本中，您將學習如何使用 TensorFlow Lite Model Maker 訓練語音辨識模型，該模型可以使用一秒的聲音樣本對口語單字或短語進行分類。Model Maker 程式庫使用遷移學習，以新的資料集重新訓練現有的 TensorFlow 模型，從而減少訓練所需的樣本資料量和時間。

預設情況下，此筆記本會使用語音指令資料集中的一部分字詞（例如「上」、「下」、「左」和「右」）重新訓練模型 (BrowserFft, 來自 TFJS Speech Command Recognizer)。然後，它會匯出一個 TFLite 模型，您可以在行動裝置或嵌入式系統（例如 Raspberry Pi）上執行。它也會將訓練後的模型匯出為 TensorFlow SavedModel。

此筆記本也設計為接受 WAV 檔案的自訂資料集，以 ZIP 檔案形式上傳到 Colab。每個類別的樣本越多，準確度就越高，但由於遷移學習過程使用來自預先訓練模型的功能嵌入，因此即使每個類別只有幾十個樣本，您仍然可以獲得相當準確的模型。

如果您想使用預設語音資料集執行筆記本，現在可以按一下 Colab 工具列中的「執行階段」>「全部執行」來執行整個筆記本。但是，如果您想使用自己的資料集，請繼續前往準備資料集並按照那裡的指示操作。

匯入必要的套件

您需要 TensorFlow、TFLite Model Maker 以及一些用於音訊處理、播放和視覺化的模組。

sudo apt -y install libportaudio2
pip install tflite-model-maker

import os
import glob
import random
import shutil

import librosa
import soundfile as sf
from IPython.display import Audio
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
import tflite_model_maker as mm
from tflite_model_maker import audio_classifier
from tflite_model_maker.config import ExportFormat

print(f"TensorFlow Version: {tf.__version__}")
print(f"Model Maker Version: {mm.__version__}")

準備資料集

若要使用預設語音資料集進行訓練，只需按原樣執行以下所有程式碼即可。

但如果您想使用自己的語音資料集進行訓練，請依照以下步驟操作

請確保資料集中的每個樣本都是 WAV 檔案格式，長度約為一秒。然後，建立一個包含所有 WAV 檔案的 ZIP 檔案，並為每個分類整理成個別的子資料夾。例如，語音指令「yes」的每個樣本都應位於名為「yes」的子資料夾中。即使您只有一個類別，樣本也必須儲存在以類別名稱作為目錄名稱的子目錄中。（此指令碼假設您的資料集未分割成訓練/驗證/測試集，並為您執行分割。）
按一下左側面板中的「檔案」標籤，然後直接將您的 ZIP 檔案拖放到那裡以上傳。
使用以下下拉式選單選項將 use_custom_dataset 設定為 True。
然後跳至準備自訂音訊資料集以指定您的 ZIP 檔案名稱和資料集目錄名稱。

use_custom_dataset = False

產生背景雜訊資料集

無論您使用預設語音資料集還是自訂資料集，都應該有一組良好的背景雜訊，以便您的模型可以區分語音和其他雜訊（包括靜音）。

由於以下背景樣本以 WAV 檔案形式提供，長度為一分鐘或更長，我們需要將它們分割成較小的一秒樣本，以便我們可以為我們的測試資料集保留一些。我們還將結合幾個不同的樣本來源，以建構一組全面的背景雜訊和靜音

tf.keras.utils.get_file('speech_commands_v0.01.tar.gz',
                        'http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz',
                        cache_dir='./',
                        cache_subdir='dataset-speech',
                        extract=True)
tf.keras.utils.get_file('background_audio.zip',
                        'https://storage.googleapis.com/download.tensorflow.org/models/tflite/sound_classification/background_audio.zip',
                        cache_dir='./',
                        cache_subdir='dataset-background',
                        extract=True)

# Create a list of all the background wav files
files = glob.glob(os.path.join('./dataset-speech/_background_noise_', '*.wav'))
files = files + glob.glob(os.path.join('./dataset-background', '*.wav'))

background_dir = './background'
os.makedirs(background_dir, exist_ok=True)

# Loop through all files and split each into several one-second wav files
for file in files:
  filename = os.path.basename(os.path.normpath(file))
  print('Splitting', filename)
  name = os.path.splitext(filename)[0]
  rate = librosa.get_samplerate(file)
  length = round(librosa.get_duration(filename=file))
  for i in range(length - 1):
    start = i * rate
    stop = (i * rate) + rate
    data, _ = sf.read(file, start=start, stop=stop)
    sf.write(os.path.join(background_dir, name + str(i) + '.wav'), data, rate)

準備語音指令資料集

我們已經下載了語音指令資料集，所以現在我們只需要修剪模型的類別數量。

此資料集包含超過 30 個語音指令分類，其中大多數都有超過 2,000 個樣本。但由於我們使用的是遷移學習，因此我們不需要那麼多樣本。因此，以下程式碼執行了幾項操作

指定我們要使用的分類，並刪除其餘分類。
每個類別僅保留 150 個樣本用於訓練（以證明遷移學習在較小的資料集上也能良好運作，並簡化訓練時間）。
為測試資料集建立一個單獨的目錄，以便我們稍後可以輕鬆地對它們執行推論。

if not use_custom_dataset:
  commands = [ "up", "down", "left", "right", "go", "stop", "on", "off", "background"]
  dataset_dir = './dataset-speech'
  test_dir = './dataset-test'

  # Move the processed background samples
  shutil.move(background_dir, os.path.join(dataset_dir, 'background'))   

  # Delete all directories that are not in our commands list
  dirs = glob.glob(os.path.join(dataset_dir, '*/'))
  for dir in dirs:
    name = os.path.basename(os.path.normpath(dir))
    if name not in commands:
      shutil.rmtree(dir)

  # Count is per class
  sample_count = 150
  test_data_ratio = 0.2
  test_count = round(sample_count * test_data_ratio)

  # Loop through child directories (each class of wav files)
  dirs = glob.glob(os.path.join(dataset_dir, '*/'))
  for dir in dirs:
    files = glob.glob(os.path.join(dir, '*.wav'))
    random.seed(42)
    random.shuffle(files)
    # Move test samples:
    for file in files[sample_count:sample_count + test_count]:
      class_dir = os.path.basename(os.path.normpath(dir))
      os.makedirs(os.path.join(test_dir, class_dir), exist_ok=True)
      os.rename(file, os.path.join(test_dir, class_dir, os.path.basename(file)))
    # Delete remaining samples
    for file in files[sample_count + test_count:]:
      os.remove(file)

準備自訂資料集

如果您想使用自己的語音資料集訓練模型，您需要以 ZIP 檔案形式上傳您的樣本作為 WAV 檔案 (如上文所述)，並修改以下變數以指定您的資料集

if use_custom_dataset:
  # Specify the ZIP file you uploaded:
  !unzip YOUR-FILENAME.zip
  # Specify the unzipped path to your custom dataset
  # (this path contains all the subfolders with classification names):
  dataset_dir = './YOUR-DIRNAME'

在變更上述檔案名稱和路徑名稱後，您就可以使用自訂資料集訓練模型了。在 Colab 工具列中，選取「執行階段」>「全部執行」以執行整個筆記本。

以下程式碼將我們新的背景雜訊樣本整合到您的資料集中，然後分離所有樣本的一部分以建立測試集。

def move_background_dataset(dataset_dir):
  dest_dir = os.path.join(dataset_dir, 'background')
  if os.path.exists(dest_dir):
    files = glob.glob(os.path.join(background_dir, '*.wav'))
    for file in files:
      shutil.move(file, dest_dir)
  else:
    shutil.move(background_dir, dest_dir)

if use_custom_dataset:
  # Move background samples into custom dataset
  move_background_dataset(dataset_dir)

  # Now we separate some of the files that we'll use for testing:
  test_dir = './dataset-test'
  test_data_ratio = 0.2
  dirs = glob.glob(os.path.join(dataset_dir, '*/'))
  for dir in dirs:
    files = glob.glob(os.path.join(dir, '*.wav'))
    test_count = round(len(files) * test_data_ratio)
    random.seed(42)
    random.shuffle(files)
    # Move test samples:
    for file in files[:test_count]:
      class_dir = os.path.basename(os.path.normpath(dir))
      os.makedirs(os.path.join(test_dir, class_dir), exist_ok=True)
      os.rename(file, os.path.join(test_dir, class_dir, os.path.basename(file)))
    print('Moved', test_count, 'images from', class_dir)

播放樣本

為了確保資料集看起來正確，讓我們播放測試集中的隨機樣本

def get_random_audio_file(samples_dir):
  files = os.path.abspath(os.path.join(samples_dir, '*/*.wav'))
  files_list = glob.glob(files)
  random_audio_path = random.choice(files_list)
  return random_audio_path

def show_sample(audio_path):
  audio_data, sample_rate = sf.read(audio_path)
  class_name = os.path.basename(os.path.dirname(audio_path))
  print(f'Class: {class_name}')
  print(f'File: {audio_path}')
  print(f'Sample rate: {sample_rate}')
  print(f'Sample length: {len(audio_data)}')

  plt.title(class_name)
  plt.plot(audio_data)
  display(Audio(audio_data, rate=sample_rate))

random_audio = get_random_audio_file(test_dir)
show_sample(random_audio)

定義模型

當使用 Model Maker 重新訓練任何模型時，您必須先定義模型規格。規格定義了基礎模型，您的新模型將從該模型中提取特徵嵌入，以開始學習新類別。此語音辨識器的規格基於 TFJS 的預先訓練 BrowserFft 模型。

模型預期輸入為 44.1 kHz 的音訊樣本，且長度略小於一秒：確切的樣本長度必須為 44034 幀。

您無需對訓練資料集執行任何重新取樣。Model Maker 會為您處理。但是，當您稍後執行推論時，您必須確保您的輸入符合預期的格式。

您在這裡需要做的就是實例化 BrowserFftSpec

spec = audio_classifier.BrowserFftSpec()

載入您的資料集

現在您需要根據模型規格載入您的資料集。Model Maker 包含 DataLoader API，它將從資料夾載入您的資料集，並確保其格式符合模型規格的預期。

我們已經透過將一些測試檔案移動到單獨的目錄來保留它們，這使得稍後對它們執行推論更容易。現在我們將為每個分割建立一個 DataLoader：訓練集、驗證集和測試集。

載入語音指令資料集

if not use_custom_dataset:
  train_data_ratio = 0.8
  train_data = audio_classifier.DataLoader.from_folder(
      spec, dataset_dir, cache=True)
  train_data, validation_data = train_data.split(train_data_ratio)
  test_data = audio_classifier.DataLoader.from_folder(
      spec, test_dir, cache=True)

載入自訂資料集

if use_custom_dataset:
  train_data_ratio = 0.8
  train_data = audio_classifier.DataLoader.from_folder(
      spec, dataset_dir, cache=True)
  train_data, validation_data = train_data.split(train_data_ratio)
  test_data = audio_classifier.DataLoader.from_folder(
      spec, test_dir, cache=True)

訓練模型

現在我們將使用 Model Maker create() 函式根據我們的模型規格和訓練資料集建立模型，並開始訓練。

如果您使用的是自訂資料集，您可能需要根據訓練集中樣本的數量來變更批次大小。

# If your dataset has fewer than 100 samples per class,
# you might want to try a smaller batch size
batch_size = 25
epochs = 25
model = audio_classifier.create(train_data, spec, validation_data, batch_size, epochs)

檢閱模型效能

即使從上面的訓練輸出中準確度/損失看起來不錯，但使用模型尚未見過的測試資料執行模型也很重要，這就是 evaluate() 方法在這裡的作用

model.evaluate(test_data)

檢視混淆矩陣

在訓練像這樣分類模型時，檢視混淆矩陣也很有用。混淆矩陣可讓您詳細視覺化地表示您的分類器在測試資料中每個分類的效能。

def show_confusion_matrix(confusion, test_labels):
  """Compute confusion matrix and normalize."""
  confusion_normalized = confusion.astype("float") / confusion.sum(axis=1)
  sns.set(rc = {'figure.figsize':(6,6)})
  sns.heatmap(
      confusion_normalized, xticklabels=test_labels, yticklabels=test_labels,
      cmap='Blues', annot=True, fmt='.2f', square=True, cbar=False)
  plt.title("Confusion matrix")
  plt.ylabel("True label")
  plt.xlabel("Predicted label")

confusion_matrix = model.confusion_matrix(test_data)
show_confusion_matrix(confusion_matrix.numpy(), test_data.index_to_label)

匯出模型

最後一步是將您的模型匯出為 TensorFlow Lite 格式，以便在行動/嵌入式裝置上執行，並匯出為 SavedModel 格式，以便在其他地方執行。

從 Model Maker 匯出 .tflite 檔案時，它會包含模型中繼資料，其中描述了稍後在推論期間可能有幫助的各種詳細資訊。它甚至包含分類標籤檔案的副本，因此您不需要單獨的 labels.txt 檔案。（在下一節中，我們將展示如何使用此中繼資料來執行推論。）

TFLITE_FILENAME = 'browserfft-speech.tflite'
SAVE_PATH = './models'

print(f'Exporing the model to {SAVE_PATH}')
model.export(SAVE_PATH, tflite_filename=TFLITE_FILENAME)
model.export(SAVE_PATH, export_format=[mm.ExportFormat.SAVED_MODEL, mm.ExportFormat.LABEL])

使用 TF Lite 模型執行推論

現在您的 TFLite 模型可以使用任何支援的推論程式庫或新的 TFLite AudioClassifier Task API 部署和執行。以下程式碼示範如何在 Python 中使用 .tflite 模型執行推論。

# This library provides the TFLite metadata API
 pip install -q tflite_support

from tflite_support import metadata
import json

def get_labels(model):
  """Returns a list of labels, extracted from the model metadata."""
  displayer = metadata.MetadataDisplayer.with_model_file(model)
  labels_file = displayer.get_packed_associated_file_list()[0]
  labels = displayer.get_associated_file_buffer(labels_file).decode()
  return [line for line in labels.split('\n')]

def get_input_sample_rate(model):
  """Returns the model's expected sample rate, from the model metadata."""
  displayer = metadata.MetadataDisplayer.with_model_file(model)
  metadata_json = json.loads(displayer.get_metadata_json())
  input_tensor_metadata = metadata_json['subgraph_metadata'][0][
          'input_tensor_metadata'][0]
  input_content_props = input_tensor_metadata['content']['content_properties']
  return input_content_props['sample_rate']

若要觀察模型在真實樣本上的效能，請重複執行以下程式碼區塊。每次，它都會提取一個新的測試樣本並對其執行推論，您可以收聽下方的音訊樣本。

# Get a WAV file for inference and list of labels from the model
tflite_file = os.path.join(SAVE_PATH, TFLITE_FILENAME)
labels = get_labels(tflite_file)
random_audio = get_random_audio_file(test_dir)

# Ensure the audio sample fits the model input
interpreter = tf.lite.Interpreter(tflite_file)
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_size = input_details[0]['shape'][1]
sample_rate = get_input_sample_rate(tflite_file)
audio_data, _ = librosa.load(random_audio, sr=sample_rate)
if len(audio_data) < input_size:
  audio_data.resize(input_size)
audio_data = np.expand_dims(audio_data[:input_size], axis=0)

# Run inference
interpreter.allocate_tensors()
interpreter.set_tensor(input_details[0]['index'], audio_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

# Display prediction and ground truth
top_index = np.argmax(output_data[0])
label = labels[top_index]
score = output_data[0][top_index]
print('---prediction---')
print(f'Class: {label}\nScore: {score}')
print('----truth----')
show_sample(random_audio)

下載 TF Lite 模型

現在您可以將 TF Lite 模型部署到您的行動裝置或嵌入式裝置。您無需下載標籤檔案，因為您可以從 .tflite 檔案中繼資料中擷取標籤，如先前的推論範例所示。

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download(tflite_file)

查看我們的端對端範例應用程式，這些應用程式在 Android 和 iOS 上使用 TFLite 音訊模型執行推論。