圖像標題生成與視覺注意力

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

以下圖為例，您的目標是產生標題，例如「衝浪者乘浪」。

一位男士在衝浪，來自 wikimedia

此處使用的模型架構靈感來自 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention，但已更新為使用 2 層 Transformer 解碼器。為了充分利用本教學課程，您應該具備文字生成、seq2seq 模型與注意力機制或 Transformer 的一些經驗。

本教學課程中建構的模型架構如下所示。特徵從影像中擷取，並傳遞至 Transformer 解碼器的交叉注意力層。

模型架構

Transformer 解碼器主要由注意力層建構而成。它使用自我注意力來處理正在產生的序列，並使用交叉注意力來關注影像。

透過檢查交叉注意力層的注意力權重，您將看到模型在產生文字時關注影像的哪些部分。

Prediction

本筆記本是一個端對端範例。當您執行筆記本時，它會下載資料集、擷取並快取影像特徵，並訓練解碼器模型。然後，它會使用該模型在新影像上產生標題。

設定

apt install --allow-change-held-packages libcudnn8=8.6.0.163-1+cuda11.8

pip uninstall -y tensorflow estimator keras

pip install -U tensorflow_text tensorflow tensorflow_datasets

pip install einops

本教學課程使用大量匯入，主要用於載入資料集。

import concurrent.futures
import collections
import dataclasses
import hashlib
import itertools
import json
import math
import os
import pathlib
import random
import re
import string
import time
import urllib.request

import einops
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import requests
import tqdm

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow_datasets as tfds

[選用] 資料處理

本節下載標題資料集並準備用於訓練。它將輸入文字符號化，並快取透過預先訓練的特徵擷取器模型執行所有影像的結果。不一定要理解本節中的所有內容。

選擇資料集

本教學課程設定為提供資料集選擇。您可以選擇 Flickr8k 或 Conceptual Captions 資料集的小部分。這兩者都是從頭開始下載和轉換的，但將教學課程轉換為使用 TensorFlow Datasets 中提供的標題資料集並不困難：Coco Captions 和完整的 Conceptual Captions。

Flickr8k

def flickr8k(path='flickr8k'):
  path = pathlib.Path(path)

  if len(list(path.rglob('*'))) < 16197:
    tf.keras.utils.get_file(
        origin='https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip',
        cache_dir='.',
        cache_subdir=path,
        extract=True)
    tf.keras.utils.get_file(
        origin='https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip',
        cache_dir='.',
        cache_subdir=path,
        extract=True)

  captions = (path/"Flickr8k.token.txt").read_text().splitlines()
  captions = (line.split('\t') for line in captions)
  captions = ((fname.split('#')[0], caption) for (fname, caption) in captions)

  cap_dict = collections.defaultdict(list)
  for fname, cap in captions:
    cap_dict[fname].append(cap)

  train_files = (path/'Flickr_8k.trainImages.txt').read_text().splitlines()
  train_captions = [(str(path/'Flicker8k_Dataset'/fname), cap_dict[fname]) for fname in train_files]

  test_files = (path/'Flickr_8k.testImages.txt').read_text().splitlines()
  test_captions = [(str(path/'Flicker8k_Dataset'/fname), cap_dict[fname]) for fname in test_files]

  train_ds = tf.data.experimental.from_list(train_captions)
  test_ds = tf.data.experimental.from_list(test_captions)

  return train_ds, test_ds

Conceptual Captions

def conceptual_captions(*, data_dir="conceptual_captions", num_train, num_val):
  def iter_index(index_path):
    with open(index_path) as f:
      for line in f:
        caption, url = line.strip().split('\t')
        yield caption, url

  def download_image_urls(data_dir, urls):
    ex = concurrent.futures.ThreadPoolExecutor(max_workers=100)
    def save_image(url):
      hash = hashlib.sha1(url.encode())
      # Name the files after the hash of the URL.
      file_path = data_dir/f'{hash.hexdigest()}.jpeg'
      if file_path.exists():
        # Only download each file once.
        return file_path

      try:
        result = requests.get(url, timeout=5)
      except Exception:
        file_path = None
      else:
        file_path.write_bytes(result.content)
      return file_path

    result = []
    out_paths = ex.map(save_image, urls)
    for file_path in tqdm.tqdm(out_paths, total=len(urls)):
      result.append(file_path)

    return result

  def ds_from_index_file(index_path, data_dir, count):
    data_dir.mkdir(exist_ok=True)
    index = list(itertools.islice(iter_index(index_path), count))
    captions = [caption for caption, url in index]
    urls = [url for caption, url in index]

    paths = download_image_urls(data_dir, urls)

    new_captions = []
    new_paths = []
    for cap, path in zip(captions, paths):
      if path is None:
        # Download failed, so skip this pair.
        continue
      new_captions.append(cap)
      new_paths.append(path)

    new_paths = [str(p) for p in new_paths]

    ds = tf.data.Dataset.from_tensor_slices((new_paths, new_captions))
    ds = ds.map(lambda path,cap: (path, cap[tf.newaxis])) # 1 caption per image
    return ds

  data_dir = pathlib.Path(data_dir)
  train_index_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/gcc-data/Train/GCC-training.tsv',
    cache_subdir=data_dir,
    cache_dir='.')

  val_index_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/gcc-data/Validation/GCC-1.1.0-Validation.tsv',
    cache_subdir=data_dir,
    cache_dir='.')

  train_raw = ds_from_index_file(train_index_path, data_dir=data_dir/'train', count=num_train)
  test_raw = ds_from_index_file(val_index_path, data_dir=data_dir/'val', count=num_val)

  return train_raw, test_raw

下載資料集

Flickr8k 是一個不錯的選擇，因為它每個影像包含 5 個標題，下載量較小但資料量較多。

choose = 'flickr8k'

if choose == 'flickr8k':
  train_raw, test_raw = flickr8k()
else:
  train_raw, test_raw = conceptual_captions(num_train=10000, num_val=5000)

以上兩個資料集的載入器都會傳回包含 tf.data.Datasets 的 (image_path, captions) 配對。Flickr8k 資料集每個影像包含 5 個標題，而 Conceptual Captions 則有 1 個

train_raw.element_spec

for ex_path, ex_captions in train_raw.take(1):
  print(ex_path)
  print(ex_captions)

影像特徵擷取器

您將使用影像模型 (在 imagenet 上預先訓練) 從每個影像中擷取特徵。該模型被訓練為影像分類器，但設定 include_top=False 會傳回不含最終分類層的模型，因此您可以使用最後一層特徵圖

IMAGE_SHAPE=(224, 224, 3)
mobilenet = tf.keras.applications.MobileNetV3Small(
    input_shape=IMAGE_SHAPE,
    include_top=False,
    include_preprocessing=True)
mobilenet.trainable=False

以下是一個載入影像並調整其大小以適合模型的功能

def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.io.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, IMAGE_SHAPE[:-1])
    return img

模型會傳回輸入批次中每個影像的特徵圖

test_img_batch = load_image(ex_path)[tf.newaxis, :]

print(test_img_batch.shape)
print(mobilenet(test_img_batch).shape)

設定文字符號化器/向量化器

您將使用 TextVectorization 層將文字標題轉換為整數序列，步驟如下

使用 adapt 迭代所有標題，將標題分割成單字，並計算最常用單字的詞彙表。
透過將每個單字對應到其在詞彙表中的索引，將所有標題符號化。所有輸出序列都將填充到長度 50。
建立單字到索引和索引到單字的對應以顯示結果。

def standardize(s):
  s = tf.strings.lower(s)
  s = tf.strings.regex_replace(s, f'[{re.escape(string.punctuation)}]', '')
  s = tf.strings.join(['[START]', s, '[END]'], separator=' ')
  return s

# Use the top 5000 words for a vocabulary.
vocabulary_size = 5000
tokenizer = tf.keras.layers.TextVectorization(
    max_tokens=vocabulary_size,
    standardize=standardize,
    ragged=True)
# Learn the vocabulary from the caption data.

tokenizer.adapt(train_raw.map(lambda fp,txt: txt).unbatch().batch(1024))

tokenizer.get_vocabulary()[:10]

t = tokenizer([['a cat in a hat'], ['a robot dog']])
t

# Create mappings for words to indices and indices to words.
word_to_index = tf.keras.layers.StringLookup(
    mask_token="",
    vocabulary=tokenizer.get_vocabulary())
index_to_word = tf.keras.layers.StringLookup(
    mask_token="",
    vocabulary=tokenizer.get_vocabulary(),
    invert=True)

w = index_to_word(t)
w.to_list()

tf.strings.reduce_join(w, separator=' ', axis=-1).numpy()

準備資料集

train_raw 和 test_raw 資料集包含 1:多個 (影像，標題) 配對。

此函數將複製影像，以便影像與標題為 1:1

def match_shapes(images, captions):
  caption_shape = einops.parse_shape(captions, 'b c')
  captions = einops.rearrange(captions, 'b c -> (b c)')
  images = einops.repeat(
      images, 'b ... -> (b c) ...',
      c = caption_shape['c'])
  return images, captions

for ex_paths, ex_captions in train_raw.batch(32).take(1):
  break

print('image paths:', ex_paths.shape)
print('captions:', ex_captions.shape)
print()

ex_paths, ex_captions = match_shapes(images=ex_paths, captions=ex_captions)

print('image_paths:', ex_paths.shape)
print('captions:', ex_captions.shape)

為了與 keras 訓練相容，資料集應包含 (輸入，標籤) 配對。對於文字生成，符號既是輸入也是標籤，偏移一個步驟。此函數會將 (影像，文字) 配對轉換為 ((影像，輸入符號)，標籤符號) 配對

def prepare_txt(imgs, txts):
  tokens = tokenizer(txts)

  input_tokens = tokens[..., :-1]
  label_tokens = tokens[..., 1:]
  return (imgs, input_tokens), label_tokens

此函數會將運算新增至資料集。步驟如下

載入影像 (並忽略載入失敗的影像)。
複製影像以符合標題數量。
隨機排序並重新批次處理 image, caption 配對。
將文字符號化，偏移符號並新增 label_tokens。
將文字從 RaggedTensor 表示法轉換為填充的密集 Tensor 表示法。

def prepare_dataset(ds, tokenizer, batch_size=32, shuffle_buffer=1000):
  # Load the images and make batches.
  ds = (ds
        .shuffle(10000)
        .map(lambda path, caption: (load_image(path), caption))
        .apply(tf.data.experimental.ignore_errors())
        .batch(batch_size))

  def to_tensor(inputs, labels):
    (images, in_tok), out_tok = inputs, labels
    return (images, in_tok.to_tensor()), out_tok.to_tensor()

  return (ds
          .map(match_shapes, tf.data.AUTOTUNE)
          .unbatch()
          .shuffle(shuffle_buffer)
          .batch(batch_size)
          .map(prepare_txt, tf.data.AUTOTUNE)
          .map(to_tensor, tf.data.AUTOTUNE)
          )

您可以將特徵擷取器安裝在模型中，並像這樣在資料集上進行訓練

train_ds = prepare_dataset(train_raw, tokenizer)
train_ds.element_spec

test_ds = prepare_dataset(test_raw, tokenizer)
test_ds.element_spec

[選用] 快取影像特徵

由於影像特徵擷取器不會變更，且本教學課程未使用影像擴增，因此可以快取影像特徵。文字符號化也是如此。設定快取所需的時間會在訓練和驗證期間的每個 epoch 賺回。以下程式碼定義了兩個函數 save_dataset 和 load_dataset

def save_dataset(ds, save_path, image_model, tokenizer, shards=10, batch_size=32):
  # Load the images and make batches.
  ds = (ds
        .map(lambda path, caption: (load_image(path), caption))
        .apply(tf.data.experimental.ignore_errors())
        .batch(batch_size))

  # Run the feature extractor on each batch
  # Don't do this in a .map, because tf.data runs on the CPU. 
  def gen():
    for (images, captions) in tqdm.tqdm(ds): 
      feature_maps = image_model(images)

      feature_maps, captions = match_shapes(feature_maps, captions)
      yield feature_maps, captions

  # Wrap the generator in a new tf.data.Dataset.
  new_ds = tf.data.Dataset.from_generator(
      gen,
      output_signature=(
          tf.TensorSpec(shape=image_model.output_shape),
          tf.TensorSpec(shape=(None,), dtype=tf.string)))

  # Apply the tokenization 
  new_ds = (new_ds
            .map(prepare_txt, tf.data.AUTOTUNE)
            .unbatch()
            .shuffle(1000))

  # Save the dataset into shard files.
  def shard_func(i, item):
    return i % shards
  new_ds.enumerate().save(save_path, shard_func=shard_func)

def load_dataset(save_path, batch_size=32, shuffle=1000, cycle_length=2):
  def custom_reader_func(datasets):
    datasets = datasets.shuffle(1000)
    return datasets.interleave(lambda x: x, cycle_length=cycle_length)

  ds = tf.data.Dataset.load(save_path, reader_func=custom_reader_func)

  def drop_index(i, x):
    return x

  ds = (ds
        .map(drop_index, tf.data.AUTOTUNE)
        .shuffle(shuffle)
        .padded_batch(batch_size)
        .prefetch(tf.data.AUTOTUNE))
  return ds

save_dataset(train_raw, 'train_cache', mobilenet, tokenizer)
save_dataset(test_raw, 'test_cache', mobilenet, tokenizer)

資料已準備好用於訓練

經過這些預先處理步驟後，以下是資料集

train_ds = load_dataset('train_cache')
test_ds = load_dataset('test_cache')

train_ds.element_spec

資料集現在傳回適用於 keras 訓練的 (輸入，標籤) 配對。inputs 是 (影像，輸入符號) 配對。images 已使用特徵擷取器模型進行處理。對於 input_tokens 中的每個位置，模型會查看到目前為止的文字，並嘗試預測下一個文字，該文字與 labels 中的相同位置對齊。

for (inputs, ex_labels) in train_ds.take(1):
  (ex_img, ex_in_tok) = inputs

print(ex_img.shape)
print(ex_in_tok.shape)
print(ex_labels.shape)

輸入符號和標籤相同，只是偏移了 1 個步驟

print(ex_in_tok[0].numpy())
print(ex_labels[0].numpy())

Transformer 解碼器模型

此模型假設預先訓練的影像編碼器已足夠，並且僅專注於建構文字解碼器。本教學課程使用 2 層 Transformer 解碼器。

實作幾乎與 Transformer 教學課程中的實作完全相同。請參閱該教學課程以取得更多詳細資訊。

Transformer 編碼器和解碼器。

模型將在三個主要部分中實作

輸入 - 符號嵌入和位置編碼 (SeqEmbedding)。
解碼器 - 一疊 Transformer 解碼器層 (DecoderLayer)，其中每個都包含
1. 因果自我注意力層 (CausalSelfAttention)，其中每個輸出位置都可以關注到目前為止的輸出。
2. 交叉注意力層 (CrossAttention)，其中每個輸出位置都可以關注輸入影像。
3. 前饋網路 (FeedForward) 層，進一步獨立處理每個輸出位置。
輸出 - 輸出詞彙表上的多類別分類。

輸入

輸入文字已分割成符號並轉換為 ID 序列。

請記住，與 CNN 或 RNN 不同，Transformer 的注意力層對於序列的順序是不變的。如果沒有一些位置輸入，它只會看到一個無序的集合，而不是一個序列。因此，除了每個符號 ID 的簡單向量嵌入之外，嵌入層還將包含序列中每個位置的嵌入。

以下定義的 SeqEmbedding 層

它會查閱每個符號的嵌入向量。
它會查閱每個序列位置的嵌入向量。
它將兩者加在一起。
它使用 mask_zero=True 初始化模型的 keras 遮罩。

class SeqEmbedding(tf.keras.layers.Layer):
  def __init__(self, vocab_size, max_length, depth):
    super().__init__()
    self.pos_embedding = tf.keras.layers.Embedding(input_dim=max_length, output_dim=depth)

    self.token_embedding = tf.keras.layers.Embedding(
        input_dim=vocab_size,
        output_dim=depth,
        mask_zero=True)

    self.add = tf.keras.layers.Add()

  def call(self, seq):
    seq = self.token_embedding(seq) # (batch, seq, depth)

    x = tf.range(tf.shape(seq)[1])  # (seq)
    x = x[tf.newaxis, :]  # (1, seq)
    x = self.pos_embedding(x)  # (1, seq, depth)

    return self.add([seq,x])

解碼器

解碼器是標準 Transformer 解碼器，它包含一疊 DecoderLayers，其中每個都包含三個子層：CausalSelfAttention、CrossAttention 和 FeedForward。實作幾乎與 Transformer 教學課程完全相同，請參閱該教學課程以取得更多詳細資訊。

以下是 CausalSelfAttention 層

class CausalSelfAttention(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    # Use Add instead of + so the keras mask propagates through.
    self.add = tf.keras.layers.Add() 
    self.layernorm = tf.keras.layers.LayerNormalization()

  def call(self, x):
    attn = self.mha(query=x, value=x,
                    use_causal_mask=True)
    x = self.add([x, attn])
    return self.layernorm(x)

以下是 CrossAttention 層。請注意 return_attention_scores 的使用。

class CrossAttention(tf.keras.layers.Layer):
  def __init__(self,**kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.add = tf.keras.layers.Add() 
    self.layernorm = tf.keras.layers.LayerNormalization()

  def call(self, x, y, **kwargs):
    attn, attention_scores = self.mha(
             query=x, value=y,
             return_attention_scores=True)

    self.last_attention_scores = attention_scores

    x = self.add([x, attn])
    return self.layernorm(x)

以下是 FeedForward 層。請記住，layers.Dense 層會套用至輸入的最後一個軸。輸入的形狀將為 (批次，序列，通道)，因此它會自動在 batch 和 sequence 軸上逐點套用。

class FeedForward(tf.keras.layers.Layer):
  def __init__(self, units, dropout_rate=0.1):
    super().__init__()
    self.seq = tf.keras.Sequential([
        tf.keras.layers.Dense(units=2*units, activation='relu'),
        tf.keras.layers.Dense(units=units),
        tf.keras.layers.Dropout(rate=dropout_rate),
    ])

    self.layernorm = tf.keras.layers.LayerNormalization()

  def call(self, x):
    x = x + self.seq(x)
    return self.layernorm(x)

接下來將這三層排列成更大的 DecoderLayer。每個解碼器層依序套用三個較小的層。在每個子層之後，out_seq 的形狀為 (批次，序列，通道)。解碼器層也會傳回 attention_scores 以供稍後視覺化。

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, units, num_heads=1, dropout_rate=0.1):
    super().__init__()

    self.self_attention = CausalSelfAttention(num_heads=num_heads,
                                              key_dim=units,
                                              dropout=dropout_rate)
    self.cross_attention = CrossAttention(num_heads=num_heads,
                                          key_dim=units,
                                          dropout=dropout_rate)
    self.ff = FeedForward(units=units, dropout_rate=dropout_rate)


  def call(self, inputs, training=False):
    in_seq, out_seq = inputs

    # Text input
    out_seq = self.self_attention(out_seq)

    out_seq = self.cross_attention(out_seq, in_seq)

    self.last_attention_scores = self.cross_attention.last_attention_scores

    out_seq = self.ff(out_seq)

    return out_seq

輸出

最起碼，輸出層需要一個 layers.Dense 層，以產生每個位置每個符號的 logit 預測。

但是您可以新增一些其他功能，使此功能運作得更好一些

處理錯誤符號：模型將產生文字。它絕不應產生填充、未知或開始符號 (''、'[UNK]'、'[START]')。因此，將這些符號的偏差設定為較大的負值。

注意： 您也需要在損失函數中忽略這些符號。
智慧型初始化：密集層的預設初始化將提供一個模型，該模型最初以幾乎均勻的可能性預測每個符號。實際的符號分佈遠非均勻。輸出層初始偏差的最佳值是每個符號機率的對數。因此，包含一個 adapt 方法來計算符號並設定最佳初始偏差。這將初始損失從均勻分佈的熵 (log(vocabulary_size)) 減少到分佈的邊際熵 (-p*log(p))。

class TokenOutput(tf.keras.layers.Layer):
  def __init__(self, tokenizer, banned_tokens=('', '[UNK]', '[START]'), **kwargs):
    super().__init__()

    self.dense = tf.keras.layers.Dense(
        units=tokenizer.vocabulary_size(), **kwargs)
    self.tokenizer = tokenizer
    self.banned_tokens = banned_tokens

    self.bias = None

  def adapt(self, ds):
    counts = collections.Counter()
    vocab_dict = {name: id 
                  for id, name in enumerate(self.tokenizer.get_vocabulary())}

    for tokens in tqdm.tqdm(ds):
      counts.update(tokens.numpy().flatten())

    counts_arr = np.zeros(shape=(self.tokenizer.vocabulary_size(),))
    counts_arr[np.array(list(counts.keys()), dtype=np.int32)] = list(counts.values())

    counts_arr = counts_arr[:]
    for token in self.banned_tokens:
      counts_arr[vocab_dict[token]] = 0

    total = counts_arr.sum()
    p = counts_arr/total
    p[counts_arr==0] = 1.0
    log_p = np.log(p)  # log(1) == 0

    entropy = -(log_p*p).sum()

    print()
    print(f"Uniform entropy: {np.log(self.tokenizer.vocabulary_size()):0.2f}")
    print(f"Marginal entropy: {entropy:0.2f}")

    self.bias = log_p
    self.bias[counts_arr==0] = -1e9

  def call(self, x):
    x = self.dense(x)
    # TODO(b/250038731): Fix this.
    # An Add layer doesn't work because of the different shapes.
    # This clears the mask, that's okay because it prevents keras from rescaling
    # the losses.
    return x + self.bias

智慧型初始化將顯著減少初始損失

output_layer = TokenOutput(tokenizer, banned_tokens=('', '[UNK]', '[START]'))
# This might run a little faster if the dataset didn't also have to load the image data.
output_layer.adapt(train_ds.map(lambda inputs, labels: labels))

建構模型

若要建構模型，您需要結合幾個部分

影像 feature_extractor 和文字 tokenizer 和。
seq_embedding 層，用於將批次符號 ID 轉換為向量 (批次，序列，通道)。
將處理文字和影像資料的 DecoderLayers 層堆疊。
output_layer，其傳回下一個單字應為何者的逐點預測。

class Captioner(tf.keras.Model):
  @classmethod
  def add_method(cls, fun):
    setattr(cls, fun.__name__, fun)
    return fun

  def __init__(self, tokenizer, feature_extractor, output_layer, num_layers=1,
               units=256, max_length=50, num_heads=1, dropout_rate=0.1):
    super().__init__()
    self.feature_extractor = feature_extractor
    self.tokenizer = tokenizer
    self.word_to_index = tf.keras.layers.StringLookup(
        mask_token="",
        vocabulary=tokenizer.get_vocabulary())
    self.index_to_word = tf.keras.layers.StringLookup(
        mask_token="",
        vocabulary=tokenizer.get_vocabulary(),
        invert=True) 

    self.seq_embedding = SeqEmbedding(
        vocab_size=tokenizer.vocabulary_size(),
        depth=units,
        max_length=max_length)

    self.decoder_layers = [
        DecoderLayer(units, num_heads=num_heads, dropout_rate=dropout_rate)
        for n in range(num_layers)]

    self.output_layer = output_layer

當您呼叫模型以進行訓練時，它會接收 image, txt 配對。為了使此函數更易於使用，請靈活處理輸入

如果影像有 3 個通道，請透過 feature_extractor 執行。否則假設它已經執行過。同樣地
如果文字具有 dtype tf.string，請透過 tokenizer 執行。

在那之後，執行模型僅需幾個步驟

展平擷取的影像特徵，以便將其輸入到解碼器層。
查閱符號嵌入。
在影像特徵和文字嵌入上執行 DecoderLayers 堆疊。
執行輸出層以預測每個位置的下一個符號。

@Captioner.add_method
  def call(self, inputs):
    image, txt = inputs

    if image.shape[-1] == 3:
      # Apply the feature-extractor, if you get an RGB image.
      image = self.feature_extractor(image)

    # Flatten the feature map
    image = einops.rearrange(image, 'b h w c -> b (h w) c')


    if txt.dtype == tf.string:
      # Apply the tokenizer if you get string inputs.
      txt = tokenizer(txt)

    txt = self.seq_embedding(txt)

    # Look at the image
    for dec_layer in self.decoder_layers:
      txt = dec_layer(inputs=(image, txt))

    txt = self.output_layer(txt)

    return txt

model = Captioner(tokenizer, feature_extractor=mobilenet, output_layer=output_layer,
                  units=256, dropout_rate=0.5, num_layers=2, num_heads=2)

產生標題

在開始訓練之前，先編寫一些程式碼來產生標題。您將使用它來查看訓練的進度。

首先下載測試影像

image_url = 'https://tensorflow.dev.org.tw/images/surf.jpg'
image_path = tf.keras.utils.get_file('surf.jpg', origin=image_url)
image = load_image(image_path)

若要使用此模型為影像加上標題

擷取 img_features
使用 [START] 符號初始化輸出符號清單。
將 img_features 和 tokens 傳遞到模型中。
- 它會傳回 logit 清單。
- 根據這些 logit 選擇下一個符號。
- 將其新增至符號清單，然後繼續迴圈。
- 如果它產生 '[END]' 符號，則跳出迴圈。

因此，新增一個「簡單」方法來執行此操作

@Captioner.add_method
def simple_gen(self, image, temperature=1):
  initial = self.word_to_index([['[START]']]) # (batch, sequence)
  img_features = self.feature_extractor(image[tf.newaxis, ...])

  tokens = initial # (batch, sequence)
  for n in range(50):
    preds = self((img_features, tokens)).numpy()  # (batch, sequence, vocab)
    preds = preds[:,-1, :]  #(batch, vocab)
    if temperature==0:
        next = tf.argmax(preds, axis=-1)[:, tf.newaxis]  # (batch, 1)
    else:
        next = tf.random.categorical(preds/temperature, num_samples=1)  # (batch, 1)
    tokens = tf.concat([tokens, next], axis=1) # (batch, sequence) 

    if next[0] == self.word_to_index('[END]'):
      break
  words = index_to_word(tokens[0, 1:-1])
  result = tf.strings.reduce_join(words, axis=-1, separator=' ')
  return result.numpy().decode()

以下是該影像的一些產生標題，模型尚未訓練，因此它們還沒有太多意義

for t in (0.0, 0.5, 1.0):
  result = model.simple_gen(image, temperature=t)
  print(result)

溫度參數可讓您在 3 種模式之間進行內插

貪婪解碼 (temperature=0.0) - 在每個步驟選擇最有可能的下一個符號。
根據 logit 進行隨機取樣 (temperature=1.0)。
均勻隨機取樣 (temperature >> 1.0)。

由於模型尚未訓練，並且使用了基於頻率的初始化，「貪婪」輸出 (第一個) 通常僅包含最常見的符號：['a', '.', '[END]']。

訓練

若要訓練模型，您需要幾個額外的元件

損失和指標
最佳化器
選用回呼

損失和指標

以下是遮罩損失和準確度的實作

在計算損失的遮罩時，請注意 loss < 1e8。此詞彙會捨棄 banned_tokens 的人為、極高的損失。

def masked_loss(labels, preds):  
  loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, preds)

  mask = (labels != 0) & (loss < 1e8) 
  mask = tf.cast(mask, loss.dtype)

  loss = loss*mask
  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss

def masked_acc(labels, preds):
  mask = tf.cast(labels!=0, tf.float32)
  preds = tf.argmax(preds, axis=-1)
  labels = tf.cast(labels, tf.int64)
  match = tf.cast(preds == labels, mask.dtype)
  acc = tf.reduce_sum(match*mask)/tf.reduce_sum(mask)
  return acc

回呼

為了在訓練期間獲得回饋，請設定一個 keras.callbacks.Callback，以便在每個 epoch 結束時為衝浪者影像產生一些標題。

class GenerateText(tf.keras.callbacks.Callback):
  def __init__(self):
    image_url = 'https://tensorflow.dev.org.tw/images/surf.jpg'
    image_path = tf.keras.utils.get_file('surf.jpg', origin=image_url)
    self.image = load_image(image_path)

  def on_epoch_end(self, epochs=None, logs=None):
    print()
    print()
    for t in (0.0, 0.5, 1.0):
      result = self.model.simple_gen(self.image, temperature=t)
      print(result)
    print()

它會產生三個輸出字串，就像先前的範例一樣，就像之前一樣，第一個是「貪婪」，在每個步驟選擇 logit 的 argmax。

g = GenerateText()
g.model = model
g.on_epoch_end(0)

也使用 callbacks.EarlyStopping 在模型開始過度擬合時終止訓練。

callbacks = [
    GenerateText(),
    tf.keras.callbacks.EarlyStopping(
        patience=5, restore_best_weights=True)]

訓練

設定並執行訓練。

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
           loss=masked_loss,
           metrics=[masked_acc])

為了更頻繁地報告，請使用 Dataset.repeat() 方法，並將 steps_per_epoch 和 validation_steps 引數設定為 Model.fit。

透過在 Flickr8k 上的設定，完整傳遞資料集是 900 多個批次，但以下報告 epoch 為 100 個步驟。

history = model.fit(
    train_ds.repeat(),
    steps_per_epoch=100,
    validation_data=test_ds.repeat(),
    validation_steps=20,
    epochs=100,
    callbacks=callbacks)

繪製訓練運行的損失和準確度

plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylim([0, max(plt.ylim())])
plt.xlabel('Epoch #')
plt.ylabel('CE/token')
plt.legend()

plt.plot(history.history['masked_acc'], label='accuracy')
plt.plot(history.history['val_masked_acc'], label='val_accuracy')
plt.ylim([0, max(plt.ylim())])
plt.xlabel('Epoch #')
plt.ylabel('CE/token')
plt.legend()

注意力圖

現在，使用訓練後的模型，在影像上運行該 simple_gen 方法

result = model.simple_gen(image, temperature=0.0)
result

將輸出分割回符號

str_tokens = result.split()
str_tokens.append('[END]')

DecoderLayers 各自快取其 CrossAttention 層的注意力分數。每個注意力圖的形狀為 (batch=1, heads, sequence, image)

attn_maps = [layer.last_attention_scores for layer in model.decoder_layers]
[map.shape for map in attn_maps]

因此，沿著 batch 軸堆疊地圖，然後在 (batch, heads) 軸上平均，同時將 image 軸分割回 height, width

attention_maps = tf.concat(attn_maps, axis=0)
attention_maps = einops.reduce(
    attention_maps,
    'batch heads sequence (height width) -> sequence height width',
    height=7, width=7,
    reduction='mean')

現在，您有了每個序列預測的單個注意力圖。每個地圖中的值應總和為 1.。

einops.reduce(attention_maps, 'sequence height width -> sequence', reduction='sum')

因此，這是模型在產生輸出中的每個符號時關注的位置

def plot_attention_maps(image, str_tokens, attention_map):
    fig = plt.figure(figsize=(16, 9))

    len_result = len(str_tokens)

    titles = []
    for i in range(len_result):
      map = attention_map[i]
      grid_size = max(int(np.ceil(len_result/2)), 2)
      ax = fig.add_subplot(3, grid_size, i+1)
      titles.append(ax.set_title(str_tokens[i]))
      img = ax.imshow(image)
      ax.imshow(map, cmap='gray', alpha=0.6, extent=img.get_extent(),
                clim=[0.0, np.max(map)])

    plt.tight_layout()

plot_attention_maps(image/255, str_tokens, attention_maps)

現在將其整合到更易於使用的函數中

@Captioner.add_method
def run_and_show_attention(self, image, temperature=0.0):
  result_txt = self.simple_gen(image, temperature)
  str_tokens = result_txt.split()
  str_tokens.append('[END]')

  attention_maps = [layer.last_attention_scores for layer in self.decoder_layers]
  attention_maps = tf.concat(attention_maps, axis=0)
  attention_maps = einops.reduce(
      attention_maps,
      'batch heads sequence (height width) -> sequence height width',
      height=7, width=7,
      reduction='mean')

  plot_attention_maps(image/255, str_tokens, attention_maps)
  t = plt.suptitle(result_txt)
  t.set_y(1.05)

run_and_show_attention(model, image)

在您自己的圖片上試試看

為了好玩，下面提供了您可以使用的方法，使用您剛訓練的模型為您自己的影像加上標題。請記住，它是在相對少量資料上訓練的，並且您的影像可能與訓練資料不同 (因此請為奇怪的結果做好準備！)

image_url = 'https://tensorflow.dev.org.tw/images/bedroom_hrnet_tutorial.jpg'
image_path = tf.keras.utils.get_file(origin=image_url)
image = load_image(image_path)

run_and_show_attention(model, image)