BERT 預處理與 TF Text

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

總覽

文字預處理是將原始文字端對端轉換為模型整數輸入的過程。NLP 模型通常伴隨著數百行（如果不是數千行）Python 程式碼用於預處理文字。文字預處理對於模型而言通常是一項挑戰，因為

訓練-服務偏移。 越來越難確保模型輸入的預處理邏輯在模型開發的所有階段（例如，預訓練、微調、評估、推論）保持一致。在不同階段使用不同的超參數、Tokenization、字串預處理演算法或僅僅是不一致地封裝模型輸入，可能會導致難以偵錯且對模型產生災難性影響。
效率和彈性。 雖然預處理可以離線完成（例如，透過將處理後的輸出寫入磁碟上的檔案，然後在輸入管線中重新使用所述預處理資料），但此方法會產生額外的檔案讀取和寫入成本。如果需要動態進行預處理決策，離線預處理也會很不方便。嘗試不同的選項將需要再次重新產生資料集。
複雜的模型介面。 當文字模型的輸入是純文字時，更容易理解。當模型的輸入需要額外的間接編碼步驟時，很難理解模型。降低預處理複雜性對於模型偵錯、服務和評估尤其受到讚賞。

此外，更簡單的模型介面也使得在不同的、未探索的資料集上嘗試模型（例如，推論或訓練）更加方便。

使用 TF.Text 進行文字預處理

使用 TF.Text 的文字預處理 API，我們可以建構一個預處理函式，將使用者的文字資料集轉換為模型的整數輸入。使用者可以直接將預處理封裝為模型的一部分，以減輕上述問題。

本教學課程將示範如何使用 TF.Text 預處理運算子將文字資料轉換為 BERT 模型的輸入，以及用於語言遮罩預訓練任務的輸入，如 BERT：用於語言理解的深度雙向 Transformer 的預訓練的「Masked LM and Masking Procedure」中所述。此過程涉及將文字 Tokenization 為子詞單元、組合句子、將內容修剪為固定大小，以及為遮罩語言模型化任務提取標籤。

設定

讓我們先匯入我們需要的套件和程式庫。

pip install -q -U "tensorflow-text==2.11.*"

import tensorflow as tf
import tensorflow_text as text
import functools
print("TensorFlow version: ", tf.__version__)

2024-06-25 11:46:30.898009: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:46:31.716557: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:46:31.716646: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:46:31.716655: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
TensorFlow version:  2.11.1

我們的資料包含兩個文字特徵，我們可以建立範例 tf.data.Dataset。我們的目標是建立一個函式，我們可以為 Dataset.map() 提供該函式，以便在訓練中使用。

examples = {
    "text_a": [
      "Sponge bob Squarepants is an Avenger",
      "Marvel Avengers"
    ],
    "text_b": [
     "Barack Obama is the President.",
     "President is the highest office"
  ],
}

dataset = tf.data.Dataset.from_tensor_slices(examples)
next(iter(dataset))

2024-06-25 11:46:33.419785: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:46:33.419883: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:46:33.419946: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:46:33.420005: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-06-25 11:46:33.475841: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:46:33.476035: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://tensorflow.dev.org.tw/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
{'text_a': <tf.Tensor: shape=(), dtype=string, numpy=b'Sponge bob Squarepants is an Avenger'>,
 'text_b': <tf.Tensor: shape=(), dtype=string, numpy=b'Barack Obama is the President.'>}

Tokenization

我們的第一步是執行任何字串預處理並 Tokenization 我們的資料集。這可以使用 text.BertTokenizer 完成，它是一個 text.Splitter，可以根據從 Wordpiece 演算法產生的詞彙表，將句子 Tokenization 為 BERT 模型的子詞或字詞片段。您可以從這裡了解更多關於 TF.Text 中可用的其他子詞 Tokenizer 的資訊。

詞彙表可以來自先前產生的 BERT 檢查點，或者您可以根據自己的資料產生一個。為了本範例的目的，讓我們建立一個玩具詞彙表

_VOCAB = [
    # Special tokens
    b"[UNK]", b"[MASK]", b"[RANDOM]", b"[CLS]", b"[SEP]",
    # Suffixes
    b"##ack", b"##ama", b"##ger", b"##gers", b"##onge", b"##pants",  b"##uare",
    b"##vel", b"##ven", b"an", b"A", b"Bar", b"Hates", b"Mar", b"Ob",
    b"Patrick", b"President", b"Sp", b"Sq", b"bob", b"box", b"has", b"highest",
    b"is", b"office", b"the",
]

_START_TOKEN = _VOCAB.index(b"[CLS]")
_END_TOKEN = _VOCAB.index(b"[SEP]")
_MASK_TOKEN = _VOCAB.index(b"[MASK]")
_RANDOM_TOKEN = _VOCAB.index(b"[RANDOM]")
_UNK_TOKEN = _VOCAB.index(b"[UNK]")
_MAX_SEQ_LEN = 8
_MAX_PREDICTIONS_PER_BATCH = 5

_VOCAB_SIZE = len(_VOCAB)

lookup_table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
      keys=_VOCAB,
      key_dtype=tf.string,
      values=tf.range(
          tf.size(_VOCAB, out_type=tf.int64), dtype=tf.int64),
          value_dtype=tf.int64
        ),
      num_oov_buckets=1
)

讓我們使用上述詞彙表建構 text.BertTokenizer，並將文字輸入 Tokenization 為 RaggedTensor。

bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.string)
bert_tokenizer.tokenize(examples["text_a"])

<tf.RaggedTensor [[[b'Sp', b'##onge'], [b'bob'], [b'Sq', b'##uare', b'##pants'], [b'is'],
  [b'an'], [b'A', b'##ven', b'##ger']]                                  ,
 [[b'Mar', b'##vel'], [b'A', b'##ven', b'##gers']]]>

bert_tokenizer.tokenize(examples["text_b"])

<tf.RaggedTensor [[[b'Bar', b'##ack'], [b'Ob', b'##ama'], [b'is'], [b'the'], [b'President'],
  [b'[UNK]']]                                                              ,
 [[b'President'], [b'is'], [b'the'], [b'highest'], [b'office']]]>

來自 text.BertTokenizer 的文字輸出讓我們了解文字是如何被 Tokenization 的，但模型需要整數 ID。我們可以將 token_out_type 參數設定為 tf.int64 以取得整數 ID（即詞彙表中的索引）。

bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.int64)
segment_a = bert_tokenizer.tokenize(examples["text_a"])
segment_a

<tf.RaggedTensor [[[22, 9], [24], [23, 11, 10], [28], [14], [15, 13, 7]],
 [[18, 12], [15, 13, 8]]]>

segment_b = bert_tokenizer.tokenize(examples["text_b"])
segment_b

<tf.RaggedTensor [[[16, 5], [19, 6], [28], [30], [21], [0]], [[21], [28], [30], [27], [29]]]>

text.BertTokenizer 傳回形狀為 [batch, num_tokens, num_wordpieces] 的 RaggedTensor。因為我們目前的用例不需要額外的 num_tokens 維度，我們可以合併最後兩個維度以獲得形狀為 [batch, num_wordpieces] 的 RaggedTensor

segment_a = segment_a.merge_dims(-2, -1)
segment_a

<tf.RaggedTensor [[22, 9, 24, 23, 11, 10, 28, 14, 15, 13, 7], [18, 12, 15, 13, 8]]>

segment_b = segment_b.merge_dims(-2, -1)
segment_b

<tf.RaggedTensor [[16, 5, 19, 6, 28, 30, 21, 0], [21, 28, 30, 27, 29]]>

內容修剪

BERT 的主要輸入是兩個句子的串聯。但是，BERT 要求輸入為固定大小和形狀，而我們的內容可能超出我們的預算。

我們可以透過使用 text.Trimmer 將我們的內容修剪到預定大小（一旦沿著最後一個軸串聯）來解決這個問題。有不同的 text.Trimmer 類型，它們使用不同的演算法來選擇要保留的內容。text.RoundRobinTrimmer 例如，將為每個區段平均分配配額，但可能會修剪句子的結尾。text.WaterfallTrimmer 將從最後一個句子的結尾開始修剪。

對於我們的範例，我們將使用 RoundRobinTrimmer，它以從左到右的方式從每個區段中選擇項目。

trimmer = text.RoundRobinTrimmer(max_seq_length=_MAX_SEQ_LEN)
trimmed = trimmer.trim([segment_a, segment_b])
trimmed

[<tf.RaggedTensor [[22, 9, 24, 23],
  [18, 12, 15, 13]]>,
 <tf.RaggedTensor [[16, 5, 19, 6],
  [21, 28, 30, 27]]>]

trimmed 現在包含區段，其中批次中的元素數量為 8 個元素（當沿著軸=-1 串聯時）。

組合區段

現在我們已經修剪了區段，我們可以將它們組合在一起以獲得單個 RaggedTensor。BERT 使用特殊 Token 來指示區段的開始 ([CLS]) 和結束 ([SEP])。我們還需要一個 RaggedTensor 來指示組合 Tensor 中哪些項目屬於哪個區段。我們可以使用 text.combine_segments() 來取得這兩個 Tensor，並插入特殊 Token。

segments_combined, segments_ids = text.combine_segments(
  trimmed,
  start_of_sequence_id=_START_TOKEN, end_of_segment_id=_END_TOKEN)
segments_combined, segments_ids

(<tf.RaggedTensor [[3, 22, 9, 24, 23, 4, 16, 5, 19, 6, 4],
  [3, 18, 12, 15, 13, 4, 21, 28, 30, 27, 4]]>,
 <tf.RaggedTensor [[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
  [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]>)

遮罩語言模型任務

現在我們有了基本輸入，我們可以開始提取「Masked LM and Masking Procedure」任務所需的輸入，如 BERT：用於語言理解的深度雙向 Transformer 的預訓練中所述

遮罩語言模型任務有兩個子問題需要我們思考：（1）選擇哪些項目進行遮罩，以及（2）它們被分配了哪些值？

項目選擇

因為我們將選擇隨機選擇項目進行遮罩，所以我們將使用 text.RandomItemSelector。RandomItemSelector 在批次中隨機選擇項目，但受給定的限制（max_selections_per_batch、selection_rate 和 unselectable_ids）約束，並傳回一個布林遮罩，指示哪些項目被選中。

random_selector = text.RandomItemSelector(
    max_selections_per_batch=_MAX_PREDICTIONS_PER_BATCH,
    selection_rate=0.2,
    unselectable_ids=[_START_TOKEN, _END_TOKEN, _UNK_TOKEN]
)
selected = random_selector.get_selection_mask(
    segments_combined, axis=1)
selected

<tf.RaggedTensor [[False, False, False, True, False, False, True, False, False, False,
  False],
 [False, False, False, True, False, False, False, True, False, False,
  False]]>

選擇遮罩值

原始 BERT 論文中描述的用於選擇遮罩值的方法如下

在 mask_token_rate 的時間內，將項目替換為 [MASK] Token

"my dog is hairy" -> "my dog is [MASK]"

在 random_token_rate 的時間內，將項目替換為隨機字詞

"my dog is hairy" -> "my dog is apple"

在 1 - mask_token_rate - random_token_rate 的時間內，保持項目不變

"my dog is hairy" -> "my dog is hairy."

text.MaskedValuesChooser 封裝了這個邏輯，可以用於我們的預處理函式。以下是 MaskValuesChooser 在給定 80% 的 mask_token_rate 和預設 random_token_rate 時傳回的範例

mask_values_chooser = text.MaskValuesChooser(_VOCAB_SIZE, _MASK_TOKEN, 0.8)
mask_values_chooser.get_mask_values(segments_combined)

<tf.RaggedTensor [[1, 1, 1, 1, 1, 1, 25, 1, 19, 1, 1],
 [1, 1, 21, 1, 1, 1, 16, 1, 1, 1, 20]]>

當提供 RaggedTensor 輸入時，text.MaskValuesChooser 傳回形狀相同的 RaggedTensor，其中包含 _MASK_VALUE (0)、隨機 ID 或相同的不變 ID。

為遮罩語言模型任務產生輸入

現在我們有一個 RandomItemSelector 來協助我們選擇要遮罩的項目，以及 text.MaskValuesChooser 來分配值，我們可以使用 text.mask_language_model() 來組裝此任務的所有輸入，以用於我們的 BERT 模型。

masked_token_ids, masked_pos, masked_lm_ids = text.mask_language_model(
  segments_combined,
  item_selector=random_selector, mask_values_chooser=mask_values_chooser)

讓我們更深入地研究並檢查 mask_language_model() 的輸出。masked_token_ids 的輸出為

masked_token_ids

<tf.RaggedTensor [[3, 22, 9, 24, 23, 4, 1, 5, 19, 6, 4],
 [3, 13, 12, 15, 13, 4, 21, 28, 30, 27, 4]]>

請記住，我們的輸入是使用詞彙表編碼的。如果我們使用詞彙表解碼 masked_token_ids，我們會得到

tf.gather(_VOCAB, masked_token_ids)

<tf.RaggedTensor [[b'[CLS]', b'Sp', b'##onge', b'bob', b'Sq', b'[SEP]', b'[MASK]',
  b'##ack', b'Ob', b'##ama', b'[SEP]'],
 [b'[CLS]', b'##ven', b'##vel', b'A', b'##ven', b'[SEP]', b'President',
  b'is', b'the', b'highest', b'[SEP]']]>

請注意，某些字詞片段 Token 已被替換為 [MASK]、[RANDOM] 或不同的 ID 值。masked_pos 輸出給出了已替換 Token 的索引（在各自的批次中）。

masked_pos

<tf.RaggedTensor [[2, 6],
 [1, 7]]>

masked_lm_ids 給出了 Token 的原始值。

masked_lm_ids

<tf.RaggedTensor [[9, 16],
 [18, 28]]>

我們可以再次在此處解碼 ID 以獲得人類可讀的值。

tf.gather(_VOCAB, masked_lm_ids)

<tf.RaggedTensor [[b'##onge', b'Bar'],
 [b'Mar', b'is']]>

填補模型輸入

現在我們有了模型的所有輸入，我們預處理的最後一步是將它們封裝成固定的 2 維 Tensor，並使用填補，並產生一個遮罩 Tensor，指示哪些值是填補值。我們可以使用 text.pad_model_inputs() 來協助我們完成此任務。

# Prepare and pad combined segment inputs
input_word_ids, input_mask = text.pad_model_inputs(
  masked_token_ids, max_seq_length=_MAX_SEQ_LEN)
input_type_ids, _ = text.pad_model_inputs(
  segments_ids, max_seq_length=_MAX_SEQ_LEN)

# Prepare and pad masking task inputs
masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
  masked_pos, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
masked_lm_ids, _ = text.pad_model_inputs(
  masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

model_inputs = {
    "input_word_ids": input_word_ids,
    "input_mask": input_mask,
    "input_type_ids": input_type_ids,
    "masked_lm_ids": masked_lm_ids,
    "masked_lm_positions": masked_lm_positions,
    "masked_lm_weights": masked_lm_weights,
}
model_inputs

{'input_word_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[ 3, 22,  9, 24, 23,  4,  1,  5],
        [ 3, 13, 12, 15, 13,  4, 21, 28]])>,
 'input_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])>,
 'input_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[0, 0, 0, 0, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 0, 1, 1]])>,
 'masked_lm_ids': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[ 9, 16,  0,  0,  0],
        [18, 28,  0,  0,  0]])>,
 'masked_lm_positions': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[2, 6, 0, 0, 0],
        [1, 7, 0, 0, 0]])>,
 'masked_lm_weights': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0]])>}

回顧

讓我們回顧一下我們目前所擁有的，並組裝我們的預處理函式。以下是我們擁有的

def bert_pretrain_preprocess(vocab_table, features):
  # Input is a string Tensor of documents, shape [batch, 1].
  text_a = features["text_a"]
  text_b = features["text_b"]

  # Tokenize segments to shape [num_sentences, (num_words)] each.
  tokenizer = text.BertTokenizer(
      vocab_table,
      token_out_type=tf.int64)
  segments = [tokenizer.tokenize(text).merge_dims(
      1, -1) for text in (text_a, text_b)]

  # Truncate inputs to a maximum length.
  trimmer = text.RoundRobinTrimmer(max_seq_length=6)
  trimmed_segments = trimmer.trim(segments)

  # Combine segments, get segment ids and add special tokens.
  segments_combined, segment_ids = text.combine_segments(
      trimmed_segments,
      start_of_sequence_id=_START_TOKEN,
      end_of_segment_id=_END_TOKEN)

  # Apply dynamic masking task.
  masked_input_ids, masked_lm_positions, masked_lm_ids = (
      text.mask_language_model(
        segments_combined,
        random_selector,
        mask_values_chooser,
      )
  )

  # Prepare and pad combined segment inputs
  input_word_ids, input_mask = text.pad_model_inputs(
    masked_input_ids, max_seq_length=_MAX_SEQ_LEN)
  input_type_ids, _ = text.pad_model_inputs(
    segment_ids, max_seq_length=_MAX_SEQ_LEN)

  # Prepare and pad masking task inputs
  masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
    masked_lm_positions, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
  masked_lm_ids, _ = text.pad_model_inputs(
    masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

  model_inputs = {
      "input_word_ids": input_word_ids,
      "input_mask": input_mask,
      "input_type_ids": input_type_ids,
      "masked_lm_ids": masked_lm_ids,
      "masked_lm_positions": masked_lm_positions,
      "masked_lm_weights": masked_lm_weights,
  }
  return model_inputs

我們之前建構了一個 tf.data.Dataset，我們現在可以在 Dataset.map() 中使用我們組裝的預處理函式 bert_pretrain_preprocess()。這讓我們可以建立一個輸入管線，將我們的原始字串資料轉換為整數輸入，並直接饋送到我們的模型中。

dataset = (
    tf.data.Dataset.from_tensors(examples)
    .map(functools.partial(bert_pretrain_preprocess, lookup_table))
)

next(iter(dataset))

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
{'input_word_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[ 3, 22,  9, 24,  4,  1,  1, 19],
        [ 3, 18, 12, 15,  4, 21, 28, 30]])>,
 'input_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])>,
 'input_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
 array([[0, 0, 0, 0, 0, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1]])>,
 'masked_lm_ids': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[16,  5,  0,  0,  0],
        [12, 15,  0,  0,  0]])>,
 'masked_lm_positions': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[5, 6, 0, 0, 0],
        [2, 3, 0, 0, 0]])>,
 'masked_lm_weights': <tf.Tensor: shape=(2, 5), dtype=int64, numpy=
 array([[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0]])>}