使用 TF Text 進行 Tokenizing

在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視 下載筆記本 查看 TF Hub 模型

總覽

Tokenization 是將字串分解為 token 的過程。通常,這些 token 是字詞、數字和/或標點符號。 tensorflow_text 套件提供許多 tokenizer,可用於預先處理以文字為基礎的模型所需的文字。藉由在 TensorFlow 圖表中執行 tokenization,您不必擔心訓練和推論工作流程之間的差異,以及管理預先處理指令碼的問題。

本指南討論 TensorFlow Text 提供的許多 tokenization 選項、您可能想要在何時使用某個選項而非另一個選項,以及如何從模型內部呼叫這些 tokenizer。

設定

pip install -q "tensorflow-text==2.11.*"
import requests
import tensorflow as tf
import tensorflow_text as tf_text
2024-06-25 11:36:00.193262: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997185: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997272: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997281: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Splitter API

主要介面是 SplitterSplitterWithOffsets,它們具有單一方法 splitsplit_with_offsetsSplitterWithOffsets 變體 (擴充 Splitter) 包含取得位元組偏移量的選項。這可讓呼叫端知道所建立的 token 是從原始字串中的哪些位元組建立而來。

TokenizerTokenizerWithOffsetsSplitter 的特殊版本,提供便利方法 tokenizetokenize_with_offsets

一般而言,對於任何 N 維輸入,傳回的 token 位於 N+1 維 RaggedTensor 中,其中最內層的 token 維度會對應到原始個別字串。

class Splitter {
  @abstractmethod
  def split(self, input)
}

class SplitterWithOffsets(Splitter) {
  @abstractmethod
  def split_with_offsets(self, input)
}

還有一個 Detokenizer 介面。任何實作此介面的 tokenizer 都可以接受 N 維 token 的不規則張量,並且通常會傳回 N-1 維張量或不規則張量,其中已將指定的 token 組裝在一起。

class Detokenizer {
  @abstractmethod
  def detokenize(self, input)
}

Tokenizer

以下是 TensorFlow Text 提供的 tokenizer 套件。字串輸入假設為 UTF-8。請查看 Unicode 指南,以瞭解如何將字串轉換為 UTF-8。

全字詞 tokenizer

這些 tokenizer 嘗試依字詞分割字串,是最直覺的文字分割方式。

WhitespaceTokenizer

text.WhitespaceTokenizer 是最基本的 tokenizer,可依 ICU 定義的空白字元 (例如空格、Tab 鍵、換行符號) 分割字串。這通常很適合快速建構原型模型。

tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]
2024-06-25 11:36:02.659929: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660029: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660092: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660152: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.714527: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.714728: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://tensorflow.dev.org.tw/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

您可能會注意到此 tokenizer 的一個缺點,就是標點符號會與字詞一起包含在 token 中。若要將字詞和標點符號分割成個別 token,應使用 UnicodeScriptTokenizer

UnicodeScriptTokenizer

UnicodeScriptTokenizer 會根據 Unicode script 邊界分割字串。使用的 script 代碼對應於國際碼組件 (ICU) UScriptCode 值。請參閱:http://icu-project.org/apiref/icu4c/uscript_8h.html

實際上,這與 WhitespaceTokenizer 類似,最明顯的差異在於它會將標點符號 (USCRIPT_COMMON) 與語言文字 (例如 USCRIPT_LATIN、USCRIPT_CYRILLIC 等) 分割開來,同時也會將語言文字彼此分開。請注意,這也會將縮略字分割成個別 token。

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b'can', b"'", b't', b'explain', b',', b'but', b'you', b'feel', b'it', b'.']]

Subword tokenizer

Subword tokenizer 可以搭配較小的詞彙表使用,並讓模型從組成新字詞的 subword 取得一些關於新字詞的資訊。

我們將在下方簡要討論 Subword tokenization 選項,但 Subword Tokenization 教學課程 會更深入探討,並說明如何產生詞彙表檔案。

WordpieceTokenizer

WordPiece tokenization 是一種資料驅動的 tokenization 方案,可產生一組 sub-token。這些 sub-token 可能對應於語言詞素,但通常並非如此。

WordpieceTokenizer 預期輸入已分割成 token。由於有此先決條件,您通常會想要事先使用 WhitespaceTokenizerUnicodeScriptTokenizer 進行分割。

tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]

將字串分割成 token 後,即可使用 WordpieceTokenizer 分割成 sub-token。

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"
open(filepath, 'wb').write(r.content)
52382
subtokenizer = tf_text.UnicodeScriptTokenizer(filepath)
subtokens = tokenizer.tokenize(tokens)
print(subtokens.to_list())
[[[b'What'], [b'you'], [b'know'], [b'you'], [b"can't"], [b'explain,'], [b'but'], [b'you'], [b'feel'], [b'it.']]]

BertTokenizer

BertTokenizer 反映 BERT 論文中 tokenization 的原始實作。這以 WordpieceTokenizer 為基礎,但也執行額外的工作,例如正規化和先 tokenize 為字詞。

tokenizer = tf_text.BertTokenizer(filepath, token_out_type=tf.string, lower_case=True)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[[b'what'], [b'you'], [b'know'], [b'you'], [b'can'], [b"'"], [b't'], [b'explain'], [b','], [b'but'], [b'you'], [b'feel'], [b'it'], [b'.']]]

SentencepieceTokenizer

SentencepieceTokenizer 是一種高度可設定的 sub-token tokenizer。這以 Sentencepiece 程式庫為基礎。與 BertTokenizer 類似,它可以在分割成 sub-token 之前包含正規化和 token 分割。

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true"
sp_model = requests.get(url).content
tokenizer = tf_text.SentencepieceTokenizer(sp_model, out_type=tf.string)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'\xe2\x96\x81What', b'\xe2\x96\x81you', b'\xe2\x96\x81know', b'\xe2\x96\x81you', b'\xe2\x96\x81can', b"'", b't', b'\xe2\x96\x81explain', b',', b'\xe2\x96\x81but', b'\xe2\x96\x81you', b'\xe2\x96\x81feel', b'\xe2\x96\x81it', b'.']]

其他 splitter

UnicodeCharTokenizer

這會將字串分割成 UTF-8 字元。這對於字詞之間沒有空格的 CJK 語言非常有用。

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]]

輸出為 Unicode 碼位。這也可用於建立字元 N 連詞,例如雙連詞。若要轉換回 UTF-8 字元。

characters = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), "UTF-8")
bigrams = tf_text.ngrams(characters, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')
print(bigrams.to_list())
[[b'Wh', b'ha', b'at', b't ', b' y', b'yo', b'ou', b'u ', b' k', b'kn', b'no', b'ow', b'w ', b' y', b'yo', b'ou', b'u ', b' c', b'ca', b'an', b"n'", b"'t", b't ', b' e', b'ex', b'xp', b'pl', b'la', b'ai', b'in', b'n,', b', ', b' b', b'bu', b'ut', b't ', b' y', b'yo', b'ou', b'u ', b' f', b'fe', b'ee', b'el', b'l ', b' i', b'it', b't.']]

HubModuleTokenizer

這是部署到 TF Hub 的模型包裝函式,可讓呼叫更輕鬆,因為 TF Hub 目前不支援不規則張量。當您想要分割成字詞,但沒有空格可提供啟發式指南時,讓模型執行 tokenization 對於 CJK 語言特別有用。目前,我們有一個用於中文的單一分詞模型。

MODEL_HANDLE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = tf_text.HubModuleTokenizer(MODEL_HANDLE)
tokens = segmenter.tokenize(["新华社北京"])
print(tokens.to_list())
[[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe', b'\xe5\x8c\x97\xe4\xba\xac']]

可能難以檢視 UTF-8 編碼位元組字串的結果。解碼清單值可讓檢視更輕鬆。

def decode_list(x):
  if type(x) is list:
    return list(map(decode_list, x))
  return x.decode("UTF-8")

def decode_utf8_tensor(x):
  return list(map(decode_list, x.to_list()))

print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

SplitMergeTokenizer

SplitMergeTokenizerSplitMergeFromLogitsTokenizer 的目標用途是根據提供的值分割字串,這些值指出應在何處分割字串。當您建立自己的分詞模型 (如先前的分詞範例) 時,這非常有用。

對於 SplitMergeTokenizer,值 0 用於指示新字串的開頭,值 1 指示字元是目前字串的一部分。

strings = ["新华社北京"]
labels = [[0, 1, 1, 0, 1]]
tokenizer = tf_text.SplitMergeTokenizer()
tokens = tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

SplitMergeFromLogitsTokenizer 類似,但它改為接受來自神經網路的 logit 值配對,這些配對預測每個字元應分割成新字串還是合併到目前的字串中。

strings = [["新华社北京"]]
labels = [[[5.0, -3.2], [0.2, 12.0], [0.0, 11.0], [2.2, -1.0], [-3.0, 3.0]]]
tokenizer = tf_text.SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

RegexSplitter

RegexSplitter 能夠在提供的規則運算式定義的任意中斷點分割字串。

splitter = tf_text.RegexSplitter("\s")
tokens = splitter.split(["What you know you can't explain, but you feel it."], )
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]

偏移量

tokenize 字串時,通常希望知道 token 源自原始字串中的哪個位置。因此,每個實作 TokenizerWithOffsets 的 tokenizer 都有一個 tokenize_with_offsets 方法,該方法會傳回位元組偏移量以及 token。 start_offsets 列出每個 token 在原始字串中開始的位元組,而 end_offsets 列出每個 token 結束點之後立即的位元組。換句話說,開始偏移量是包含的,而結束偏移量是不包含的。

tokenizer = tf_text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(['Everything not saved will be lost.'])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())
[[b'Everything', b'not', b'saved', b'will', b'be', b'lost', b'.']]
[[0, 11, 15, 21, 26, 29, 33]]
[[10, 14, 20, 25, 28, 33, 34]]

Detokenization

實作 Detokenizer 的 Tokenizer 提供 detokenize 方法,嘗試組合字串。這有可能會造成資訊遺失,因此 detokenized 字串可能不一定與原始的預先 tokenize 字串完全相符。

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
strings = tokenizer.detokenize(tokens)
print(strings.numpy())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]]
[b"What you know you can't explain, but you feel it."]

TF Data

TF Data 是用於建立訓練模型輸入管線的強大 API。Tokenizer 可與 API 搭配運作,如同預期。

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = tf_text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
[[b'Never', b'tell', b'me', b'the', b'odds.']]
[[b"It's", b'a', b'trap!']]