![]() |
![]() |
![]() |
![]() |
![]() |
總覽
Tokenization 是將字串分解為 token 的過程。通常,這些 token 是字詞、數字和/或標點符號。 tensorflow_text
套件提供許多 tokenizer,可用於預先處理以文字為基礎的模型所需的文字。藉由在 TensorFlow 圖表中執行 tokenization,您不必擔心訓練和推論工作流程之間的差異,以及管理預先處理指令碼的問題。
本指南討論 TensorFlow Text 提供的許多 tokenization 選項、您可能想要在何時使用某個選項而非另一個選項,以及如何從模型內部呼叫這些 tokenizer。
設定
pip install -q "tensorflow-text==2.11.*"
import requests
import tensorflow as tf
import tensorflow_text as tf_text
2024-06-25 11:36:00.193262: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2024-06-25 11:36:00.997185: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2024-06-25 11:36:00.997272: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2024-06-25 11:36:00.997281: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Splitter API
主要介面是 Splitter
和 SplitterWithOffsets
,它們具有單一方法 split
和 split_with_offsets
。 SplitterWithOffsets
變體 (擴充 Splitter
) 包含取得位元組偏移量的選項。這可讓呼叫端知道所建立的 token 是從原始字串中的哪些位元組建立而來。
Tokenizer
和 TokenizerWithOffsets
是 Splitter
的特殊版本,提供便利方法 tokenize
和 tokenize_with_offsets
。
一般而言,對於任何 N 維輸入,傳回的 token 位於 N+1 維 RaggedTensor 中,其中最內層的 token 維度會對應到原始個別字串。
class Splitter {
@abstractmethod
def split(self, input)
}
class SplitterWithOffsets(Splitter) {
@abstractmethod
def split_with_offsets(self, input)
}
還有一個 Detokenizer
介面。任何實作此介面的 tokenizer 都可以接受 N 維 token 的不規則張量,並且通常會傳回 N-1 維張量或不規則張量,其中已將指定的 token 組裝在一起。
class Detokenizer {
@abstractmethod
def detokenize(self, input)
}
Tokenizer
以下是 TensorFlow Text 提供的 tokenizer 套件。字串輸入假設為 UTF-8。請查看 Unicode 指南,以瞭解如何將字串轉換為 UTF-8。
全字詞 tokenizer
這些 tokenizer 嘗試依字詞分割字串,是最直覺的文字分割方式。
WhitespaceTokenizer
text.WhitespaceTokenizer
是最基本的 tokenizer,可依 ICU 定義的空白字元 (例如空格、Tab 鍵、換行符號) 分割字串。這通常很適合快速建構原型模型。
tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']] 2024-06-25 11:36:02.659929: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2024-06-25 11:36:02.660029: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory 2024-06-25 11:36:02.660092: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory 2024-06-25 11:36:02.660152: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory 2024-06-25 11:36:02.714527: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory 2024-06-25 11:36:02.714728: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://tensorflow.dev.org.tw/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...
您可能會注意到此 tokenizer 的一個缺點,就是標點符號會與字詞一起包含在 token 中。若要將字詞和標點符號分割成個別 token,應使用 UnicodeScriptTokenizer
。
UnicodeScriptTokenizer
UnicodeScriptTokenizer
會根據 Unicode script 邊界分割字串。使用的 script 代碼對應於國際碼組件 (ICU) UScriptCode 值。請參閱:http://icu-project.org/apiref/icu4c/uscript_8h.html
實際上,這與 WhitespaceTokenizer
類似,最明顯的差異在於它會將標點符號 (USCRIPT_COMMON) 與語言文字 (例如 USCRIPT_LATIN、USCRIPT_CYRILLIC 等) 分割開來,同時也會將語言文字彼此分開。請注意,這也會將縮略字分割成個別 token。
tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b'can', b"'", b't', b'explain', b',', b'but', b'you', b'feel', b'it', b'.']]
Subword tokenizer
Subword tokenizer 可以搭配較小的詞彙表使用,並讓模型從組成新字詞的 subword 取得一些關於新字詞的資訊。
我們將在下方簡要討論 Subword tokenization 選項,但 Subword Tokenization 教學課程 會更深入探討,並說明如何產生詞彙表檔案。
WordpieceTokenizer
WordPiece tokenization 是一種資料驅動的 tokenization 方案,可產生一組 sub-token。這些 sub-token 可能對應於語言詞素,但通常並非如此。
WordpieceTokenizer 預期輸入已分割成 token。由於有此先決條件,您通常會想要事先使用 WhitespaceTokenizer
或 UnicodeScriptTokenizer
進行分割。
tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]
將字串分割成 token 後,即可使用 WordpieceTokenizer
分割成 sub-token。
url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"
open(filepath, 'wb').write(r.content)
52382
subtokenizer = tf_text.UnicodeScriptTokenizer(filepath)
subtokens = tokenizer.tokenize(tokens)
print(subtokens.to_list())
[[[b'What'], [b'you'], [b'know'], [b'you'], [b"can't"], [b'explain,'], [b'but'], [b'you'], [b'feel'], [b'it.']]]
BertTokenizer
BertTokenizer 反映 BERT 論文中 tokenization 的原始實作。這以 WordpieceTokenizer 為基礎,但也執行額外的工作,例如正規化和先 tokenize 為字詞。
tokenizer = tf_text.BertTokenizer(filepath, token_out_type=tf.string, lower_case=True)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[[b'what'], [b'you'], [b'know'], [b'you'], [b'can'], [b"'"], [b't'], [b'explain'], [b','], [b'but'], [b'you'], [b'feel'], [b'it'], [b'.']]]
SentencepieceTokenizer
SentencepieceTokenizer 是一種高度可設定的 sub-token tokenizer。這以 Sentencepiece 程式庫為基礎。與 BertTokenizer 類似,它可以在分割成 sub-token 之前包含正規化和 token 分割。
url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true"
sp_model = requests.get(url).content
tokenizer = tf_text.SentencepieceTokenizer(sp_model, out_type=tf.string)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'\xe2\x96\x81What', b'\xe2\x96\x81you', b'\xe2\x96\x81know', b'\xe2\x96\x81you', b'\xe2\x96\x81can', b"'", b't', b'\xe2\x96\x81explain', b',', b'\xe2\x96\x81but', b'\xe2\x96\x81you', b'\xe2\x96\x81feel', b'\xe2\x96\x81it', b'.']]
其他 splitter
UnicodeCharTokenizer
這會將字串分割成 UTF-8 字元。這對於字詞之間沒有空格的 CJK 語言非常有用。
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]]
輸出為 Unicode 碼位。這也可用於建立字元 N 連詞,例如雙連詞。若要轉換回 UTF-8 字元。
characters = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), "UTF-8")
bigrams = tf_text.ngrams(characters, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')
print(bigrams.to_list())
[[b'Wh', b'ha', b'at', b't ', b' y', b'yo', b'ou', b'u ', b' k', b'kn', b'no', b'ow', b'w ', b' y', b'yo', b'ou', b'u ', b' c', b'ca', b'an', b"n'", b"'t", b't ', b' e', b'ex', b'xp', b'pl', b'la', b'ai', b'in', b'n,', b', ', b' b', b'bu', b'ut', b't ', b' y', b'yo', b'ou', b'u ', b' f', b'fe', b'ee', b'el', b'l ', b' i', b'it', b't.']]
HubModuleTokenizer
這是部署到 TF Hub 的模型包裝函式,可讓呼叫更輕鬆,因為 TF Hub 目前不支援不規則張量。當您想要分割成字詞,但沒有空格可提供啟發式指南時,讓模型執行 tokenization 對於 CJK 語言特別有用。目前,我們有一個用於中文的單一分詞模型。
MODEL_HANDLE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = tf_text.HubModuleTokenizer(MODEL_HANDLE)
tokens = segmenter.tokenize(["新华社北京"])
print(tokens.to_list())
[[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe', b'\xe5\x8c\x97\xe4\xba\xac']]
可能難以檢視 UTF-8 編碼位元組字串的結果。解碼清單值可讓檢視更輕鬆。
def decode_list(x):
if type(x) is list:
return list(map(decode_list, x))
return x.decode("UTF-8")
def decode_utf8_tensor(x):
return list(map(decode_list, x.to_list()))
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]
SplitMergeTokenizer
SplitMergeTokenizer
和 SplitMergeFromLogitsTokenizer
的目標用途是根據提供的值分割字串,這些值指出應在何處分割字串。當您建立自己的分詞模型 (如先前的分詞範例) 時,這非常有用。
對於 SplitMergeTokenizer
,值 0 用於指示新字串的開頭,值 1 指示字元是目前字串的一部分。
strings = ["新华社北京"]
labels = [[0, 1, 1, 0, 1]]
tokenizer = tf_text.SplitMergeTokenizer()
tokens = tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]
SplitMergeFromLogitsTokenizer
類似,但它改為接受來自神經網路的 logit 值配對,這些配對預測每個字元應分割成新字串還是合併到目前的字串中。
strings = [["新华社北京"]]
labels = [[[5.0, -3.2], [0.2, 12.0], [0.0, 11.0], [2.2, -1.0], [-3.0, 3.0]]]
tokenizer = tf_text.SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]
RegexSplitter
RegexSplitter
能夠在提供的規則運算式定義的任意中斷點分割字串。
splitter = tf_text.RegexSplitter("\s")
tokens = splitter.split(["What you know you can't explain, but you feel it."], )
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]
偏移量
tokenize 字串時,通常希望知道 token 源自原始字串中的哪個位置。因此,每個實作 TokenizerWithOffsets
的 tokenizer 都有一個 tokenize_with_offsets 方法,該方法會傳回位元組偏移量以及 token。 start_offsets 列出每個 token 在原始字串中開始的位元組,而 end_offsets 列出每個 token 結束點之後立即的位元組。換句話說,開始偏移量是包含的,而結束偏移量是不包含的。
tokenizer = tf_text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(['Everything not saved will be lost.'])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())
[[b'Everything', b'not', b'saved', b'will', b'be', b'lost', b'.']] [[0, 11, 15, 21, 26, 29, 33]] [[10, 14, 20, 25, 28, 33, 34]]
Detokenization
實作 Detokenizer
的 Tokenizer 提供 detokenize
方法,嘗試組合字串。這有可能會造成資訊遺失,因此 detokenized 字串可能不一定與原始的預先 tokenize 字串完全相符。
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
strings = tokenizer.detokenize(tokens)
print(strings.numpy())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]] [b"What you know you can't explain, but you feel it."]
TF Data
TF Data 是用於建立訓練模型輸入管線的強大 API。Tokenizer 可與 API 搭配運作,如同預期。
docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = tf_text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089 WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089 [[b'Never', b'tell', b'me', b'the', b'odds.']] [[b"It's", b'a', b'trap!']]