文字 | TensorFlow - TensorFlow 機器學習平台

TensorFlow 的文字處理工具

TensorFlow 提供兩個程式庫以進行文字和自然語言處理：KerasNLP 和 TensorFlow Text。KerasNLP 是一個高階自然語言處理 (NLP) 程式庫，其中包含以 Transformer 為基礎的現代模型以及較低階的語彙化工具。對於大多數 NLP 用例，這是建議的解決方案。KerasNLP 建構於 TensorFlow Text 之上，將低階文字處理作業抽象化為易於使用的 API。但是，如果您不想使用 Keras API，或需要存取較低階的文字處理運算，則可以直接使用 TensorFlow Text。

KerasNLP

import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
  "imdb_reviews",
  split=["train", "test"],
  as_supervised=True,
  batch_size=16,
)
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset("bert_base_en_uncased")
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])

請參閱 GitHub 上的快速入門指南。

在 TensorFlow 中開始處理文字最簡單的方式是使用 KerasNLP。KerasNLP 是一個自然語言處理程式庫，支援從模組化元件建構的工作流程，這些元件具有最先進的預設權重和架構。您可以直接使用 KerasNLP 元件的預設配置。如果您需要更多控制權，可以輕鬆自訂元件。KerasNLP 強調所有工作流程的圖內運算，因此您可以預期使用 TensorFlow 生態系統輕鬆實現生產。

KerasNLP 是核心 Keras API 的擴充功能，而所有高階 KerasNLP 模組都是圖層或模型。如果您熟悉 Keras，就已經了解大部分的 KerasNLP。

如要進一步瞭解，請參閱 KerasNLP。

TensorFlow Text

import tensorflow as tf
import tensorflow_text as tf_text

def preprocess(vocab_lookup_table, example_text):

  # Normalize text
  tf_text.normalize_utf8(example_text)

  # Tokenize into words
  word_tokenizer = tf_text.WhitespaceTokenizer()
  tokens = word_tokenizer.tokenize(example_text)

  # Tokenize into subwords
  subword_tokenizer = tf_text.WordpieceTokenizer(
       vocab_lookup_table, token_out_type=tf.int64)
  subtokens = subword_tokenizer.tokenize(tokens).merge_dims(1, -1)

  # Apply padding
  padded_inputs = tf_text.pad_model_inputs(subtokens, max_seq_length=16)
  return padded_inputs

在 Notebook 中執行

KerasNLP 提供高階文字處理模組，這些模組以圖層或模型的形式提供。如果您需要存取較低階的工具，可以使用 TensorFlow Text。TensorFlow Text 為您提供豐富的運算和程式庫集合，可協助您處理文字格式的輸入，例如原始文字字串或文件。這些程式庫可以執行以文字為基礎的模型經常需要的預先處理，並包含對序列模型化有用的其他功能。

您可以從 TensorFlow 圖形內部擷取強大的語法和語意文字特徵，作為神經網路的輸入。

將預先處理與 TensorFlow 圖形整合可提供下列優點

方便使用大型工具組來處理文字
允許與大量 TensorFlow 工具套件整合，以支援從問題定義到訓練、評估和發布的專案
降低服務時間的複雜性並防止訓練-服務偏差

除了上述優點之外，您不必擔心訓練中的語彙化與推論時的語彙化不同，或管理預先處理指令碼。

TensorFlow 的文字處理工具

KerasNLP

TensorFlow Text

文字預先處理

端對端預先處理

子詞語彙化

文字分類

使用 BERT 進行文字分類

使用 RNN 進行文字分類

文字產生

使用 Transformer 模型進行文字翻譯

使用 seq2seq 模型進行文字翻譯

TensorFlow Text 部落格文章

GitHub 上的 TensorFlow Text