Universal Sentence Encoder SentEval 示範

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

查看 TF Hub 模型

這個 Colab 示範了使用 SentEval 工具組的 Universal Sentence Encoder CMLM 模型，SentEval 工具組是用於衡量句子嵌入品質的程式庫。SentEval 工具組包含各種下游任務，能夠評估嵌入模型的泛化能力，並評估編碼的語言屬性。

執行前兩個程式碼區塊來設定環境，在第三個程式碼區塊中，您可以選取 SentEval 任務來評估模型。建議使用 GPU 執行階段來執行此 Colab。

若要進一步瞭解 Universal Sentence Encoder CMLM 模型，請參閱 https://openreview.net/forum?id=WDVD4lUCTzU

安裝依附元件

pip install --quiet "tensorflow-text==2.11.*"
pip install --quiet torch==1.8.1

下載 SentEval 和任務資料

此步驟從 github 下載 SentEval，並執行資料指令碼以下載任務資料。完成時間可能長達 5 分鐘。

安裝 SentEval 並下載任務資料

rm -rf ./SentEval
git clone https://github.com/facebookresearch/SentEval.git
cd $PWD/SentEval/data/downstream && bash get_transfer_data.bash > /dev/null 2>&1

執行 SentEval 評估任務

以下程式碼區塊會執行 SentEval 任務並輸出結果，請選擇下列其中一項任務來評估 USE CMLM 模型

MR  CR  SUBJ    MPQA    SST TREC    MRPC    SICK-E

選取要執行的模型、參數和任務。快速原型設計參數可用於減少計算時間，以加快結果。

使用「快速原型設計」參數完成任務通常需要 5-15 分鐘，而使用「較慢但效能最佳」參數則需要長達一小時。

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

為了獲得更佳結果，請使用較慢的「較慢但效能最佳」參數，計算時間可能長達 1 小時

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
                                 'tenacity': 5, 'epoch_size': 6}

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
sys.path.append(f'{os.getcwd()}/SentEval')

import tensorflow as tf

# Prevent TF from claiming all GPU memory so there is some left for pytorch.
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Memory growth needs to be the same across GPUs.
  for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

import tensorflow_hub as hub
import tensorflow_text
import senteval
import time

PATH_TO_DATA = f'{os.getcwd()}/SentEval/data'
MODEL = 'https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1'
PARAMS = 'rapid prototyping'
TASK = 'CR'

params_prototyping = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params_prototyping['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

params_best = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params_best['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
                                 'tenacity': 5, 'epoch_size': 6}

params = params_best if PARAMS == 'slower, best performance' else params_prototyping

preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1")

inputs = tf.keras.Input(shape=tf.shape(''), dtype=tf.string)
outputs = encoder(preprocessor(inputs))

model = tf.keras.Model(inputs=inputs, outputs=outputs)

def prepare(params, samples):
    return

def batcher(_, batch):
    batch = [' '.join(sent) if sent else '.' for sent in batch]
    return model.predict(tf.constant(batch))["default"]


se = senteval.engine.SE(params, batcher, prepare)
print("Evaluating task %s with %s parameters" % (TASK, PARAMS))
start = time.time()
results = se.eval(TASK)
end = time.time()
print('Time took on task %s : %.1f. seconds' % (TASK, end - start))
print(results)

瞭解詳情

在 TensorFlow Hub 上尋找更多文字嵌入模型
另請參閱多語言通用句子編碼器 CMLM 模型
查看其他通用句子編碼器模型

參考文獻

Ziyi Yang、Yinfei Yang、Daniel Cer、Jax Law、Eric Darve。Universal Sentence Representations Learning with Conditional Masked Language Model. 2020 年 11 月