多語言通用句子編碼器問答檢索

在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視 下載筆記本 查看 TF Hub 模型

這是一個示範,說明如何使用「通用編碼器多語言問答模型」進行文字問答檢索,並示範模型的 question_encoderresponse_encoder 的用法。我們使用 SQuAD 段落中的句子作為示範資料集,每個句子及其上下文 (句子周圍的文字) 都會使用 response_encoder 編碼為高維度嵌入。這些嵌入會儲存在使用 simpleneighbors 程式庫建構的索引中,以便進行問答檢索。

在檢索時,會從 SQuAD 資料集中選取一個隨機問題,並使用 question_encoder 編碼為高維度嵌入,然後查詢 simpleneighbors 索引,傳回語意空間中近似最近鄰的清單。

更多模型

您可以在這裡找到目前託管的所有文字嵌入模型,以及在這裡找到所有已在 SQuAD 上訓練的模型。

設定

設定環境

設定通用匯入和函式

2024-02-02 12:42:03.366166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103807: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103818: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

執行以下程式碼區塊,將 SQuAD 資料集下載並解壓縮到

  • sentences 是 (文字、上下文) 元組的清單 - SQuAD 資料集中的每個段落都使用 nltk 程式庫分割成句子,而句子和段落文字會形成 (文字、上下文) 元組。
  • questions 是 (問題、答案) 元組的清單。

下載並解壓縮 SQuAD 資料

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('Oxygen gas is increasingly obtained by these non-cryogenic technologies (see '
 'also the related vacuum swing adsorption).')

context:

('The other major method of producing O\n'
 '2 gas involves passing a stream of clean, dry air through one bed of a pair '
 'of identical zeolite molecular sieves, which absorbs the nitrogen and '
 'delivers a gas stream that is 90% to 93% O\n'
 '2. Simultaneously, nitrogen gas is released from the other '
 'nitrogen-saturated zeolite bed, by reducing the chamber operating pressure '
 'and diverting part of the oxygen gas from the producer bed through it, in '
 'the reverse direction of flow. After a set cycle time the operation of the '
 'two beds is interchanged, thereby allowing for a continuous supply of '
 'gaseous oxygen to be pumped through a pipeline. This is known as pressure '
 'swing adsorption. Oxygen gas is increasingly obtained by these non-cryogenic '
 'technologies (see also the related vacuum swing adsorption).')

以下程式碼區塊使用「通用編碼器多語言問答模型」question_encoderresponse_encoder 簽名來設定 tensorflow 圖形 g工作階段

從 tensorflow hub 載入模型

2024-02-02 12:42:11.161871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

以下程式碼區塊會計算所有 (文字、上下文) 元組的嵌入,並使用 response_encoder 將其儲存在 simpleneighbors 索引中。

計算嵌入並建構 simpleneighbors 索引

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

在檢索時,問題會使用 question_encoder 進行編碼,並使用問題嵌入來查詢 simpleneighbors 索引。

擷取 SQuAD 中隨機問題的最近鄰