多語言通用句子編碼器問答檢索

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

查看 TF Hub 模型

這是一個示範，說明如何使用「通用編碼器多語言問答模型」進行文字問答檢索，並示範模型的 question_encoder 和 response_encoder 的用法。我們使用 SQuAD 段落中的句子作為示範資料集，每個句子及其上下文 (句子周圍的文字) 都會使用 response_encoder 編碼為高維度嵌入。這些嵌入會儲存在使用 simpleneighbors 程式庫建構的索引中，以便進行問答檢索。

在檢索時，會從 SQuAD 資料集中選取一個隨機問題，並使用 question_encoder 編碼為高維度嵌入，然後查詢 simpleneighbors 索引，傳回語意空間中近似最近鄰的清單。

設定

設定環境

%%capture
# Install the latest Tensorflow version.
!pip install -q "tensorflow-text==2.11.*"
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

設定通用匯入和函式

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

2024-02-02 12:42:03.366166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103807: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103818: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

執行以下程式碼區塊，將 SQuAD 資料集下載並解壓縮到

sentences 是 (文字、上下文) 元組的清單 - SQuAD 資料集中的每個段落都使用 nltk 程式庫分割成句子，而句子和段落文字會形成 (文字、上下文) 元組。
questions 是 (問題、答案) 元組的清單。

下載並解壓縮 SQuAD 資料

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('Oxygen gas is increasingly obtained by these non-cryogenic technologies (see '
 'also the related vacuum swing adsorption).')

context:

('The other major method of producing O\n'
 '2 gas involves passing a stream of clean, dry air through one bed of a pair '
 'of identical zeolite molecular sieves, which absorbs the nitrogen and '
 'delivers a gas stream that is 90% to 93% O\n'
 '2. Simultaneously, nitrogen gas is released from the other '
 'nitrogen-saturated zeolite bed, by reducing the chamber operating pressure '
 'and diverting part of the oxygen gas from the producer bed through it, in '
 'the reverse direction of flow. After a set cycle time the operation of the '
 'two beds is interchanged, thereby allowing for a continuous supply of '
 'gaseous oxygen to be pumped through a pipeline. This is known as pressure '
 'swing adsorption. Oxygen gas is increasingly obtained by these non-cryogenic '
 'technologies (see also the related vacuum swing adsorption).')

以下程式碼區塊使用「通用編碼器多語言問答模型」的 question_encoder 和 response_encoder 簽名來設定 tensorflow 圖形 g 和工作階段。

從 tensorflow hub 載入模型

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

2024-02-02 12:42:11.161871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

以下程式碼區塊會計算所有 (文字、上下文) 元組的嵌入，並使用 response_encoder 將其儲存在 simpleneighbors 索引中。

計算嵌入並建構 simpleneighbors 索引

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

在檢索時，問題會使用 question_encoder 進行編碼，並使用問題嵌入來查詢 simpleneighbors 索引。

擷取 SQuAD 中隨機問題的最近鄰

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])