![]() |
![]() |
![]() |
![]() |
![]() |
本教學課程說明如何從 TensorFlow Hub (TF-Hub) 模組產生指定輸入資料的嵌入,並使用擷取的嵌入建立近似最近鄰 (ANN) 索引。然後,此索引可用於即時相似度比對和檢索。
在處理大量資料語料庫時,掃描整個儲存庫以即時尋找與給定查詢最相似的項目效率不高。因此,我們使用近似相似度比對演算法,讓我們能夠犧牲一點尋找精確最近鄰比對的準確性,以大幅提升速度。
在本教學課程中,我們展示了一個範例,說明如何對新聞標題語料庫執行即時文字搜尋,以尋找與查詢最相似的標題。與關鍵字搜尋不同,此方法擷取了文字嵌入中編碼的語意相似度。
本教學課程的步驟如下
- 下載範例資料。
- 使用 TF-Hub 模組為資料產生嵌入
- 為嵌入建立 ANN 索引
- 使用索引進行相似度比對
我們使用 Apache Beam 從 TF-Hub 模組產生嵌入。我們也使用 Spotify 的 ANNOY 程式庫來建立近似最近鄰索引。
更多模型
對於架構相同但在不同語言上訓練的模型,請參閱此集合。您可以在這裡找到目前託管在 tfhub.dev 上的所有文字嵌入。
設定
安裝必要的程式庫。
pip install -q apache_beam
pip install -q 'scikit_learn~=0.23.0' # For gaussian_random_matrix.
pip install -q annoy
匯入必要的程式庫
import os
import sys
import pickle
from collections import namedtuple
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy
from sklearn.random_projection import gaussian_random_matrix
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.4.0 TF-Hub version: 0.11.0 Apache Beam version: 2.26.0
1. 下載範例資料
「一百萬則新聞標題」資料集包含澳洲廣播公司 (ABC) 這間信譽卓著的媒體在 15 年期間發布的新聞標題。此新聞資料集摘要記錄了 2003 年初到 2017 年底全球值得注意的事件歷史記錄,並更著重於澳洲。
格式:Tab 字元分隔的兩欄資料:1) 發布日期和 2) 標題文字。我們只對標題文字感興趣。
wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2021-01-07 12:50:08-- https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 206.191.184.198 Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|206.191.184.198|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 57600231 (55M) [text/tab-separated-values] Saving to: ‘raw.tsv’ raw.tsv 100%[===================>] 54.93M 14.7MB/s in 4.4s 2021-01-07 12:50:14 (12.4 MB/s) - ‘raw.tsv’ saved [57600231/57600231] 1103664 raw.tsv publish_date headline_text 20030219 "aba decides against community broadcasting licence" 20030219 "act fire witnesses must be aware of defamation" 20030219 "a g calls for infrastructure protection summit" 20030219 "air nz staff in aust strike for pay rise" 20030219 "air nz strike to affect australian travellers" 20030219 "ambitious olsson wins triple jump" 20030219 "antic delighted with record breaking barca" 20030219 "aussie qualifier stosur wastes four memphis match" 20030219 "aust addresses un security council over iraq"
為了簡潔起見,我們只保留標題文字並移除發布日期
!rm -r corpus
!mkdir corpus
with open('corpus/text.txt', 'w') as out_file:
with open('raw.tsv', 'r') as in_file:
for line in in_file:
headline = line.split('\t')[1].strip().strip('"')
out_file.write(headline+"\n")
rm: cannot remove 'corpus': No such file or directory
tail corpus/text.txt
severe storms forecast for nye in south east queensland snake catcher pleads for people not to kill reptiles south australia prepares for party to welcome new year strikers cool off the heat with big win in adelaide stunning images from the sydney to hobart yacht the ashes smiths warners near miss liven up boxing day test timelapse: brisbanes new year fireworks what 2017 meant to the kids of australia what the papodopoulos meeting may mean for ausus who is george papadopoulos the former trump campaign aide
2. 為資料產生嵌入。
在本教學課程中,我們使用神經網路語言模型 (NNLM) 為標題資料產生嵌入。然後,句子嵌入可以輕鬆用於計算句子層級的語意相似度。我們使用 Apache Beam 執行嵌入產生程序。
嵌入擷取方法
embed_fn = None
def generate_embeddings(text, module_url, random_projection_matrix=None):
# Beam will run this function in different processes that need to
# import hub and load embed_fn (if not previously loaded)
global embed_fn
if embed_fn is None:
embed_fn = hub.load(module_url)
embedding = embed_fn(text).numpy()
if random_projection_matrix is not None:
embedding = embedding.dot(random_projection_matrix)
return text, embedding
轉換為 tf.Example 方法
def to_tf_example(entries):
examples = []
text_list, embedding_list = entries
for i in range(len(text_list)):
text = text_list[i]
embedding = embedding_list[i]
features = {
'text': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
'embedding': tf.train.Feature(
float_list=tf.train.FloatList(value=embedding.tolist()))
}
example = tf.train.Example(
features=tf.train.Features(
feature=features)).SerializeToString(deterministic=True)
examples.append(example)
return examples
Beam 管線
def run_hub2emb(args):
'''Runs the embedding generation pipeline'''
options = beam.options.pipeline_options.PipelineOptions(**args)
args = namedtuple("options", args.keys())(*args.values())
with beam.Pipeline(args.runner, options=options) as pipeline:
(
pipeline
| 'Read sentences from files' >> beam.io.ReadFromText(
file_pattern=args.data_dir)
| 'Batch elements' >> util.BatchElements(
min_batch_size=args.batch_size, max_batch_size=args.batch_size)
| 'Generate embeddings' >> beam.Map(
generate_embeddings, args.module_url, args.random_projection_matrix)
| 'Encode to tf example' >> beam.FlatMap(to_tf_example)
| 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
file_path_prefix='{}/emb'.format(args.output_dir),
file_name_suffix='.tfrecords')
)
產生隨機投影權重矩陣
隨機投影是一種簡單但功能強大的技術,用於降低歐幾里得空間中一組點的維度。如需理論背景,請參閱 Johnson-Lindenstrauss 引理。
使用隨機投影降低嵌入的維度表示建構和查詢 ANN 索引所需的時間更少。
在本教學課程中,我們使用 Scikit-learn 程式庫中的高斯隨機投影。
def generate_random_projection_weights(original_dim, projected_dim):
random_projection_matrix = None
random_projection_matrix = gaussian_random_matrix(
n_components=projected_dim, n_features=original_dim).T
print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
print('Storing random projection matrix to disk...')
with open('random_projection_matrix', 'wb') as handle:
pickle.dump(random_projection_matrix,
handle, protocol=pickle.HIGHEST_PROTOCOL)
return random_projection_matrix
設定參數
如果您想要使用沒有隨機投影的原始嵌入空間來建構索引,請將 projected_dim
參數設定為 None
。請注意,這會減慢高維度嵌入的索引步驟。
執行管線
import tempfile
output_dir = tempfile.mkdtemp()
original_dim = hub.load(module_url)(['']).shape[1]
random_projection_matrix = None
if projected_dim:
random_projection_matrix = generate_random_projection_weights(
original_dim, projected_dim)
args = {
'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
'runner': 'DirectRunner',
'batch_size': 1024,
'data_dir': 'corpus/*.txt',
'output_dir': output_dir,
'module_url': module_url,
'random_projection_matrix': random_projection_matrix,
}
print("Pipeline args are set.")
args
A Gaussian random weight matrix was creates with shape of (128, 64) Storing random projection matrix to disk... Pipeline args are set. /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning) {'job_name': 'hub2emb-210107-125029', 'runner': 'DirectRunner', 'batch_size': 1024, 'data_dir': 'corpus/*.txt', 'output_dir': '/tmp/tmp0g361gzp', 'module_url': 'https://tfhub.dev/google/nnlm-en-dim128/2', 'random_projection_matrix': array([[-0.1349755 , -0.12082699, 0.07092581, ..., -0.02680793, -0.0459312 , -0.20462361], [-0.06197901, 0.01832142, 0.21362496, ..., 0.06641898, 0.14553738, -0.117217 ], [ 0.03452009, 0.14239163, 0.01371371, ..., 0.10422342, 0.02966668, -0.07094185], ..., [ 0.03384223, 0.05102025, 0.01941788, ..., -0.07500625, 0.09584965, -0.08593636], [ 0.11010087, -0.10597793, 0.06668758, ..., -0.0518654 , -0.14681441, 0.08449293], [ 0.26909502, -0.0291555 , 0.04305639, ..., -0.02295843, 0.1164921 , -0.04828371]])}
print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features. Running pipeline... Warning:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7efcac3599d8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://tensorflow.dev.org.tw/guide/function#controlling_retracing and https://tensorflow.dev.org.tw/api_docs/python/tf/function for more details. Warning:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7efcac3599d8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://tensorflow.dev.org.tw/guide/function#controlling_retracing and https://tensorflow.dev.org.tw/api_docs/python/tf/function for more details. Warning:tensorflow:6 out of the last 6 calls to <function recreate_function.<locals>.restored_function_body at 0x7efcac475598> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://tensorflow.dev.org.tw/guide/function#controlling_retracing and https://tensorflow.dev.org.tw/api_docs/python/tf/function for more details. Warning:tensorflow:6 out of the last 6 calls to <function recreate_function.<locals>.restored_function_body at 0x7efcac475598> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://tensorflow.dev.org.tw/guide/function#controlling_retracing and https://tensorflow.dev.org.tw/api_docs/python/tf/function for more details. WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be. CPU times: user 9min 4s, sys: 10min 14s, total: 19min 19s Wall time: 2min 30s Pipeline is done.
ls {output_dir}
emb-00000-of-00001.tfrecords
讀取一些產生的嵌入…
embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5
# Create a description of the features.
feature_description = {
'text': tf.io.FixedLenFeature([], tf.string),
'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}
def _parse_example(example):
# Parse the input `tf.Example` proto using the dictionary above.
return tf.io.parse_single_example(example, feature_description)
dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
print("{}: {}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))
headline_text: [ 0.07743962 -0.10065071 -0.03604915 0.03902601 0.02538098 -0.01991337 -0.11972483 0.03102058 0.16498186 -0.04299153] aba decides against community broadcasting licence: [ 0.02420221 -0.07736929 0.05655728 -0.18739551 0.11344934 0.12652674 -0.18189304 0.00422473 0.13149698 0.01910412] act fire witnesses must be aware of defamation: [-0.17413895 -0.05418579 0.07769868 0.05096476 0.08622053 0.33112594 0.04067763 0.00448784 0.15882017 0.33829722] a g calls for infrastructure protection summit: [ 0.16939437 -0.18585566 -0.14201084 -0.21779229 -0.1374832 0.14933842 -0.19583155 0.12921487 0.09811856 0.099967 ] air nz staff in aust strike for pay rise: [ 0.0230642 -0.03269081 0.18271443 0.23761444 -0.01575144 0.06109515 -0.01963143 -0.05211507 0.06050447 -0.20023327]
3. 為嵌入建立 ANN 索引
ANNOY (Approximate Nearest Neighbors Oh Yeah) 是一個 C++ 程式庫,具有 Python 繫結,可搜尋空間中靠近給定查詢點的點。它也會建立對應到記憶體中的大型唯讀檔案型資料結構。Spotify 建構並使用它來提供音樂推薦。如果您有興趣,可以試用 ANNOY 的其他替代方案,例如 NGT、FAISS 等。
def build_index(embedding_files_pattern, index_filename, vector_length,
metric='angular', num_trees=100):
'''Builds an ANNOY index'''
annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
# Mapping between the item and its identifier in the index
mapping = {}
embed_files = tf.io.gfile.glob(embedding_files_pattern)
num_files = len(embed_files)
print('Found {} embedding file(s).'.format(num_files))
item_counter = 0
for i, embed_file in enumerate(embed_files):
print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.map(_parse_example):
text = record['text'].numpy().decode("utf-8")
embedding = record['embedding'].numpy()
mapping[item_counter] = text
annoy_index.add_item(item_counter, embedding)
item_counter += 1
if item_counter % 100000 == 0:
print('{} items loaded to the index'.format(item_counter))
print('A total of {} items added to the index'.format(item_counter))
print('Building the index with {} trees...'.format(num_trees))
annoy_index.build(n_trees=num_trees)
print('Index is successfully built.')
print('Saving index to disk...')
annoy_index.save(index_filename)
print('Index is saved to disk.')
print("Index file size: {} GB".format(
round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
annoy_index.unload()
print('Saving mapping to disk...')
with open(index_filename + '.mapping', 'wb') as handle:
pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
print('Mapping is saved to disk.')
print("Mapping file size: {} MB".format(
round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"
!rm {index_filename}
!rm {index_filename}.mapping
%time build_index(embedding_files, index_filename, embedding_dimension)
rm: cannot remove 'index': No such file or directory rm: cannot remove 'index.mapping': No such file or directory Found 1 embedding file(s). Loading embeddings in file 1 of 1... 100000 items loaded to the index 200000 items loaded to the index 300000 items loaded to the index 400000 items loaded to the index 500000 items loaded to the index 600000 items loaded to the index 700000 items loaded to the index 800000 items loaded to the index 900000 items loaded to the index 1000000 items loaded to the index 1100000 items loaded to the index A total of 1103664 items added to the index Building the index with 100 trees... Index is successfully built. Saving index to disk... Index is saved to disk. Index file size: 1.61 GB Saving mapping to disk... Mapping is saved to disk. Mapping file size: 50.61 MB CPU times: user 9min 54s, sys: 53.9 s, total: 10min 48s Wall time: 5min 5s
ls
corpus random_projection_matrix index raw.tsv index.mapping tf2_semantic_approximate_nearest_neighbors.ipynb
4. 使用索引進行相似度比對
現在我們可以使用 ANN 索引來尋找在語意上與輸入查詢接近的新聞標題。
載入索引和對應檔案
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
mapping = pickle.load(handle)
print('Mapping file is loaded.')
Annoy index is loaded. /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly. """Entry point for launching an IPython kernel. Mapping file is loaded.
相似度比對方法
def find_similar_items(embedding, num_matches=5):
'''Finds similar items to a given embedding in the ANN index'''
ids = index.get_nns_by_vector(
embedding, num_matches, search_k=-1, include_distances=False)
items = [mapping[i] for i in ids]
return items
從給定查詢中擷取嵌入
# Load the TF-Hub module
print("Loading the TF-Hub module...")
%time embed_fn = hub.load(module_url)
print("TF-Hub module is loaded.")
random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
print("Loading random projection matrix...")
with open('random_projection_matrix', 'rb') as handle:
random_projection_matrix = pickle.load(handle)
print('random projection matrix is loaded.')
def extract_embeddings(query):
'''Generates the embedding for the query'''
query_embedding = embed_fn([query])[0].numpy()
if random_projection_matrix is not None:
query_embedding = query_embedding.dot(random_projection_matrix)
return query_embedding
Loading the TF-Hub module... CPU times: user 757 ms, sys: 619 ms, total: 1.38 s Wall time: 1.37 s TF-Hub module is loaded. Loading random projection matrix... random projection matrix is loaded.
extract_embeddings("Hello Machine Learning!")[:10]
array([ 0.12164804, 0.0162079 , -0.15466002, -0.14580576, 0.03926325, -0.10124508, -0.1333948 , 0.0515029 , -0.14688903, -0.09971556])
輸入查詢以尋找最相似的項目
Generating embedding for the query... CPU times: user 5.18 ms, sys: 596 µs, total: 5.77 ms Wall time: 2.19 ms Finding relevant items in the index... CPU times: user 555 µs, sys: 327 µs, total: 882 µs Wall time: 601 µs Results: ========= confronting global challenges emerging nations to help struggling global economy g7 warns of increasing global economic crisis world struggling to cope with global terrorism companies health to struggle amid global crisis external risks biggest threat to economy asian giants unite to tackle global crisis g7 ministers warn of slowing global growth experts to discuss global warming threat scientists warn of growing natural disasters
想進一步瞭解嗎?
您可以在 tensorflow.org 瞭解更多關於 TensorFlow 的資訊,並在 tensorflow.org/hub 查看 TF-Hub API 文件。在 tfhub.dev 尋找可用的 TensorFlow Hub 模組,包括更多文字嵌入模組和圖片特徵向量模組。
也請查看機器學習速成課程,這是 Google 推出步調快速又實用的機器學習入門課程。