跨語言相似度和語意搜尋引擎與多語言通用句子編碼器

在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視 下載筆記本 查看 TF Hub 模型

此筆記本說明如何存取多語言通用句子編碼器模組,並將其用於跨多種語言的句子相似度。此模組是原始通用編碼器模組的擴展。

筆記本的結構如下

  • 第一節顯示語言對之間句子的視覺化。這是一個更學術性的練習。
  • 在第二節中,我們將展示如何從多種語言的維基百科語料庫樣本建立語意搜尋引擎。

引用

使用本 Colab 中探索的模型的研究論文應引用

用於語意檢索的多語言通用句子編碼器

Yinfei Yang、Daniel Cer、Amin Ahmad、Mandy Guo、Jax Law、Noah Constant、Gustavo Hernandez Abrego、Steve Yuan、Chris Tar、Yun-Hsuan Sung、Brian Strope 和 Ray Kurzweil。2019 年。arXiv 預印本 arXiv:1907.04307

設定

本節設定環境以存取多語言通用句子編碼器模組,並準備一組英文句子及其翻譯。在以下各節中,多語言模組將用於計算跨語言的相似度。

設定環境

設定通用匯入和函式

這是額外的樣板程式碼,我們在其中匯入預先訓練的 ML 模型,我們將在整個筆記本中使用該模型來編碼文字。

# The 16-language multilingual module is the default but feel free
# to pick others from the list and compare the results.
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'

model = hub.load(module_url)

def embed_text(input):
  return model(input)

視覺化語言之間的文字相似度

現在有了句子嵌入,我們可以視覺化不同語言之間的語意相似度。

計算文字嵌入

我們首先定義一組平行翻譯成各種語言的句子。然後,我們預先計算所有句子的嵌入。

# Some texts of different lengths in different languages.
arabic_sentences = ['كلب', 'الجراء لطيفة.', 'أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.']
chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
french_sentences = ['chien', 'Les chiots sont gentils.', 'J\'aime faire de longues promenades sur la plage avec mon chien.']
german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']
italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']
japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']
korean_sentences = ['개', '강아지가 좋다.', '나는 나의 개와 해변을 따라 길게 산책하는 것을 즐긴다.']
russian_sentences = ['собака', 'Милые щенки.', 'Мне нравится подолгу гулять по пляжу со своей собакой.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']

# Multilingual example
multilingual_example = ["Willkommen zu einfachen, aber", "verrassend krachtige", "multilingüe", "compréhension du language naturel", "модели.", "大家是什么意思" , "보다 중요한", ".اللغة التي يتحدثونها"]
multilingual_example_in_en =  ["Welcome to simple yet", "surprisingly powerful", "multilingual", "natural language understanding", "models.", "What people mean", "matters more than", "the language they speak."]
# Compute embeddings.
ar_result = embed_text(arabic_sentences)
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)
de_result = embed_text(german_sentences)
fr_result = embed_text(french_sentences)
it_result = embed_text(italian_sentences)
ja_result = embed_text(japanese_sentences)
ko_result = embed_text(korean_sentences)
ru_result = embed_text(russian_sentences)
zh_result = embed_text(chinese_sentences)

multilingual_result = embed_text(multilingual_example)
multilingual_in_en_result = embed_text(multilingual_example_in_en)

視覺化相似度

有了文字嵌入,我們可以取它們的點積,以視覺化語言之間句子的相似程度。顏色越深表示嵌入在語意上越相似。

多語言相似度

visualize_similarity(multilingual_in_en_result, multilingual_result,
                     multilingual_example_in_en, multilingual_example,  "Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)")

英文-阿拉伯文相似度

visualize_similarity(en_result, ar_result, english_sentences, arabic_sentences, 'English-Arabic Similarity')

英文-俄文相似度

visualize_similarity(en_result, ru_result, english_sentences, russian_sentences, 'English-Russian Similarity')

英文-西班牙文相似度

visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')

英文-義大利文相似度

visualize_similarity(en_result, it_result, english_sentences, italian_sentences, 'English-Italian Similarity')

義大利文-西班牙文相似度

visualize_similarity(it_result, es_result, italian_sentences, spanish_sentences, 'Italian-Spanish Similarity')

英文-中文相似度

visualize_similarity(en_result, zh_result, english_sentences, chinese_sentences, 'English-Chinese Similarity')

英文-韓文相似度

visualize_similarity(en_result, ko_result, english_sentences, korean_sentences, 'English-Korean Similarity')

中文-韓文相似度

visualize_similarity(zh_result, ko_result, chinese_sentences, korean_sentences, 'Chinese-Korean Similarity')

以及更多...

以上範例可以擴展到英文、阿拉伯文、中文、荷蘭文、法文、德文、義大利文、日文、韓文、波蘭文、葡萄牙文、俄文、西班牙文、泰文和土耳其文中的任何語言對。祝您編碼愉快!

建立多語言語意相似度搜尋引擎

在前一個範例中,我們視覺化了少數句子,在本節中,我們將從維基百科語料庫建立約 200,000 個句子的語意搜尋索引。大約一半將是英文,另一半是西班牙文,以展示通用句子編碼器的多語言功能。

下載要索引的資料

首先,我們將從 新聞評論語料庫 [1] 下載多種語言的新聞句子。在不失一般性的情況下,此方法也應適用於索引其餘支援的語言。

為了加速演示,我們將每種語言限制為 1000 個句子。

corpus_metadata = [
    ('ar', 'ar-en.txt.zip', 'News-Commentary.ar-en.ar', 'Arabic'),
    ('zh', 'en-zh.txt.zip', 'News-Commentary.en-zh.zh', 'Chinese'),
    ('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),
    ('ru', 'en-ru.txt.zip', 'News-Commentary.en-ru.ru', 'Russian'),
    ('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),
]

language_to_sentences = {}
language_to_news_path = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  zip_path = tf.keras.utils.get_file(
      fname=zip_file,
      origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,
      extract=True)
  news_path = os.path.join(os.path.dirname(zip_path), news_file)
  language_to_sentences[language_code] = pd.read_csv(news_path, sep='\t', header=None)[0][:1000]
  language_to_news_path[language_code] = news_path

  print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))
Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/ar-en.txt.zip
24715264/24714354 [==============================] - 2s 0us/step
1,000 Arabic sentences
Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-zh.txt.zip
18104320/18101984 [==============================] - 2s 0us/step
1,000 Chinese sentences
Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-es.txt.zip
28106752/28106064 [==============================] - 2s 0us/step
1,000 English sentences
Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-ru.txt.zip
24854528/24849511 [==============================] - 2s 0us/step
1,000 Russian sentences
1,000 Spanish sentences

使用預先訓練的模型將句子轉換為向量

我們分批計算嵌入,以便它們適合 GPU 的 RAM。

# Takes about 3 minutes

batch_size = 2048
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nComputing {} embeddings'.format(language_name))
  with tqdm(total=len(language_to_sentences[language_code])) as pbar:
    for batch in pd.read_csv(language_to_news_path[language_code], sep='\t',header=None, chunksize=batch_size):
      language_to_embeddings.setdefault(language_code, []).extend(embed_text(batch[0]))
      pbar.update(len(batch))
0%|          | 0/1000 [00:00<?, ?it/s]
Computing Arabic embeddings
83178it [00:30, 2768.60it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]
Computing Chinese embeddings
69206it [00:18, 3664.60it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]
Computing English embeddings
238853it [00:37, 6319.00it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]
Computing Russian embeddings
190092it [00:34, 5589.16it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]
Computing Spanish embeddings
238819it [00:41, 5754.02it/s]

建立語意向量索引

我們使用 SimpleNeighbors 程式庫(它是 Annoy 程式庫的封裝器)來有效地從語料庫中查找結果。

%%time

# Takes about 8 minutes

num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nAdding {} embeddings to index'.format(language_name))
  index = SimpleNeighbors(embedding_dimensions, metric='dot')

  for i in trange(len(language_to_sentences[language_code])):
    index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])

  print('Building {} index with {} trees...'.format(language_name, num_index_trees))
  index.build(n=num_index_trees)
  language_name_to_index[language_name] = index
0%|          | 1/1000 [00:00<02:21,  7.04it/s]
Adding Arabic embeddings to index
100%|██████████| 1000/1000 [02:06<00:00,  7.90it/s]
  0%|          | 1/1000 [00:00<01:53,  8.84it/s]
Building Arabic index with 40 trees...

Adding Chinese embeddings to index
100%|██████████| 1000/1000 [02:05<00:00,  7.99it/s]
  0%|          | 1/1000 [00:00<01:59,  8.39it/s]
Building Chinese index with 40 trees...

Adding English embeddings to index
100%|██████████| 1000/1000 [02:07<00:00,  7.86it/s]
  0%|          | 1/1000 [00:00<02:17,  7.26it/s]
Building English index with 40 trees...

Adding Russian embeddings to index
100%|██████████| 1000/1000 [02:06<00:00,  7.91it/s]
  0%|          | 1/1000 [00:00<02:03,  8.06it/s]
Building Russian index with 40 trees...

Adding Spanish embeddings to index
100%|██████████| 1000/1000 [02:07<00:00,  7.84it/s]
Building Spanish index with 40 trees...
CPU times: user 11min 21s, sys: 2min 14s, total: 13min 35s
Wall time: 10min 33s

%%time

# Takes about 13 minutes

num_index_trees = 60
print('Computing mixed-language index')
combined_index = SimpleNeighbors(embedding_dimensions, metric='dot')
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('Adding {} embeddings to mixed-language index'.format(language_name))
  for i in trange(len(language_to_sentences[language_code])):
    annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])
    combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])

print('Building mixed-language index with {} trees...'.format(num_index_trees))
combined_index.build(n=num_index_trees)
0%|          | 1/1000 [00:00<02:00,  8.29it/s]
Computing mixed-language index
Adding Arabic embeddings to mixed-language index
100%|██████████| 1000/1000 [02:06<00:00,  7.92it/s]
  0%|          | 1/1000 [00:00<02:24,  6.89it/s]
Adding Chinese embeddings to mixed-language index
100%|██████████| 1000/1000 [02:05<00:00,  7.95it/s]
  0%|          | 1/1000 [00:00<02:05,  7.98it/s]
Adding English embeddings to mixed-language index
100%|██████████| 1000/1000 [02:06<00:00,  7.88it/s]
  0%|          | 1/1000 [00:00<02:18,  7.20it/s]
Adding Russian embeddings to mixed-language index
100%|██████████| 1000/1000 [02:04<00:00,  8.03it/s]
  0%|          | 1/1000 [00:00<02:17,  7.28it/s]
Adding Spanish embeddings to mixed-language index
100%|██████████| 1000/1000 [02:06<00:00,  7.90it/s]
Building mixed-language index with 60 trees...
CPU times: user 11min 18s, sys: 2min 13s, total: 13min 32s
Wall time: 10min 30s

驗證語意相似度搜尋引擎是否運作

在本節中,我們將示範

  1. 語意搜尋功能:從語料庫中檢索在語意上與給定查詢相似的句子。
  2. 多語言功能:當查詢語言和索引語言相符時,以多種語言執行此操作
  3. 跨語言功能:以與索引語料庫不同的語言發出查詢
  4. 混合語言語料庫:對包含所有語言條目的單一索引執行上述所有操作

語意搜尋跨語言功能

在本節中,我們將展示如何檢索與一組範例英文句子相關的句子。可嘗試的事項

  • 嘗試一些不同的範例句子
  • 嘗試變更傳回結果的數量(它們按相似度順序傳回)
  • 嘗試跨語言功能,方法是以不同語言傳回結果(可能想要在某些結果上使用 Google 翻譯到您的母語進行健全性檢查)

English sentences similar to: "The stock market fell four points."
['Nobel laureate Amartya Sen attributed the European crisis to four failures – political, economic, social, and intellectual.',
 'Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.',
 'His ratings have dipped below 50% for the first time.',
 'As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.',
 'Consider the advanced economies.',
 'But the agreement has three major flaws.',
 'This “predetermined equilibrium” thinking – reflected in the view that markets always self-correct – led to policy paralysis until the Great Depression, when John Maynard Keynes’s argument for government intervention to address unemployment and output gaps gained traction.',
 'Officials underestimated tail risks.',
 'Consider a couple of notorious examples.',
 'Stalin was content to settle for an empire in Eastern Europe.']

混合語料庫功能

我們現在將以英文發出查詢,但結果將來自任何索引語言。

English sentences similar to: "The stock market fell four points."
['Nobel laureate Amartya Sen attributed the European crisis to four failures – political, economic, social, and intellectual.',
 'It was part of the 1945 consensus.',
 'The end of the East-West ideological divide and the end of absolute faith in markets are historical turning points.',
 'Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.',
 'His ratings have dipped below 50% for the first time.',
 'As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.',
 'Consider the advanced economies.',
 'Since their articles appeared, the price of gold has moved up still further.',
 'But the agreement has three major flaws.',
 'Gold prices even hit a record-high $1,300 recently.',
 'This “predetermined equilibrium” thinking – reflected in the view that markets always self-correct – led to policy paralysis until the Great Depression, when John Maynard Keynes’s argument for government intervention to address unemployment and output gaps gained traction.',
 'What Failed in 2008?',
 'Officials underestimated tail risks.',
 'Consider a couple of notorious examples.',
 'One of these species, orange roughy, has been caught commercially for only around a quarter-century, but already is being fished to the point of collapse.',
 'Meanwhile, policymakers were lulled into complacency by the widespread acceptance of economic theories such as the “efficient-market hypothesis,” which assumes that investors act rationally and use all available information when making their decisions.',
 'Stalin was content to settle for an empire in Eastern Europe.',
 'Intelligence assets have been redirected.',
 'A new wave of what the economist Joseph Schumpeter famously called “creative destruction” is under way: even as central banks struggle to maintain stability by flooding markets with liquidity, credit to business and households is shrinking.',
 'It all came about in a number of ways.',
 'The UN, like the dream of European unity, was also part of the 1945 consensus.',
 'The End of 1945',
 'The Global Economy’s New Path',
 'But this scenario failed to materialize.',
 'Gold prices are extremely sensitive to global interest-rate movements.',
 'Fukushima has presented the world with a far-reaching, fundamental choice.',
 'It was Japan, the high-tech country par excellence (not the latter-day Soviet Union) that proved unable to take adequate precautions to avert disaster in four reactor blocks.',
 'Some European academics tried to argue that there was no need for US-like fiscal transfers, because any desired degree of risk sharing can, in theory, be achieved through financial markets.',
 '$10,000 Gold?',
 'One answer, of course, is a complete collapse of the US dollar.',
 '1929 or 1989?',
 'The goods we made were what economists call “rival" and “excludible" commodities.',
 'This dream quickly faded when the Cold War divided the world into two hostile blocs. But in some ways the 1945 consensus, in the West, was strengthened by Cold War politics.',
 'The first flaw is that the spending reductions are badly timed: coming as they do when the US economy is weak, they risk triggering another recession.',
 'One successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980’s.',
 'Eichengreen traces our tepid response to the crisis to the triumph of monetarist economists, the disciples of Milton Friedman, over their Keynesian and Minskyite peers – at least when it comes to interpretations of the causes and consequences of the Great Depression.',
 "However, America's unilateral options are limited.",
 'Once it was dark, a screen was set up and Mark showed home videos from space.',
 'These aspirations were often voiced in the United Nations, founded in 1945.',
 'Then I got distracted for about 40 years.']

嘗試您自己的查詢

English sentences similar to: "The stock market fell four points."
['(Chinese) 新兴市场的号角',
 '(English) It was part of the 1945 consensus.',
 '(Russian) Брюссель. Цунами, пронёсшееся по финансовым рынкам, является глобальной катастрофой.',
 '(Arabic) هناك أربعة شروط مسبقة لتحقيق النجاح الأوروبي في أفغانستان:',
 '(Spanish) Su índice de popularidad ha caído por primera vez por debajo del 50 por ciento.',
 '(English) His ratings have dipped below 50% for the first time.',
 '(Russian) Впервые его рейтинг опустился ниже 50%.',
 '(English) As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.',
 '(Arabic) وكانت التطورات التي شهدتها سوق العمل أكثر تشجيعا، فهي على النقيض من أسواق الأصول تعكس النتائج وليس التوقعات. وهنا أيضاً كانت الأخبار طيبة. فقد أصبحت سوق العمل أكثر إحكاما، حيث ظلت البطالة عند مستوى 3.5% وكانت نسبة الوظائف إلى الطلبات المقدمة فوق مستوى التعادل.',
 '(Russian) Это было частью консенсуса 1945 года.',
 '(English) Consider the advanced economies.',
 '(English) Since their articles appeared, the price of gold has moved up still further.',
 '(Russian) Тогда они не только смогут накормить свои семьи, но и начать получать рыночную прибыль и откладывать деньги на будущее.',
 '(English) Gold prices even hit a record-high $1,300 recently.',
 '(Chinese) 另一种金融危机',
 '(Russian) Европейская мечта находится в кризисе.',
 '(English) What Failed in 2008?',
 '(Spanish) Pero el acuerdo alcanzado tiene tres grandes defectos.',
 '(English) Officials underestimated tail risks.',
 '(English) Consider a couple of notorious examples.',
 '(Spanish) Los mercados financieros pueden ser frágiles y ofrecen muy poca capacidad de compartir los riesgos relacionados con el ingreso de los trabajadores, que constituye la mayor parte de la renta de cualquier economía avanzada.',
 '(Chinese) 2008年败在何处?',
 '(Spanish) Consideremos las economías avanzadas.',
 '(Spanish) Los bienes producidos se caracterizaron por ser, como señalaron algunos economistas, mercancías “rivales” y “excluyentes”.',
 '(Arabic) إغلاق الفجوة الاستراتيجية في أوروبا',
 '(English) Stalin was content to settle for an empire in Eastern Europe.',
 '(English) Intelligence assets have been redirected.',
 '(Spanish) Hoy, envalentonados por la apreciación continua, algunos están sugiriendo que el oro podría llegar incluso a superar esa cifra.',
 '(Russian) Цены на золото чрезвычайно чувствительны к мировым движениям процентных ставок.',
 '(Russian) Однако у достигнутой договоренности есть три основных недостатка.']

進一步主題

多語言

最後,我們鼓勵您嘗試以任何支援的語言進行查詢:英文、阿拉伯文、中文、荷蘭文、法文、德文、義大利文、日文、韓文、波蘭文、葡萄牙文、俄文、西班牙文、泰文和土耳其文

此外,即使我們僅在部分語言中建立索引,您也可以在任何支援的語言中索引內容。

模型變化

我們提供針對記憶體、延遲和/或品質等各種事物最佳化的通用編碼器模型變體。請隨意試驗它們以找到合適的模型。

最近鄰程式庫

我們使用 Annoy 有效率地查找最近鄰。請參閱 取捨章節,以了解樹狀結構數量(取決於記憶體)和要搜尋的項目數量(取決於延遲)---SimpleNeighbors 僅允許控制樹狀結構數量,但重構程式碼以直接使用 Annoy 應該很簡單,我們只是想讓此程式碼盡可能簡單以供一般使用者使用。

如果 Annoy 無法針對您的應用程式擴展,也請查看 FAISS

祝您一切順利,建立您的多語言語意應用程式!

[1] J. Tiedemann,2012 年,OPUS 中的平行資料、工具和介面。在第 8 屆國際語言資源和評估會議 (LREC 2012) 會議記錄中