Unicode 字串

在 TensorFlow.org 上查看

在 Google Colab 中執行

在 GitHub 上查看原始碼

下載筆記本

簡介

NLP 模型通常會處理使用不同字元集的不同語言。Unicode 是一種標準編碼系統，用於表示幾乎所有語言的字元。每個 Unicode 字元都使用介於 0 和 0x10FFFF 之間的唯一整數程式碼指標進行編碼。Unicode 字串是零或多個程式碼指標的序列。

本教學課程說明如何在 TensorFlow 中表示 Unicode 字串，以及如何使用標準字串運算的 Unicode 等效運算來操作這些字串。本課程會根據指令碼偵測將 Unicode 字串分成符記。

import tensorflow as tf
import numpy as np

`tf.string` 資料類型

基本的 TensorFlow tf.string dtype 可讓您建構位元組字串張量。Unicode 字串預設為 utf-8 編碼。

tf.constant(u"Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

tf.string 張量會將位元組字串視為原子單位。這讓它能夠儲存長度不一的位元組字串。字串長度不包含在張量維度中。

tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

如果您使用 Python 建構字串，請注意字串常值預設為 Unicode 編碼。

表示 Unicode

在 TensorFlow 中，有兩種標準方式可以表示 Unicode 字串

string 純量 — 其中程式碼指標序列是使用已知的字元編碼進行編碼。
int32 向量 — 其中每個位置都包含單一程式碼指標。

例如，下列三個值都代表 Unicode 字串「语言处理」(在中文中意指「語言處理」)

# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

# Unicode string, represented as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

在表示法之間轉換

TensorFlow 提供運算，可在這些不同表示法之間進行轉換

tf.strings.unicode_decode：將編碼字串純量轉換為程式碼指標向量。
tf.strings.unicode_encode：將程式碼指標向量轉換為編碼字串純量。
tf.strings.unicode_transcode：將編碼字串純量轉換為不同的編碼。

tf.strings.unicode_decode(text_utf8,
                          input_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

tf.strings.unicode_encode(text_chars,
                          output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

tf.strings.unicode_transcode(text_utf8,
                             input_encoding='UTF8',
                             output_encoding='UTF-16-BE')

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

批次維度

解碼多個字串時，每個字串中的字元數可能不相等。傳回結果是 tf.RaggedTensor，其中最內層維度的長度會因每個字串中的字元數而異。

# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
              [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
                                               input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]

您可以直接使用此 tf.RaggedTensor，或使用方法 tf.RaggedTensor.to_tensor 和 tf.RaggedTensor.to_sparse，將其轉換為含填補的密集 tf.Tensor 或 tf.sparse.SparseTensor。

batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]]

batch_chars_sparse = batch_chars_ragged.to_sparse()

nrows, ncols = batch_chars_sparse.dense_shape.numpy()
elements = [['_' for i in range(ncols)] for j in range(nrows)]
for (row, col), value in zip(batch_chars_sparse.indices.numpy(), batch_chars_sparse.values.numpy()):
  elements[row][col] = str(value)
# max_width = max(len(value) for row in elements for value in row)
value_lengths = []
for row in elements:
  for value in row:
    value_lengths.append(len(value))
max_width = max(value_lengths)
print('[%s]' % '\n '.join(
    '[%s]' % ', '.join(value.rjust(max_width) for value in row)
    for row in elements))

[[   104,    195,    108,    108,    111,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _]
 [    87,    104,     97,    116,     32,    105,    115,     32,    116,    104,    101,     32,    119,    101,     97,    116,    104,    101,    114,     32,    116,    111,    109,    111,    114,    114,    111,    119]
 [    71,    246,    246,    100,    110,    105,    103,    104,    116,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _]
 [128522,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _,      _]]

編碼多個長度相同的字串時，請使用 tf.Tensor 作為輸入。

tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [99, 111, 119]],
                          output_encoding='UTF-8')

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

編碼多個長度不一的字串時，請使用 tf.RaggedTensor 作為輸入。

tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

如果您有張量包含多個採用填補或稀疏格式的字串，請先將其轉換為 tf.RaggedTensor，再呼叫 tf.strings.unicode_encode。

tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
    output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

Unicode 運算

字元長度

使用 tf.strings.length 運算的 unit 參數，指出應如何計算字元長度。unit 預設為 「BYTE」，但可以設為其他值，例如 「UTF8_CHAR」 或 「UTF16_CHAR」，以判斷每個編碼字串中的 Unicode 程式碼指標數。

# Note that the final character takes up 4 bytes in UTF8.
thanks = u'Thanks 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))

11 bytes; 8 UTF-8 characters

字元子字串

tf.strings.substr 運算接受 unit 參數，並使用該參數判斷 pos 和 len 參數包含哪種位移。

# Here, unit='BYTE' (default). Returns a single byte with len=1
tf.strings.substr(thanks, pos=7, len=1).numpy()

b'\xf0'

# Specifying unit='UTF8_CHAR', returns a single 4 byte character in this case
print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())

b'\xf0\x9f\x98\x8a'

分割 Unicode 字串

tf.strings.unicode_split 運算會將 Unicode 字串分割成個別字元的子字串。

tf.strings.unicode_split(thanks, 'UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

字元的位元組位移

為了將 tf.strings.unicode_decode 產生的字元張量與原始字串對齊，知道每個字元的起始位移會很有幫助。tf.strings.unicode_decode_with_offsets 方法與 unicode_decode 類似，不同之處在於它會傳回第二個張量，其中包含每個字元的起始位移。

codepoints, offsets = tf.strings.unicode_decode_with_offsets(u'🎈🎉🎊', 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print('At byte offset {}: codepoint {}'.format(offset, codepoint))

At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

Unicode 指令碼

每個 Unicode 程式碼指標都屬於一組程式碼指標，稱為指令碼。字元的指令碼有助於判斷字元可能使用的語言。例如，知道「Б」位於斯拉夫語系指令碼中表示，包含該字元的現代文字可能來自斯拉夫語系語言，例如俄文或烏克蘭文。

TensorFlow 提供 tf.strings.unicode_script 運算，以判斷給定程式碼指標使用的指令碼。指令碼代碼是 int32 值，對應於 Unicode 國際元件 (ICU) UScriptCode 值。

uscript = tf.strings.unicode_script([33464, 1041])  # ['芸', 'Б']

print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

[17  8]

tf.strings.unicode_script 運算也可以套用至程式碼指標的多維 tf.Tensor 或 tf.RaggedTensor

print(tf.strings.unicode_script(batch_chars_ragged))

<tf.RaggedTensor [[25, 25, 25, 25, 25],
 [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25,
  0, 25, 25, 25, 25, 25, 25, 25, 25]                                      ,
 [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>

範例：簡易分段

分段是將文字分割成類似單字單位的任務。當空格字元用於分隔單字時，這通常很容易，但有些語言 (例如中文和日文) 不使用空格，而有些語言 (例如德文) 包含必須分割的長複合字，才能分析其含義。在網頁文字中，不同的語言和指令碼經常混合在一起，例如「NY株価」(紐約證券交易所)。

我們可以透過使用指令碼變更來近似單字邊界，執行非常粗略的分段 (無需實作任何機器學習模型)。這適用於上述「NY株価」範例等字串。這也適用於大多數使用空格的語言，因為各種指令碼的空格字元都歸類為 USCRIPT_COMMON，這是一種特殊的指令碼代碼，與任何實際文字的指令碼代碼不同。

# dtype: string; shape: [num_sentences]
#
# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = [u'Hello, world.', u'世界こんにちは']

首先，將句子解碼為字元程式碼指標，並找出每個字元的指令碼識別碼。

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the Unicode script of the j'th character in
# the i'th sentence.
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46],
 [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0],
 [17, 17, 20, 20, 20, 20, 20]]>

使用指令碼識別碼判斷應在何處新增單字邊界。在每個句子的開頭新增單字邊界，並針對每個指令碼與前一個字元不同的字元新增單字邊界。

# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word.
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True),
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
    axis=1)

# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences).
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)

然後，您可以使用這些起始位移來建構 RaggedTensor，其中包含來自所有批次的單字清單。

# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46],
 [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

最後，將單字程式碼指標 RaggedTensor 分段回句子，並編碼為 UTF-8 字串以提高可讀性。

# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)

# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)

tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]],
 [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>
[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]