TFDS 現在支援 Croissant 🥐 格式！請參閱文件以瞭解詳情。

TFDS 和決定性

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

本文件說明

TFDS 在決定性上的保證
TFDS 讀取範例的順序
各種注意事項和陷阱

設定

資料集

需要一些背景資訊才能瞭解 TFDS 如何讀取資料。

在產生期間，TFDS 會將原始資料寫入標準化的 .tfrecord 檔案。對於大型資料集，會建立多個 .tfrecord 檔案，每個檔案都包含多個範例。我們將每個 .tfrecord 檔案稱為分片。

本指南使用具有 1024 個分片的 imagenet

import re
import tensorflow_datasets as tfds

imagenet = tfds.builder('imagenet2012')

num_shards = imagenet.info.splits['train'].num_shards
num_examples = imagenet.info.splits['train'].num_examples
print(f'imagenet has {num_shards} shards ({num_examples} examples)')

imagenet has 1024 shards (1281167 examples)

尋找資料集範例 ID

如果您只想瞭解決定性，可以跳到以下章節。

每個資料集範例都由 id (例如 'imagenet2012-train.tfrecord-01023-of-01024__32') 唯一識別。您可以透過傳遞 read_config.add_tfds_id = True 來還原此 id，這會在 tf.data.Dataset 的字典中新增 'tfds_id' 鍵。

在本教學課程中，我們定義了一個小型公用程式，用於列印資料集的範例 ID (轉換為整數以提高人類可讀性)

def load_dataset(builder, **as_dataset_kwargs):
  """Load the dataset with the tfds_id."""
  read_config = as_dataset_kwargs.pop('read_config', tfds.ReadConfig())
  read_config.add_tfds_id = True  # Set `True` to return the 'tfds_id' key
  return builder.as_dataset(read_config=read_config, **as_dataset_kwargs)

def print_ex_ids(
    builder,
    *,
    take: int,
    skip: int = None,
    **as_dataset_kwargs,
) -> None:
  """Print the example ids from the given dataset split."""
  ds = load_dataset(builder, **as_dataset_kwargs)
  if skip:
    ds = ds.skip(skip)
  ds = ds.take(take)
  exs = [ex['tfds_id'].numpy().decode('utf-8') for ex in ds]
  exs = [id_to_int(tfds_id, builder=builder) for tfds_id in exs]
  print(exs)

def id_to_int(tfds_id: str, builder) -> str:
  """Format the tfds_id in a more human-readable."""
  match = re.match(r'\w+-(\w+).\w+-(\d+)-of-\d+__(\d+)', tfds_id)
  split_name, shard_id, ex_id = match.groups()
  split_info = builder.info.splits[split_name]
  return sum(split_info.shard_lengths[:int(shard_id)]) + int(ex_id)

讀取時的決定性

本節說明 tfds.load 的決定性保證。

使用 `shuffle_files=False` (預設)

依預設，TFDS 會以決定性方式產生範例 (shuffle_files=False)

# Same as: imagenet.as_dataset(split='train').take(20)
print_ex_ids(imagenet, split='train', take=20)
print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

為了效能，TFDS 使用 tf.data.Dataset.interleave 同時讀取多個分片。在此範例中，我們看到 TFDS 在讀取 16 個範例後切換到分片 2 (..., 14, 15, 1251, 1252, ...)。以下提供有關 interleave 的更多資訊。

同樣地，子分割 API 也具有決定性

print_ex_ids(imagenet, split='train[67%:84%]', take=20)
print_ex_ids(imagenet, split='train[67%:84%]', take=20)

[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]
[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]

如果您要訓練超過一個 epoch，則不建議使用上述設定，因為所有 epoch 都會以相同的順序讀取分片 (因此隨機性僅限於 ds = ds.shuffle(buffer) 緩衝區大小)。

使用 `shuffle_files=True`

使用 shuffle_files=True 時，每個 epoch 的分片都會被隨機排序，因此讀取不再具有決定性。

print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)
print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)

[568017, 329050, 329051, 329052, 329053, 329054, 329056, 329055, 568019, 568020, 568021, 568022, 568023, 568018, 568025, 568024, 568026, 568028, 568030, 568031]
[43790, 43791, 43792, 43793, 43796, 43794, 43797, 43798, 43795, 43799, 43800, 43801, 43802, 43803, 43804, 43805, 43806, 43807, 43809, 43810]

請參閱以下配方以取得決定性檔案隨機排序。

決定性注意事項：interleave 引數

變更 read_config.interleave_cycle_length、read_config.interleave_block_length 將會變更範例順序。

TFDS 仰賴 tf.data.Dataset.interleave 一次僅載入少數分片，從而提高效能並減少記憶體用量。

範例順序僅保證對於 interleave 引數的固定值保持不變。請參閱 interleave 文件以瞭解 cycle_length 和 block_length 對應的內容。

cycle_length=16, block_length=16 (預設，與上述相同)

print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

cycle_length=3, block_length=2

read_config = tfds.ReadConfig(
    interleave_cycle_length=3,
    interleave_block_length=2,
)
print_ex_ids(imagenet, split='train', read_config=read_config, take=20)

[0, 1, 1251, 1252, 2502, 2503, 2, 3, 1253, 1254, 2504, 2505, 4, 5, 1255, 1256, 2506, 2507, 6, 7]

在第二個範例中，我們看到資料集在分片中讀取 2 個 (block_length=2) 範例，然後切換到下一個分片。每 2 * 3 (cycle_length=3) 個範例，它會回到第一個分片 (shard0-ex0, shard0-ex1, shard1-ex0, shard1-ex1, shard2-ex0, shard2-ex1, shard0-ex2, shard0-ex3, shard1-ex2, shard1-ex3, shard2-ex2,...)。

子分割和範例順序

每個範例都有一個 ID 0, 1, ..., num_examples-1。子分割 API 選擇範例的切片 (例如，train[:x] 選擇 0, 1, ..., x-1)。

但是，在子分割中，範例不會以遞增的 ID 順序讀取 (由於分片和 interleave)。

更具體來說，ds.take(x) 和 split='train[:x]'不相等！

這可以在上述 interleave 範例中輕鬆看出，其中範例來自不同的分片。

print_ex_ids(imagenet, split='train', take=25)  # tfds.load(..., split='train').take(25)
print_ex_ids(imagenet, split='train[:25]', take=-1)  # tfds.load(..., split='train[:25]')

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

在 16 個 (block_length) 範例之後，.take(25) 會切換到下一個分片，而 train[:25] 則繼續從第一個分片讀取範例。

配方

取得決定性檔案隨機排序

有 2 種方法可以進行決定性隨機排序

設定 shuffle_seed。注意：這需要每個 epoch 都變更種子，否則 epoch 之間將以相同的順序讀取分片。

read_config = tfds.ReadConfig(
    shuffle_seed=32,
)

# Deterministic order, different from the default shuffle_files=False above
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)

[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]
[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]

使用 experimental_interleave_sort_fn：這讓您可以完全控制要讀取哪些分片以及以何種順序讀取，而不是依賴 ds.shuffle 順序。

def _reverse_order(file_instructions):
  return list(reversed(file_instructions))

read_config = tfds.ReadConfig(
    experimental_interleave_sort_fn=_reverse_order,
)

# Last shard (01023-of-01024) is read first
print_ex_ids(imagenet, split='train', read_config=read_config, take=5)

[1279916, 1279917, 1279918, 1279919, 1279920]

取得決定性可搶佔管線

這個比較複雜。沒有簡單、令人滿意的解決方案。

在沒有 ds.shuffle 且具有決定性隨機排序的情況下，理論上應該可以計算已讀取的範例，並推斷出每個分片中已讀取的範例 (作為 cycle_length、block_length 和分片順序的函數)。然後可以透過 experimental_interleave_sort_fn 注入每個分片的 skip、take。
使用 ds.shuffle，如果不重播完整的訓練管線，則可能無法做到。這需要儲存 ds.shuffle 緩衝區狀態，以推斷出已讀取的範例。範例可能不連續 (例如，已讀取 shard5_ex2、shard5_ex4，但未讀取 shard5_ex3)。
使用 ds.shuffle，一種方法是儲存所有已讀取的分片 ID/範例 ID (從 tfds_id 推斷)，然後從中推斷出檔案指示。

對於 1.，最簡單的情況是讓 .skip(x).take(y) 符合 train[x:x+y] 符合。這需要

設定 cycle_length=1 (以便依序讀取分片)
設定 shuffle_files=False
請勿使用 ds.shuffle

它應該僅在訓練僅為 1 個 epoch 的巨量資料集上使用。範例將以預設隨機排序順序讀取。

read_config = tfds.ReadConfig(
    interleave_cycle_length=1,  # Read shards sequentially
)

print_ex_ids(imagenet, split='train', read_config=read_config, skip=40, take=22)
# If the job get pre-empted, using the subsplit API will skip at most `len(shard0)`
print_ex_ids(imagenet, split='train[40:]', read_config=read_config, take=22)

[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

尋找給定子分割讀取的分片/範例

透過 tfds.core.DatasetInfo，您可以直接存取讀取指示。

imagenet.info.splits['train[44%:45%]'].file_instructions

[FileInstruction(filename='imagenet2012-train.tfrecord-00450-of-01024', skip=700, take=-1, num_examples=551),
 FileInstruction(filename='imagenet2012-train.tfrecord-00451-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00452-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00453-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00454-of-01024', skip=0, take=-1, num_examples=1252),
 FileInstruction(filename='imagenet2012-train.tfrecord-00455-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00456-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00457-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00458-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00459-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00460-of-01024', skip=0, take=1001, num_examples=1001)]