TFDS 現在支援 Croissant 🥐 格式！請閱讀文件以瞭解詳情。

特定格式的資料集建構器

本指南記錄 TFDS 中目前提供的所有特定格式資料集建構器。

特定格式的資料集建構器是 tfds.core.GeneratorBasedBuilder 的子類別，可處理特定資料格式的大部分資料處理作業。

以 `tf.data.Dataset` 為基礎的資料集

如果您想從 tf.data.Dataset (參考資料) 格式的資料集建立 TFDS 資料集，則可以使用 tfds.dataset_builders.TfDataBuilder (請參閱 API 文件)。

我們預期此類別有兩種典型用途

在類似筆記本的環境中建立實驗性資料集
在程式碼中定義資料集建構器

從筆記本建立新資料集

假設您正在筆記本中工作，將一些資料載入為 tf.data.Dataset、套用各種轉換 (map、filter 等)，現在您想要儲存此資料並輕鬆與團隊成員分享，或在其他筆記本中載入。除了必須定義新的資料集建構器類別之外，您也可以執行個體化 tfds.dataset_builders.TfDataBuilder 並呼叫 download_and_prepare，將您的資料集儲存為 TFDS 資料集。

由於它是 TFDS 資料集，您可以對其進行版本控制、使用設定、擁有不同的分割，並記錄下來以便日後更輕鬆地使用。這表示您也必須告訴 TFDS 資料集中有哪些功能。

以下是如何使用它的虛擬範例。

import tensorflow as tf
import tensorflow_datasets as tfds

my_ds_train = tf.data.Dataset.from_tensor_slices({"number": [1, 2, 3]})
my_ds_test = tf.data.Dataset.from_tensor_slices({"number": [4, 5]})

# Optionally define a custom `data_dir`.
# If None, then the default data dir is used.
custom_data_dir = "/my/folder"

# Define the builder.
single_number_builder = tfds.dataset_builders.TfDataBuilder(
    name="my_dataset",
    config="single_number",
    version="1.0.0",
    data_dir=custom_data_dir,
    split_datasets={
        "train": my_ds_train,
        "test": my_ds_test,
    },
    features=tfds.features.FeaturesDict({
        "number": tfds.features.Scalar(dtype=tf.int64),
    }),
    description="My dataset with a single number.",
    release_notes={
        "1.0.0": "Initial release with numbers up to 5!",
    }
)

# Make the builder store the data as a TFDS dataset.
single_number_builder.download_and_prepare()

download_and_prepare 方法將疊代輸入 tf.data.Dataset，並將對應的 TFDS 資料集儲存在 /my/folder/my_dataset/single_number/1.0.0 中，其中將包含訓練和測試分割。

config 引數是選用的，如果您想要在同一個資料集下儲存不同的設定，則會很有用。

data_dir 引數可用於將產生的 TFDS 資料集儲存在不同的資料夾中，例如在您自己的沙箱中 (如果您還不想與他人分享)。請注意，執行此操作時，您也需要將 data_dir 傳遞給 tfds.load。如果未指定 data_dir 引數，則會使用預設的 TFDS 資料目錄。

載入您的資料集

在 TFDS 資料集儲存後，可以從其他指令碼載入，或由有權存取資料的團隊成員載入

# If no custom data dir was specified:
ds_test = tfds.load("my_dataset/single_number", split="test")

# When there are multiple versions, you can also specify the version.
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test")

# If the TFDS was stored in a custom folder, then it can be loaded as follows:
custom_data_dir = "/my/folder"
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test", data_dir=custom_data_dir)

新增新版本或設定

在進一步疊代資料集後，您可能已新增或變更來源資料的一些轉換。若要儲存並分享此資料集，您可以輕鬆地將其儲存為新版本。

def add_one(example):
  example["number"] = example["number"] + 1
  return example

my_ds_train_v2 = my_ds_train.map(add_one)
my_ds_test_v2 = my_ds_test.map(add_one)

single_number_builder_v2 = tfds.dataset_builders.TfDataBuilder(
    name="my_dataset",
    config="single_number",
    version="1.1.0",
    data_dir=custom_data_dir,
    split_datasets={
        "train": my_ds_train_v2,
        "test": my_ds_test_v2,
    },
    features=tfds.features.FeaturesDict({
        "number": tfds.features.Scalar(dtype=tf.int64, doc="Some number"),
    }),
    description="My dataset with a single number.",
    release_notes={
        "1.1.0": "Initial release with numbers up to 6!",
        "1.0.0": "Initial release with numbers up to 5!",
    }
)

# Make the builder store the data as a TFDS dataset.
single_number_builder_v2.download_and_prepare()

定義新的資料集建構器類別

您也可以根據此類別定義新的 DatasetBuilder。

import tensorflow as tf
import tensorflow_datasets as tfds

class MyDatasetBuilder(tfds.dataset_builders.TfDataBuilder):
  def __init__(self):
    ds_train = tf.data.Dataset.from_tensor_slices([1, 2, 3])
    ds_test = tf.data.Dataset.from_tensor_slices([4, 5])
    super().__init__(
        name="my_dataset",
        version="1.0.0",
        split_datasets={
            "train": ds_train,
            "test": ds_test,
        },
        features=tfds.features.FeaturesDict({
            "number": tfds.features.Scalar(dtype=tf.int64),
        }),
        config="single_number",
        description="My dataset with a single number.",
        release_notes={
            "1.0.0": "Initial release with numbers up to 5!",
        })

CroissantBuilder

格式

Croissant 🥐 是機器學習資料集的高階格式，它將中繼資料、資源檔案描述、資料結構和預設 ML 語意合併到單一檔案中；它與現有資料集搭配運作，讓它們更容易尋找、使用，並透過工具支援。

Croissant 以 schema.org 及其 sc:Dataset 詞彙為基礎，這是一種廣泛使用的格式，可在網路上表示資料集，並使其可搜尋。

`CroissantBuilder`

CroissantBuilder 根據 Croissant 🥐 中繼資料檔案定義 TFDS 資料集；指定的每個 record_set_ids 都會產生個別的 ConfigBuilder。

例如，若要使用 MNIST 資料集的 Croissant 🥐 定義，初始化 MNIST 資料集的 CroissantBuilder

import tensorflow_datasets as tfds
builder = tfds.dataset_builders.CroissantBuilder(
    jsonld="https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/0.8/huggingface-mnist/metadata.json",
    file_format='array_record',
)
builder.download_and_prepare()
ds = builder.as_data_source()
print(ds['default'][0])

CoNLL

格式

CoNLL 是一種常用於表示標註文字資料的格式。

CoNLL 格式的資料通常每行包含一個符記及其語言註解；在同一行中，註解通常以空格或 Tab 字元分隔。空行表示句子邊界。

以 conll2003 資料集中的以下句子為例，它遵循 CoNLL 註解格式

U.N. NNP I-NP I-ORG official
NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP
I-LOC . . O O

`ConllDatasetBuilder`

若要將新的 CoNLL 型資料集新增至 TFDS，您可以將您的資料集建構器類別建立在 tfds.dataset_builders.ConllDatasetBuilder 之上。此基礎類別包含處理 CoNLL 資料集特有性的通用程式碼 (疊代以欄為基礎的格式、預先編譯的功能和標記清單等)。

tfds.dataset_builders.ConllDatasetBuilder 實作 CoNLL 特有的 GeneratorBasedBuilder。請參閱以下類別，以瞭解 CoNLL 資料集建構器的最簡範例

from tensorflow_datasets.core.dataset_builders.conll import conll_dataset_builder_utils as conll_lib
import tensorflow_datasets.public_api as tfds

class MyCoNNLDataset(tfds.dataset_builders.ConllDatasetBuilder):
  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {'1.0.0': 'Initial release.'}

  # conllu_lib contains a set of ready-to-use CONLL-specific configs.
  BUILDER_CONFIGS = [conll_lib.CONLL_2003_CONFIG]

  def _info(self) -> tfds.core.DatasetInfo:
    return self.create_dataset_info(
        # ...
    )

  def _split_generators(self, dl_manager):
    path = dl_manager.download_and_extract('https://data-url')

    return {'train': self._generate_examples(path=path / 'train.txt'),
            'test': self._generate_examples(path=path / 'train.txt'),
    }

如同標準資料集建構器，它需要覆寫類別方法 _info 和 _split_generators。根據資料集，您可能也需要更新 conll_dataset_builder_utils.py，以包含您的資料集特有的功能和標記清單。

_generate_examples 方法不應需要進一步覆寫，除非您的資料集需要特定的實作方式。

範例

請將 conll2003 作為使用 CoNLL 特定資料集建構器實作的資料集範例。

CLI

撰寫新的 CoNLL 型資料集最簡單的方法是使用 TFDS CLI

cd path/to/my/project/datasets/
tfds new my_dataset --format=conll   # Create `my_dataset/my_dataset.py` CoNLL-specific template files

CoNLL-U

格式

CoNLL-U 是一種常用於表示標註文字資料的格式。

CoNLL-U 透過新增許多功能來增強 CoNLL 格式，例如支援多符記字詞。CoNLL-U 格式的資料通常每行包含一個符記及其語言註解；在同一行中，註解通常以單一 Tab 字元分隔。空行表示句子邊界。

一般而言，每個 CoNLL-U 註解的字詞行都包含以下欄位，如官方文件中所述

ID：字詞索引，每個新句子從 1 開始的整數；多字詞符記可能是範圍；空節點可能是小數 (小數可能小於 1，但必須大於 0)。
FORM：字詞形式或標點符號。
LEMMA：字詞形式的詞元或詞幹。
UPOS：通用詞性標記。
XPOS：語言特定詞性標記；如果沒有，則為底線。
FEATS：來自通用功能庫或已定義語言特定擴充功能的詞法功能清單；如果沒有，則為底線。
HEAD：目前字詞的詞首，可以是 ID 或零 (0) 的值。
DEPREL：與 HEAD 的通用依存關係 (如果是詞首，則 HEAD = 0) 或已定義的語言特定子類型。
DEPS：增強的依存關係圖，格式為詞首-依存關係配對清單。
MISC：任何其他註解。

以官方文件中的以下 CoNLL-U 註解句子為例

1-2    vámonos   _
1      vamos     ir
2      nos       nosotros
3-4    al        _
3      a         a
4      el        el
5      mar       mar

`ConllUDatasetBuilder`

若要將新的 CoNLL-U 型資料集新增至 TFDS，您可以將您的資料集建構器類別建立在 tfds.dataset_builders.ConllUDatasetBuilder 之上。此基礎類別包含處理 CoNLL-U 資料集特有性的通用程式碼 (疊代以欄為基礎的格式、預先編譯的功能和標記清單等)。

tfds.dataset_builders.ConllUDatasetBuilder 實作 CoNLL-U 特有的 GeneratorBasedBuilder。請參閱以下類別，以瞭解 CoNLL-U 資料集建構器的最簡範例

from tensorflow_datasets.core.dataset_builders.conll import conllu_dataset_builder_utils as conllu_lib
import tensorflow_datasets.public_api as tfds

class MyCoNNLUDataset(tfds.dataset_builders.ConllUDatasetBuilder):
  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {'1.0.0': 'Initial release.'}

  # conllu_lib contains a set of ready-to-use features.
  BUILDER_CONFIGS = [
      conllu_lib.get_universal_morphology_config(
          language='en',
          features=conllu_lib.UNIVERSAL_DEPENDENCIES_FEATURES,
      )
  ]

  def _info(self) -> tfds.core.DatasetInfo:
    return self.create_dataset_info(
        # ...
    )

  def _split_generators(self, dl_manager):
    path = dl_manager.download_and_extract('https://data-url')

    return {
        'train':
            self._generate_examples(
                path=path / 'train.txt',
                # If necessary, add optional custom processing (see conllu_lib
                # for examples).
                # process_example_fn=...,
            )
    }

如同標準資料集建構器，它需要覆寫類別方法 _info 和 _split_generators。根據資料集，您可能也需要更新 conllu_dataset_builder_utils.py，以包含您的資料集特有的功能和標記清單。

_generate_examples 方法不應需要進一步覆寫，除非您的資料集需要特定的實作方式。請注意，如果您的資料集需要特定的前處理 (例如，如果它考慮非典型的通用依存關係功能)，您可能需要更新 generate_examples 函式的 process_example_fn 屬性 (請參閱 xtreme_pos 資料集作為範例)。

範例

請將以下使用 CoNNL-U 特定資料集建構器的資料集，作為範例

CLI

撰寫新的 CoNLL-U 型資料集最簡單的方法是使用 TFDS CLI

cd path/to/my/project/datasets/
tfds new my_dataset --format=conllu   # Create `my_dataset/my_dataset.py` CoNLL-U specific template files

特定格式的資料集建構器

以 tf.data.Dataset 為基礎的資料集

從筆記本建立新資料集

載入您的資料集

新增新版本或設定

定義新的資料集建構器類別

CroissantBuilder

格式

CroissantBuilder

CoNLL

格式

ConllDatasetBuilder

範例

CLI

CoNLL-U

格式

ConllUDatasetBuilder

範例

CLI

以 `tf.data.Dataset` 為基礎的資料集

`CroissantBuilder`

`ConllDatasetBuilder`

`ConllUDatasetBuilder`