資料集集合

在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視 下載筆記本

總覽

資料集集合提供一種簡單的方式,將任意數量的現有 TFDS 資料集群組在一起,並對其執行簡單的作業。

舉例來說,它們可能適用於將與相同任務相關的不同資料集群組在一起,或方便針對固定數量的不同任務進行模型的基準化分析

設定

若要開始使用,請安裝幾個套件

# Use tfds-nightly to ensure access to the latest features.
pip install -q tfds-nightly tensorflow
pip install -U conllu

將 TensorFlow 和 Tensorflow Datasets 套件匯入您的開發環境

import pprint

import tensorflow as tf
import tensorflow_datasets as tfds
2023-10-03 09:24:58.961730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-03 09:24:58.961781: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-03 09:24:58.961817: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

資料集集合提供一種簡單的方式,將 Tensorflow Datasets (TFDS) 中任意數量的現有資料集群組在一起,並對其執行簡單的作業。

舉例來說,它們可能適用於將與相同任務相關的不同資料集群組在一起,或方便針對固定數量的不同任務進行模型的基準化分析

尋找可用的資料集集合

所有資料集集合建構工具都是 tfds.core.dataset_collection_builder.DatasetCollection 的子類別。

若要取得可用建構工具的清單,請使用 tfds.list_dataset_collections()

tfds.list_dataset_collections()
['longt5', 'xtreme']

載入並檢查資料集集合

載入資料集集合最簡單的方式是使用 tfds.dataset_collection 命令,例項化 DatasetCollectionLoader 物件。

collection_loader = tfds.dataset_collection('xtreme')

特定資料集集合版本可以使用與 TFDS 資料集相同的語法載入

collection_loader = tfds.dataset_collection('xtreme:1.0.0')

資料集載入器可以顯示關於集合的資訊

collection_loader.print_info()
Dataset collection: xtreme
Version: 1.0.0
Description: # Xtreme Benchmark

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME)
benchmark is a benchmark for the evaluation of the cross-lingual generalization
ability of pre-trained multilingual models. It covers 40 typologically diverse
languages (spanning 12 language families) and includes nine tasks that
collectively require reasoning about different levels of syntax and semantics.
The languages in XTREME are selected to maximize language diversity, coverage
in existing tasks, and availability of training data. Among these are many
under-studied languages, such as the Dravidian languages Tamil (spoken in
southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken
mainly in southern India), and the Niger-Congo languages Swahili and Yoruba,
spoken in Africa.

For a full description of the benchmark,
see the [paper](https://arxiv.org/abs/2003.11080).

Citation:
@article{hu2020xtreme,
    author    = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham
                 Neubig and Orhan Firat and Melvin Johnson},
    title     = {XTREME: A Massively Multilingual Multi-task Benchmark for
                 Evaluating Cross-lingual Generalization},
    journal   = {CoRR},
    volume    = {abs/2003.11080},
    year      = {2020},
    archivePrefix = {arXiv},
    eprint    = {2003.11080}
}

資料集載入器也可以顯示關於集合中所含資料集的資訊

collection_loader.print_datasets()
The dataset collection xtreme (version: 1.0.0) contains the datasets:

 - xnli: DatasetReference(dataset_name='xtreme_xnli', namespace=None, config=None, version='1.1.0', data_dir=None, split_mapping=None)
 - pawsx: DatasetReference(dataset_name='xtreme_pawsx', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - pos: DatasetReference(dataset_name='xtreme_pos', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - ner: DatasetReference(dataset_name='wikiann', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - xquad: DatasetReference(dataset_name='xquad', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)
 - mlqa: DatasetReference(dataset_name='mlqa', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - tydiqa: DatasetReference(dataset_name='tydi_qa', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)
 - bucc: DatasetReference(dataset_name='bucc', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - tatoeba: DatasetReference(dataset_name='tatoeba', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)

從資料集集合載入資料集

從集合載入一個資料集最簡單的方式是使用 DatasetCollectionLoader 物件的 load_dataset 方法,此方法會透過呼叫 tfds.load 載入所需的資料集。

此呼叫會傳回分割名稱的字典和對應的 tf.data.Dataset

splits = collection_loader.load_dataset("ner")

pprint.pprint(splits)
{'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}
2023-10-03 09:25:02.792501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

load_dataset 接受以下選用參數

  • split:要載入哪個 (哪些) 分割。它接受單一分割 (split="test") 或分割清單:(split=["train", "test"])。如果未指定,則會載入指定資料集的所有分割。
  • loader_kwargs:要傳遞至 tfds.load 函式的關鍵字引數。請參閱 tfds.load 說明文件,以全面瞭解不同的載入選項。

從資料集集合載入多個資料集

從集合載入多個資料集最簡單的方式是使用 DatasetCollectionLoader 物件的 load_datasets 方法,此方法會透過呼叫 tfds.load 載入所需的資料集。

它會傳回資料集名稱的字典,每個名稱都與分割名稱的字典和對應的 tf.data.Dataset 相關聯,如下列範例所示

datasets = collection_loader.load_datasets(['xnli', 'bucc'])

pprint.pprint(datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>} }

load_all_datasets 方法會載入指定集合的所有可用資料集

all_datasets = collection_loader.load_all_datasets()

pprint.pprint(all_datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'mlqa': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'ner': {'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
         'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
         'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>},
 'pawsx': {'train': <_PrefetchDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'sentence1': TensorSpec(shape=(), dtype=tf.string, name=None), 'sentence2': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'pos': {'dev': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,
         'test': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,
         'train': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>},
 'tatoeba': {'train': <_PrefetchDataset element_spec={'source_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'tydiqa': {'train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-en': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>},
 'xquad': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-dev': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>} }

load_datasets 方法接受以下選用參數

  • split:要載入哪個 (哪些) 分割。它接受單一分割 (split="test") 或分割清單:(split=["train", "test"])。如果未指定,則會載入指定資料集的所有分割。
  • loader_kwargs:要傳遞至 tfds.load 函式的關鍵字引數。請參閱 tfds.load 說明文件,以全面瞭解不同的載入選項。

指定 loader_kwargs

loader_kwargs 是要傳遞至 tfds.load 函式的選用關鍵字引數。它們可以使用三種方式指定

  1. 在初始化 DatasetCollectionLoader 類別時
collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
  1. 使用 DatasetCollectioLoaderset_loader_kwargs 方法
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))
  1. 作為 load_datasetload_datasetsload_all_datasets 方法的選用參數。
dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

意見回饋

我們不斷嘗試改進資料集建立工作流程,但只有在我們瞭解問題時才能做到。您在建立資料集集合時遇到哪些問題、錯誤?是否有任何部分令人困惑、繁瑣或第一次無法運作?請在 GitHub 上分享您的意見回饋。