![]() |
![]() |
![]() |
![]() |
總覽
資料集集合提供一種簡單的方式,將任意數量的現有 TFDS 資料集群組在一起,並對其執行簡單的作業。
舉例來說,它們可能適用於將與相同任務相關的不同資料集群組在一起,或方便針對固定數量的不同任務進行模型的基準化分析。
設定
若要開始使用,請安裝幾個套件
# Use tfds-nightly to ensure access to the latest features.
pip install -q tfds-nightly tensorflow
pip install -U conllu
將 TensorFlow 和 Tensorflow Datasets 套件匯入您的開發環境
import pprint
import tensorflow as tf
import tensorflow_datasets as tfds
2023-10-03 09:24:58.961730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-10-03 09:24:58.961781: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-10-03 09:24:58.961817: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
資料集集合提供一種簡單的方式,將 Tensorflow Datasets (TFDS) 中任意數量的現有資料集群組在一起,並對其執行簡單的作業。
舉例來說,它們可能適用於將與相同任務相關的不同資料集群組在一起,或方便針對固定數量的不同任務進行模型的基準化分析。
尋找可用的資料集集合
所有資料集集合建構工具都是 tfds.core.dataset_collection_builder.DatasetCollection
的子類別。
若要取得可用建構工具的清單,請使用 tfds.list_dataset_collections()
。
tfds.list_dataset_collections()
['longt5', 'xtreme']
載入並檢查資料集集合
載入資料集集合最簡單的方式是使用 tfds.dataset_collection
命令,例項化 DatasetCollectionLoader
物件。
collection_loader = tfds.dataset_collection('xtreme')
特定資料集集合版本可以使用與 TFDS 資料集相同的語法載入
collection_loader = tfds.dataset_collection('xtreme:1.0.0')
資料集載入器可以顯示關於集合的資訊
collection_loader.print_info()
Dataset collection: xtreme Version: 1.0.0 Description: # Xtreme Benchmark The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa. For a full description of the benchmark, see the [paper](https://arxiv.org/abs/2003.11080). Citation: @article{hu2020xtreme, author = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson}, title = {XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization}, journal = {CoRR}, volume = {abs/2003.11080}, year = {2020}, archivePrefix = {arXiv}, eprint = {2003.11080} }
資料集載入器也可以顯示關於集合中所含資料集的資訊
collection_loader.print_datasets()
The dataset collection xtreme (version: 1.0.0) contains the datasets: - xnli: DatasetReference(dataset_name='xtreme_xnli', namespace=None, config=None, version='1.1.0', data_dir=None, split_mapping=None) - pawsx: DatasetReference(dataset_name='xtreme_pawsx', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - pos: DatasetReference(dataset_name='xtreme_pos', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - ner: DatasetReference(dataset_name='wikiann', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - xquad: DatasetReference(dataset_name='xquad', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None) - mlqa: DatasetReference(dataset_name='mlqa', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - tydiqa: DatasetReference(dataset_name='tydi_qa', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None) - bucc: DatasetReference(dataset_name='bucc', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - tatoeba: DatasetReference(dataset_name='tatoeba', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
從資料集集合載入資料集
從集合載入一個資料集最簡單的方式是使用 DatasetCollectionLoader
物件的 load_dataset
方法,此方法會透過呼叫 tfds.load
載入所需的資料集。
此呼叫會傳回分割名稱的字典和對應的 tf.data.Dataset
splits = collection_loader.load_dataset("ner")
pprint.pprint(splits)
{'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>} 2023-10-03 09:25:02.792501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
load_dataset
接受以下選用參數
split
:要載入哪個 (哪些) 分割。它接受單一分割 (split="test"
) 或分割清單:(split=["train", "test"]
)。如果未指定,則會載入指定資料集的所有分割。loader_kwargs
:要傳遞至tfds.load
函式的關鍵字引數。請參閱tfds.load
說明文件,以全面瞭解不同的載入選項。
從資料集集合載入多個資料集
從集合載入多個資料集最簡單的方式是使用 DatasetCollectionLoader
物件的 load_datasets
方法,此方法會透過呼叫 tfds.load
載入所需的資料集。
它會傳回資料集名稱的字典,每個名稱都與分割名稱的字典和對應的 tf.data.Dataset
相關聯,如下列範例所示
datasets = collection_loader.load_datasets(['xnli', 'bucc'])
pprint.pprint(datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>} }
load_all_datasets
方法會載入指定集合的所有可用資料集
all_datasets = collection_loader.load_all_datasets()
pprint.pprint(all_datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'mlqa': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'ner': {'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}, 'pawsx': {'train': <_PrefetchDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'sentence1': TensorSpec(shape=(), dtype=tf.string, name=None), 'sentence2': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'pos': {'dev': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>, 'test': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>, 'train': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>}, 'tatoeba': {'train': <_PrefetchDataset element_spec={'source_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'tydiqa': {'train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-en': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>}, 'xquad': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-dev': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>} }
load_datasets
方法接受以下選用參數
split
:要載入哪個 (哪些) 分割。它接受單一分割(split="test")
或分割清單:(split=["train", "test"])
。如果未指定,則會載入指定資料集的所有分割。loader_kwargs
:要傳遞至tfds.load
函式的關鍵字引數。請參閱tfds.load
說明文件,以全面瞭解不同的載入選項。
指定 loader_kwargs
loader_kwargs
是要傳遞至 tfds.load
函式的選用關鍵字引數。它們可以使用三種方式指定
- 在初始化
DatasetCollectionLoader
類別時
collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
- 使用
DatasetCollectioLoader
的set_loader_kwargs
方法
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))
- 作為
load_dataset
、load_datasets
和load_all_datasets
方法的選用參數。
dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
意見回饋
我們不斷嘗試改進資料集建立工作流程,但只有在我們瞭解問題時才能做到。您在建立資料集集合時遇到哪些問題、錯誤?是否有任何部分令人困惑、繁瑣或第一次無法運作?請在 GitHub 上分享您的意見回饋。