前兩章中筑悴,在JAX里使用了TensorFlow數(shù)據(jù)集來做模型訓(xùn)練。吳恩達老師說過,“公共數(shù)據(jù)集為機器學(xué)習(xí)研究這枚火箭提供了動力”。解決了機器學(xué)習(xí)中“無米下炊”的難題往声。
公共數(shù)據(jù)集
公共數(shù)據(jù)集有Kaggle、UCI機器學(xué)習(xí)庫纸兔、Visual Data等誓沸。涵蓋圖像識別、自然語言處理塔鳍、臨床等伯铣。
TensorFlow Datasets作為數(shù)據(jù)集管理庫不僅包含了數(shù)據(jù)集的下載、安裝轮纫,還包含了數(shù)據(jù)集使用方法腔寡,比如分割、批量處理掌唾,迭代設(shè)置等等放前。本章仍以TensorFlow Datasets為例介紹如何使用公共數(shù)據(jù)集。
TensorFlow Datasets概覽
目前來說糯彬,TensorFlow Datasets已經(jīng)管理一千多個數(shù)據(jù)集犀斋,可以通過代碼獲取全部數(shù)據(jù)集的名稱。代碼如下情连。
import tensorflow_datasets as tfds
def preview_datasets():
builders = tfds.list_builders()
print("builders.shape = ", len(builders))
print("builders = ", builders)
if __name__ == "__main__":
preview_datasets()
運行結(jié)果打印輸出如下叽粹,
Length of builders = 1142
builders = ['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset', 'ai2_arc', 'ai2_arc_with_ir', 'amazon_us_reviews', 'anli', 'answer_equivalence', 'arc', 'asqa', 'asset', 'assin2', 'bair_robot_pushing_small', 'bccd', 'beans', 'bee_dataset', 'beir', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'ble_wind_field', 'blimp', 'booksum', 'bool_q', 'bucc', 'c4', 'c4_wsrs', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cardiotox', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'cfq', 'cherry_blossoms', 'chexpert', 'cifar10', 'cifar100', 'cifar100_n', 'cifar10_1', 'cifar10_corrupted', 'cifar10_n', 'citrus_leaves', 'cityscapes', 'civil_comments', 'clevr', 'clic', 'clinc_oos', 'cmaterdb', 'cnn_dailymail', 'coco', 'coco_captions', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'common_voice', 'conll2002', 'conll2003', 'controlled_noisy_web_labels', 'coqa', 'cos_e', 'cosmos_qa', 'covid19', 'covid19sum', 'crema_d', 'criteo', 'cs_restaurants', 'curated_breast_imaging_ddsm', 'cycle_gan', 'd4rl_adroit_door', 'd4rl_adroit_hammer', 'd4rl_adroit_pen', 'd4rl_adroit_relocate', 'd4rl_antmaze', 'd4rl_mujoco_ant', 'd4rl_mujoco_halfcheetah', 'd4rl_mujoco_hopper', 'd4rl_mujoco_walker2d', 'dart', 'davis', 'deep1b', 'deep_weeds', 'definite_pronoun_resolution', 'dementiabank', 'diabetic_retinopathy_detection', 'diamonds', 'div2k', 'dmlab', 'doc_nli', 'dolphin_number_word', 'domainnet', 'downsampled_imagenet', 'drop', 'dsprites', 'dtd', 'duke_ultrasound', 'e2e_cleaned', 'efron_morris75', 'emnist', 'eraser_multi_rc', 'esnli', 'eurosat', 'fashion_mnist', 'flic', 'flores', 'food101', 'forest_fires', 'fuss', 'gap', 'geirhos_conflict_stimuli', 'gem', 'genomics_ood', 'german_credit_numeric', 'gigaword', 'glove100_angular', 'glue', 'goemotions', 'gov_report', 'gpt3', 'gref', 'groove', 'grounded_scan', 'gsm8k', 'gtzan', 'gtzan_music_speech', 'hellaswag', 'higgs', 'hillstrom', 'horses_or_humans', 'howell', 'i_naturalist2017', 'i_naturalist2018', 'i_naturalist2021', 'imagenet2012', 'imagenet2012_corrupted', 'imagenet2012_fewshot', 'imagenet2012_multilabel', 'imagenet2012_real', 'imagenet2012_subset', 'imagenet_a', 'imagenet_lt', 'imagenet_pi', 'imagenet_r', 'imagenet_resized', 'imagenet_sketch', 'imagenet_v2', 'imagenette', 'imagewang', 'imdb_reviews', 'irc_disentanglement', 'iris', 'istella', 'kddcup99', 'kitti', 'kmnist', 'laion400m', 'lambada', 'lfw', 'librispeech', 'librispeech_lm', 'libritts', 'ljspeech', 'lm1b', 'locomotion', 'lost_and_found', 'lsun', 'lvis', 'malaria', 'math_dataset', 'math_qa', 'mctaco', 'media_sum', 'mlqa', 'mnist', 'mnist_corrupted', 'movie_lens', 'movie_rationales', 'movielens', 'moving_mnist', 'mrqa', 'mslr_web', 'mt_opt', 'mtnt', 'multi_news', 'multi_nli', 'multi_nli_mismatch', 'natural_instructions', 'natural_questions', 'natural_questions_open', 'newsroom', 'nsynth', 'nyu_depth_v2', 'ogbg_molpcba', 'omniglot', 'open_images_challenge2019_detection', 'open_images_v4', 'openbookqa', 'opinion_abstracts', 'opinosis', 'opus', 'oxford_flowers102', 'oxford_iiit_pet', 'para_crawl', 'pass', 'patch_camelyon', 'paws_wiki', 'paws_x_wiki', 'penguins', 'pet_finder', 'pg19', 'piqa', 'places365_small', 'placesfull', 'plant_leaves', 'plant_village', 'plantae_k', 'protein_net', 'q_re_cc', 'qa4mre', 'qasc', 'quac', 'quality', 'quickdraw_bitmap', 'race', 'radon', 'reddit', 'reddit_disentanglement', 'reddit_tifu', 'ref_coco', 'resisc45', 'rlu_atari', 'rlu_atari_checkpoints', 'rlu_atari_checkpoints_ordered', 'rlu_control_suite', 'rlu_dmlab_explore_object_rewards_few', 'rlu_dmlab_explore_object_rewards_many', 'rlu_dmlab_rooms_select_nonmatching_object', 'rlu_dmlab_rooms_watermaze', 'rlu_dmlab_seekavoid_arena01', 'rlu_locomotion', 'rlu_rwrl', 'robomimic_mg', 'robomimic_mh', 'robomimic_ph', 'robonet', 'robosuite_panda_pick_place_can', 'rock_paper_scissors', 'rock_you', 's3o4d', 'salient_span_wikipedia', 'samsum', 'savee', 'scan', 'scene_parse150', 'schema_guided_dialogue', 'sci_tail', 'scicite', 'scientific_papers', 'scrolls', 'sentiment140', 'shapes3d', 'sift1m', 'simpte', 'siscore', 'smallnorb', 'smartwatch_gestures', 'snli', 'so2sat', 'speech_commands', 'spoken_digit', 'squad', 'squad_question_generation', 'stanford_dogs', 'stanford_online_products', 'star_cfq', 'starcraft_video', 'stl10', 'story_cloze', 'summscreen', 'sun397', 'super_glue', 'svhn_cropped', 'symmetric_solids', 'tao', 'tatoeba', 'ted_hrlr_translate', 'ted_multi_translate', 'tedlium', 'tf_flowers', 'the300w_lp', 'tiny_shakespeare', 'titanic', 'trec', 'trivia_qa', 'tydi_qa', 'uc_merced', 'ucf101', 'unified_qa', 'universal_dependencies', 'unnatural_instructions', 'user_libri_audio', 'user_libri_text', 'vctk', 'visual_domain_decathlon', 'voc', 'voxceleb', 'voxforge', 'waymo_open_dataset', 'web_graph', 'web_nlg', 'web_questions', 'webvid', 'wider_face', 'wiki40b', 'wiki_auto', 'wiki_bio', 'wiki_dialog', 'wiki_table_questions', 'wiki_table_text', 'wikiann', 'wikihow', 'wikipedia', 'wikipedia_toxicity_subtypes', 'wine_quality', 'winogrande', 'wit', 'wit_kaggle', 'wmt13_translate', 'wmt14_translate', 'wmt15_translate', 'wmt16_translate', 'wmt17_translate', 'wmt18_translate', 'wmt19_translate', 'wmt_t2t_translate', 'wmt_translate', 'wordnet', 'wsc273', 'xnli', 'xquad', 'xsum', 'xtreme_pawsx', 'xtreme_pos', 'xtreme_s', 'xtreme_xnli', 'yahoo_ltrc', 'yelp_polarity_reviews', 'yes_no', 'youtube_vis', 'huggingface:acronym_identification', 'huggingface:ade_corpus_v2', 'huggingface:adv_glue', 'huggingface:adversarial_qa', 'huggingface:aeslc', 'huggingface:afrikaans_ner_corpus', 'huggingface:ag_news', 'huggingface:ai2_arc', 'huggingface:air_dialogue', 'huggingface:ajgt_twitter_ar', 'huggingface:allegro_reviews', 'huggingface:allocine', 'huggingface:alt', 'huggingface:amazon_polarity', 'huggingface:amazon_reviews_multi', 'huggingface:amazon_us_reviews', 'huggingface:ambig_qa', 'huggingface:americas_nli', 'huggingface:ami', 'huggingface:amttl', 'huggingface:anli', 'huggingface:app_reviews', 'huggingface:aqua_rat', 'huggingface:aquamuse', 'huggingface:ar_cov19', 'huggingface:ar_res_reviews', 'huggingface:ar_sarcasm', 'huggingface:arabic_billion_words', 'huggingface:arabic_pos_dialect', 'huggingface:arabic_speech_corpus', 'huggingface:arcd', 'huggingface:arsentd_lev', 'huggingface:art', 'huggingface:arxiv_dataset', 'huggingface:ascent_kb', 'huggingface:aslg_pc12', 'huggingface:asnq', 'huggingface:asset', 'huggingface:assin', 'huggingface:assin2', 'huggingface:atomic', 'huggingface:autshumato', 'huggingface:babi_qa', 'huggingface:banking77', 'huggingface:bbaw_egyptian', 'huggingface:bbc_hindi_nli', 'huggingface:bc2gm_corpus', 'huggingface:beans', 'huggingface:best2009', 'huggingface:bianet', 'huggingface:bible_para', 'huggingface:big_patent', 'huggingface:bigbench', 'huggingface:billsum', 'huggingface:bing_coronavirus_query_set', 'huggingface:biomrc', 'huggingface:biosses', 'huggingface:biwi_kinect_head_pose', 'huggingface:blbooks', 'huggingface:blbooksgenre', 'huggingface:blended_skill_talk', 'huggingface:blimp', 'huggingface:blog_authorship_corpus', 'huggingface:bn_hate_speech', 'huggingface:bnl_newspapers', 'huggingface:bookcorpus', 'huggingface:bookcorpusopen', 'huggingface:boolq', 'huggingface:bprec', 'huggingface:break_data', 'huggingface:brwac', 'huggingface:bsd_ja_en', 'huggingface:bswac', 'huggingface:c3', 'huggingface:c4', 'huggingface:cail2018', 'huggingface:caner', 'huggingface:capes', 'huggingface:casino', 'huggingface:catalonia_independence', 'huggingface:cats_vs_dogs', 'huggingface:cawac', 'huggingface:cbt', 'huggingface:cc100', 'huggingface:cc_news', 'huggingface:ccaligned_multilingual', 'huggingface:cdsc', 'huggingface:cdt', 'huggingface:cedr', 'huggingface:cfq', 'huggingface:chr_en', 'huggingface:cifar10', 'huggingface:cifar100', 'huggingface:circa', 'huggingface:civil_comments', 'huggingface:clickbait_news_bg', 'huggingface:climate_fever', 'huggingface:clinc_oos', 'huggingface:clue', 'huggingface:cmrc2018', 'huggingface:cmu_hinglish_dog', 'huggingface:cnn_dailymail', 'huggingface:coached_conv_pref', 'huggingface:coarse_discourse', 'huggingface:codah', 'huggingface:code_search_net', 'huggingface:code_x_glue_cc_clone_detection_big_clone_bench', 'huggingface:code_x_glue_cc_clone_detection_poj104', 'huggingface:code_x_glue_cc_cloze_testing_all', 'huggingface:code_x_glue_cc_cloze_testing_maxmin', 'huggingface:code_x_glue_cc_code_completion_line', 'huggingface:code_x_glue_cc_code_completion_token', 'huggingface:code_x_glue_cc_code_refinement', 'huggingface:code_x_glue_cc_code_to_code_trans', 'huggingface:code_x_glue_cc_defect_detection', 'huggingface:code_x_glue_ct_code_to_text', 'huggingface:code_x_glue_tc_nl_code_search_adv', 'huggingface:code_x_glue_tc_text_to_code', 'huggingface:code_x_glue_tt_text_to_text', 'huggingface:com_qa', 'huggingface:common_gen', 'huggingface:common_language', 'huggingface:common_voice', 'huggingface:commonsense_qa', 'huggingface:competition_math', 'huggingface:compguesswhat', 'huggingface:conceptnet5', 'huggingface:conceptual_12m', 'huggingface:conceptual_captions', 'huggingface:conll2000', 'huggingface:conll2002', 'huggingface:conll2003', 'huggingface:conll2012_ontonotesv5', 'huggingface:conllpp', 'huggingface:consumer-finance-complaints', 'huggingface:conv_ai', 'huggingface:conv_ai_2', 'huggingface:conv_ai_3', 'huggingface:conv_questions', 'huggingface:coqa', 'huggingface:cord19', 'huggingface:cornell_movie_dialog', 'huggingface:cos_e', 'huggingface:cosmos_qa', 'huggingface:counter', 'huggingface:covid_qa_castorini', 'huggingface:covid_qa_deepset', 'huggingface:covid_qa_ucsd', 'huggingface:covid_tweets_japanese', 'huggingface:covost2', 'huggingface:cppe-5', 'huggingface:craigslist_bargains', 'huggingface:crawl_domain', 'huggingface:crd3', 'huggingface:crime_and_punish', 'huggingface:crows_pairs', 'huggingface:cryptonite', 'huggingface:cs_restaurants', 'huggingface:cuad', 'huggingface:curiosity_dialogs', 'huggingface:daily_dialog', 'huggingface:dane', 'huggingface:danish_political_comments', 'huggingface:dart', 'huggingface:datacommons_factcheck', 'huggingface:dbpedia_14', 'huggingface:dbrd', 'huggingface:deal_or_no_dialog', 'huggingface:definite_pronoun_resolution', 'huggingface:dengue_filipino', 'huggingface:dialog_re', 'huggingface:diplomacy_detection', 'huggingface:disaster_response_messages', 'huggingface:discofuse', 'huggingface:discovery', 'huggingface:disfl_qa', 'huggingface:doc2dial', 'huggingface:docred', 'huggingface:doqa', 'huggingface:dream', 'huggingface:drop', 'huggingface:duorc', 'huggingface:dutch_social', 'huggingface:dyk', 'huggingface:e2e_nlg', 'huggingface:e2e_nlg_cleaned', 'huggingface:ecb', 'huggingface:ecthr_cases', 'huggingface:eduge', 'huggingface:ehealth_kd', 'huggingface:eitb_parcc', 'huggingface:electricity_load_diagrams', 'huggingface:eli5', 'huggingface:eli5_category', 'huggingface:elkarhizketak', 'huggingface:emea', 'huggingface:emo', 'huggingface:emotion', 'huggingface:emotone_ar', 'huggingface:empathetic_dialogues', 'huggingface:enriched_web_nlg', 'huggingface:enwik8', 'huggingface:eraser_multi_rc', 'huggingface:esnli', 'huggingface:eth_py150_open', 'huggingface:ethos', 'huggingface:ett', 'huggingface:eu_regulatory_ir', 'huggingface:eurlex', 'huggingface:euronews', 'huggingface:europa_eac_tm', 'huggingface:europa_ecdc_tm', 'huggingface:europarl_bilingual', 'huggingface:event2Mind', 'huggingface:evidence_infer_treatment', 'huggingface:exams', 'huggingface:factckbr', 'huggingface:fake_news_english', 'huggingface:fake_news_filipino', 'huggingface:farsi_news', 'huggingface:fashion_mnist', 'huggingface:fever', 'huggingface:few_rel', 'huggingface:financial_phrasebank', 'huggingface:finer', 'huggingface:flores', 'huggingface:flue', 'huggingface:food101', 'huggingface:fquad', 'huggingface:freebase_qa', 'huggingface:gap', 'huggingface:gem', 'huggingface:generated_reviews_enth', 'huggingface:generics_kb', 'huggingface:german_legal_entity_recognition', 'huggingface:germaner', 'huggingface:germeval_14', 'huggingface:giga_fren', 'huggingface:gigaword', 'huggingface:glucose', 'huggingface:glue', 'huggingface:gnad10', 'huggingface:go_emotions', 'huggingface:gooaq', 'huggingface:google_wellformed_query', 'huggingface:grail_qa', 'huggingface:great_code', 'huggingface:greek_legal_code', 'huggingface:gsm8k', 'huggingface:guardian_authorship', 'huggingface:gutenberg_time', 'huggingface:hans', 'huggingface:hansards', 'huggingface:hard', 'huggingface:harem', 'huggingface:has_part', 'huggingface:hate_offensive', 'huggingface:hate_speech18', 'huggingface:hate_speech_filipino', 'huggingface:hate_speech_offensive', 'huggingface:hate_speech_pl', 'huggingface:hate_speech_portuguese', 'huggingface:hatexplain', 'huggingface:hausa_voa_ner', 'huggingface:hausa_voa_topics', 'huggingface:hda_nli_hindi', 'huggingface:head_qa', 'huggingface:health_fact', 'huggingface:hebrew_projectbenyehuda', 'huggingface:hebrew_sentiment', 'huggingface:hebrew_this_world', 'huggingface:hellaswag', 'huggingface:hendrycks_test', 'huggingface:hind_encorp', 'huggingface:hindi_discourse', 'huggingface:hippocorpus', 'huggingface:hkcancor', 'huggingface:hlgd', 'huggingface:hope_edi', 'huggingface:hotpot_qa', 'huggingface:hover', 'huggingface:hrenwac_para', 'huggingface:hrwac', 'huggingface:humicroedit', 'huggingface:hybrid_qa', 'huggingface:hyperpartisan_news_detection', 'huggingface:iapp_wiki_qa_squad', 'huggingface:id_clickbait', 'huggingface:id_liputan6', 'huggingface:id_nergrit_corpus', 'huggingface:id_newspapers_2018', 'huggingface:id_panl_bppt', 'huggingface:id_puisi', 'huggingface:igbo_english_machine_translation', 'huggingface:igbo_monolingual', 'huggingface:igbo_ner', 'huggingface:ilist', 'huggingface:imagenet-1k', 'huggingface:imagenet_sketch', 'huggingface:imdb', 'huggingface:imdb_urdu_reviews', 'huggingface:imppres', 'huggingface:indic_glue', 'huggingface:indonli', 'huggingface:indonlu', 'huggingface:inquisitive_qg', 'huggingface:interpress_news_category_tr', 'huggingface:interpress_news_category_tr_lite', 'huggingface:irc_disentangle', 'huggingface:isixhosa_ner_corpus', 'huggingface:isizulu_ner_corpus', 'huggingface:iwslt2017', 'huggingface:jeopardy', 'huggingface:jfleg', 'huggingface:jigsaw_toxicity_pred', 'huggingface:jigsaw_unintended_bias', 'huggingface:jnlpba', 'huggingface:journalists_questions', 'huggingface:kan_hope', 'huggingface:kannada_news', 'huggingface:kd_conv', 'huggingface:kde4', 'huggingface:kelm', 'huggingface:kilt_tasks', 'huggingface:kilt_wikipedia', 'huggingface:kinnews_kirnews', 'huggingface:klue', 'huggingface:kor_3i4k', 'huggingface:kor_hate', 'huggingface:kor_ner', 'huggingface:kor_nli', 'huggingface:kor_nlu', 'huggingface:kor_qpair', 'huggingface:kor_sae', 'huggingface:kor_sarcasm', 'huggingface:labr', 'huggingface:lama', 'huggingface:lambada', 'huggingface:large_spanish_corpus', 'huggingface:laroseda', 'huggingface:lc_quad', 'huggingface:lccc', 'huggingface:lener_br', 'huggingface:lex_glue', 'huggingface:liar', 'huggingface:librispeech_asr', 'huggingface:librispeech_lm', 'huggingface:limit', 'huggingface:lince', 'huggingface:linnaeus', 'huggingface:liveqa', 'huggingface:lj_speech', 'huggingface:lm1b', 'huggingface:lst20', 'huggingface:m_lama', 'huggingface:mac_morpho', 'huggingface:makhzan', 'huggingface:masakhaner', 'huggingface:math_dataset', 'huggingface:math_qa', 'huggingface:matinf', 'huggingface:mbpp', 'huggingface:mc4', 'huggingface:mc_taco', 'huggingface:md_gender_bias', 'huggingface:mdd', 'huggingface:med_hop', 'huggingface:medal', 'huggingface:medical_dialog', 'huggingface:medical_questions_pairs', 'huggingface:medmcqa', 'huggingface:menyo20k_mt', 'huggingface:meta_woz', 'huggingface:metashift', 'huggingface:metooma', 'huggingface:metrec', 'huggingface:miam', 'huggingface:mkb', 'huggingface:mkqa', 'huggingface:mlqa', 'huggingface:mlsum', 'huggingface:mnist', 'huggingface:mocha', 'huggingface:monash_tsf', 'huggingface:moroco', 'huggingface:movie_rationales', 'huggingface:mrqa', 'huggingface:ms_marco', 'huggingface:ms_terms', 'huggingface:msr_genomics_kbcomp', 'huggingface:msr_sqa', 'huggingface:msr_text_compression', 'huggingface:msr_zhen_translation_parity', 'huggingface:msra_ner', 'huggingface:mt_eng_vietnamese', 'huggingface:muchocine', 'huggingface:multi_booked', 'huggingface:multi_eurlex', 'huggingface:multi_news', 'huggingface:multi_nli', 'huggingface:multi_nli_mismatch', 'huggingface:multi_para_crawl', 'huggingface:multi_re_qa', 'huggingface:multi_woz_v22', 'huggingface:multi_x_science_sum', 'huggingface:multidoc2dial', 'huggingface:multilingual_librispeech', 'huggingface:mutual_friends', 'huggingface:mwsc', 'huggingface:myanmar_news', 'huggingface:narrativeqa', 'huggingface:narrativeqa_manual', 'huggingface:natural_questions', 'huggingface:ncbi_disease', 'huggingface:nchlt', 'huggingface:ncslgr', 'huggingface:nell', 'huggingface:neural_code_search', 'huggingface:news_commentary', 'huggingface:newsgroup', 'huggingface:newsph', 'huggingface:newsph_nli', 'huggingface:newspop', 'huggingface:newsqa', 'huggingface:newsroom', 'huggingface:nkjp-ner', 'huggingface:nli_tr', 'huggingface:nlu_evaluation_data', 'huggingface:norec', 'huggingface:norne', 'huggingface:norwegian_ner', 'huggingface:nq_open', 'huggingface:nsmc', 'huggingface:numer_sense', 'huggingface:numeric_fused_head', 'huggingface:oclar', 'huggingface:offcombr', 'huggingface:offenseval2020_tr', 'huggingface:offenseval_dravidian', 'huggingface:ofis_publik', 'huggingface:ohsumed', 'huggingface:ollie', 'huggingface:omp', 'huggingface:onestop_english', 'huggingface:onestop_qa', 'huggingface:open_subtitles', 'huggingface:openai_humaneval', 'huggingface:openbookqa', 'huggingface:openslr', 'huggingface:openwebtext', 'huggingface:opinosis', 'huggingface:opus100', 'huggingface:opus_books', 'huggingface:opus_dgt', 'huggingface:opus_dogc', 'huggingface:opus_elhuyar', 'huggingface:opus_euconst', 'huggingface:opus_finlex', 'huggingface:opus_fiskmo', 'huggingface:opus_gnome', 'huggingface:opus_infopankki', 'huggingface:opus_memat', 'huggingface:opus_montenegrinsubs', 'huggingface:opus_openoffice', 'huggingface:opus_paracrawl', 'huggingface:opus_rf', 'huggingface:opus_tedtalks', 'huggingface:opus_ubuntu', 'huggingface:opus_wikipedia', 'huggingface:opus_xhosanavy', 'huggingface:orange_sum', 'huggingface:oscar', 'huggingface:para_crawl', 'huggingface:para_pat', 'huggingface:parsinlu_reading_comprehension', 'huggingface:pass', 'huggingface:paws', 'huggingface:paws-x', 'huggingface:pec', 'huggingface:peer_read', 'huggingface:peoples_daily_ner', 'huggingface:per_sent', 'huggingface:persian_ner', 'huggingface:pg19', 'huggingface:php', 'huggingface:piaf', 'huggingface:pib', 'huggingface:piqa', 'huggingface:pn_summary', 'huggingface:poem_sentiment', 'huggingface:polemo2', 'huggingface:poleval2019_cyberbullying', 'huggingface:poleval2019_mt', 'huggingface:polsum', 'huggingface:polyglot_ner', 'huggingface:prachathai67k', 'huggingface:pragmeval', 'huggingface:proto_qa', 'huggingface:psc', 'huggingface:ptb_text_only', 'huggingface:pubmed', 'huggingface:pubmed_qa', 'huggingface:py_ast', 'huggingface:qa4mre', 'huggingface:qa_srl', 'huggingface:qa_zre', 'huggingface:qangaroo', 'huggingface:qanta', 'huggingface:qasc', 'huggingface:qasper', 'huggingface:qed', 'huggingface:qed_amara', 'huggingface:quac', 'huggingface:quail', 'huggingface:quarel', 'huggingface:quartz', 'huggingface:quickdraw', 'huggingface:quora', 'huggingface:quoref', 'huggingface:race', 'huggingface:re_dial', 'huggingface:reasoning_bg', 'huggingface:recipe_nlg', 'huggingface:reclor', 'huggingface:red_caps', 'huggingface:reddit', 'huggingface:reddit_tifu', 'huggingface:refresd', 'huggingface:reuters21578', 'huggingface:riddle_sense', 'huggingface:ro_sent', 'huggingface:ro_sts', 'huggingface:ro_sts_parallel', 'huggingface:roman_urdu', 'huggingface:roman_urdu_hate_speech', 'huggingface:ronec', 'huggingface:ropes', 'huggingface:rotten_tomatoes', 'huggingface:russian_super_glue', 'huggingface:rvl_cdip', 'huggingface:s2orc', 'huggingface:samsum', 'huggingface:sanskrit_classic', 'huggingface:saudinewsnet', 'huggingface:sberquad', 'huggingface:sbu_captions', 'huggingface:scan', 'huggingface:scb_mt_enth_2020', 'huggingface:scene_parse_150', 'huggingface:schema_guided_dstc8', 'huggingface:scicite', 'huggingface:scielo', 'huggingface:scientific_papers', 'huggingface:scifact', 'huggingface:sciq', 'huggingface:scitail', 'huggingface:scitldr', 'huggingface:search_qa', 'huggingface:sede', 'huggingface:selqa', 'huggingface:sem_eval_2010_task_8', 'huggingface:sem_eval_2014_task_1', 'huggingface:sem_eval_2018_task_1', 'huggingface:sem_eval_2020_task_11', 'huggingface:sent_comp', 'huggingface:senti_lex', 'huggingface:senti_ws', 'huggingface:sentiment140', 'huggingface:sepedi_ner', 'huggingface:sesotho_ner_corpus', 'huggingface:setimes', 'huggingface:setswana_ner_corpus', 'huggingface:sharc', 'huggingface:sharc_modified', 'huggingface:sick', 'huggingface:silicone', 'huggingface:simple_questions_v2', 'huggingface:siswati_ner_corpus', 'huggingface:smartdata', 'huggingface:sms_spam', 'huggingface:snips_built_in_intents', 'huggingface:snli', 'huggingface:snow_simplified_japanese_corpus', 'huggingface:so_stacksample', 'huggingface:social_bias_frames', 'huggingface:social_i_qa', 'huggingface:sofc_materials_articles', 'huggingface:sogou_news', 'huggingface:spanish_billion_words', 'huggingface:spc', 'huggingface:species_800', 'huggingface:speech_commands', 'huggingface:spider', 'huggingface:squad', 'huggingface:squad_adversarial', 'huggingface:squad_es', 'huggingface:squad_it', 'huggingface:squad_kor_v1', 'huggingface:squad_kor_v2', 'huggingface:squad_v1_pt', 'huggingface:squad_v2', 'huggingface:squadshifts', 'huggingface:srwac', 'huggingface:sst', 'huggingface:stereoset', 'huggingface:story_cloze', 'huggingface:stsb_mt_sv', 'huggingface:stsb_multi_mt', 'huggingface:style_change_detection', 'huggingface:subjqa', 'huggingface:super_glue', 'huggingface:superb', 'huggingface:svhn', 'huggingface:swag', 'huggingface:swahili', 'huggingface:swahili_news', 'huggingface:swda', 'huggingface:swedish_medical_ner', 'huggingface:swedish_ner_corpus', 'huggingface:swedish_reviews', 'huggingface:swiss_judgment_prediction', 'huggingface:tab_fact', 'huggingface:tamilmixsentiment', 'huggingface:tanzil', 'huggingface:tapaco', 'huggingface:tashkeela', 'huggingface:taskmaster1', 'huggingface:taskmaster2', 'huggingface:taskmaster3', 'huggingface:tatoeba', 'huggingface:ted_hrlr', 'huggingface:ted_iwlst2013', 'huggingface:ted_multi', 'huggingface:ted_talks_iwslt', 'huggingface:telugu_books', 'huggingface:telugu_news', 'huggingface:tep_en_fa_para', 'huggingface:text2log', 'huggingface:textvqa', 'huggingface:thai_toxicity_tweet', 'huggingface:thainer', 'huggingface:thaiqa_squad', 'huggingface:thaisum', 'huggingface:the_pile', 'huggingface:the_pile_books3', 'huggingface:the_pile_openwebtext2', 'huggingface:the_pile_stack_exchange', 'huggingface:tilde_model', 'huggingface:time_dial', 'huggingface:times_of_india_news_headlines', 'huggingface:timit_asr', 'huggingface:tiny_shakespeare', 'huggingface:tlc', 'huggingface:tmu_gfm_dataset', 'huggingface:tne', 'huggingface:told-br', 'huggingface:totto', 'huggingface:trec', 'huggingface:trivia_qa', 'huggingface:truthful_qa', 'huggingface:tsac', 'huggingface:ttc4900', 'huggingface:tunizi', 'huggingface:tuple_ie', 'huggingface:turk', 'huggingface:turkic_xwmt', 'huggingface:turkish_movie_sentiment', 'huggingface:turkish_ner', 'huggingface:turkish_product_reviews', 'huggingface:turkish_shrinked_ner', 'huggingface:turku_ner_corpus', 'huggingface:tweet_eval', 'huggingface:tweet_qa', 'huggingface:tweets_ar_en_parallel', 'huggingface:tweets_hate_speech_detection', 'huggingface:twi_text_c3', 'huggingface:twi_wordsim353', 'huggingface:tydiqa', 'huggingface:ubuntu_dialogs_corpus', 'huggingface:udhr', 'huggingface:um005', 'huggingface:un_ga', 'huggingface:un_multi', 'huggingface:un_pc', 'huggingface:universal_dependencies', 'huggingface:universal_morphologies', 'huggingface:urdu_fake_news', 'huggingface:urdu_sentiment_corpus', 'huggingface:vctk', 'huggingface:visual_genome', 'huggingface:vivos', 'huggingface:web_nlg', 'huggingface:web_of_science', 'huggingface:web_questions', 'huggingface:weibo_ner', 'huggingface:wi_locness', 'huggingface:wider_face', 'huggingface:wiki40b', 'huggingface:wiki_asp', 'huggingface:wiki_atomic_edits', 'huggingface:wiki_auto', 'huggingface:wiki_bio', 'huggingface:wiki_dpr', 'huggingface:wiki_hop', 'huggingface:wiki_lingua', 'huggingface:wiki_movies', 'huggingface:wiki_qa', 'huggingface:wiki_qa_ar', 'huggingface:wiki_snippets', 'huggingface:wiki_source', 'huggingface:wiki_split', 'huggingface:wiki_summary', 'huggingface:wikiann', 'huggingface:wikicorpus', 'huggingface:wikihow', 'huggingface:wikipedia', 'huggingface:wikisql', 'huggingface:wikitablequestions', 'huggingface:wikitext', 'huggingface:wikitext_tl39', 'huggingface:wili_2018', 'huggingface:wino_bias', 'huggingface:winograd_wsc', 'huggingface:winogrande', 'huggingface:wiqa', 'huggingface:wisesight1000', 'huggingface:wisesight_sentiment', 'huggingface:wmt14', 'huggingface:wmt15', 'huggingface:wmt16', 'huggingface:wmt17', 'huggingface:wmt18', 'huggingface:wmt19', 'huggingface:wmt20_mlqe_task1', 'huggingface:wmt20_mlqe_task2', 'huggingface:wmt20_mlqe_task3', 'huggingface:wmt_t2t', 'huggingface:wnut_17', 'huggingface:wongnai_reviews', 'huggingface:woz_dialogue', 'huggingface:wrbsc', 'huggingface:x_stance', 'huggingface:xcopa', 'huggingface:xcsr', 'huggingface:xed_en_fi', 'huggingface:xglue', 'huggingface:xnli', 'huggingface:xor_tydi_qa', 'huggingface:xquad', 'huggingface:xquad_r', 'huggingface:xsum', 'huggingface:xsum_factuality', 'huggingface:xtreme', 'huggingface:yahoo_answers_qa', 'huggingface:yahoo_answers_topics', 'huggingface:yelp_polarity', 'huggingface:yelp_review_full', 'huggingface:yoruba_bbc_topics', 'huggingface:yoruba_gv_ner', 'huggingface:yoruba_text_c3', 'huggingface:yoruba_wordsim353', 'huggingface:youtube_caption_corrections', 'huggingface:zest', 'kubric:kubric_frames', 'kubric:movi_a', 'kubric:movi_b', 'kubric:movi_c', 'kubric:movi_d', 'kubric:movi_e', 'kubric:movi_f', 'kubric:msn_easy_frames', 'kubric:multi_shapenet_frames', 'kubric:nerf_synthetic_frames', 'kubric:nerf_synthetic_scenes', 'kubric:shapenet_pretraining', 'robotics:language_table', 'robotics:language_table_blocktoabsolute_oracle_sim', 'robotics:language_table_blocktoblock_4block_sim', 'robotics:language_table_blocktoblock_oracle_sim', 'robotics:language_table_blocktoblock_sim', 'robotics:language_table_blocktoblockrelative_oracle_sim', 'robotics:language_table_blocktorelative_oracle_sim', 'robotics:language_table_separate_oracle_sim', 'robotics:language_table_sim', 'robotics:mt_opt_rlds', 'robotics:mt_opt_sd']
數(shù)據(jù)集太多,可能會感到陌生,不用一一去查看和測試這些數(shù)據(jù)集虫几。下面列出常用的幾種類型锤灿,分別設(shè)計文本類、圖像類辆脸、文檔概要但校、自然語言類、對象檢測啡氢、推薦以及視頻類數(shù)據(jù)状囱。
類別 | 名稱 | 說明 |
---|---|---|
文本 | billsum | BookSum: A Collection of Datasets for Long-form Narrative Summarization |
NEWSROOM | NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. | |
samsum | SAMSum Corpus contains over 16k chat dialogues with manually annotated summaries. | |
圖像 | imagenet_v2 | ImageNet-v2 is an ImageNet test set (10 per class) collected by closely following the original labelling protocol |
Cityscapes | Cityscapes is a dataset consisting of diverse urban street scenes across 50 different cities at varying times of the year as well as ground truths for several vision tasks including semantic segmentation, instance level segmentation (TODO), and stereo pair disparity inference. | |
cifar100 | This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). | |
food101 | This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. | |
mnist | The MNIST database of handwritten digits. | |
文檔概要 | opinion_abstracts | The movie critics and consensus crawled. |
booksum | BookSum: A Collection of Datasets for Long-form Narrative Summarization. | |
自然語言 | natural_questions | The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. |
math_qa | A large-scale dataset of math word problems and an interpretable neural math problem solver that learns to map problems to operation programs. | |
對象識別 | waymo_open_dataset | The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. This data is licensed for non-commercial use. |
對象識別 | coco | COCO is a large-scale object detection, segmentation, and captioning dataset. |
推薦 | hillstrom | This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test. |
視頻 | bair_robot_pushing_small | this data set contains roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64x64 version. |
TAO | The TAO dataset is a large video object detection dataset consisting of 2,907 high resolution videos and 833 object categories. | |
WebVid | WebVid is a large-scale dataset of short videos with textual descriptions sourced from the web. The videos are diverse and rich in their content. |
完整列表可以訪問https://www.tensorflow.org/datasets/catalog/overview
一般來說,安裝好了TensorFlow倘是,TensorFlow Datasets庫也被默認安裝亭枷。也可以通過如下命令單獨安裝TensorFlow Datasets,
pip install tensorflow_datasets
TensorFlow Datasets使用
下面以MNIST數(shù)據(jù)集為例搀崭,介紹TensorFlow Datasets數(shù)據(jù)集的基本使用方法叨粘。代碼如下,
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
def setup():
mnist = tfds.load(name” = mnist", data_dir = "/tmp")
trains, tests = mnist["train"], mnist["test"]
assert isinstance(trains, tf.data.Dataset)
def main():
setup()
if __name__ == "__main__":
main()
代碼里瘤睹,首先導(dǎo)入tensorflow_datasets作為導(dǎo)入數(shù)據(jù)的入口升敲,之后調(diào)用load函數(shù)并傳遞name和data_dir(可選,默認當前用戶主目錄下)來加載數(shù)據(jù)集轰传。在按照trains和tests將其分割成訓(xùn)練集和測試集驴党。運行結(jié)果打印輸入出如下,
Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /tmp/mnist/3.0.1...
Dl Completed...: 100%|#########################| 5/5 [00:07<00:00, 1.58s/ file]
Dataset mnist downloaded and prepared to /tmp/mnist/3.0.1. Subsequent calls will reuse this data.
由于是第一次下載获茬,tfds連接數(shù)據(jù)的下載點獲取數(shù)據(jù)的下載地址和內(nèi)容鼻弧,耐心等待下載完成。修改代碼锦茁,添加打印trains和tests功能攘轩,然后再次運行。
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
def setup():
mnist = tfds.load(name” = mnist", data_dir = "/tmp")
trains, tests = mnist["train"], mnist["test"]
assert isinstance(trains, tf.data.Dataset)
print(trains, tests)
def main():
setup()
if __name__ == "__main__":
main()
運行結(jié)果打印輸出如下码俩,
<_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}> <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>
運行可知度帮,已經(jīng)下載數(shù)據(jù)集,不會再次下載稿存,除非把數(shù)據(jù)集從目錄里刪除笨篷。另外,數(shù)據(jù)集已經(jīng)被調(diào)整成相應(yīng)的維度和數(shù)據(jù)格式瓣履。根據(jù)輸出的打印信息可知率翅,MNIST數(shù)據(jù)集中的數(shù)據(jù)為3維、大小為[28, 28, 1]的圖片袖迎,數(shù)據(jù)類型是uint8冕臭,而label類型時int64腺晾。
TensorFlow Datasets的load函數(shù),提供了一種簡便的方法辜贵,用以構(gòu)建和加載tensorflow.data.Dataset最快捷的方法悯蝉。器獲取的是一個不同的字典類型文件,根據(jù)不同的key獲取不同的value托慨。
然而鼻由,在JAX中,處理的數(shù)據(jù)基本都是float類型厚棵,很顯然與uint8蕉世、int64
不兼容。為了方便那些在程序中需要使用NumPy數(shù)組的用戶婆硬,可以使用tfds.as_numpy返回一個用于生成NumPy數(shù)組狠轻。以下是示例代碼,
import tensorflow as tf
import tensorflow_datasets as tfds
def setup():
trains = tfds.load(name = "mnist", split = tfds.Split.TRAIN, data_dir = "/tmp")
trains = trains.shuffle(1024).batch(128).repeat(5).prefetch(10)
i = 0
for item in tfds.as_numpy(trains):
images, labels = item["image"], item["label"]
print(f"i = {i}, images.shape = {images.shape}, labels.shape = {labels.shape}")
i = i + 1
def main():
setup()
if __name__ == "__main__":
main()
運行結(jié)果打印輸出如下柿祈,
…
i = 2326, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2327, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2328, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2329, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2330, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2331, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2332, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2333, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2334, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2335, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2336, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2337, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2338, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2339, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2340, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2341, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2342, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2343, images.shape = (128, 28, 28, 1), labels.shape = (128,)
i = 2344, images.shape = (96, 28, 28, 1), labels.shape = (96,)
tfds.load函數(shù)還可以指定batch_size = -1哈误,從而返回tf.Tensor對象中獲取完整數(shù)據(jù)集哩至。修改代碼如下躏嚎,
import tensorflow as tf
import tensorflow_datasets as tfds
def setup():
trains = tfds.load(name = "mnist", split = tfds.Split.TRAIN, data_dir = "/tmp")
trains = tfds.load(name = "mnist", batch_size = -1, split = tfds.Split.TRAIN, data_dir = "/tmp")
trains = trains.shuffle(1024).batch(128).repeat(5).prefetch(10)
i = 0
for item in tfds.as_numpy(trains):
images, labels = item["image"], item["label"]
print(f"i = {i}, images.shape = {images.shape}, labels.shape = {labels.shape}")
i = i + 1
def setup_():
trains = tfds.load(name = "mnist", batch_size = -1, split = tfds.Split.TRAIN, data_dir = "/tmp")
trains = tfds.as_numpy(trains)
train_images, train_labels = trains["image"], trains["label"]
print(f"train_images.shape = {train_images.shape}, train_labels.shape = {train_labels.shape}")
def main():
# setup()
setup_()
if __name__ == "__main__":
main()
運行結(jié)果打印輸出如下,
train_images.shape = (60000, 28, 28, 1), train_labels.shape = (60000,)
Load函數(shù)調(diào)用時菩貌,split參數(shù)指定將數(shù)據(jù)進行分割卢佣。如果需要對數(shù)據(jù)集進行更細分,可以安權(quán)重將其細分成訓(xùn)練集箭阶、測試集和驗證集虚茶。代碼如下,
import tensorflow_datasets as tfds
def setup():
splits = ["train[:50%]", "train[:20%]", "train[:25%]"]
(trains, validations, tests), metas = tfds.load(name = "mnist", data_dir = "/tmp/", split = list(splits), with_info = True, as_supervised = True)
print(f"trains = {trains}, validations = {validations}, tests = {tests}), metas = {metas}")
def main():
setup()
if __name__ == "__main__":
main()
這里使用了splits = ["train[:50%]", "train[:20%]", "train[:25%]”]來指定split = list(splits)參數(shù)仇参,按其指定的權(quán)重降訓(xùn)練集嘹叫、驗證集和測試集分別分割成50%、25%和25%诈乒。with_info屬性獲取了mnist數(shù)據(jù)集的基本信息罩扇。包含數(shù)據(jù)的種類、大小以及對應(yīng)的格式怕磨。運行結(jié)果打印輸入出如下喂饥,
rains = <_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>, validations = <_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>, tests = <_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>), metas = tfds.core.DatasetInfo(
name='mnist',
full_name='mnist/3.0.1',
description="""
The MNIST database of handwritten digits.
""",
homepage='http://yann.lecun.com/exdb/mnist/',
data_path='/tmp/mnist/3.0.1',
file_format=tfrecord,
download_size=11.06 MiB,
dataset_size=21.00 MiB,
features=FeaturesDict({
'image': Image(shape=(28, 28, 1), dtype=uint8),
'label': ClassLabel(shape=(), dtype=int64, num_classes=10),
}),
supervised_keys=('image', 'label'),
disable_shuffling=False,
splits={
'test': <SplitInfo num_examples=10000, num_shards=1>,
'train': <SplitInfo num_examples=60000, num_shards=1>,
},
citation="""@article{lecun2010mnist,
title={MNIST handwritten digit database},
author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
volume={2},
year={2010}
}""",
)
結(jié)論
這兩章以TensorFlow Datasets為例介紹了公共數(shù)據(jù)集,以及在JAX里使用數(shù)據(jù)集肠鲫。這些數(shù)據(jù)集讓JAX很自由地借助于TensorFlow Datasets等公共數(shù)據(jù)集來進行訓(xùn)練员帮,從而解決用戶尋找數(shù)據(jù)集的困難。