How do I check if a tokenizer/model is already saved

Question

I am using HuggingFace Transformers with PyTorch. My modus operandi is to download a pre-trained model and save it in a local project folder. While doing so, I can see that .bin file is saved locally, which stands for the model. However, I am also downloading and saving a tokenizer, for which I cannot see any…

Accepted Answer

I&#8217;ve used this code in the past for this purpose. You can adapt it to your setting.from tokenizers import BertWordPieceTokenizerimport urllibfrom transformers import AutoTokenizerdef download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=False):    vocab_files_map = tokenizer.pretrained_vocab_files_map    vocab_files = {}    for resource in vocab_files_map.keys():        download_location = vocab_files_map[resource][model_type]        f_path = os.path.join(output_path, os.path.basename(download_location))        if vocab_exist_bool != True:            urllib.request.urlretrieve(download_location, f_path)        vocab_files[resource] = f_path    return vocab_filesmodel_type = 'bert-base-uncased'#initialized tokenizertokenizer = AutoTokenizer.from_pretrained(model_type)#will do this part later#retrieve vocab file if it's not thereoutput_path = os.getcwd()+'/vocab_files/'vocab_file_name = 'bert-base-uncased-vocab.txt'vocab_exist_bool = os.path.exists(output_path + vocab_file_name)#get vocab filesvocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=vocab_exist_bool)

Advertisement

Answer