Skip to content
Advertisement

How do I check if a tokenizer/model is already saved

I am using HuggingFace Transformers with PyTorch. My modus operandi is to download a pre-trained model and save it in a local project folder.

While doing so, I can see that .bin file is saved locally, which stands for the model. However, I am also downloading and saving a tokenizer, for which I cannot see any associated file.

So, how do I check if a tokenizer is saved locally before downloading? Secondly, apart from the usual os.path.isfile(...) check, is there any other better way to prioritize local copy usage from a given location before downloading?

Advertisement

Answer

I’ve used this code in the past for this purpose. You can adapt it to your setting.

from tokenizers import BertWordPieceTokenizer
import urllib
from transformers import AutoTokenizer

def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=False):
    vocab_files_map = tokenizer.pretrained_vocab_files_map
    vocab_files = {}
    for resource in vocab_files_map.keys():
        download_location = vocab_files_map[resource][model_type]
        f_path = os.path.join(output_path, os.path.basename(download_location))
        if vocab_exist_bool != True:
            urllib.request.urlretrieve(download_location, f_path)
        vocab_files[resource] = f_path
    return vocab_files

model_type = 'bert-base-uncased'
#initialized tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_type)
#will do this part later

#retrieve vocab file if it's not there
output_path = os.getcwd()+'/vocab_files/'
vocab_file_name = 'bert-base-uncased-vocab.txt'
vocab_exist_bool = os.path.exists(output_path + vocab_file_name)

#get vocab files
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=vocab_exist_bool)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement