I am using HuggingFace Transformers with PyTorch. My modus operandi is to download a pre-trained model and save it in a local project folder.
While doing so, I can see that .bin file is saved locally, which stands for the model. However, I am also downloading and saving a tokenizer, for which I cannot see any associated file.
So, how do I check if a tokenizer is saved locally before downloading? Secondly, apart from the usual os.path.isfile(...)
check, is there any other better way to prioritize local copy usage from a given location before downloading?
Advertisement
Answer
I’ve used this code in the past for this purpose. You can adapt it to your setting.
from tokenizers import BertWordPieceTokenizer import urllib from transformers import AutoTokenizer def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=False): vocab_files_map = tokenizer.pretrained_vocab_files_map vocab_files = {} for resource in vocab_files_map.keys(): download_location = vocab_files_map[resource][model_type] f_path = os.path.join(output_path, os.path.basename(download_location)) if vocab_exist_bool != True: urllib.request.urlretrieve(download_location, f_path) vocab_files[resource] = f_path return vocab_files model_type = 'bert-base-uncased' #initialized tokenizer tokenizer = AutoTokenizer.from_pretrained(model_type) #will do this part later #retrieve vocab file if it's not there output_path = os.getcwd()+'/vocab_files/' vocab_file_name = 'bert-base-uncased-vocab.txt' vocab_exist_bool = os.path.exists(output_path + vocab_file_name) #get vocab files vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=vocab_exist_bool)