I am using HuggingFace Transformers with PyTorch. My modus operandi is to download a pre-trained model and save it in a local project folder.
While doing so, I can see that .bin file is saved locally, which stands for the model. However, I am also downloading and saving a tokenizer, for which I cannot see any associated file.
So, how do I check if a tokenizer is saved locally before downloading? Secondly, apart from the usual os.path.isfile(...)
check, is there any other better way to prioritize local copy usage from a given location before downloading?
Advertisement
Answer
I’ve used this code in the past for this purpose. You can adapt it to your setting.
JavaScript
x
28
28
1
from tokenizers import BertWordPieceTokenizer
2
import urllib
3
from transformers import AutoTokenizer
4
5
def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=False):
6
vocab_files_map = tokenizer.pretrained_vocab_files_map
7
vocab_files = {}
8
for resource in vocab_files_map.keys():
9
download_location = vocab_files_map[resource][model_type]
10
f_path = os.path.join(output_path, os.path.basename(download_location))
11
if vocab_exist_bool != True:
12
urllib.request.urlretrieve(download_location, f_path)
13
vocab_files[resource] = f_path
14
return vocab_files
15
16
model_type = 'bert-base-uncased'
17
#initialized tokenizer
18
tokenizer = AutoTokenizer.from_pretrained(model_type)
19
#will do this part later
20
21
#retrieve vocab file if it's not there
22
output_path = os.getcwd()+'/vocab_files/'
23
vocab_file_name = 'bert-base-uncased-vocab.txt'
24
vocab_exist_bool = os.path.exists(output_path + vocab_file_name)
25
26
#get vocab files
27
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path, vocab_exist_bool=vocab_exist_bool)
28