I’d like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I’d like to achieve is this:
- User runs
pip install dataset
- pip downloads the dataset, say via
wget http://mydata.com/data.tar.gz
. Note that the data does not reside in the python package itself, but is downloaded from somewhere else. - pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn’t ideal, but the datasets are pretty small, so let’s assume storing the data here isn’t a big deal.)
- Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
Advertisement
Answer
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn’t be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'): # Download; extract data to disk. # Raise an exception if the link is bad, or we can't connect, etc. def load_data(): if not os.path.exists(DATA_DIR): download_data() data = read_data_from_disk(DATA_DIR) return data
We could then describe download_data
in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio
module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.