I have a dataset composed as:
dataset = [{"sample":[numpy array (2048,3) shape], "category":"Cat"}, ....]
Each element of the list is a dictionary containing a key “sample” and its value is a numpy array that has shape (2048,3) and the category is the class of that sample. The dataset len is 8000.
I tried to save in JSON but it said it can’t serialize numpy arrays.
What’s the best way to save this list? I can’t use np.save("file", dataset)
because there is a dictionary and I can’t use JSON because there is the numpy array. Should I use HDF5? What format should I use if I have to use the dataset for machine learning?
Thanks!
Advertisement
Answer
Creating an example specific to your data requires more details about the dictionaries in the list. I created an example that assumes every dictionary has:
- A unique value for the
category
key. The value is used for the dataset name. - There is a
sample
key with the array you want to save.
Code below creates some data, loads to a HDF5 file with h5py package, then reads the data back into a new list of dictionaries. It is a good starting point for your problem.
import numpy as np import h5py a0, a1 = 10, 5 arr1 = np.arange(a0*a1).reshape(a0,a1) arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1) arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1) dataset = [{"sample":arr1, "category":"Cat"}, {"sample":arr2, "category":"Dog"}, {"sample":arr3, "category":"Fish"}, ] # Create the HDF5 file with "category" as dataset name and "sample" as the data with h5py.File('SO_73499414.h5', 'w') as h5f: for ds_dict in dataset: h5f.create_dataset(ds_dict["category"], data=ds_dict["sample"]) # Retrieve the HDF5 data with "category" as dataset name and "sample" as the data ds_list = [] with h5py.File('SO_73499414.h5', 'r') as h5f: for ds_name in h5f: print(ds_name,'n',h5f[ds_name]) # prints name and dataset attributes print(h5f[ds_name][()]) # prints the dataset values (as an array) # add data and name to list ds_list.append({"sample":h5f[ds_name][()], "category":ds_name})
Here is a second method when category values aren’t unique.
a0, a1 = 10, 5 arr1 = np.arange(a0*a1).reshape(a0,a1) arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1) arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1) arr4 = np.arange(3*a0*a1,4*a0*a1).reshape(a0,a1) dataset = [{"sample":arr1, "category":"Cat"}, {"sample":arr2, "category":"Dog"}, {"sample":arr3, "category":"Cat"}, {"sample":arr4, "category":"Dog"} ] # Create the HDF5 file with dataset name using counter and "sample" as the data # "category" is savee as a dataset attribute with h5py.File('SO_73499414.h5', 'w') as h5f: for i, ds_dict in enumerate(dataset): ds = h5f.create_dataset(f'ds_{i:04}', data=ds_dict["sample"]) ds.attrs["category"] = ds_dict["category"] # Retrieve the HDF5 data with "sample" as the data and "category" from the attribute ds_list = [] with h5py.File('SO_73499414.h5', 'r') as h5f: for ds_name in h5f: print(ds_name,'n',h5f[ds_name]) # prints name and dataset attributes print(h5f[ds_name].attrs["category"]) # prints the category attribute print(h5f[ds_name][()]) # prints the dataset values (as an array) # add data and name to list ds_list.append({"sample":h5f[ds_name][()], "category":h5f[ds_name].attrs["category"]})