Skip to content
Advertisement

How to use HDF5 dimension scales in h5py

HDF5 has the concept of dimension scales, as explained on the HDF5 and h5py websites. However, the explanations both use terse or generic examples and so I don’t really understand how to use dimension scales. Namely, given a dataset f['coordinates'] in some HDF5 file f = h5py.File('data.h5'):

>>> f['coordinates'].value
array([[ 52.60636111,   4.38963889],
   [ 52.57877778,   4.43422222],
   [ 52.58319444,   4.42811111],
   ..., 
   [ 52.62269444,   4.43130556],
   [ 52.62711111,   4.42519444],
   [ 52.63152778,   4.41905556]])

I’d like to make it clear that the first column is the latitude and the second is the longitude. Are dimension scales used for this? Or are they used to indicate that the unit is degrees. Or both?

Perhaps another concrete example can illustrate the use of dimension scales better? If you have one, please share it, even if you are not using h5py.

Advertisement

Answer

Specifically for this question, the best answer is probably to use attributes:

f['coordinates'].attrs['columns'] = ['latitude', 'longitude']

But dimension scales are useful for other things. I’ll show what they’re for, how you could use them in a way similar to attributes, and how you might actually use your f['coordinates'] dataset as a scale for some other dataset.

Dimension scales

I agree that those documentation pages are not as clear as they could be, because they launch into complicated possibilities and mire in technical details before they actually explain the basic concepts. I think some simple examples should make things clear.

First, suppose you’ve kept track of the temperature outside over the course of a day — maybe measuring it every hour on the hour, for a total of 24 measurements. You might think of this as two columns of data: one for the hour, and one for the temperature. You could store this as a single dataset of shape 24×2. But time and temperature have different units, and are really different datatypes. So it might make more sense to store time and temperature as separate datasets — probably named "time" and "temperature", each of shape 24. But you’d also need to be a little more clear about what these are and how they’re related to each other. That relationship is what “dimension scales” are really for.

If you imagine plotting the temperature as a function of time, you might label the horizontal axis as “Time (hour of day)”, and the scale for the horizontal axis would be the hours themselves, telling you the horizontal position at which to plot each temperature. You could store this information through h5py like this:

with h5py.File("temperatures.h5", "w") as f:
    time = f.create_dataset("time", data=...)
    time.make_scale("hour of day")
    temp = f.create_dataset("temperature", data=...)
    temp.dims[0].label = "Time"
    temp.dims[0].attach_scale(time)

Note that the argument to make_scale is specific information about that particular time dataset — in this case, the units we used to measure time — whereas the label is the more general concept of that dimension. Also note that it’s actually more standard to attach unit information as attributes, but I like this approach more for specifying the unit of a scale because of its simplicity.

Now, suppose we measured the temperatures in three different places — say, Los Angeles, Chicago, and New York. Now, our array of temperatures would have shape 24×3. We would still need the time scale for dim[0], but now we also have dim[1] to deal with.

with h5py.File("temperatures.h5", "w") as f:
    time = f.create_dataset("time", data=...)
    time.make_scale("hour of day")
    cities = f.create_dataset("cities",
        data=[s.encode() for s in ["Los Angeles", "Chicago", "New York"]]
    )
    cities.make_scale("city")
    temp = f.create_dataset("temperature", data=...)
    temp.dims[0].label = "Time"
    temp.dims[0].attach_scale(time)
    temp.dims[1].label = "Location"
    temp.dims[1].attach_scale(cities)

It might be more useful to store the latitude and longitude, instead of city names. You can actually attach both types of scale to the same dimension. Just add code like this at the bottom of that last code block:

    latlong = f.create_dataset("latlong",
        data=[[34.0522, 118.2437], [41.8781, 87.6298], [40.7128, 74.0060]]
    )
    latlong.make_scale("latitude and longitude (degrees)")
    temp.dims[1].attach_scale(latlong)

Finally, you can access these labels and scales like this:

with h5py.File("temperatures.h5", "r") as f:
    print('Datasets:', list(f))
    print('Temperature dimension labels:', [dim.label for dim in f['temperature'].dims])
    print('Temperature dim[1] scales:', f['temperature'].dims[1].keys())
    latlong = f['temperature'].dims[1]['latitude and longitude (degrees)'][:]
    print(latlong)

The output looks like this:

Datasets: ['cities', 'latlong', 'temperature', 'time']
Temperature dimension labels: ['Time', 'Location']
Temperature dim[1] scales: ['city', 'latitude and longitude (degrees)']
[[ 34.0522 118.2437]
 [ 41.8781  87.6298]
 [ 40.7128  74.006 ]]
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement