Efficiently reading small pieces from multiple h5df files?

Question

I have a hdf5 file every day, which contains compressed data for many assets. Specifically, each h5 file contains 5000 assets, and is organized by key-value structure such as The data of each asset has the same format and size and all together I have around 1000 days of data. Now the task is to do ad-hoc analysis of different

Accepted Answer

I&#8217;m mostly concerned that with method 1, the IO burden grows up quickly?In some testing, I had a Cisco S3260 Storage Server with 256 GB ram and 56 SAS drives at 18 TB each. I used two clients both connected via a 10gbe Intel x550 SFP+ to a Mikrotik CRS317-1G-16S+ Layer 3 switch.The server had 4,428 files of at least 100 MB in size. So I created a crude script to read 100 MB from a random file. I found the performance peaked at about 4.5 Gb/s at the server side. I didn&#8217;t notice performance issues with the existing load on the server.So I have some recommendations:Use as many drives as possible, with each drive being as fast as possible. I used spinning disks, but SSD would have been better.Add as much ram as you can, ideally at least 128 GB.Use 10 Gbe fiber connections over RJ-45 copper.Should I read many small pieces this way stochastically or should I only let 1 process do the reading?If you data store is sufficiently wide (e.g. similar to my example of 56 spinning disks) then you should be able to have multiple readers performing simultaneously.Your file system also matters greatly here. Any file system that has sufficient redundancy that it can service reads from multiple disks helps. In my case, we use zfs with two mirrored devs. Therefore there are at least two drives holding the data and zfs queues the read to the least busy disk.

Efficiently reading small pieces from multiple h5df files?

Advertisement

Answer

I’m mostly concerned that with method 1, the IO burden grows up quickly?

Should I read many small pieces this way stochastically or should I only let 1 process do the reading?