I have a hdf5 file every day, which contains compressed data for many assets. Specifically, each h5 file contains 5000 assets, and is organized by key-value structure such as
{'asset1': pd.DataFrame(), 'asset2': pd.DataFrame(), ...}
The data of each asset has the same format and size and all together I have around 1000 days of data.
Now the task is to do ad-hoc analysis of different assets across different days. That is, I might want to process data for 100 random assets on 100 random days, or all the assets for a particular day, the tasks can be quite random. I’m working on a server with 64 cores so naturally I want to leverage multiprocessing to make the analysis faster.
I’m thinking about doing it in two ways given 8 processes:
Each single process reads one asset from one day and does the analysis. So there could be many processes reading at the same time.
Load all the data using 1 process, and then split the analysis across 8 processes.
The asset data is not big so the IO for each single asset is quick. The time for doing the analysis varies depending on the exact analysis to do, but we could assume it generally takes much longer time than reading single asset data. There are also other people using the same framework to do the analysis.
I’m mostly concerned that with method 1, the IO burden grows up quickly? when many people spawning many processes to read at the same time. However, method 2 faces the same problem given multiple users, though less serious.
For now I’m using method 1 and it works fine. I’m wondering whether it is the correct way to do or what problems it might cause? Should I read many small pieces this way stochatically or should I only let 1 process do the reading?
Edit: I’m reading all the data from a NAS, connected to the server with a 10000 Mbit/s NIC, so maybe it is a bit different from reading directly from disk. The data size for single asset is less than 100Mb.
Advertisement
Answer
I’m mostly concerned that with method 1, the IO burden grows up quickly?
In some testing, I had a Cisco S3260 Storage Server with 256 GB ram and 56 SAS drives at 18 TB each. I used two clients both connected via a 10gbe Intel x550 SFP+ to a Mikrotik CRS317-1G-16S+ Layer 3 switch.
The server had 4,428 files of at least 100 MB in size. So I created a crude script to read 100 MB from a random file. I found the performance peaked at about 4.5 Gb/s at the server side. I didn’t notice performance issues with the existing load on the server.
So I have some recommendations:
- Use as many drives as possible, with each drive being as fast as possible. I used spinning disks, but SSD would have been better.
- Add as much ram as you can, ideally at least 128 GB.
- Use 10 Gbe fiber connections over RJ-45 copper.
Should I read many small pieces this way stochastically or should I only let 1 process do the reading?
If you data store is sufficiently wide (e.g. similar to my example of 56 spinning disks) then you should be able to have multiple readers performing simultaneously.
Your file system also matters greatly here. Any file system that has sufficient redundancy that it can service reads from multiple disks helps. In my case, we use zfs with two mirrored devs. Therefore there are at least two drives holding the data and zfs queues the read to the least busy disk.