How to save a list in a pandas dataframe cell to a HDF5 table format?

Question

I have a dataframe that I want to save in the appendable format to a hdf5 file. The dataframe looks like this: And the code that replicates the issue is: Unfortunately, it returns this error: I am aware that I can save each value in a separate column. This does not help my extended use case, as there might be

Accepted Answer

Python Lists present a challenge when writing to HDF5 because they may contain different types. For example, this is a perfectly valid list: [1, 'two', 3.0]. Also, if I understand your Pandas 'column1' dataframe, it may contain different length lists. There is no (simple) way to represent this as an HDF5 dataset.[That&#8217;s why you got the [mixed] object dtype message. The conversion of the dataframe creates an intermediate object that is written as a dataset. The dtype of the converted list data is &#8220;O&#8221; (object), and HDF5 doesn&#8217;t support this type.]However, all is not lost. If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset. Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length. (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. The choice is yours.Before I post the code, here is an outline of the process:Create a NumPy record array (aka recarray) from the the dataframeDefine the desired type and shape for the HDF5 dataset (as an Atom forPytables, or a dtype for h5py).Create the dataset with Ataom/dtype definition above (could do on 1 line, buteasier to read this way).Loop over rows of the recarray (from Step 1), and write data to rows ofthe dataset. This converts the List to the equivalent array.Code to create recarray (adds 2 rows to your dataframe):import pandas as pdtest = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})# create recarray from the dataframe (use index='column1' to only get that column)rec_arr = test.to_records(index=False)PyTables specific code to export data:import tables as tbwith tb.File('74489101_tb.h5', 'w') as h5f:    # define "atom" with type and shape of column1 data    df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )    # create the dataset    test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )    # loop over recarray and polulate dataset    for i in range(rec_arr.shape[0]):        test[i] = rec_arr[i]['column1']    print(test[:])  h5py specific code to export data:import h5pywith h5py.File('74489101_h5py.h5', 'w') as h5f:    df_dt = (int,(len(rec_arr1[0]['column1']),))    test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )    for i in range(rec_arr1.shape[0]):        test[i] = rec_arr1[i]['column1']    print(test[:])

Advertisement

Answer