Skip to content
Advertisement

Convert dataframe to a rec array (and objects to strings)

I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For purely numeric dataframes, this is easy to do with the to_records() method. I also need the dtypes of pandas columns to be converted to strings rather than objects so that I can use the numpy method tofile() which will output numbers and strings to a binary file, but will not output objects.

In a nutshell, I need to convert pandas columns with dtype=object to numpy structured arrays of string or unicode dtype.

Here’s an example, with code that would be sufficient if all columns had a numerical (float or int) dtype.

import pandas as pd
df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3], 
                 'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})

struct_arr=df.to_records(index=False)

print('struct_arr',struct_arr.dtype,'n')

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', 'O'), ('mixed', 'O')]) 

But because I want to end up with string dtypes, I need to add this additional and somewhat involved code:

lst=[]
for col in struct_arr.dtype.names:  # this was the only iterator I 
                                    # could find for the column labels
    dt=struct_arr[col].dtype

    if dt == 'O':   # this is 'O', meaning 'object'

        # it appears an explicit string length is required
        # so I calculate with pandas len & max methods
        dt = 'U' + str( df[col].astype(str).str.len().max() )
       
    lst.append((col,dt))

struct_arr = struct_arr.astype(lst)
        
print('struct_arr',struct_arr.dtype)

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', '<U3'), ('mixed', '<U2')])

See also: How to change the dtype of certain columns of a numpy recarray?

This seems to work, as the character and mixed dtypes are now <U3 and <U2 rather than ‘O’ or ‘object’. I’m just checking if there is a simpler or more elegant approach. But since pandas does not have a native string type as numpy does, maybe there is not?

Advertisement

Answer

Combining suggestions from @jpp (list comp for conciseness) & @hpaulj (cannibalize to_records for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):

names = df.columns
arrays = [ df[col].get_values() for col in names ]

formats = [ array.dtype if array.dtype != 'O' 
            else f'{array.astype(str).dtype}' for array in arrays ] 

rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )

The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I’m reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the “formats” line above with this:

formats = [ array.dtype if array.dtype != 'O' 
            else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]

E.g. a dtype of <U4 becomes S4.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement