tl;dr: Need help cleaning up my downcast_int(df) function below.
Hello, I’m trying to write my own downcasting functions to save memory usage. I am curious about alternatives to my (frankly, quite messy, but functioning) code, to make it more readable – and, perhaps, faster.
The downcasting function directly modifies my dataframe, something I am not sure I should be doing.
Any help is appreciated.
Example df
df = pd.DataFrame({ 'first': [1_000, 200_000], 'second': [-30, -40_000], 'third': ["some", "string"], 'fourth': [4.5, 6.1], 'fifth': [-6, -8] })
first second third fourth fifth 0 1000 -30 some 4.5 -6 1 200000 -40000 string 6.1 -8
df.info()
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null int64 1 second 2 non-null int64 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int64 dtypes: float64(1), int64(3), object(1)
Downcasting function
def downcast_int(df): """Select all int columns. Convert them to unsigned or signed types.""" cols = df.select_dtypes(include=['int64']).columns cols_unsigned = None # There is at least one negative number in a column. if (df[cols] < 0).any().any(): df_unsigned = (df[cols] < 0).any() cols_unsigned = df_unsigned[df_unsigned == True].index df[cols_unsigned] = df[cols_unsigned].apply(pd.to_numeric, downcast='signed') # If there were any changed columns, remove them. if cols_unsigned is not None: cols = cols.drop(cols_unsigned) # Turn the remaining columns into unsigned integers. df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
df.info() after downcasting
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null uint32 1 second 2 non-null int32 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int8 dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
Advertisement
Answer
Just apply to_numeric()
twice. Once to get to min signed, then a second time to reduce the unsigned.
df2 = df.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='signed') df2 = df2.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='unsigned') df[df2.columns] = df2
Same output as your method:
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null uint32 1 second 2 non-null int32 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int8 dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)