tl;dr: Need help cleaning up my downcast_int(df) function below.
Hello, I’m trying to write my own downcasting functions to save memory usage. I am curious about alternatives to my (frankly, quite messy, but functioning) code, to make it more readable – and, perhaps, faster.
The downcasting function directly modifies my dataframe, something I am not sure I should be doing.
Any help is appreciated.
Example df
df = pd.DataFrame({
'first': [1_000, 200_000],
'second': [-30, -40_000],
'third': ["some", "string"],
'fourth': [4.5, 6.1],
'fifth': [-6, -8]
})
first second third fourth fifth
0 1000 -30 some 4.5 -6
1 200000 -40000 string 6.1 -8
df.info()
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null int64 1 second 2 non-null int64 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int64 dtypes: float64(1), int64(3), object(1)
Downcasting function
def downcast_int(df):
"""Select all int columns. Convert them to unsigned or signed types."""
cols = df.select_dtypes(include=['int64']).columns
cols_unsigned = None
# There is at least one negative number in a column.
if (df[cols] < 0).any().any():
df_unsigned = (df[cols] < 0).any()
cols_unsigned = df_unsigned[df_unsigned == True].index
df[cols_unsigned] = df[cols_unsigned].apply(pd.to_numeric, downcast='signed')
# If there were any changed columns, remove them.
if cols_unsigned is not None:
cols = cols.drop(cols_unsigned)
# Turn the remaining columns into unsigned integers.
df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
df.info() after downcasting
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null uint32 1 second 2 non-null int32 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int8 dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
Advertisement
Answer
Just apply to_numeric() twice. Once to get to min signed, then a second time to reduce the unsigned.
df2 = df.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='signed') df2 = df2.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='unsigned') df[df2.columns] = df2
Same output as your method:
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 2 non-null uint32 1 second 2 non-null int32 2 third 2 non-null object 3 fourth 2 non-null float64 4 fifth 2 non-null int8 dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)