Skip to content
Advertisement

Is there a more efficient way to find and downgrade int64 columns with to_numeric() in Python Pandas?

tl;dr: Need help cleaning up my downcast_int(df) function below.

Hello, I’m trying to write my own downcasting functions to save memory usage. I am curious about alternatives to my (frankly, quite messy, but functioning) code, to make it more readable – and, perhaps, faster.

The downcasting function directly modifies my dataframe, something I am not sure I should be doing.

Any help is appreciated.

Example df

df = pd.DataFrame({
    'first': [1_000, 200_000],
    'second': [-30, -40_000],
    'third': ["some", "string"],
    'fourth': [4.5, 6.1],
    'fifth': [-6, -8]
    })
    first   second  third   fourth  fifth
0   1000    -30     some    4.5     -6
1   200000  -40000  string  6.1     -8

df.info()

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   2 non-null      int64  
 1   second  2 non-null      int64  
 2   third   2 non-null      object 
 3   fourth  2 non-null      float64
 4   fifth   2 non-null      int64  
dtypes: float64(1), int64(3), object(1)

Downcasting function

def downcast_int(df):
  """Select all int columns. Convert them to unsigned or signed types."""
  cols = df.select_dtypes(include=['int64']).columns
  cols_unsigned = None
  
  # There is at least one negative number in a column.
  if (df[cols] < 0).any().any():
    df_unsigned = (df[cols] < 0).any()
    cols_unsigned = df_unsigned[df_unsigned == True].index
    df[cols_unsigned] = df[cols_unsigned].apply(pd.to_numeric, downcast='signed')
    
  # If there were any changed columns, remove them.
  if cols_unsigned is not None:
    cols = cols.drop(cols_unsigned)
  
  # Turn the remaining columns into unsigned integers.
  df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')

df.info() after downcasting

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   2 non-null      uint32 
 1   second  2 non-null      int32  
 2   third   2 non-null      object 
 3   fourth  2 non-null      float64
 4   fifth   2 non-null      int8   
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)

Advertisement

Answer

Just apply to_numeric() twice. Once to get to min signed, then a second time to reduce the unsigned.

df2 = df.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='signed')
df2 = df2.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='unsigned')
df[df2.columns] = df2

Same output as your method:

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   2 non-null      uint32 
 1   second  2 non-null      int32  
 2   third   2 non-null      object 
 3   fourth  2 non-null      float64
 4   fifth   2 non-null      int8   
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement