tl;dr: Need help cleaning up my downcast_int(df) function below.
Hello, I’m trying to write my own downcasting functions to save memory usage. I am curious about alternatives to my (frankly, quite messy, but functioning) code, to make it more readable – and, perhaps, faster.
The downcasting function directly modifies my dataframe, something I am not sure I should be doing.
Any help is appreciated.
Example df
JavaScript
x
8
1
df = pd.DataFrame({
2
'first': [1_000, 200_000],
3
'second': [-30, -40_000],
4
'third': ["some", "string"],
5
'fourth': [4.5, 6.1],
6
'fifth': [-6, -8]
7
})
8
first second third fourth fifth 0 1000 -30 some 4.5 -6 1 200000 -40000 string 6.1 -8
df.info()
JavaScript
1
9
1
# Column Non-Null Count Dtype
2
--- ------ -------------- -----
3
0 first 2 non-null int64
4
1 second 2 non-null int64
5
2 third 2 non-null object
6
3 fourth 2 non-null float64
7
4 fifth 2 non-null int64
8
dtypes: float64(1), int64(3), object(1)
9
Downcasting function
JavaScript
1
18
18
1
def downcast_int(df):
2
"""Select all int columns. Convert them to unsigned or signed types."""
3
cols = df.select_dtypes(include=['int64']).columns
4
cols_unsigned = None
5
6
# There is at least one negative number in a column.
7
if (df[cols] < 0).any().any():
8
df_unsigned = (df[cols] < 0).any()
9
cols_unsigned = df_unsigned[df_unsigned == True].index
10
df[cols_unsigned] = df[cols_unsigned].apply(pd.to_numeric, downcast='signed')
11
12
# If there were any changed columns, remove them.
13
if cols_unsigned is not None:
14
cols = cols.drop(cols_unsigned)
15
16
# Turn the remaining columns into unsigned integers.
17
df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
18
df.info() after downcasting
JavaScript
1
9
1
# Column Non-Null Count Dtype
2
--- ------ -------------- -----
3
0 first 2 non-null uint32
4
1 second 2 non-null int32
5
2 third 2 non-null object
6
3 fourth 2 non-null float64
7
4 fifth 2 non-null int8
8
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
9
Advertisement
Answer
Just apply to_numeric()
twice. Once to get to min signed, then a second time to reduce the unsigned.
JavaScript
1
4
1
df2 = df.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='signed')
2
df2 = df2.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='unsigned')
3
df[df2.columns] = df2
4
Same output as your method:
JavaScript
1
9
1
# Column Non-Null Count Dtype
2
--- ------ -------------- -----
3
0 first 2 non-null uint32
4
1 second 2 non-null int32
5
2 third 2 non-null object
6
3 fourth 2 non-null float64
7
4 fifth 2 non-null int8
8
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
9