Skip to content
Advertisement

In Pandas, how do I convert a number column to discrete values quickly?

I have a dataframe with an Integer column, which I need to convert to discrete value bands, I am currently doing this using apply with my own function, as seen bellow, however this is rather slow, is there any way to quickly do this?

def CalDiscrete(value, open):
    # Alot = >1%
    # Normal = 0.2% ~ 1%
    # Alittle = 0.01% ~ 0.2%
    # Flat = 0% ~ 0.01%
    if(value >= open*0.01):
        return 'up_alot'
    elif(value <= open*-0.01):
        return 'down_alot'
    elif(value >= open*0.002):
        return 'up'
    elif(value <= open*-0.002):
        return 'down'
    elif(value >= open*0.0001):
        return 'up_alittle'
    elif(value <= open*-0.0001):
        return 'down_alittle'
    else:
        return 'flat'

Caller function here:

df_input['movement'] = df_input["close"].diff()
df_input['discrete'] = df_input.apply(lambda row: CalDiscrete(row['movement'], row["close"]), axis=1) 

My data look like this:

date,open,high,low,close,adj close,volume
2000-01-03,148.25,148.25,143.875,145.4375,97.82567596435547,8164300
2000-01-04,143.53125,144.0625,139.640625,139.75,94.00010681152344,8089800
2000-01-05,139.9375,141.53125,137.25,140.0,94.16825866699219,12177900
2000-01-06,139.625,141.5,137.75,137.75,92.65486145019531,6227200
2000-01-07,140.3125,145.75,140.0625,145.75,98.03589630126953,8066500
2000-01-10,146.25,146.90625,145.03125,146.25,98.3722152709961,5741700
2000-01-11,145.8125,146.09375,143.5,144.5,97.19509887695312,7503700
2000-01-12,144.59375,144.59375,142.875,143.0625,96.22821044921875,6907700
2000-01-13,144.46875,145.75,143.28125,145.0,97.53144073486328,5158300
2000-01-14,146.53125,147.46875,145.96875,146.96875,98.85567474365234,7437300
2000-01-18,145.34375,146.625,145.1875,145.8125,98.07793426513672,6488500
2000-01-19,145.3125,147.0,145.0,147.0,98.87665557861328,6157900
2000-01-20,146.96875,146.96875,143.8125,144.75,97.36325073242188,5800100
2000-01-21,145.5,145.5,144.0625,144.4375,97.15299224853516,6244800
2000-01-24,145.65625,145.84375,139.40625,140.34375,94.39948272705078,7896900
2000-01-25,140.515625,141.9375,139.0,141.9375,95.47146606445312,9942500
2000-01-26,141.0,141.546875,140.09375,140.8125,94.71478271484375,5158100
2000-01-27,141.84375,142.21875,138.125,140.25,94.33642578125,10922700
2000-01-28,139.4375,140.0625,135.53125,135.875,91.39366912841797,11916200
2000-01-31,135.8125,139.671875,135.0,139.5625,93.87398529052734,10768700
2000-02-01,139.75,141.6875,138.53125,140.9375,94.79885864257812,8419900
2000-02-02,141.28125,142.25,140.375,141.0625,94.88296508789062,6205900
2000-02-03,140.875,143.25,140.0,143.1875,96.31224822998047,7997500
2000-02-04,143.1875,144.0,142.125,142.59375,95.91292572021484,4925400
2000-02-07,142.5625,142.78125,141.4375,142.375,95.76575469970703,5845800
2000-02-08,143.96875,144.5625,143.625,144.3125,97.0689468383789,4936400
2000-02-09,144.46875,144.46875,141.265625,141.28125,95.03006744384766,8511500
2000-02-10,141.625,142.5625,140.875,141.5625,95.21924591064453,6690600
2000-02-11,141.84375,141.9375,138.03125,138.6875,93.28549194335938,9849800
2000-02-14,139.78125,139.78125,138.3125,139.5,93.83197021484375,8528800

Advertisement

Answer

IIUC and given the mix of types of conditions I’d recommend np.select:

conditions = [
    df['movement'] >= df['open'] * 0.01,
    df['movement'] <= df['open'] * -0.01,
    df['movement'] >= df['open'] * 0.002,
    df['movement'] <= df['open'] * -0.002,
    df['movement'] >= df['open'] * 0.0001,
    df['movement'] <= df['open'] * -0.0001,
]

labels = [
    'up_alot',
    'down_alot',
    'up',
    'down',
    'up_alittle',
    'down_alittle'
]

df['discrete'] = np.select(conditions, labels, default='flat')

By comparing against pct_change instead of diff the conditions can be simplified:

s = df["close"].pct_change()
conditions = [
    s >= 0.01,
    s <= -0.01,
    s >= 0.002,
    s <= -0.002,
    s >= 0.0001,
    s <= -0.0001,
]
                  open        high  ...  movement      discrete
date                                ...                        
2000-01-03  148.250000  148.250000  ...       NaN          flat
2000-01-04  143.531250  144.062500  ...  -5.68750     down_alot
2000-01-05  139.937500  141.531250  ...   0.25000    up_alittle
2000-01-06  139.625000  141.500000  ...  -2.25000     down_alot
2000-01-07  140.312500  145.750000  ...   8.00000       up_alot
2000-01-10  146.250000  146.906250  ...   0.50000            up
2000-01-11  145.812500  146.093750  ...  -1.75000     down_alot
2000-01-12  144.593750  144.593750  ...  -1.43750          down
2000-01-13  144.468750  145.750000  ...   1.93750       up_alot
2000-01-14  146.531250  147.468750  ...   1.96875       up_alot
2000-01-18  145.343750  146.625000  ...  -1.15625          down
2000-01-19  145.312500  147.000000  ...   1.18750            up
2000-01-20  146.968750  146.968750  ...  -2.25000     down_alot
2000-01-21  145.500000  145.500000  ...  -0.31250          down
2000-01-24  145.656250  145.843750  ...  -4.09375     down_alot
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement