I have a dataframe with an Integer column, which I need to convert to discrete value bands, I am currently doing this using apply with my own function, as seen bellow, however this is rather slow, is there any way to quickly do this?
def CalDiscrete(value, open): # Alot = >1% # Normal = 0.2% ~ 1% # Alittle = 0.01% ~ 0.2% # Flat = 0% ~ 0.01% if(value >= open*0.01): return 'up_alot' elif(value <= open*-0.01): return 'down_alot' elif(value >= open*0.002): return 'up' elif(value <= open*-0.002): return 'down' elif(value >= open*0.0001): return 'up_alittle' elif(value <= open*-0.0001): return 'down_alittle' else: return 'flat'
Caller function here:
df_input['movement'] = df_input["close"].diff() df_input['discrete'] = df_input.apply(lambda row: CalDiscrete(row['movement'], row["close"]), axis=1)
My data look like this:
date,open,high,low,close,adj close,volume 2000-01-03,148.25,148.25,143.875,145.4375,97.82567596435547,8164300 2000-01-04,143.53125,144.0625,139.640625,139.75,94.00010681152344,8089800 2000-01-05,139.9375,141.53125,137.25,140.0,94.16825866699219,12177900 2000-01-06,139.625,141.5,137.75,137.75,92.65486145019531,6227200 2000-01-07,140.3125,145.75,140.0625,145.75,98.03589630126953,8066500 2000-01-10,146.25,146.90625,145.03125,146.25,98.3722152709961,5741700 2000-01-11,145.8125,146.09375,143.5,144.5,97.19509887695312,7503700 2000-01-12,144.59375,144.59375,142.875,143.0625,96.22821044921875,6907700 2000-01-13,144.46875,145.75,143.28125,145.0,97.53144073486328,5158300 2000-01-14,146.53125,147.46875,145.96875,146.96875,98.85567474365234,7437300 2000-01-18,145.34375,146.625,145.1875,145.8125,98.07793426513672,6488500 2000-01-19,145.3125,147.0,145.0,147.0,98.87665557861328,6157900 2000-01-20,146.96875,146.96875,143.8125,144.75,97.36325073242188,5800100 2000-01-21,145.5,145.5,144.0625,144.4375,97.15299224853516,6244800 2000-01-24,145.65625,145.84375,139.40625,140.34375,94.39948272705078,7896900 2000-01-25,140.515625,141.9375,139.0,141.9375,95.47146606445312,9942500 2000-01-26,141.0,141.546875,140.09375,140.8125,94.71478271484375,5158100 2000-01-27,141.84375,142.21875,138.125,140.25,94.33642578125,10922700 2000-01-28,139.4375,140.0625,135.53125,135.875,91.39366912841797,11916200 2000-01-31,135.8125,139.671875,135.0,139.5625,93.87398529052734,10768700 2000-02-01,139.75,141.6875,138.53125,140.9375,94.79885864257812,8419900 2000-02-02,141.28125,142.25,140.375,141.0625,94.88296508789062,6205900 2000-02-03,140.875,143.25,140.0,143.1875,96.31224822998047,7997500 2000-02-04,143.1875,144.0,142.125,142.59375,95.91292572021484,4925400 2000-02-07,142.5625,142.78125,141.4375,142.375,95.76575469970703,5845800 2000-02-08,143.96875,144.5625,143.625,144.3125,97.0689468383789,4936400 2000-02-09,144.46875,144.46875,141.265625,141.28125,95.03006744384766,8511500 2000-02-10,141.625,142.5625,140.875,141.5625,95.21924591064453,6690600 2000-02-11,141.84375,141.9375,138.03125,138.6875,93.28549194335938,9849800 2000-02-14,139.78125,139.78125,138.3125,139.5,93.83197021484375,8528800
Advertisement
Answer
IIUC and given the mix of types of conditions I’d recommend np.select
:
conditions = [ df['movement'] >= df['open'] * 0.01, df['movement'] <= df['open'] * -0.01, df['movement'] >= df['open'] * 0.002, df['movement'] <= df['open'] * -0.002, df['movement'] >= df['open'] * 0.0001, df['movement'] <= df['open'] * -0.0001, ] labels = [ 'up_alot', 'down_alot', 'up', 'down', 'up_alittle', 'down_alittle' ] df['discrete'] = np.select(conditions, labels, default='flat')
By comparing against pct_change
instead of diff
the conditions can be simplified:
s = df["close"].pct_change() conditions = [ s >= 0.01, s <= -0.01, s >= 0.002, s <= -0.002, s >= 0.0001, s <= -0.0001, ]
open high ... movement discrete date ... 2000-01-03 148.250000 148.250000 ... NaN flat 2000-01-04 143.531250 144.062500 ... -5.68750 down_alot 2000-01-05 139.937500 141.531250 ... 0.25000 up_alittle 2000-01-06 139.625000 141.500000 ... -2.25000 down_alot 2000-01-07 140.312500 145.750000 ... 8.00000 up_alot 2000-01-10 146.250000 146.906250 ... 0.50000 up 2000-01-11 145.812500 146.093750 ... -1.75000 down_alot 2000-01-12 144.593750 144.593750 ... -1.43750 down 2000-01-13 144.468750 145.750000 ... 1.93750 up_alot 2000-01-14 146.531250 147.468750 ... 1.96875 up_alot 2000-01-18 145.343750 146.625000 ... -1.15625 down 2000-01-19 145.312500 147.000000 ... 1.18750 up 2000-01-20 146.968750 146.968750 ... -2.25000 down_alot 2000-01-21 145.500000 145.500000 ... -0.31250 down 2000-01-24 145.656250 145.843750 ... -4.09375 down_alot