I have a dataframe with an Integer column, which I need to convert to discrete value bands, I am currently doing this using apply with my own function, as seen bellow, however this is rather slow, is there any way to quickly do this?
JavaScript
x
20
20
1
def CalDiscrete(value, open):
2
# Alot = >1%
3
# Normal = 0.2% ~ 1%
4
# Alittle = 0.01% ~ 0.2%
5
# Flat = 0% ~ 0.01%
6
if(value >= open*0.01):
7
return 'up_alot'
8
elif(value <= open*-0.01):
9
return 'down_alot'
10
elif(value >= open*0.002):
11
return 'up'
12
elif(value <= open*-0.002):
13
return 'down'
14
elif(value >= open*0.0001):
15
return 'up_alittle'
16
elif(value <= open*-0.0001):
17
return 'down_alittle'
18
else:
19
return 'flat'
20
Caller function here:
JavaScript
1
3
1
df_input['movement'] = df_input["close"].diff()
2
df_input['discrete'] = df_input.apply(lambda row: CalDiscrete(row['movement'], row["close"]), axis=1)
3
My data look like this:
JavaScript
1
32
32
1
date,open,high,low,close,adj close,volume
2
2000-01-03,148.25,148.25,143.875,145.4375,97.82567596435547,8164300
3
2000-01-04,143.53125,144.0625,139.640625,139.75,94.00010681152344,8089800
4
2000-01-05,139.9375,141.53125,137.25,140.0,94.16825866699219,12177900
5
2000-01-06,139.625,141.5,137.75,137.75,92.65486145019531,6227200
6
2000-01-07,140.3125,145.75,140.0625,145.75,98.03589630126953,8066500
7
2000-01-10,146.25,146.90625,145.03125,146.25,98.3722152709961,5741700
8
2000-01-11,145.8125,146.09375,143.5,144.5,97.19509887695312,7503700
9
2000-01-12,144.59375,144.59375,142.875,143.0625,96.22821044921875,6907700
10
2000-01-13,144.46875,145.75,143.28125,145.0,97.53144073486328,5158300
11
2000-01-14,146.53125,147.46875,145.96875,146.96875,98.85567474365234,7437300
12
2000-01-18,145.34375,146.625,145.1875,145.8125,98.07793426513672,6488500
13
2000-01-19,145.3125,147.0,145.0,147.0,98.87665557861328,6157900
14
2000-01-20,146.96875,146.96875,143.8125,144.75,97.36325073242188,5800100
15
2000-01-21,145.5,145.5,144.0625,144.4375,97.15299224853516,6244800
16
2000-01-24,145.65625,145.84375,139.40625,140.34375,94.39948272705078,7896900
17
2000-01-25,140.515625,141.9375,139.0,141.9375,95.47146606445312,9942500
18
2000-01-26,141.0,141.546875,140.09375,140.8125,94.71478271484375,5158100
19
2000-01-27,141.84375,142.21875,138.125,140.25,94.33642578125,10922700
20
2000-01-28,139.4375,140.0625,135.53125,135.875,91.39366912841797,11916200
21
2000-01-31,135.8125,139.671875,135.0,139.5625,93.87398529052734,10768700
22
2000-02-01,139.75,141.6875,138.53125,140.9375,94.79885864257812,8419900
23
2000-02-02,141.28125,142.25,140.375,141.0625,94.88296508789062,6205900
24
2000-02-03,140.875,143.25,140.0,143.1875,96.31224822998047,7997500
25
2000-02-04,143.1875,144.0,142.125,142.59375,95.91292572021484,4925400
26
2000-02-07,142.5625,142.78125,141.4375,142.375,95.76575469970703,5845800
27
2000-02-08,143.96875,144.5625,143.625,144.3125,97.0689468383789,4936400
28
2000-02-09,144.46875,144.46875,141.265625,141.28125,95.03006744384766,8511500
29
2000-02-10,141.625,142.5625,140.875,141.5625,95.21924591064453,6690600
30
2000-02-11,141.84375,141.9375,138.03125,138.6875,93.28549194335938,9849800
31
2000-02-14,139.78125,139.78125,138.3125,139.5,93.83197021484375,8528800
32
Advertisement
Answer
IIUC and given the mix of types of conditions I’d recommend np.select
:
JavaScript
1
20
20
1
conditions = [
2
df['movement'] >= df['open'] * 0.01,
3
df['movement'] <= df['open'] * -0.01,
4
df['movement'] >= df['open'] * 0.002,
5
df['movement'] <= df['open'] * -0.002,
6
df['movement'] >= df['open'] * 0.0001,
7
df['movement'] <= df['open'] * -0.0001,
8
]
9
10
labels = [
11
'up_alot',
12
'down_alot',
13
'up',
14
'down',
15
'up_alittle',
16
'down_alittle'
17
]
18
19
df['discrete'] = np.select(conditions, labels, default='flat')
20
By comparing against pct_change
instead of diff
the conditions can be simplified:
JavaScript
1
10
10
1
s = df["close"].pct_change()
2
conditions = [
3
s >= 0.01,
4
s <= -0.01,
5
s >= 0.002,
6
s <= -0.002,
7
s >= 0.0001,
8
s <= -0.0001,
9
]
10
JavaScript
1
18
18
1
open high movement discrete
2
date
3
2000-01-03 148.250000 148.250000 NaN flat
4
2000-01-04 143.531250 144.062500 -5.68750 down_alot
5
2000-01-05 139.937500 141.531250 0.25000 up_alittle
6
2000-01-06 139.625000 141.500000 -2.25000 down_alot
7
2000-01-07 140.312500 145.750000 8.00000 up_alot
8
2000-01-10 146.250000 146.906250 0.50000 up
9
2000-01-11 145.812500 146.093750 -1.75000 down_alot
10
2000-01-12 144.593750 144.593750 -1.43750 down
11
2000-01-13 144.468750 145.750000 1.93750 up_alot
12
2000-01-14 146.531250 147.468750 1.96875 up_alot
13
2000-01-18 145.343750 146.625000 -1.15625 down
14
2000-01-19 145.312500 147.000000 1.18750 up
15
2000-01-20 146.968750 146.968750 -2.25000 down_alot
16
2000-01-21 145.500000 145.500000 -0.31250 down
17
2000-01-24 145.656250 145.843750 -4.09375 down_alot
18