A few operations with df.groupby()

I working with a forex dataset, trying to fill in my dataframe with open, high, low, close updated every tick.

Here is my code:

import pandas as pd

# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 320)
pd.set_option('display.width', 320)

# creating dataframe
df = pd.read_csv('https://www.dropbox.com/s/tcek3kmleklgxm5/eur_usd_lastweek.csv?dl=1', names=['timestamp', 'ask', 'bid', 'avol', 'bvol'], parse_dates=[0], header=0)
df['spread'] = df.ask - df.bid
df['symbol'] = 'EURUSD'
times = pd.DatetimeIndex(df.timestamp)

# parameters for df.groupby()
df['date'] = times.date
df['hour'] = times.hour

# 1h candles updated every tick
df['candle_number'] = '...'
df['1h_open'] = '...'
df['1h_high'] = '...'
df['1h_low'] = '...'
df['1h_close'] = '...'

# print(df)

grouped = df.groupby(['date', 'hour'])

for idx, x in enumerate(grouped):
    print(idx)
    print(x)

JavaScript
​x
 
import pandas as pd
​
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 320)
pd.set_option('display.width', 320)
​
# creating dataframe
df = pd.read_csv('https://www.dropbox.com/s/tcek3kmleklgxm5/eur_usd_lastweek.csv?dl=1', names=['timestamp', 'ask', 'bid', 'avol', 'bvol'], parse_dates=[0], header=0)
df['spread'] = df.ask - df.bid
df['symbol'] = 'EURUSD'
times = pd.DatetimeIndex(df.timestamp)
​
# parameters for df.groupby()
df['date'] = times.date
df['hour'] = times.hour
​
# 1h candles updated every tick
df['candle_number'] = '...'
df['1h_open'] = '...'
df['1h_high'] = '...'
df['1h_low'] = '...'
df['1h_close'] = '...'
​
# print(df)
​
grouped = df.groupby(['date', 'hour'])
​
for idx, x in enumerate(grouped):
    print(idx)
    print(x)
​
​
​
​

So as you can see, with for loop I’m getting groups.

Now I want to fill the following columns in my dataframe:

idx be my df[‘candle_number’]
df[‘1h_open’] must be equal to the very first df.bid in the group
df[‘1h_high’] = the highest number in df.bid up until current row (so for instance if there are 350 rows in the group, for 20th value we count the highest number from 0-20 span, on 215th value we the highest value from 0-215 span which can be completely different.
df[‘1h_low’] = lowest value up until the current iteration (same approach as for the above)

I hope it’s not too confusing =) Cheers

Answer

It’s convinient to reindex on date and hour:

df_new = df.set_index(['date', 'hour'])

Then apply groupby functions aggregating by index:

df_new['candle_number'] = df_new.groupby(level=[0,1]).ngroup()
df_new['1h_open'] = df_new.groupby(level=[0,1])['bid'].first()
df_new['1h_high'] = df_new.groupby(level=[0,1])['bid'].cummax()
df_new['1h_low']  = df_new.groupby(level=[0,1])['bid'].cummin()

JavaScript
 
df_new['candle_number'] = df_new.groupby(level=[0,1]).ngroup()
df_new['1h_open'] = df_new.groupby(level=[0,1])['bid'].first()
df_new['1h_high'] = df_new.groupby(level=[0,1])['bid'].cummax()
df_new['1h_low']  = df_new.groupby(level=[0,1])['bid'].cummin()
​

you can reset_index() back to a flat dataframe.

Advertisement

Answer