Sorry for the noob question. I have a bunch of stocks in a sqlite3 database:
import pandas as pd import sqlite3, config connection = sqlite3.connect(config.db_file) connection.row_factory = sqlite3.Row df = pd.read_sql('SELECT * FROM stock_price', connection) # sort the dataframe df.sort_values(by='stock_id', inplace=True) # # set the index to be this and don't drop df.set_index(keys=['stock_id'], drop=False,inplace=True)
When I print the df, it gives me the following (where each stock_id refers to a unique stock, e.g APPL):
id stock_id date open high low close volume stock_id 1 1 1 2022-08-02 9.83 9.845 9.83 9.830 584772 1 2 1 2022-08-03 9.84 9.860 9.84 9.820 7711 4 3 4 2022-08-03 10.38 10.380 10.38 10.380 199 5 46 5 2022-08-03 34.75 35.200 34.75 35.200 1007 5 45 5 2022-08-02 34.32 34.550 34.32 34.442 1252 ... ... ... ... ... ... ... ... ... 98 8 98 2022-08-02 28.00 28.095 27.90 28.000 2417 99 71 99 2022-08-02 88.19 88.940 87.15 88.370 1045596 99 72 99 2022-08-03 88.34 88.550 87.65 88.410 982710 100 171 100 2022-08-02 117.58 120.010 117.08 119.270 67795 100 172 100 2022-08-03 119.80 121.940 120.60 121.440 4237 [178 rows x 8 columns]
I need to target each unique stock_id
individually, and get the prior close.
I know if each stock was in its own separate dataframe, I could do something like this:
final_df['previous close'] = final_df['c'].shift()
But when I’ve tried that, because everything in one dataframe, then you get one stock getting the previous close of an entirely different stock, which isn’t what I want.
So my question:
What’s the best to achieve splitting out all these different stocks from one single dataframe and being able to target them individually, and get the previous close price of each stock?
Advertisement
Answer
If I understand the question correctly, you just want the close of the previous row within each stock group. You can do that easily:
newdf = df.assign(prev_close=df.groupby(level=0)['close'].shift())
or, before setting the index to stock_id
:
newdf = df.assign(prev_close=df.groupby('stock_id')['close'].shift())
On your data sample:
>>> newdf id stock_id date open high low close volume prev_close stock_id 1 1 1 2022-08-02 9.83 9.845 9.83 9.830 584772 NaN 1 2 1 2022-08-03 9.84 9.860 9.84 9.820 7711 9.83 4 3 4 2022-08-03 10.38 10.380 10.38 10.380 199 NaN 5 46 5 2022-08-03 34.75 35.200 34.75 35.200 1007 NaN 5 45 5 2022-08-02 34.32 34.550 34.32 34.442 1252 35.20 98 8 98 2022-08-02 28.00 28.095 27.90 28.000 2417 NaN 99 71 99 2022-08-02 88.19 88.940 87.15 88.370 1045596 NaN 99 72 99 2022-08-03 88.34 88.550 87.65 88.410 982710 88.37 100 171 100 2022-08-02 117.58 120.010 117.08 119.270 67795 NaN 100 172 100 2022-08-03 119.80 121.940 120.60 121.440 4237 119.27
Side note: try to not use inplace=True
all over the place. It will cause you trouble in the long run (see e.g. here).
Personally, I would prefer this format:
cleandf = df.assign( date=pd.to_datetime(df['date']) ).set_index(['stock_id', 'date']).sort_index() newdf = cleandf.assign(prev_close=cleandf.groupby('stock_id')['close'].shift()) >>> newdf id open high low close volume prev_close stock_id date 1 2022-08-02 1 9.83 9.845 9.83 9.830 584772 NaN 2022-08-03 2 9.84 9.860 9.84 9.820 7711 9.830 4 2022-08-03 3 10.38 10.380 10.38 10.380 199 NaN 5 2022-08-02 45 34.32 34.550 34.32 34.442 1252 NaN 2022-08-03 46 34.75 35.200 34.75 35.200 1007 34.442 98 2022-08-02 8 28.00 28.095 27.90 28.000 2417 NaN 99 2022-08-02 71 88.19 88.940 87.15 88.370 1045596 NaN 2022-08-03 72 88.34 88.550 87.65 88.410 982710 88.370 100 2022-08-02 171 117.58 120.010 117.08 119.270 67795 NaN 2022-08-03 172 119.80 121.940 120.60 121.440 4237 119.270