Skip to content
Advertisement

Pandasql Exception with OVER

I tried to use this line of code :

import pandasql as ps
df2["transaction per 30 minutes"] = ps.sqldf("SELECT COUNT(*) FROM df OVER (PARTITION BY 'Card Number' ORDER BY 'Date' RANGE BETWEEN '30 minutes' PRECEDING AND '0 minutes' FOLLOWING)", locals())
df2["avg_transaction_30"] = ps.sqldf("SELECT AVG(Amount) FROM df OVER (PARTITION BY 'Card Number' ORDER BY Date RANGE BETWEEN '30 minutes' PRECEDING AND  '0 minutes' FOLLOWING)")

NB : delete ‘locals()’ or replace by ‘globals()’ don’t solve the problem.

But I have this error :

PandaSQLException: (sqlite3.OperationalError) near "(": syntax error
[SQL: SELECT COUNT(*) FROM df2 OVER (PARTITION BY 'Card Number' ORDER BY date RANGE BETWEEN '30 minutes' PRECEDING AND '0 minutes' FOLLOWING)]
(Background on this error at: )

http://sqlalche.me/e/14/e3q8 (Same error with Average)

I load my dataframe like this :

df2 = pd.read_csv('df.csv')

My file is in the same directory as my Jupiter notebook file. I follow the link, but I still don’t understand what’s wrong.

And this line work :

ps.sqldf("SELECT * FROM df2")

So I guess is not a problem with my pandasql

My initial goal with this code is :

I want to know how many transactions a card number do for every 30 min interval, and also calculate in average how many transactions a client do per 30 minutes.

For example, I want to have something like (order by Card Number) :

Card Number interval #transactions_30min AVG_transactions_30min
536518******2108 2015-05-11 00:00:01 – 00:30:00 2 5
536518******2108 2015-05-11 00:30:01 – 01:00:00 8 5
536518******2197 2015-05-11 00:00:01 – 00:30:00 6 8
536518******2197 2015-05-11 00:30:01 – 01:00:00 10 8

My database look like this :

id Card Number Date Amount Fraud
0 536518******2108 2015-05-01 00:01:54 36.54 No
1 536518******2108 2015-05-01 00:03:46 36.54 No
2 453211******1239 2015-05-01 00:08:50 69.00 No
3 548827******1705 2015-05-01 00:27:00 193.43 No
4 531681******9778 2015-05-01 01:32:46 132.00 No
5 515117******4107 2015-05-01 02:10:26 161.00 No
6 432032******9111 2015-05-01 08:09:15 110.00 No
7 544540******7141 2015-05-01 08:30:16 159.50 No
8 554906******0358 2015-05-01 09:13:51 126.50 Yes
9 554906******0358 2015-05-01 09:15:28 126.50 Yes

So is it possible to use pandasql in my case ? If yes, how can I correct my request to make it work ?

Advertisement

Answer

If you don’t mind using pandas for all calculations, here is one approach:

df['Date'] = pd.to_datetime(df['Date'])

g = df.groupby(["Card Number", pd.Grouper(key='Date', freq='30min')], sort=False)
df_out = g['Amount'].agg(['count', 'mean']).add_prefix('transactions30min_').reset_index()

Result:

print(df_out)

        Card Number                Date  transactions30min_count  transactions30min_mean
0  536518******2108 2015-05-01 00:00:00                        2                   36.54
1  453211******1239 2015-05-01 00:00:00                        1                   69.00
2  548827******1705 2015-05-01 00:00:00                        1                  193.43
3  531681******9778 2015-05-01 01:30:00                        1                  132.00
4  515117******4107 2015-05-01 02:00:00                        1                  161.00
5  432032******9111 2015-05-01 08:00:00                        1                  110.00
6  544540******7141 2015-05-01 08:30:00                        1                  159.50
7  554906******0358 2015-05-01 09:00:00                        2                  126.50
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement