Given a Spark dataframe with the following columns I am trying to construct an incremental/running count for each id
based on when the contents of the event
column evaluate to True
.
JavaScript
x
26
26
1
spark.createDataFrame([
2
('2018-01-01', 'A', False),
3
('2018-01-02', 'A', False),
4
('2018-01-03', 'A', True),
5
('2018-01-04', 'A', False),
6
('2018-01-05', 'A', True),
7
('2018-01-06', 'A', False),
8
('2018-01-01', 'B', False),
9
('2018-01-02', 'B', True),
10
('2018-01-03', 'B', False),
11
], ['date', 'id', 'event']).show()
12
13
+----------+---+-----+
14
| date| id|event|
15
+----------+---+-----+
16
|2018-01-01| A|false|
17
|2018-01-02| A|false|
18
|2018-01-03| A| true|
19
|2018-01-04| A|false|
20
|2018-01-05| A| true|
21
|2018-01-06| A|false|
22
|2018-01-01| B|false|
23
|2018-01-02| B| true|
24
|2018-01-03| B|false|
25
+----------+---+-----+
26
Here a new column called results
would be created that contained the incremental count.
JavaScript
1
14
14
1
+----------+---+-----+------+
2
| date| id|event|result|
3
+----------+---+-----+------+
4
|2018-01-01| A|false| 0|
5
|2018-01-02| A|false| 0|
6
|2018-01-03| A| true| 1|
7
|2018-01-04| A|false| 1|
8
|2018-01-05| A| true| 2|
9
|2018-01-06| A|false| 2|
10
|2018-01-01| B|false| 0|
11
|2018-01-02| B| true| 1|
12
|2018-01-03| B|false| 1|
13
+----------+---+-----+------+
14
I’ve tried using window functions but am stumped at this point. Ideally, the solution would increment the count row-wise without and group by or aggregation functions.
Thanks in advance.
Advertisement
Answer
You can use sum
function, casting your event
as an int:
JavaScript
1
19
19
1
import pyspark.sql.functions as f
2
from pyspark.sql import Window
3
4
df = df.withColumn('result', f.sum(f.col('event').cast('int')).over((Window.partitionBy('id').orderBy('date'))))
5
df.sort('id', 'date').show()
6
# +----------+---+-----+------+
7
# | date| id|event|result|
8
# +----------+---+-----+------+
9
# |2018-01-01| A|false| 0|
10
# |2018-01-02| A|false| 0|
11
# |2018-01-03| A| true| 1|
12
# |2018-01-04| A|false| 1|
13
# |2018-01-05| A| true| 2|
14
# |2018-01-06| A|false| 2|
15
# |2018-01-01| B|false| 0|
16
# |2018-01-02| B| true| 1|
17
# |2018-01-03| B|false| 1|
18
# +----------+---+-----+------+
19