I try to search for this problem many places and couldn’t find the right tools.
I have a simple time series data,
print(anom)
0 0
1 0
2 0
3 0
4 0
..
52777 1
52778 1
52779 0
52780 1
For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false).
How do I achieve this with pandas or numpy?
I also want to plot those anomalies, with the colour red for example, how do we achieve that?
How do I mark those anomalies (values = 1 that expanse for around 1000 time instances) as red?
Advertisement
Answer
It is not exactly clear which output you expect. Yet, let’s consider the following dataset similar to yours:
s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')
0 1
1 0
2 0
3 0
4 0
..
95 0
96 1
97 1
98 0
99 1
Name: anom, Length: 100, dtype: int64
Looking like:
filtering based on consecutive values
First we calculate the length of the stretches of 1s
length = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s
This works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1)
(the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s
, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one’s length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s
(graph #6).
Here is the visual representation of the successive steps where (…)
denotes the previous step in each graph:
s_valid = s.loc[length<10]
s_anom = s.drop(s_valid.index)
ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')
ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')
other example with 7 as threshold:
original answer
You can easily convert to bool
to get anomalies
>>> s.astype(bool)
0 True
1 False
2 False
3 False
4 False
95 False
96 True
97 True
98 False
99 True
Name: anom, Length: 100, dtype: bool
Regarding the plot, depending on what you expect you can do:
s_valid = s.loc[~s.astype(bool)]
s_anom = s.loc[s.astype(bool)]
ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')
output:
s_anom = s.loc[s.astype(bool)]
ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')