I try to search for this problem many places and couldn’t find the right tools.
I have a simple time series data,
print(anom)
0 0 1 0 2 0 3 0 4 0 .. 52777 1 52778 1 52779 0 52780 1
For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false).
How do I achieve this with pandas or numpy?
I also want to plot those anomalies, with the colour red for example, how do we achieve that?
How do I mark those anomalies (values = 1 that expanse for around 1000 time instances) as red?
Advertisement
Answer
It is not exactly clear which output you expect. Yet, let’s consider the following dataset similar to yours:
s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')
0 1 1 0 2 0 3 0 4 0 .. 95 0 96 1 97 1 98 0 99 1 Name: anom, Length: 100, dtype: int64
Looking like:
filtering based on consecutive values
First we calculate the length of the stretches of 1s
length = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s
This works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1)
(the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s
, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one’s length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s
(graph #6).
Here is the visual representation of the successive steps where (…)
denotes the previous step in each graph:
s_valid = s.loc[length<10] s_anom = s.drop(s_valid.index) ax = s_valid.plot(marker='o', ls='') s_anom.plot(marker='o', ls='', ax=ax, color='r')
ax = s.plot() s_anom.plot(marker='o', ls='', ax=ax, color='r')
other example with 7 as threshold:
original answer
You can easily convert to bool
to get anomalies
>>> s.astype(bool) 0 True 1 False 2 False 3 False 4 False ... 95 False 96 True 97 True 98 False 99 True Name: anom, Length: 100, dtype: bool
Regarding the plot, depending on what you expect you can do:
s_valid = s.loc[~s.astype(bool)] s_anom = s.loc[s.astype(bool)] ax = s_valid.plot(marker='o', ls='') s_anom.plot(marker='o', ls='', ax=ax, color='r')
output:
s_anom = s.loc[s.astype(bool)] ax = s.plot() s_anom.plot(marker='o', ls='', ax=ax, color='r')