Skip to content
Advertisement

How to mark data as anomalies based on specific condition in each interval

I try to search for this problem many places and couldn’t find the right tools.

I have a simple time series data,

print(anom)
0        0
1        0
2        0
3        0
4        0
        ..
52777    1
52778    1
52779    0
52780    1

For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false).

How do I achieve this with pandas or numpy?

I also want to plot those anomalies, with the colour red for example, how do we achieve that?

How do I mark those anomalies (values = 1 that expanse for around 1000 time instances) as red? enter image description here

Advertisement

Answer

It is not exactly clear which output you expect. Yet, let’s consider the following dataset similar to yours:

s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')
0     1
1     0
2     0
3     0
4     0
     ..
95    0
96    1
97    1
98    0
99    1
Name: anom, Length: 100, dtype: int64

Looking like:

input data

filtering based on consecutive values

First we calculate the length of the stretches of 1s

length = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s

This works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1) (the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one’s length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s (graph #6).

Here is the visual representation of the successive steps where (…) denotes the previous step in each graph:

breakdown of stretches length calculation

s_valid = s.loc[length<10]
s_anom = s.drop(s_valid.index)

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

line+dots

ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

line+dots

other example with 7 as threshold:

line+dots ; 7 as threshold

original answer


You can easily convert to bool to get anomalies

>>> s.astype(bool)
0      True
1     False
2     False
3     False
4     False
      ...  
95    False
96     True
97     True
98    False
99     True
Name: anom, Length: 100, dtype: bool

Regarding the plot, depending on what you expect you can do:

s_valid = s.loc[~s.astype(bool)]
s_anom = s.loc[s.astype(bool)]

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

output:

data as dots anomalies in red

s_anom = s.loc[s.astype(bool)]
ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

data as lines anomalies as red dots

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement