How to mark data as anomalies based on specific condition in each interval

Question

I try to search for this problem many places and couldn't find the right tools. I have a simple time series data, For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false). How do I achieve this with

Accepted Answer

It is not exactly clear which output you expect.Yet, let&#8217;s consider the following dataset similar to yours:s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')0     11     02     03     04     0     ..95    096    197    198    099    1Name: anom, Length: 100, dtype: int64Looking like:filtering based on consecutive valuesFirst we calculate the length of the stretches of 1slength = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*sThis works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1) (the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one&#8217;s length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s (graph #6).Here is the visual representation of the successive steps where (…) denotes the previous step in each graph:s_valid = s.loc[length<10]s_anom = s.drop(s_valid.index)ax = s_valid.plot(marker='o', ls='')s_anom.plot(marker='o', ls='', ax=ax, color='r')ax = s.plot()s_anom.plot(marker='o', ls='', ax=ax, color='r')other example with 7 as threshold:original answerYou can easily convert to bool to get anomalies>>> s.astype(bool)0      True1     False2     False3     False4     False      ...  95    False96     True97     True98    False99     TrueName: anom, Length: 100, dtype: boolRegarding the plot, depending on what you expect you can do:s_valid = s.loc[~s.astype(bool)]s_anom = s.loc[s.astype(bool)]ax = s_valid.plot(marker='o', ls='')s_anom.plot(marker='o', ls='', ax=ax, color='r')output:s_anom = s.loc[s.astype(bool)]ax = s.plot()s_anom.plot(marker='o', ls='', ax=ax, color='r')

Advertisement

Answer

filtering based on consecutive values

original answer