Add option to enforce minimum segment length#150
Add option to enforce minimum segment length#150StephanDollberg wants to merge 1 commit intoapache:masterfrom
Conversation
One problem we often see in practice is that single spikes cause changepoints. These are just false positive noise which we want to avoid. This patch adds a config to disallow changepoints that only enclose segments of a certain length. Like that we can filter out these one-event changepoints and avoid noise. Of course this will mute true changepoints in a short segment but that's fine if they are followed by another true changepoint. For example [100, 100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is fine. A single alert is good enough to get someone to look at the data. Default behaviour is unchanged.
|
Hi Stephan The original e-divisive implementation (in R) actually included such an option, and I think MongoDB's signal_processing_algorithms likewise required 2 points at each end of the segment, meaning that it was only possible to find a change point in segments that had at least 5 points, and in the case it would have to be point #3 that is the change point. The Hunter implementation then modified this to any point in arbitrary short segments to be a change point. "In practice the minimum segment is 3, since with only 2 points it is not possible to establish the "normal" range that the other point would be a change from." The Hunter modifications specifically were introduced to correctly find two nearby change points, since a common use case (in Cassandra development, anyway) was that they would observe a regression and then immediately fix it in a nearby commit. The datastax team observed that they could make the algorithm more sensitive by dividing a long series into smaller windows. A byproduct of this is that even individual outlier points sometimes get marked as change points much more easily than in the original e-divisive. The relevant parameter is window_len and by defaylt it is set to 50. Before adding a new parameter, it would be interesting to hear from you whether you get "better" behavior by increasing this parameter. In principle you should be able to get original e-divisive behavior by setting window_len to a really large value = larger than the length of your time series. If you do this and still observe individual outliers getting marked as change points (or even two changepoints) could you please share a data sample |
|
Thanks for the background, interesting to know! Sure, I can give that a try. I had tried without specifying windowlen in the past and wasn't entirely happy (I know not a very qualitive statement haha) but I also didn't know 50 was the default until yesterday and assumed it's actually a lot bigger. Let me try again with that. |
One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.
This patch adds a config to disallow changepoints that only enclose
segments of a certain length.
Like that we can filter out these one-event changepoints and avoid
noise.
Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.
Default behaviour is unchanged.
Let me know what you think.