Skip to content

Add option to enforce minimum segment length#150

Open
StephanDollberg wants to merge 1 commit intoapache:masterfrom
StephanDollberg:stephan/min-segment-len
Open

Add option to enforce minimum segment length#150
StephanDollberg wants to merge 1 commit intoapache:masterfrom
StephanDollberg:stephan/min-segment-len

Conversation

@StephanDollberg
Copy link
Copy Markdown

One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.

This patch adds a config to disallow changepoints that only enclose
segments of a certain length.

Like that we can filter out these one-event changepoints and avoid
noise.

Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.

Default behaviour is unchanged.

Let me know what you think.

One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.

This patch adds a config to disallow changepoints that only enclose
segments of a certain length.

Like that we can filter out these one-event changepoints and avoid
noise.

Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.

Default behaviour is unchanged.
@henrikingo
Copy link
Copy Markdown
Contributor

Hi Stephan

The original e-divisive implementation (in R) actually included such an option, and I think MongoDB's signal_processing_algorithms likewise required 2 points at each end of the segment, meaning that it was only possible to find a change point in segments that had at least 5 points, and in the case it would have to be point #3 that is the change point.

The Hunter implementation then modified this to any point in arbitrary short segments to be a change point. "In practice the minimum segment is 3, since with only 2 points it is not possible to establish the "normal" range that the other point would be a change from." The Hunter modifications specifically were introduced to correctly find two nearby change points, since a common use case (in Cassandra development, anyway) was that they would observe a regression and then immediately fix it in a nearby commit. The datastax team observed that they could make the algorithm more sensitive by dividing a long series into smaller windows. A byproduct of this is that even individual outlier points sometimes get marked as change points much more easily than in the original e-divisive.

The relevant parameter is window_len and by defaylt it is set to 50. Before adding a new parameter, it would be interesting to hear from you whether you get "better" behavior by increasing this parameter. In principle you should be able to get original e-divisive behavior by setting window_len to a really large value = larger than the length of your time series.

If you do this and still observe individual outliers getting marked as change points (or even two changepoints) could you please share a data sample

@StephanDollberg
Copy link
Copy Markdown
Author

Thanks for the background, interesting to know!

Sure, I can give that a try. I had tried without specifying windowlen in the past and wasn't entirely happy (I know not a very qualitive statement haha) but I also didn't know 50 was the default until yesterday and assumed it's actually a lot bigger. Let me try again with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants