HEAD-2283

The registrar health checker allows you to specify a "period" and "threshold". These appear to be based on the same-named parameters for amon probes:
https://github.com/joyent/sdc-amon/blob/master/docs/index.md

That is, a health check is considered to have failed only if "threshold" failures occur within the time period "period". However, the implementation looks completely wrong. When the checker starts its first check, it sets a timer for "period". When that timer fires, it sets another timer for "period", and that timer just clears the array of failures that we've detected so far. But that's it. So the end result of the entire mechanism is that the number of failures seen so far is cleared once, "2 * period" milliseconds after the first check starts. There are a bunch of things wrong with this:

The first period is twice as long as it's supposed to be. So we can incorrectly mark the component up even if it's exceeded the configured threshold.
The second period is effectively infinite because we never again clear the number of failures we've seen so far. As a result, we can incorrectly mark the component down even if it hasn't exceeded the configured threshold.
A second consequence of that: if the health check causes us to mark the component down, but it subsequently comes back up, we'll never mark it down again. That's because when a check fails, we only mark the component "down" if the current number of failures seen is exactly the threshold. Since we never decrement the number of failures seen, we'll never mark it down again.
This whole scheme of keeping a failure counter that gets cleared periodically is at best an approximation. This approach does the wrong thing if the health check fails enough to exceed the threshold, but those failures are divided across two sampling periods, which seems quite possible.

I have not tested any of this – this is just my reading of the code.

registrar: health checker threshold/period management doesn't work

Description