In the last post I mentioned how I set up my Rackspace Cloud Monitoring system to notify me when my Blog fails or performs badly.
I tweaked the configuration a little bit now: I deactivated the performance check.
Why is that? Because it did not monitor my Blog's overall performance but just the quality of the transatlantic wires. And I can tell you: It's very volatile.
While having constant good response times from the check zone in London, both U.S. zones are okay, bad, then critical and then okay again in a matter of some minutes. So I was spamming myself with that check.
I'd love to have a redundant performance check in place, but there are currently no two check zones in Europe, and I did not find a way up to now to restrict the performance check to the London values only. I think I'll do some more research on that later. For now, I'm fine with the Code 200 'Up and running' check.
Check thresholds can be difficult — we constantly see network blips while monitoring Rackspace Cloud Monitoring itself! If you want some help tuning the performance (or other) checks feel free to reach out to me with an email to the address I entered — we would love to help!
Disclaimer: I work for Rackspace on the Cloud Monitoring Team.
As you’ve discovered, Cloud Monitoring doesn’t provide any way to alert on metrics from a single monitoring zone while the check is being run in multiple zones. There are a couple things you could try out though to keep the noise to signal ratio down.
Everything I mention is found on the alarm language reference if you are interested in reading more.
The first thing I would consider is setting the alarm consistency level. This defaults to QUORUM which requires that a majority of your monitoring zones agree on the state of the alarm. This is calculated with N / 2 + 1, where N is the number of monitoring zones configured on the check. You can set this to ALL, and it would require *every* monitoring zone to agree on the state. That way you’d only get alerted if London *and* the US monitoring zones detect a slow HTTP response. Be sure to read about the pros and cons of the various consistency levels in the reference guide I mentioned above. See an example of this below:
:set consecutiveLevel=ALL
if (metric['duration'] > 20000) {
return new AlarmStatus(CRITICAL, "Things are slow!");
}
return new AlarmStatus(OK, "Things are good");
The other thing you can think about is setting the ‘consecutive count’ on the alarm. This requires the state to be evaluated x times consecutively before the alarm is triggered. The example below would require a QUORUM of monitoring zones to evaluate the same alarm status in 3 consecutive polling windows.
:set consecutiveCount=3
if (metric['duration'] > 20000) {
return new AlarmStatus(CRITICAL, "Things are slow!");
}
return new AlarmStatus(OK, "Things are good");
If you have any other questions, please don’t hesitate to email me(justin.gallardo at rackspace.com).
Happy monitoring!
Doh, I just noticed a typo. The first alarm example should look more like:
:set consistencyLevel=ALL
if (metric['duration'] > 20000) {
return new AlarmStatus(CRITICAL, "Things are slow!");
}
return new AlarmStatus(OK, "Things are good");
I had swapped ‘consistencyLevel’ for ‘consecutiveLevel’.
Cheers!