Node Monitoring

This page is an attempt to bring together information on the monitoring of nodes. The idea is to do something like MississippiMonitoring for the rest of our nodes so that we can get some information, at least, about how much they are used. Ideally, we might have some indication of when they break too, and some information to help us diagnose why they have broken. And, it would be nice if most of this information existed in one place, since we are very lazy.

Historical Discussion

In the past, we have used ["Nagios"] and sometimes ["SNMP"] to monitor nodes, although, neither of those are currently used (they are still listed on ToDoList, for instance). Some information can be garnered from the NoCat status page on port 5280, and a bit more by manually logging into the box, but this is tedious, and it is only a snapshot. SystemWideStatistics was details one idea which was never followed through. NodeMississippi monitors its nodes using cacti and snmpd, and it works quite well. WifiDog uses heartbeats to monitor the upness of a host and keep some statistics, and while this is a great advance, it is only useful on our few nodes that use WifiDog, and isn't quite as information rich or flexible as we would like.

Current Approach

The current approach, will be to run snmpd.v.3 (net-snmpd, read-only, and with authentication) on our internet accessible NuCabs and then aggregate information using [http://cacti.net Cacti]. We have a cacti instance on chevy already for MississippiMonitoring, so it is a pretty good candidate for initial experimentation.

Here is a tiny script that you can run on a nucab to get a count of currently associated clients:

echo $((`iptables -L NoCat -t mangle | grep "MARK set 0x3" | wc -l`))