Ever have one of those bugs that just drives you crazy? I squashed one today that had been haunting me for awhile.
It involved a node losing a service, followed immediately by the node being marked down. The node would then be marked up and the original service that failed would remain down.
This was especially painful when it happened during one of my training classes. Since apache is almost always installed on Red Hat machines I come across, but not running, it is usually a nice exercise to have students start the httpd service, perform a rescan, and see the service added to the WebUI.
Then you have a service you can play with. Kill it and watch the events, set up notifications, etc.
However, it was real embarrassing when students would stop httpd, and see a node down event instead of the node lost service event.
It finally happened in an environment I could capture, and I found the issue. Service polls on a node usually occur at about the same time. So OpenNMS would poll ICMP, DNS, HTTP, etc. at close to the same moment. Well, when a service poll fails, the poller checks the critical service, by default ICMP.
As of 1.1, part of the new poller is the storage of the response time in an RRD file. A second ICMP poll on the heels of the first can result in a second RRD update command with the same time stamp.
This throws a java RuntimeException, which is effectively a failed poll.
So the scenario is as follows:
- OpenNMS polls ICMP on a node, and updates an RRD.
- OpenNMS polls HTTP on a node, which fails.
- ICMP, as the critical service, is immediately polled again.
- This results in another update to the RRD. If it is quick enough, the second update fails and throws a RuntimeException.
- The critical service poll is seen to fail, and thus the node is marked down.
This is now fixed in CVS by catching the exception and simply throwing a log when this happens. It is tracked as Bug 748.