One of our clients in Australia was asking me about the best way to monitor applications on his Linux servers. He wanted to be alerted when they died (or if a particular number of them were not running, etc.)
OpenNMS has a monitor based on the host resources MIB which can do this, but the downside is that there is no corresponding capsd plugin to do the discovery portion of it, so it can be a real pain to set up. I thought that Net-SNMP should be able to do this quite simply, but I found out that it isn’t nearly as easy as I thought it would be.
There is a directive within the Net-SNMP configuration file (snmpd.conf) called “proc”. For example:
proc testing 3 1
Configuring this line will cause the agent to look through the running processes for the string “testing”. If there are at least 1 and no more than 3 matches, the test is considered to pass.
This is reflected in the process table:
$ snmpwalk -v1 -c public localhost .1.3.6.1.4.1.2021.2.1 UCD-SNMP-MIB::prIndex.1 = INTEGER: 1 UCD-SNMP-MIB::prNames.1 = STRING: testing UCD-SNMP-MIB::prMin.1 = INTEGER: 1 UCD-SNMP-MIB::prMax.1 = INTEGER: 3 UCD-SNMP-MIB::prCount.1 = INTEGER: 1 UCD-SNMP-MIB::prErrorFlag.1 = INTEGER: 0 UCD-SNMP-MIB::prErrMessage.1 = STRING: UCD-SNMP-MIB::prErrFix.1 = INTEGER: 0 UCD-SNMP-MIB::prErrFixCmd.1 = STRING:
As you can see, the prErrorFlag is set to “0” which means there is no error. This makes sense since the prCount is 1 and that is within the min/max range.
[Note: Leaving off both max and min results in a max value of infinity and a min value of 1. Just listing a max value results in a min value of 0]
If the “testing” process is stopped, this table changes:
$ snmpwalk -v1 -c public localhost .1.3.6.1.4.1.2021.2.1 UCD-SNMP-MIB::prIndex.1 = INTEGER: 1 UCD-SNMP-MIB::prNames.1 = STRING: testing UCD-SNMP-MIB::prMin.1 = INTEGER: 1 UCD-SNMP-MIB::prMax.1 = INTEGER: 3 UCD-SNMP-MIB::prCount.1 = INTEGER: 0 UCD-SNMP-MIB::prErrorFlag.1 = INTEGER: 1 UCD-SNMP-MIB::prErrMessage.1 = STRING: Too few testing running (# = 0) UCD-SNMP-MIB::prErrFix.1 = INTEGER: 0 UCD-SNMP-MIB::prErrFixCmd.1 = STRING:
This is all well and good, but the problem is how do we get the state change to generate a trap?
That was the hard part.
On reading the documentation, something like this should work. First, set the trap destination by adding a “trapsink” entry:
trapsink 172.20.1.11 public
and then add a “monitor” line:
monitor -r 15 "procTable" prErrorFlag 0 1
This should monitor the value of “prErrorFlag” and generate an error if it is “1” or greater. You can also use expressions, so something like
monitor -r 15 "procTable" prErrorFlag > 1
should do the same thing. The “-r 15” says to check for errors every 15 seconds.
Unfortunately, I was unable to get this to work. It took a lot of digging, but I found out that Net-SNMP requires a valid SNMPv3 username and password in order to access the tables so that “monitor” can check these values. Adding:
createUser snmpdInternalUser rouser snmpdInternalUser noauth .1 iquerySecName snmpdInternalUser
enabled me to start getting traps.
This was a good start, but it still wasn’t perfect. Net-SNMP uses the Distributed Management events MIB to send the traps, so I was getting something like this:
It was a little generic and didn’t really tell me what I needed to know (namely, what was the process and what was the error). Also, I found the DISMAN events were hidden in the Cisco2.events.xml file, so I broke them out into their own file (which will be included in the releases in December).
A little more research uncovered that I could add other varbinds to the generic trap with just a few options to the “monitor” directive in the configuration file:
monitor -r 15 -o prNames -o prErrMessage "procTable" prErrorFlag 0 1
This would add the name and the error message to the trap. Finally, I noticed that while I got the rising or “down” traps, I wasn’t getting the falling or “up” traps. It turns out that I needed the “-t” option:
monitor -t -r 15 -o prNames -o prErrMessage "procTable" prErrorFlag 0 1
With a little added configuration to the new DISMAN.events.xml file, I was getting the proper rising events:
and the proper falling events:
So, with a little configuration it becomes quite easy to set up Net-SNMP to send traps into OpenNMS.
Here’s a summary:
Add the following lines to your Net-SNMP configuration file (usually /etc/snmp/snmpd.conf or /etc/snmpd.conf):
# Set up a V3 security name for internal queries createUser snmpdInternalUser rouser snmpdInternalUser noauth .1 iquerySecName snmpdInternalUser agentSecName snmpdInternalUser trapsink 172.20.1.11 public proc testing 3 1 monitor -t -r 15 -o prNames -o prErrMessage "procTable" prErrorFlag 0 1
You can have multiple “proc” entries while you only need one “monitor” entry.
Next, you’ll need to get the latest DISMAN.events.xml file from Sourceforge. Place it in your OpenNMS “etc/events” directory and be sure to add it to the list of include files at the bottom of the eventconf.xml file (add it before the Cisco2.events.xml file).
Hi, thank you for this great article, I guess there are a lot of guys, who install additional agents on there systems. Net-SNMP can so much in a cheap and easy way.
A small little correction, the HostResourceSwRunPlugin is in 1.6.7 available and can be used instead of the LoopPlugin. I don´t know in which version the plugin was added. This make a lot of things easier for Process monitoring. I will correct the OpenNMS Wiki immeditately 😉
Have a nice day
Ronny
Wiki update done 😉
Heh, what do you know. That’s awesome. I love it when OpenNMS gets so big that I don’t know everything it does.