Choose the Right Thermometer

Okay, so I have a love/hate relationship with Centurylink. Centurylink provides a DSL circuit to my house. I love the fact that I have something resembling broadband with 10Mbps down and about 1Mbps up. Now that doesn’t even qualify as broadband according to the FCC, but it beats the heck out of the alternatives (and I am jealous of my friends with cable who have 100Mbps down or even 300Mbps).

The hate part comes from reliability, which lately has been crap. This post is actually focused on OpenNMS so I won’t go into all of my issues, but I’ve been struggling with long outages in my service.

The latest issue is a new one: packet loss. Usually the circuit is either up or completely down, but for the last three days I’ve been having issues with a large percentage of dropped packets. Of course I monitor my home network from the office OpenNMS instance, and this will usually manifest itself with multiple nodeLostService events around HTTP since I have a personal web server that I monitor.

The default ICMP monitor does not measure packet loss. As long as at least one ping reply makes it, ICMP is considered up, so the node itself remains up. OpenNMS does have a monitor for packet loss called Strafeping. It sends out 20 pings in a short amount of time and then measures how long they take to come back. So I added it to the node for my home and I saw something unusual: a consistent 19 out of 20 lost packets.

Strafeping Graph

Power cycling the DSL modem seems to correct the problem, and the command line ping was reporting no lost packets, so why was I seeing such packet loss from the monitor? Was Strafeping broken?

While it is always a possibility, I didn’t think that Strafeping was broken, but I did check a number of graphs for other circuits and they looked fine. Thus it had to be something else.

This brings up a touchy subject for me: false positives. Is OpenNMS reporting false problems?

It reminds me of an event happened when I was studying physics back in the late 1980s. I was working with some newly discovered ceramic material that exhibited superconductivity at relatively high temperatures (around 92K). That temperature can be reached using liquid nitrogen, which was relatively easy to source compared to cooler liquids like liquid helium.

I needed to measure the temperature of the ceramic, but mercury (used in most common thermometers) is a solid at those temperatures, so I went to my advisor for suggestions. His first question to me was “What does a thermometer measure?”

I thought it was a trick question, so I answered “temperature” (“thermo” meaning temperature and meter meaning “to measure”). He replied, “Okay, smart guy, the temperature of what?”

That was harder to answer exactly, so I said vague things like the ambient environment, whatever it was next to, etc. He interrupted me and said “No, a thermometer measures one thing: the temperature of the thermometer”.

This was an important lesson, even though it seems obvious. In the case of the ceramic it meant a lot of extra steps to make sure the thermometer we were using (which was based on changes in resistance) was as close to the temperature of the material as possible.

What does that have to do with OpenNMS? Well, OpenNMS is like that thermometer. It is up to us to make sure that the way we decide to use it for monitoring is as close to our criteria as possible. A “false positive” usually indicates a problem with the method versus the tool – OpenNMS is behaving exactly as it should but we need to match it better to what we expect.

In my case I found out the router I use was limited by default to responding 1 ping per second (to avoid DDoS attacks I assume), so last night when I upped that to allow 20 pings per second Strafeping started to work as expected (as you can see in the graph above).

This allowed me to detect when my DSL circuit packet loss started again today. A little after 14:00 the system detected high packet loss. When this happened before, power cycling the modem seemed to fix it, so I headed home to do just that.

While I was on the way, around 15:30, the packet loss seemed to improve, but as you can see from the graph the ping times were all over the place (the line is green but there is a lot of extra “smoke” around it indicating a variance in the response times). I proactively power cycled the modem and things settled down. The Centurylink agent agreed to send me a new modem.

The point of this post is to stress that you need to understand how your monitoring tools actually work and you can often correct issues that make a monitor unusable and turn it into to something useful. Choose the right thermometer.

Nextcloud, Never Stop Nexting!

It’s been awhile since I’ve posted a long, navel-gazing rant about the business of open source software. I’ve been trying to focus more on our business than spending time talking about it, but yesterday an announcement was made that brought all of it back to the fore.

TL;DR; Yesterday the Nextcloud project was announced as a fork of the popular ownCloud project. It was founded by many of the core developers of ownCloud. On the same day, the US corporation behind ownCloud shut it doors, citing Nextcloud as the reason. Is this a good thing? Only time will tell, but it represents the (still) ongoing friction between open source software and traditional software business models.

I was looking over my Google+ stream yesterday when I saw a post by Bryan Lunduke announcing a special “secret” broadcast coming at 1pm (10am Pacific). As I am a Lundookie, I made a point to watch it. I missed the start of it but when I joined it turned out to be an interview with the technical team behind a new project called Nextcloud, which was for the most part the same team behind ownCloud.

Nextcloud is a fork, and in the open source world a “fork” is the nuclear option. When a project’s community becomes so divided that they can’t work things out, or they don’t want to work things out for whatever reasons, there is the option to take the code and start a new project. It always represents a failure but sometimes it can’t be helped. The two forks I can think of off hand, Joomla from Mambo and Icinga from Nagios, both resulted in stronger projects and better software, so maybe this will happen here.

In part I blame the VC model for financing software companies for the fork. In the traditional software model, a bunch of money is poured into a company to create software, but once that software is created the cost of reproducing it is near zero, so the business model is to sell licenses to the software to the end users in order to generate revenue in the future. This model breaks when it comes to free and open source software, since once the software is created there is no way to force the end users to pay for it.

That still doesn’t keep companies from trying. This resulted in a trend (which is dying out) called “open core” – the idea that some software is available under an open source license but certain features are kept proprietary. As Brian Prentice at Gartner pointed out, there is little difference between this and just plain old proprietary software. You end up with the same lack of freedom and same vendor lock in.

Those of us who support free software tend to be bothered by this. Few things get me angrier than to be at a conference and have someone go “Oh, this OpenNMS looks nice – how much is the enterprise version?”. We only have the enterprise version and every bit of code we produce is available under an open source license.

Perhaps this happened at ownCloud. When one of the founders was on Bad Voltage awhile back, I had this to say about the interview:

The only thing that wasn’t clear to me was the business model. The founder Frank Karlitschek states that ownCloud is not “open core” (or as we like to call it “fauxpensource“) but I’m not clear on their “enterprise” vs. “community” features. My gut tells me that they are on the side of good.

Frank seemed really to be on the side of freedom, and I could see this being a problem if the rest of the ownCloud team wasn’t so dedicated.

On the interview yesterday I asked if Nextcloud was going to have a proprietary (or “enterprise”) version. As you can imagine I am pretty strongly against that.

The reason I asked was from this article on the new company that stated:

There will be two editions of Nextcloud: the free of cost community edition and the paid enterprise edition. The enterprise edition will have some additional features suited for enterprise customers, but unlike ownCloud, the community and enterprise editions for Nextcloud will borrow features from each other more freely.

Frank wouldn’t commit to making all of Nextcloud open, but he does seem genuinely determined to make as much of it open as possible.

Which leads me to wonder, what’s stopping him?

It’s got to be the money guys, right? Look, nothing says that open source companies can’t make money, it’s just you have to do it differently than you would with proprietary software. I can’t stress this enough – if your “open source” business model involves selling proprietary software you are not an open source company.

This is one of the reasons my blood pressure goes up whenever I visit Silicon Valley. Seriously, when I watch the HBO show to me it isn’t a comedy, it’s a documentary (and the fact that I most closely identify with the character of Erlich doesn’t make me feel all that better about myself).

I want to make things. I want to make things that last. I can remember the first true vacation I took, several years after taking over the OpenNMS project when it had grown it to the point that it didn’t need me all the time. I was so happy that it had reached that point. I want OpenNMS to be around well after I’m gone.

It seems, however, that Silicon Valley is more interested in making money rather than making things. They hunt “unicorns” – startups with more than a $1 billion valuation – and frequently no one can really determine how they arrive at that valuation. They are so consumed with jargon that quite often you can’t even figure out what some of these companies do, and many of them fade in value after the IPO.

I can remember a keynote at OSCON by Martin Mickos about Eucalyptus, and how it was “open source” but of course would have proprietary code because “well, we need to make money”. He is one of those Silicon Valley darlings who just doesn’t get open source, and it’s why we now have OpenStack.

The biggest challenge to making money in open source is educating the consumer that free software doesn’t mean free solution. Free software can be very powerful but it comes with a certain level of complexity, and to get the most out of it you have to invest in it. The companies focused on free and open source software make money by providing products that address this complexity.

Traditionally, this has been service and support. I like to say at OpenNMS we don’t sell software, we sell time. Since we do little marketing, all of our users are self selecting (which makes them incredibly intelligent and usually quite physically beautiful) and most of them have the ability to figure out their own issues. But by working with us we can greatly shorten the time to deploy as well as make them aware of options they may not know exist.

In more recent times, there is also the option to offer open source software as a service. Take WordPress, one of my favorite examples. While I find it incredibly easy to install an instance of WordPress, if you don’t want to or if you find it difficult, you can always pay them to host it for you. Change your mind later? You can export it to an instance you control.

The market is always changing and with it there is opportunity. As OpenNMS is a network monitoring platform and the network keeps getting larger, we are focusing on moving it to OpenStack for ultimate scalability, and then coupled with our Minions we’ll have the ability to handle an “Internet of Things” amount of devices. At each point there are revenue opportunities as we can help our clients get it set up in their private cloud, or help them by letting them outsource some or all of it, such as Newts storage. The beauty is that the end user gets to own their solution and they always have the option of bringing it back in house.

None of these models involves requiring a license purchase as part of the business plan. In fact, I can foresee a time in the near future where purchasing a proprietary software product without fully exploring open source alternatives will be considered a breach of fiduciary responsibility.

And these consumers will be savvy enough to demand pure open source solutions. That is why I think Nextcloud, if they are able to focus their revenue efforts on things such as an appliance, has a better chance of success than a company like ownCloud that relies on revenue from software licensing sales. The fact that most of the creators have left doesn’t help them, either.

The lack of revenue from licenses sales makes most VCs panic, and it looks like that’s exactly what happened with the US division of ownCloud:

Unfortunately, the announcement has consequences for ownCloud, Inc. based in Lexington, MA. Our main lenders in the US have cancelled our credit. Following American law, we are forced to close the doors of ownCloud, Inc. with immediate effect and terminate the contracts of 8 employees. The ownCloud GmbH is not directly affected by this and the growth of the ownCloud Foundation will remain a key priority.

I look forward to the time in the not too distant future when the open core model is seen as quaint as selling software on floppy disks at the local electronics store, and I eagerly await the first release of Nextcloud.

Emley Moor, Kirklees, West Yorkshire

I spent last week back in the United Kingdom. I always find it odd to travel to the UK. When I’m in, say, Germany or Spain, I know I’m in a different country. With the UK I sometimes forget and hijinks ensue. As Shaw may have once said, we are two countries separated by a common language.

Usually I spend time in the South, mainly Hampshire, but this trip was in Yorkshire, specifically West Yorkshire. I was looking forward to this for a number of reasons. For example, I love Yorkshire Pudding, and the Four Yorkshiremen is my favorite Monty Python routine.

Also, it meant that I could fly into Manchester Airport and miss Heathrow. Well, I didn’t exactly miss it.

I was visiting a big client that most people have never heard of, even though they are probably an integral part of your life if you live in the UK. Arqiva provides the broadcast infrastructure for much of the television and mobile phone industry in the country, as well as being involved in deploying networks for projects such as smart metering and the Internet of Things.

We were working at the Emley Moor location, which is home to the Emley Moor Mast. This is the largest freestanding structure in Britain (and third in the European Union). With a total height of 1084 feet, it is higher than the Eiffel Tower and almost twice as high as the Washington Monument.

Emily Moor Mast View

The mast was built in 1971 to replace a metal lattice tower that fell, due to a combination of ice and wind, in 1969. I love the excerpt from the log book mentioned in the Wikipedia article:

  • Day: Lee, Caffell, Vander Byl
  • Ice hazard – Packed ice beginning to fall from mast & stays. Roads close to station temporarily closed by Councils. Please notify councils when roads are safe (!)
  • Pye monitor – no frame lock – V10 replaced (low ins). Monitor overheating due to fan choked up with dust- cleaned out, motor lubricated and fan blades reset.
  • Evening :- Glendenning, Bottom, Redgrove
  • 1,265 ft (386 m) Mast :- Fell down across Jagger Lane (corner of Common Lane) at 17:01:45. Police, I.T.A. HQ, R.O., etc., all notified.
  • Mast Power Isolator :- Fuses removed & isolator locked in the “OFF” position. All isolators in basement feeding mast stump also switched off. Dehydrators & TXs switched off.

They still have that log book, open to that page.

Emily Moor Log Book

If you have 20 minutes, there is a great old documentary on the fall of the old tower and the construction of the new mast.

On my last day there we got to go up into the structure. It’s pretty impressive:

Emily Moor Mast Up Close

and the inside looks like something from a 1970s sci-fi movie:

Emily Moor Mast Inside

The article stated that it takes seven minutes to ride the lift to the top. I timed it at six minutes, fifty-seven seconds, so that’s about right (it’s fifteen seconds quicker going down). I was working with Dr. Craig Gallen who remembers going up in the open lift carriage, but we were in an enclosed car. It’s very small and with five of us in it I will admit to a small amount of claustrophobia on the way up.

But getting to the top is worth it. The view is amazing:

View from Emily Moor Mast

It was a calm day but you could still feel the tower sway a bit. They have a plumb bob set up to measure the drift, and it was barely moving while we were up there. Toby, our host, told of a time he had to spend seven hours installing equipment when the bob was moving four to five inches side to side. They had to move around on their hands and knees to avoid falling over.

Plumb Bob

I’m glad I wasn’t there on that day, but our day was fantastic. Here is a shot of the parking lot where the first picture (above) was taken.

View of Emily Moor Parking Lot

I had a really great time on this trip. The client was amazing, and I really like the area. It reminds me a bit of the North Carolina mountains. I did get my Yorkshire Pudding in Yorkshire (bucket list item):

Yorkshire Pudding in Yorkshire

and one evening Craig and I got to meet up with Keith Spragg.

Keith Spragg and Craig Gallen

Keith is a regular on the OpenNMS IRC channel (#opennms on freenode.net), and he works for Southway Housing Trust. They are a non-profit that manages several thousand homes, and part of that involves providing certain IT services to their tenants. They are mainly a Windows/Citrix shop but OpenNMS is running on one of the two Linux machines in their environment. He tried out a number of solutions before finding that OpenNMS met his needs, and he pays it forward by helping people via IRC. It always warms my heart to see OpenNMS being used in such places.

I hope to return to the area, although I was glad I was there in May. It’s around 53 degrees north latitude, which puts it level with the southern Alaskan islands. It would get light around 4am, and in the winter ice has been known to fall in sheets from the Mast (the walkways are covered to help protect the people who work there).

I bet Yorkshire Pudding really hits the spot on a cold winter’s day.

OpenNMS Horizon 18 “Tardigrade” Is Now Available

I am extremely happy to announce the availability of Horizon 18, codenamed “Tardigrade”. Ben is responsible for naming our releases and he’s decided that the theme for Horizon 18 will be animals. The name “Tardigrade” was suggested in the IRC channel by Uberpenguin, and while they aren’t the prettiest things, Wikipedia describes them as “perhaps the most durable of known organisms” so in the context of OpenNMS that is appropriate.

OpenNMS Horizon 18

I am also happy to see the Horizon program working. When we split OpenNMS into Horizon and Meridian, the main reason was to drive faster development. Now instead of a new stable release every 18 months, we are getting them out every 3 to 4 months. And these are great releases – not just major releases in name only.

The first thing you’ll notice if you log in to Horizon 18 as a user in the admin role is that we’ve added a new “opt-in” feature that let’s us know a little bit about how OpenNMS is being used by people. We hope that most of you will choose to send us this information, and in the spirit of the Open Source Way we’ve made all of the statistics available publicly.

OpenNMS Opt-In Screen

One of the key things we are looking for is the list of SNMP Object IDs. This will let us know what devices are being monitored by our users and to increase their level of support. Of course, this requires that your OpenNMS instance be able to reach the stats server on the Internet, and you can change your choice at any time on the Configuration admin page under “Data Choices”. It will only send this information once every 24 hours, so we don’t expect it to impact network traffic at all.

Once you’ve opted in, the next thing you’ll probably notice is new problem lists on the home page listing “services” and “applications”.

OpenNMS BSM Problem Lists

This related to the major feature addition in Horizon 18 of the Business Service Monitor (BSM).

OpenNMS BSM OpenDaylight

As people move from treating servers as pets to treating them like cattle, the emphasis has shifted to understanding how well applications and microservices are running as a whole instead of focusing on individual devices. The BSM allows you to configure these services and then leverage all the usual OpenNMS crunchy goodness as you would a legacy service like HTTP running on a particular box. The above screenshot comes from some prototype work Jesse has been doing with integrating OpenNMS with OpenDaylight. As you can see at a glance, while the ICMP service is down on a particular device, the overall Network Fabric is still functioning perfectly.

Another thing I’m extremely proud of is the increase in the quality of documentation. Ronny and the rest of the documentation team are doing a great job, and we’ve made it a requirement that new features aren’t complete without documentation. Please check out the release notes as an example. It contains a pretty comprehensive lists of changes in 18.

A few I’d like to point out:

Horizon 17 is one of the most powerful and stable releases of OpenNMS ever, and we hope to continue that tradition with Horizon 18. Hats off to the team for such great work.

Here is a list of all the issues addressed in Horizon 18:

Release Notes – OpenNMS – Version 18.0.0

Bug

  • [NMS-3489] – "ADD NODE" produces "too much" config
  • [NMS-4845] – RrdUtils.createRRD log message is unclear
  • [NMS-5788] – model-importer.properties should be deprecated and removed
  • [NMS-5839] – Bring WaterfallExecutor logging on par with RunnableConsumerThreadPool
  • [NMS-5915] – The retry handler used with HttpClient is not going to do what we expect
  • [NMS-5970] – No HTML title on Topology Map
  • [NMS-6344] – provision.pl does not import requisitions with spaces in the name
  • [NMS-6549] – Eventd does not honor reloadDaemonConfig event
  • [NMS-6623] – Update JNA.jar library to support ARM based systems
  • [NMS-7263] – jaxb.properties not included in jar
  • [NMS-7471] – SNMP Plugin tests regularly failing
  • [NMS-7525] – ArrayOutOfBounds Exception in Topology Map when selecting bridge-port
  • [NMS-7582] – non RFC conform behaviour of SmtpMonitor
  • [NMS-7731] – Remote poller dies when trying to use the PageSequenceMonitor
  • [NMS-7763] – Bridge Data is not Collected on Cisco Nexus
  • [NMS-7792] – NPE in JmxRrdMigratorOffline
  • [NMS-7846] – Slow LinkdTopologyProvider/EnhancedLinkdTopologyProvider in bigger enviroments
  • [NMS-7871] – Enlinkd bridge discovery creates erroneous entries in the Bridge Forwarding Tables of unrelated switches when host is a kvm virtual host
  • [NMS-7872] – 303 See Other on requisitions response breaks the usage of the Requisitions ReST API
  • [NMS-7880] – Integration tests in org.opennms.core.test-api.karaf have incomplete dependencies
  • [NMS-7918] – Slow BridgeBridgeTopologie discovery with enlinkd.
  • [NMS-7922] – Null pointer exceptions with whitespace in requisition name
  • [NMS-7959] – Bouncycastle JARs break large-key crypto operations
  • [NMS-7967] – XML namespace locations are not set correctly for namespaces cm, and ext
  • [NMS-7975] – Rest API v2 returns http-404 (not found) for http-204 (no content) cases
  • [NMS-8003] – Topology-UI shows LLDP links not correct
  • [NMS-8018] – Vacuumd sends automation events before transaction is closed
  • [NMS-8056] – opennms-setup.karaf shouldn't try to start ActiveMQ
  • [NMS-8057] – Add the org.opennms.features.activemq.broker .xml and .cfg files to the Minion repo webapp
  • [NMS-8058] – Poll all interface w/o critical service is incorrect
  • [NMS-8072] – NullPointerException for NodeDiscoveryBridge
  • [NMS-8079] – The OnmsDaoContainer does not update its cache correctly, leading to a NumberFormatException
  • [NMS-8080] – VLAN name is not displayed
  • [NMS-8086] – Provisioning Requisitions with spaces in their name.
  • [NMS-8096] – JMX detector connection errors use wrong log level
  • [NMS-8098] – PageSequenceMonitor sometimes gives poor failure reasons
  • [NMS-8104] – init script checkXmlFiles() fails to pick up errors
  • [NMS-8116] – Heat map Alarms/Categories do not show all categories
  • [NMS-8118] – CXF returning 204 on NULL responses, rather than 404
  • [NMS-8125] – Memory leak when using Groovy + BSF
  • [NMS-8128] – NPE if provisioning requisition name has spaces
  • [NMS-8137] – OpenNMS incorrectly discovers VLANs
  • [NMS-8146] – "Show interfaces" link forgets the filters in some circumstances
  • [NMS-8167] – Cannot search by MAC address
  • [NMS-8168] – Vaadin Applications do not show OpenNMS favicon
  • [NMS-8189] – Wrong interface status color on node detail page
  • [NMS-8194] – Return an HTTP 303 for PUT/POST request on a ReST API is a bad practice
  • [NMS-8198] – Provisioning UI indication for changed nodes is too bright
  • [NMS-8208] – Upgrade maven-bundle-plugin to v3.0.1
  • [NMS-8214] – AlarmdIT.testPersistManyAlarmsAtOnce() test ordering issue?
  • [NMS-8215] – Chart servlet reloads Notifd config instead of Charts config
  • [NMS-8216] – Discovery config screen problems in latest code
  • [NMS-8221] – Operation "Refresh Now" and "Automatic Refresh" referesh the UI differently
  • [NMS-8224] – JasperReports measurements data-source step returning null
  • [NMS-8235] – Jaspersoft Studio cannot be used anymore to debug/create new reports
  • [NMS-8240] – Requisition synchronization is failing due to space in requisition name
  • [NMS-8248] – Many Rcsript (RScript) files in OPENNMS_DATA/tmp
  • [NMS-8257] – Test flapping: ForeignSourceRestServiceIT.testForeignSources()
  • [NMS-8272] – snmp4j does not process agent responses
  • [NMS-8273] – %post error when Minion host.key already exists
  • [NMS-8274] – All the defined Statsd's reports are being executed even if they are disabled.
  • [NMS-8277] – %post failure in opennms-minion-features-core: sed not found
  • [NMS-8293] – Config Tester Tool doesn't check some of the core configuration files
  • [NMS-8298] – Label of Vertex is too short in some cases
  • [NMS-8299] – Topology UI recenters even if Manual Layout is selected
  • [NMS-8300] – Center on Selection no longer works in STUI
  • [NMS-8301] – v2 Rest Services are deployed twice to the WEB-INF/lib directory
  • [NMS-8302] – Json deserialization throws "unknown property" exception due to usage of wrong Jax-rs Provider
  • [NMS-8304] – An error on threshd-configuration.xml breaks Collectd when reloading thresholds configuration
  • [NMS-8313] – Pan moving in Topology UI automatically recenters
  • [NMS-8314] – Weird zoom behavior in Topology UI using mouse wheel
  • [NMS-8320] – Ping is available for HTTP services
  • [NMS-8324] – Friendly name of an IP service is never shown in BSM
  • [NMS-8330] – Switching Topology Providers causes Exception
  • [NMS-8335] – Focal points are no longer persisted
  • [NMS-8337] – Non-existing resources or attributes break JasperReports when using the Measurements API
  • [NMS-8353] – Plugin Manager fails to load
  • [NMS-8361] – Incorrect documentation for org.opennms.newts.query.heartbeat
  • [NMS-8371] – The contents of the info panel should refresh when the vertices and edges are refreshed
  • [NMS-8373] – The placeholder {diffTime} is not supported by Backshift.
  • [NMS-8374] – The logic to find event definitions confuses the Event Translator when translating SNMP Traps
  • [NMS-8375] – License / copyright situation in release notes introduction needs simplifying
  • [NMS-8379] – Sluggish performance with Cassandra driver
  • [NMS-8383] – jmxconfiggenerator feature has unnecessary includes
  • [NMS-8386] – Requisitioning UI fails to load in modern browsers if used behind a proxy
  • [NMS-8388] – Document resources ReST service
  • [NMS-8389] – Heatmap is not showing
  • [NMS-8394] – NoSuchElement exception when loading the TopologyUI
  • [NMS-8395] – Logging improvements to Notifd
  • [NMS-8401] – There are errors on the graph definitions for OpenNMS JMX statistics
  • [NMS-8403] – Document styles of identifying nodes in resource IDs

Enhancement

  • [NMS-2504] – Create a better landing page for Configure Discovery aftermath
  • [NMS-4229] – Detect tables with Provisiond SNMP detector
  • [NMS-5077] – Allow other services to work with Path Outages other than ICMP
  • [NMS-5905] – Add ifAlias to bridge Link Interface Info
  • [NMS-5979] – Make the Provisioning Requisitions "Node Quick-Add" look pretty
  • [NMS-7123] – Expose SNMP4J 2.x noGetBulk and allowSnmpV2cInV1 capabilities
  • [NMS-7446] – Enhance Bridge Link Object Model
  • [NMS-7447] – Update BridgeTopology to use the new Object Model
  • [NMS-7448] – Update Bridge Topology Discovery Strategy
  • [NMS-7756] – Change icon for Dell PowerConnector switch
  • [NMS-7798] – Add Sonicwall Firewall Events
  • [NMS-7903] – Elasticsearch event and alarm forwarder
  • [NMS-7950] – Create an overview for the developers guide
  • [NMS-7965] – Add support for setting system properties via user supplied .properties files
  • [NMS-7976] – Merge OSGi Plugin Manager into Admin UI
  • [NMS-7980] – provide HTTPS Quicklaunch into node page
  • [NMS-8015] – Remove Dependencies on RXTX
  • [NMS-8041] – Refactor Enhanced Linkd Topology
  • [NMS-8044] – Provide link for Microsoft RDP connections
  • [NMS-8063] – Update asciidoc dependencies to latest 1.5.3
  • [NMS-8076] – Allow user to access local documentation from OpenNMS Jetty Webapp
  • [NMS-8077] – Add NetGear Prosafe Smart switch SNMP trap events and syslog events
  • [NMS-8092] – Add OpenWrt syslog and related event definitions
  • [NMS-8129] – Disallow restricted characters from foreign source and foreign ID
  • [NMS-8149] – Update asciidoctorj to 1.5.4 and asciidoctorjPdf to 1.5.0-alpha.11
  • [NMS-8152] – Collect and publish anonymous statistics to stats.opennms.org
  • [NMS-8160] – Remove Quick-Add node to avoid confusions and avoid breaking the ReST API
  • [NMS-8163] – Requisitions UI Enhancements
  • [NMS-8179] – ifIndex >= 2^31
  • [NMS-8182] – Add HTTPS as quick-link on the node page
  • [NMS-8205] – Generate events for alarm lifecycle changes
  • [NMS-8209] – Upgrade junit to v4.12
  • [NMS-8210] – Add support for calculating the derivative with a Measurements API Filter
  • [NMS-8211] – Add support for retrieving nodes with a filter expression via the ReST API
  • [NMS-8218] – External event source tweaks to admin guide
  • [NMS-8219] – Copyright bump on asciidoc docs
  • [NMS-8225] – Integrate the Minion container and packages into the mainline OpenNMS build
  • [NMS-8226] – Upgrade SNMP4J to version 2.4
  • [NMS-8238] – Topology providers should provide a description for display
  • [NMS-8251] – Parameterize product name in asciidoc docs
  • [NMS-8259] – Cleanup testdata in SnmpDetector tests
  • [NMS-8265] – SNMP collection systemDefs for Cisco ASA5525-X, ASA5515-X
  • [NMS-8266] – SNMP collection systemDefs for Juniper SRX210he2, SRX100h
  • [NMS-8267] – Create documentation for SNMP detector
  • [NMS-8271] – Enable correlation engines to register for all events
  • [NMS-8296] – Be able to re-order the policies on a requisition through the UI
  • [NMS-8334] – Implement org.opennms.timeseries.strategy=evaluate to facilitate the sizing process
  • [NMS-8336] – Set the required fields when not specified while adding events through ReST
  • [NMS-8349] – Update screenshots with 18 theme in user documentation
  • [NMS-8365] – Add metric counter for drop counts when the ring buffer is full
  • [NMS-8377] – Applying some organizational changes on the Requisitions UI (Grunt, JSHint, Dist)

Story

Task

  • [NMS-8236] – Move the "vaadin-extender-service" module to opennms code base

Welcome Ecuador (Country 29)

It is with mixed emotions that I get to announce that we now have a customer in Ecuador, our 29th country.

My emotions are mixed as my excitement at having a new customer in a new country is offset by the tragedy that country suffered recently. Everyone at OpenNMS is sending out our best thoughts and we hope things settle down (quite literally) soon.

They join the following countries:

Australia, Canada, Chile, China, Costa Rica, Denmark, Egypt, Finland, France, Germany, Honduras, India, Ireland, Israel, Italy, Japan, Malta, Mexico, The Netherlands, Portugal, Singapore, Spain, Sweden, Switzerland, Trinidad, the UAE, the UK and the US.

Agent Provocateur

I’ve been involved with the monitoring of computer networks for a long time, two decades actually, and I’m seeing an alarming trend. Every new monitoring application seems to be insisting on software agents. Basically, in order to get any value out of the application, you have to go out and install additional software on each server in your network.

Now there was a time when this was necessary. BMC Software made a lot of money with its PATROL series of agents, yet people hated them then as much as they hate agents now. Why? Well, first there was the cost, both in terms of licensing and in continuing to maintain them (upgrades, etc.). Next there was the fact that you had to add software to already overloaded systems. I can remember the first time the company I worked for back then deployed a PATROL agent on an Oracle database. When it was started up it took the database down as it slammed the system with requests. Which leads me to the final point, outside of security issues that arise with an increase in the number of applications running on a system, the moment the system experiences a problem the blame will fall on the agent.

Despite that, agents still seem to proliferate. In part I think it is political. Downloading and installing agents looks like useful work. “Hey, I’m busy monitoring the network with these here agents”. Also in part, it is laziness. I have never met a programmer who liked working on someone else’s code, so why not come up with a proprietary protocol and write agents to implement it?

But what bothers me the most is that it is so unnecessary. The information you need for monitoring, with the possible exception of Windows, is already there. Modern operating systems (again, with the exception of Windows) ship with an SNMP agent, usually based on Net-SNMP. This is a secure, powerful extensible agent that has been tried and tested for many years, and it is maintained directly on server itself. You can use SNMPv3 for secure communications, and the “extend” and “pass” directives to make it easy to customize.

Heck, even Windows ships with an extensible SNMP agent, and you can also access data via WMI and PowerShell.

But what about applications? Don’t you need an agent for that?

Not really. Modern applications tend to have an API, usually based on ReST, that can be queried by a management station for important information. Java applications support JMX, databases support ODBC, and when all that fails you can usually use good ol’ HTTP to query the application directly. And the best part is that the application itself can be written to guard against a monitoring query causing undue load on the system.

At OpenNMS we work with a lot of large customers, and they are loathe to install new software on all of their servers. Plus, many of our customers have devices that can’t support additional agents, such as routers and switches, and IoT devices such as thermostats and door locks. This is the main reason why the OpenNMS monitoring platform is, by design, agentless.

A critic might point out that OpenNMS does have an agent in the remote poller, as well as in the upcoming Minion feature set. True, but those act as “user agents”, giving OpenNMS a view into networks as if it was a user of those networks. The software is not installed on every server but instead it just needs the same access as a user would have. So, it can be installed on an existing system or on a small system purchased for that purpose, at a minimum just one for each network to be monitored.

While some new IT fields may require agents, most successful solutions try to avoid them. Even in newer fields such as IT automation, the best solutions are agentless. They are not necessary, and I strongly suggest that anyone who is asked to install an agent for monitoring question that requirement.

OpenNMS is Sweet Sixteen

It was sixteen years ago today that the first code for OpenNMS was published on Sourceforge. While the project was started in the summer of 1999, no one seems to remember the exact date, so we use March 30th to mark the birthday of the OpenNMS project.

OpenNMS Project Details

While I’ve been closely associated with OpenNMS for a very long time, I didn’t start it. It was started by Steve Giles, Luke Rindfuss and Brian Weaver. They were soon joined by Shane O’Donnell, and while none of them are associated with the project today, they are the reason it exists.

Their company was called Oculan, and I joined them in 2001. They built management appliances marketed as “purple boxes” based on OpenNMS and I was brought on to build a business around just the OpenNMS piece of the solution.

As far as I know, this is the only surviving picture of most of the original team, taken at the OpenNMS 1.0 Release party:

OpenNMS 1.0 Release Team

In 2002 Oculan decided to close source all future work on their product, thus ending their involvement with OpenNMS. I saw the potential, so I talked with Steve Giles and soon left the company to become the OpenNMS project maintainer. When it comes to writing code I am very poorly suited to the job, but my one true talent is getting great people to work with me, and judging by the quality of people involved in OpenNMS, it is almost a superpower.

I worked out of my house and helped maintain the community mainly through the #opennms IRC channel on freenode, and surprisingly the project managed not only to survive, but to grow. When I found out that Steve Giles was leaving Oculan, I applied to be their new CEO, which I’ve been told was the source of a lot of humor among the executives. The man they hired had a track record of snuffing out all potential from a number of startups, but he had the proper credentials that VCs seem to like so he got the job. I have to admit to a bit of schadenfreude when Oculan closed its doors in 2004.

But on a good note, if you look at the two guys in the above picture right next to the cake, Seth Leger and Ben Reed, they still work for OpenNMS today. We’re still here. In fact we have the greatest team I’ve every worked with in my life, and the OpenNMS project has grown tremendously in the last 18 months. This July we’ll have our eleventh (!) annual developers conference, Dev-Jam, which will bring together people dedicated to OpenNMS, both old and new, for a week of hacking and camaraderie.

Our goal is nothing short of making OpenNMS the de facto management platform of choice for everyone, and while we still have a long way to go, we keep getting closer. My heartfelt thanks go out to everyone who made OpenNMS possible, and I look forward to writing many more of these notes in the future.

OpenNMS Horizon 17.1.1 Released

Probably the last Horizon 17 version, 17.1.1, has been released. According to TWiO, the next release will be Horizon 18 at the end of the month, with Horizon 19 following at the end of May.

This release is mainly a maintenance release. It does contain one fix I used (NMS-8199), which allows for the state names in the Jira Trouble Ticketing plugin to be configured. This helps a lot if Jira is not in English.

If you are running Horizon 17, this should help it run a bit smoother.

Bug

  • [NMS-7936] – Chart Servlet Outages model exception
  • [NMS-8010] – Groups config rolled back after deleting a user in web UI
  • [NMS-8034] – Adding com.sun.management.jmxremote.authenticate=true on opennms.conf is ignored by the opennms script
  • [NMS-8048] – org.hibernate.exception.SQLGrammarException with ACLs on V17
  • [NMS-8075] – vacuumd-configuration.xml — Database error executing statement
  • [NMS-8113] – Overview about major releases in the release notes
  • [NMS-8153] – Can't modify the Foreign ID on the Requisitions UI when adding a new node
  • [NMS-8159] – When altering the SNMP Trap NBI config, the externally referenced mapping groups are persisted into the main file.
  • [NMS-8161] – Tooltips are not working on the new Requisitions UI
  • [NMS-8165] – OutageDao ACL support is broken causing web UI failures
  • [NMS-8177] – Install guide should use postgres admin for schema updates
  • [NMS-8199] – Allows state names to be configured in the JIRA Ticketer Plugin

Enhancement

  • [NMS-6404] – Allow send events through ReST
  • [NMS-8148] – Create pull request and contribution template to GitHub project

Task

  • [NMS-8151] – Remove all jersey artifacts from lib classpath

Speeding Up OpenNMS Requisition Imports

One thing that differentiates OpenNMS from other applications is the strong focus on tools for provisioning the system. If you want to monitor hundred of thousands of devices, to ultimately millions, the ordinary methods just don’t work.

Users of OpenNMS often create large requisitions from external database sources, and sometimes it can take awhile for the import to complete. Delays can happen if the Foreign Source used for the requisition has a large number of service detectors that won’t exist on most devices.

For example, the default Foreign Source for Horizon 17 has about 15 detectors. Of those, only about 4 will exist on networking equipment (ICMP, SSH, HTTP and HTTPS). When scanning, this can add a lot of time per interface. Assuming 2 retries and a 3 second timeout, that would be 9 seconds for each non-existent service. With just 1000 interfaces, that’s 99000 seconds (9 seconds x 11 services x 1000 interfaces) of time just spent waiting, which translates to 27.5 hours.

Now, granted, the importer has multiple threads so the actual wait time will be less, but you can see how this can impact the time needed to import a requisition. This can be reduced significantly by tuning service detection to the bare minimum needed and perhaps adding other services later on a per device basis without scanning.

Review: System 76 Wild Dog Pro

Recently, I was trying to work on the desktop at my office and it kept rebooting. Nothing in the logs nor anything to indicate that there might be a software issue, so I just assumed that my now 5 year old machine was probably at its end of life.

Without hesitation I decided to order a new desktop from System 76. I really liked the Sables we bought from them last year, so I figured it would be simple to order a Linux-compatible machine from them.

I went to their “Desktops” page and without much thought decided on the “Wild Dog Pro“. I don’t have huge requirements, so the big monster with wheels the “Silverback” (probably named after the gorilla) was right out. I picked more of a middle of the road machine with the following specs:

  • Case: Black brushed aluminum
  • CPU: 4.2 GHz i7-6700K (4.0 up to 4.2 GHz – 8MB Cache – 4 Cores – 8 Threads) + Liquid Cooling
  • Memory: 32 GB Dual-channel DDR4 at 2133 MHz (4× 8 GB)
  • Graphics: 2 GB GTX 960 Superclocked with 1024 CUDA Cores
  • Storage: 1 TB 2.5″ Solid State Drive
  • Dual Layer CD-RW / DVD-RW
  • WiFi up to 867 Mbps + Bluetooth
  • 3 Year Limited Parts and Labor Warranty

I also ordered two day shipping, since I thought I would need it fast.

I got an order confirmation almost immediately with an estimate of 2 to 6 days to ship. Soon after that I got a note stating that the Wild Dog was running toward the latter end of that range. I figured I could just use my laptop until the new machine arrived if necessary, and I waited.

While I was waiting, I still continued to use my old desktop. I noticed the rebooting issue happened toward the end of the day. It finally dawned on me (I’m a little thick) that it might be heat related. I crawled under the desk to find that the power supply fan wasn’t working. I ordered a new one of those to see if it would help.

Since the new power supply arrived before the Wild Dog shipped, and it fixed my issue, I contacted System 76 to see if I could change the shipping from “speedy” to something more like “camel”. They were happy to do it and refunded the difference in price.

Anyway, the new machine finally arrived (I ordered this on 29 January and got it on 16 February – a little slow but faster than Lenovo and Dell have been for me in the past). It showed up in a standard brown box:

Wild Dog Pro Box

The unit was minimally packaged inside (which I like):

Wild Dog Pro Box Open

Pretty case with a minimalist look:

Wild Dog Pro Front

with all of the “business” being on the back:

Wild Dog Pro Back

This has USB 2.0, USB 3.0 and USB 3.1 ports, including a USB-C connector should you be into such things.

I like the case, but they tape a letter on the top that, when removed, you can still see the marks left by the tape. I haven’t hit this with goo cleaner since it is going under my desk, but it did detract from the overall look of the unit. The letter contained a welcome note and some stickers, as well as a little cut-out dude called the “Desktop Sentinel” and named “M3lvin”. Not quite sure what that is, and a quick Google search turned up nothing.

Wild Dog Pro Letter

Of course, the first thing I did was open it up. The case is nice, although I’ve grown used to captive screws to remove side panels and was surprised when the two I took off came completely off. The system is well laid out inside with room for expansion (I wanted to put in a backup SATA drive to go with the SSD).

Wild Dog Pro Right

Wild Dog Pro Left

I can’t tell you much about the performance. It seems plenty fast, and I downloaded the test suites from Phoronix but just didn’t have the hours to run them for benchmarks. While it ships with Ubuntu 15.10, I’m a Linuxmint guy so I immediately went to install Mint on the machine.

This was harder than I thought it would be. I could not get the BIOS to boot off of the USB stick no matter what I tried (it saw the stick in the boot menu but wouldn’t boot to it). I ended up burning the image to DVD and, while slower, worked fine.

Then it dawned on me that they probably shipped with Ubuntu 15.10 because it has one of those fancy “Skylake” processors which benefits from later kernels. Luckily I had run into this with my Dell laptop, so I installed gcc-4.9 and the 4.4 kernel and everything worked but the wireless card. Turns out you need to install the latest ndiswrapper and you’ll be good to go.

Needless to say I’m eager for Mint 18 to come out with support for the later kernels.

Overall, I’m happy with my purchase. There is room for improvement on the speed of producing it and shipping it out, but my decision to use Mint was totally on me. I look forward to getting many hours of use out of this machine.