In the first week that I started my co-op at TrustPoint Innovation I was tasked with improving upon the monitoring solution. The existing monitoring project was a comprehensive, multi-threaded NodeJS application complete with graphing and email notifications. It could also be completely configured through a web interface and store log data into a MongoDB database.
To Nagios or not to Nagios
After making a number of improvements to the project, I was made aware of an open source monitoring service called “Nagios”. At first I was worried about replacing a project that so much time and effort was put into building; however, after thinking about it I came to realize that I had learned a lot about docker containers, debugging with node inspector, multi-threading in NodeJS and received some useful feedback during code reviews. All of this made me a better developer and I came to realize that time spent wouldn't become wasted if there is a better solution.
After doing a little research I had found that Nagios could bring a lot of cool things to the table. It has a very straight forward plugin system for tracking all kinds of services that a machine can provide as well as a comprehensive notification and escalation service that can intelligently notify the people that most need to be contacted first, and if it persists it can escalate to someone else.
With Nagios, any executable file can be used by Nagios to test a service. Nagios refers to these files as “plugins” and creating your own plugin is as easy as returning some text that describes the state of the service. For example “OK - PING 0% packet loss” or “WARNING - 30% packet loss” then exiting with a code that represents the state the service should be in. 0 indicating “OK”, 1 meaning “WARNING“ or 2 for a “CRITICAL” state.
Once your plugin can do this you can expand your service to take arguments to modify the definition of these states, for example you may decide that 5% packet loss constitutes “CRITICAL” for one service, but another service that exists on a wireless network with a lot of interference should tolerate up to 30% packet loss and continue to be in an “OK” status. You can also have your plugin return another string for generating graphs. The Nagios Plugin Development Guidelines are a fantastic resource for designing and developing your plugins.
We had considered adding support for UPS monitoring in our current solution; however, it would have required significant modification. With Nagios' plugin system it was easy to write a plugin in Python that accomplishes this.
Our current solution will generate a message for all developers for any failed test, regardless of the time or whose responsibility that service is. This isn't only a nuisance, it could be a real problem if a real issue's notification was missed because it faded into the background of false-positive checks.
Nagios has the ability to send notifications to a specific person (or group of people) first, giving them the chance to acknowledge the problem or fix it. If a set number of checks fail without the problem being acknowledged in the web portal, Nagios can escalate the issue to a different user or group. This is fantastic for someone who is responsible for a large number of services.
Nagios can also take defined work hours or holidays into consideration before sending alerts. For a service that is not mission critical, like a build server, alerts may not need to be sent right away; however, for a UPS we could choose be notified 24/7 if it were to switch to battery power.
Sometimes network issues can cause a number of services to be unreachable and it would be incorrect to log those services as “down”. Not only would this spoil data used to generate reports, it would make identifying the root of a problem difficult when you are being faced with a massive number of failed services. Nagios lets us handle these situations by recording how we reach these devices. This way when the switch that lets us talk to a web server or the Pi that is monitoring our UPS hardware goes down, the devices on the other side are listed in a special “UNREACHABLE” state. A quick look at Nagios' network map would reveal the source of such a problem.
Unlike our current solution, Nagios cannot be configured through its default web interface. We had decided that this was not really a problem because of how infrequent changes would need to be made. In addition, our build script allows updating the configuration of a live container; which is useful because restarting Nagios introduces undesirable artifacts into the event log.
Another apparent downside was that we would need to manage new credentials for Nagios. The web interface that comes with Nagios 3 uses HTTP Basic Authentication to sign in, that would send usernames and passwords in the clear. While I was trying to think of a better way, during a daily stand-up meeting Mark, my co-worker, had mentioned how he had implemented OAuth authentication in another project.
This inspired me to research implementing OAuth in Nagios. While researching how I could accomplish this, I quickly came across oauth2_proxy, a project by bitly. This project can handle authenticating a user before forwarding them to a specified upstream web address. I was able to quickly configure it to be the only method of reaching the Nagios 3 web interface and to login on a user's behalf.
It would appear that Nagios is able to surpass our current solution in nearly every way and it also gave us some services that we didn't even know that we wanted! We now have both Nagios and our own service running until we are confident that Nagios will completely meet all of our needs. I am confident that this is going to be a useful tool for us here at TrustPoint Innovation. If you don't already have a monitoring solution, or if you find that the one you have today isn't easily extendable, I would highly recommend that you try Nagios.