System monitoring: summoning the beast of a thousand eyes

Yanick Champoux (@yenzie)
july 31st, 2018

System monitoring. A pretty vital part of any network management. That is, unless you’re one of the few who live for the visceral thrill of flying blind. For the rest of us partial to our lack of heart condition, an ounce of prevention is worth ten thousand gallons of Saturday morning intervention.

In this blog entry, I’ll go through the exercise of putting together a simple but working and easily extensible system monitoring setup leveraging common pieces of technology.

setting up the pieces

Before anything else, it bears mentioning that not two infrastructures are alike, and that monitoring is not a one-size-fits-all problem space. Network scale can go from massive server farms to a handful of machines. Those machines can all be on the same operating system, or can be wildly heterogeneous. They can be cloud-based virtual machines, or beige physicalities warming up a basement. We might be interested in the usual metrics (disk and CPU usage, bandwidth, etc.), or maybe something more exotic.

The system presented in this article is based on my private network. This means:

a network of very modest scale and pretty homogeneous make;
I’m interested in some of the standard metrics, but also in unconventional, tailor-suited measurements;
I love setting up software solutions for fun, but there is a limit on my free time — the system must be fairly quick to setup and be low maintenance;
total budget for the whole system: firmly nailed to zilch.

Put more succinctly, “frugality” is the order of the day.

With those in mind, I rummaged in my toolbox and settled with a three-part solution.

For the backend, I’ll use the time-series database InfluxDB. Implemented in Go, the installation of the database can literally be: download, unarchive, run. That’s pretty much as painfree as software installation can get.

Dealing with the database — via its cli client or REST interface — also a low hurdle; the language it uses is very close to dear ubiquitous SQL, and creation of new metrics/time-series happens automagically when new measurements are pushed. And while the commercially-supported edition of InfluxDB is geared toward big data with database replication, sharding, and flexible retention policies, the free standalone version is perfectly fine for my diminutive, non-mission critical group of machines.

On top of its own qualities, InfluxDB also has a little surrounding ecosystem providing bonus value. There is Telegraf - a metric-collecting agent, Chronograf - a data visualizer, and Kapacitor - an alert manager.

Telegraf is well worth looking into, as it comes with a vast array of clients to monitor disk/cpu/bandwidth usage, can gather metrics for most databases and common services like mail servers and such. And what it doesn’t have out of the box, it can be extended with via a HTTP interface, or via running external programs.

Chronograf has made tremendous progress in the last year. But I personally prefer and use Grafana for graphing the time-series. It supports InfluxDB out of the box, is more mature, has a much larger user-base, more plugins, and, frankly, looks prettier.

For the purpose of this article, this is all I’m going to say about the storing and the visualization of the measurements. Both because there is not much to be said about them beyond “install’em and run’em”, and because they are fairly irrelevant pieces of the solution. InfluxDB and Grafana could easily be replaced by Prometheus without much change to the beating core of the system, which is the collection of metrics.

For that gathering of metrics, I’ll bring together a few choice Perl modules and create a (mini-)framework abstracting the minutia around the collection, formatting and pushing of metrics, making the work required to implement new metrics as minimal as possible.

invocation spells

Before diving into the how, let me show you what endgame we’re gunning for.

What we’ll have is a metric collection program named oculi, and each metric we’ll want to collect will be implemented as a sub-command (à la git). For now, let’s take the example of a metric monitoring the health of a webpage.

As a base invocation, we’ll be able to pass the configuration parameters to the metric from the command line and get back the measurements.

$ oculi webpage --url http://techblog.babyl.ca --content Hacking

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=0.163129,status=200i 1527525733776843000

The measurement output is in the format that InfluxDB understands. If we want to record it, we could pipe it to a curl command hitting the right REST endpoint. But that’s onerous. So we’ll be able to it do directly from oculi instead:

$ oculi webpage --url=http://techblog.babyl.ca --content=Hacking --influxdb=oculi

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=0.177384,status=200i 1527525897821353000

sending data to http://localhost:8086/write?db=oculi...

Passing parameters via the command line is excellent for exploration, but less so once we want to set up checks. Once we’re at that stage, we’ll be able to use configuration files as well.

$ cat checks/website/techblog.yml
---
webpage:
    url: http://techblog.babyl.ca
    content: [ Hacking ]

% oculi webpage --config checks/website/techblog.yml

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=0.152118,status=200i 1527526098328207000

In fact, we’ll also be able to just pass the configuration file to oculi and let it figure out the rest.

$ oculi run checks/website/techblog.yml

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=0.15781,status=200i 1527526197279020000

For maximal ease and ad-hoc tweaks, it’ll be possible to use the configuration file as a baseline and override some of the parameters on the command line.

$ oculi run checks/website/techblog.yml --postfixdb=oculi

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=0.15781,status=200i 1527526197279020000

sending data to http://localhost:8086/write?db=oculi...

And just because it’s no fun until it gets ludicrous, we’ll make it possible to invoke the configuration file as its own program.

$ cat ./checks/website/techblog.yml

#!/usr/bin/env oculi
---
webpage:
    url: http://techblog.babyl.ca
    content: [ Hacking ]

$ ./checks/website/techblog.yml

webpage,url=http://techblog.babyl.ca is_live=TRUE,response_time=8e-06,status=200i 1527526371742747000

implementing a simple metric

So, how would one implement an oculi metric? Let’s walk through it, using the webpage healthcheck example of the previous section.

Simply put, a metric is expected to accept configuration arguments, and output measurements. With InfluxDB, measurements have 4 components: the metric name (“webpage”), tags providing details on the observed system (e.g., the url of webpage), one or more measured values (e.g., is_live, response_time, status), and of course a timestamp.

In our example we have two configuration elements: the url of the webpage (which is also a measurement tag), and an optional list of regular expressions which we may we want to assert. Code-wise, this translates to:

has option tag url => (
    documentation => 'url of the monitored webpage',
    required      => 1,
);

has option content => (
    documentation => 'list of regular expressions to be found in the page',
    isa           => 'ArrayRef',
    lazy          => sub { [] },
);

Next in line is the name of the metric. In most cases, the name of the Perl module implementing is going to match the metric name, so by default giving a relevant name to the module is going to be sufficient:

package App::Oculi::Metric::Webpage;

Now the outputs. Just like for the inputs, we create field attributes labelled as fields. How to find them, and what to do with them is a job the main program will address later on.

has field status => sub($self) {
    $self->response->{status};
};

has field is_live => sub($self) {
    $self->error ? 'FALSE' : 'TRUE';
};

has field response_time => sub($self) {
    $self->timer->elapsed;
};

has field error => sub($self) {
    my $response = $self->response;

    return 'GET failed' unless $response->{success};

    for my $re ( $self->content->@* ) {
        return "body doesn't match $re" unless $response->{content} =~ /$re/;
    }

    return;
};

Of course, we also need to implement the actual querying of the webpage.

use HTTP::Tiny;
use Timer::Simple;

has ro agent => sub { HTTP::Tiny->new };

has ro timer => sub { Timer::Simple->new };

before response_time => sub { $_[0]->response };

has ro response => sub ($self) {
    $self->timer;
    $self->agent->get( $self->url );
};

Minus some boilerplate, that’s the entirety of the metric module:

package App::Oculi::Metric::Webpage;

use 5.20.0;
use warnings;

use HTTP::Tiny;
use Timer::Simple;

use App::Oculi::Has qw/ ro tag option field /;

use Moose;
extends 'App::Oculi::Metric';

use experimental 'signatures', 'postderef';

has option tag url => (
    documentation => 'url of the monitored webpage',
    required      => 1,
);

has option content => (
    documentation => 'list of regular expressions to be found in the page',
    isa           => 'ArrayRef',
    lazy          => sub { [] },
);

has field status => sub($self) {
    $self->response->{status};
};

has field is_live => sub($self) {
    $self->error ? 'FALSE' : 'TRUE';
};

has field response_time => sub($self) {
    $self->timer->elapsed;
};

has field error => sub($self) {
    my $response = $self->response;
    return 'GET failed' unless $response->{success};

    for my $re ( $self->content->@* ) {
        return "body doesn't match $re" unless $response->{content} =~ /$re/;
    }

    return;
};


has ro agent => sub { HTTP::Tiny->new };

has ro timer => sub { Timer::Simple->new };

before response_time => sub { $_[0]->response };

has ro response => sub ($self) {
    $self->timer;
    $self->agent->get( $self->url );
};


1;

a metric with several measurements

Often a single metric invocation will generate more than one set of measurements. For example, the monitoring of a printer could generate one set of measurement for each color cartridge it has.

To cater to those cases, oculi allows a module to return a list of tags/fields pairs. For that printer example, it would look like:

package App::Oculi::Metric::PrinterInk;

use 5.20.0;

use warnings;

use Web::Query;
use List::Util qw/ pairmap /;

use App::Oculi::Has qw/ ro field fields tag option /;

use Moose;
extends 'App::Oculi::Metric';

use experimental 'signatures', 'postderef';

has option tag printer => (
    required      => 1,
    documentation => 'name of the printer',
);

has option address => (
    lazy => sub($self) { $self->printer },
    documentation =>
        "address of the printer, defaults to the 'printer' value"
);

has ro url => sub($self) {
    sprintf "http://%s/general/status.html", $self->address
};

has ro toner_levels => sub($self) {
    no warnings;  # height = 88px so =/= numerical

    return +{
        wq( $self->url )
            ->find( 'img.tonerremain' )
            ->map(sub {
                lc( $_->attr('alt') ) =>
                    sprintf "%.2f", $_->attr('height') / 55,
            })->@*
    };
};

has '+entries' => default => sub ($self) {
    return [
        pairmap { [ { color => $a } => { level => $b } ] }
                $self->toner_levels->%*
    ];
};

1;

When running, this metric produces one measurement line per color, with each line having the merged set of metric-wide and measurement-specific tags:

$ oculi printer_ink --config checks/printer.yml
printer_ink,color=magenta,printer=nibada level=0.71 1527530252911374000
printer_ink,color=black,printer=nibada level=0.53 1527530252911410000
printer_ink,color=yellow,printer=nibada level=0.16 1527530252911430000
printer_ink,color=cyan,printer=nibada level=0.00 1527530252911448000

peeking under the hood

Reinventing the wheel is long, tedious, error-prone, and often a display of hubris over practicality. Which is why oculi, for all the mechanics related to running it as a command-line app, sits on the firm base provided by MooseX::App.

Indeed, the main App::Oculi module is less than 10 lines long:

package App::Oculi;

use MooseX::App qw/ Config /;

app_namespace 'App::Oculi::Metric';

package App::Oculi::Trait::Tag    { use Moose::Role }
package App::Oculi::Trait::Field  { use Moose::Role }
package App::Oculi::Trait::Fields { use Moose::Role }

1;

The oculi script itself is only slightly more complex, and that purely because I wanted to impress the crowd with the run-the-config-as-a-script gimmick:

#!/usr/bin/env perl

use File::Serialize;
use App::Oculi;

use experimental 'postderef';

unshift @ARGV, 'run' if -f $ARGV[0];

my %defaults;
if( $ARGV[0] eq 'run' ) {
    my $file;
    ( undef, $file, @ARGV ) = @ARGV;

    my( $check, $defaults ) = deserialize_file($file)->%*;

    unshift @ARGV, $check;

    %defaults = %$defaults;

}

App::Oculi->new_with_command(%defaults)->run;

The brevity of those two files belie the number of features they bring to the table. Not only will the app find out all available metric modules by itself, know how to parse their configuration from the command line and/or configuration files, but help menus and basic documentation are generated without any extra effort from our part:

$ oculi --help
usage:
    oculi <command> [long options...]
    oculi help
    oculi <command> --help

global options:
    --config              Path to command config file
    --help -h --usage -?  Prints this usage information. [Flag]

available commands:
    disk_usage
    email_backlog
    help           Prints this usage information
    printer_ink
    taskwarrior
    webpage


$ oculi printer_ink --help
usage:
    oculi printer_ink [long options...]
    oculi help
    oculi printer_ink --help

options:
    --address             address of the printer, defaults to the 'printer'
                          value
    --config              Path to command config file
    --help -h --usage -?  Prints this usage information. [Flag]
    --influxdb            push the data to this influx instance
    --printer             name of the printer [Required]

the metric magic

Once the command line mechanics are delegated out of our way, what is left is to collect measurement tags and fields, and format them to be consumable by InfluxDB.

For that gathering of tags and fields, we use Moose’s introspection. Class introspection and meta-programming is often seen as scary, but in this instance we only dabble in minor sorcery. In the previous sections, we had attributes declared as:

has option tag url => (
    documentation => 'url of the monitored webpage',
    required      => 1,
);

What I didn’t point up then is that the option and tag keywords are helper functions imported from App::Oculi::Has. They are DSL sugar that allows us to write the class’ attributes in a shorter, more expressive way. Without them, we would declare the attribute as:

has url => (
    is => 'ro',
    traits => [ qw/
        App::Oculi::Trait::Tag
        MooseX::App::Meta::Role::Attribute::Option
    /],
    documentation => 'url of the monitored webpage',
    required      => 1,
);

With the attributes properly earmarked via those traits, gathering them is only a question of iterating over all attributes and picking the ones we want.

has tags => (
    is     => 'rw',
    traits => [ 'Hash' ],
    lazy   => 1,
    default => sub($self) {
        return +{
            map  { $_->name => $_->get_value($self) }
            grep { $_->does('App::Oculi::Trait::Tag') }
                 $self->meta->get_all_attributes
        }
    },
    handles => { all_tags => 'elements' },
);

The real code crosses a few more Ts and dots a handful of Is (and can be perused here), but for the bits that really matter, that’s it.

running the checks

My own needs don’t go beyond a few checks per day, so for me a cronjob is perfectly acceptable:

1 */6 * * * find checks -name '*.yml' \
                        -exec perl -Ilib bin/oculi \{\} --influxdb=oculi \;

But I sold you Telegraf pretty hard a few sections ago, so I should show how to run the checks using it as well:

[[inputs.exec]]
commands = [
    "/usr/bin/oculi run /path/to/checks/disk_usage.yml",
    "/usr/bin/oculi run /path/to/checks/printed_ink.yml",
]

data_format = "influx"

Or, if one is willing to use the yml-as-script trick, it can be crunched into:

[[inputs.exec]]
commands = [
    "/path/to/checks/*.yml",
]

data_format = "influx"

in conclusion

The goal of this blog entry was to demonstrate that monitoring systems doesn’t need to be exceedingly complex to let the measurements flow, and that a little bit of structure can lead to smaller, simpler, faster to write, easier to maintain libraries of metrics. Hopefully, it’ll inspire people. Maybe to adopt and extend oculi, maybe to write their own version in the language of their choice.

Either way, eyes will be watching.

Tags: technology monitoring influxdb perl