Alarm management is a fundamental part of network monitoring. The motivation for defining a standard alarm interface for network devices isn’t new. In the early 90s, ITU-T standardized X.733 (OSI model). This continued in mobile networks with the standardization of Alarm IRP (Integration Reference Point) by 3GPP. In TCP/IP networks, SNMP is the preferred choice for network management, along with ad hoc tools (usually command-line scripts). In SNMP, object information is stored as MIBs (Management Information Base), formal descriptions of the network objects that can be managed. Usually MIBs have a tree structure.

The IETF didn’t early on standardize an alarm MIB. Instead, management systems interpreted the enterprise specific traps per MIB to build an alarm list. When finally RFC 3877 (Alarm Management Information Base MIB) was published, it had to address the existence of these enterprise traps and map them into alarms. This requirement led to a MIB that was not easy to use.

Introducing NETCONF and YANG

SNMP is still the dominant protocol for network management, although it has start showing its age. In the last years, several alternatives were proposed with the goal of replacing it. Among all proposals, the most promising alternative is NETCONF (RFC 6241: Network Configuration Protocol). NETCONF is, like SNMP, a network management protocol. It provides mechanisms to install, manipulate, and delete the configuration of network devices. NETCONF uses an RPC mechanism to execute its operations, whereas protocol messages are encoded in XML (or JSON).

The NETMOD WG (NETCONF Data Modeling Working Group) defines the semantics of operational data, configuration data, notifications and operations, using a data modeling language called YANG (See RFC 6020 and RFC 6021).

YANG is a very rich language. It allows to define much more complex data structures than other modeling languages such DTD or XML-Schema. For instance, YANG features a wide range of primitive data types (uint32, string, boolean, decimal64, etc), simple data (leaf), structured data elements (container, list, list-leaf), definition of customized types (typedef), definition of remote procedure calls, references (instance-ref, leaf-ref), notifications, etc.

Take the following model as example:

container students {
   list student {
      leaf name {
         type string;
      }
      leaf data-birth {
         type yang:date;
      }
   }
}

students {
   student { name "Jane"; date-of-birth "01-01-1995"; }
   student { name "John"; date-of-birth "31-03-1995"; }
}

That very same model could be written in DTD/XML form as:

<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT students (student*)>
<!ELEMENT student (name,date-of-birth)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT date-of-birth (#PCDATA)>
]>
<students>
   <student>
      <name>Jane</name>
      <date_of_birth>01-01-1995</date_of_birth>
   </student>
   <student>
      <name>John</name>
      <date_of_birth>31-01-1995</date_of_birth>
   </student>
</students>

One obvious difference is that field date-of-birth is encoded as a string in the DTD/XML model. On the contrary, it’s defined as a date in the YANG model. Supporting date as a native data type in the language improves value checking. If date-of-birth is not a valid date, our YANG library will report the error.

YANG also allows to compose several YANG modules into one single document. Data types from a different module can be accessed via namespace, as in the example above in the case of yang:date.

Covering all its aspects of YANG would require a blog post on its own, so I will end here this introduction. Summarizing, the two main ideas I’d like to highlight are the following:

  • NETCONF is a relatively new network management protocol, aimed to replace SNMP, tightly coupled with YANG.
  • YANG is the data modeling language used to define NETCONF’s data models.

YANG alarms

The YANG alarms module is defined in draft-vallin-netmod-alarm-module-02.txt. Its implementation in Snabb was sponsored, as most lwAFTR related work, by Deutsche-Telekom. The module specification is still a draft but even in this state it features enough functionality to make an implementation valuable.

The implementation uses Snabb’s native YANG library and Snabb’s config tool, a simple implementation of NETCONF. Both tools were mostly developed by Igalia (more precisely by my colleagues Andy Wingo and Jessica Tallon), also as part of the work of Snabb’s lwAFTR.

At a high level view, the YANG alarms module is organized in two parts:

  • Configuration data: stores all the attributes and variables that control how the module should operate. For example, max-alarm-status-changes controls the size of an alarm status-change list (default: 32); notify-status-changes, controls whether notifications are sent on alarms status updates.
  • State data: actually stores alarm information and consists of 4 containers: alarm-list, alarm-inventory, shelved-alarms and summary.

The main component of the state data container is the alarm-list container:

list alarm {
   key "resource alarm-type-id alarm-type-qualifier";

   uses common-alarm-parameters;
}

grouping common-alarm-parameters {
   leaf resource {
      type resource;
      mandatory true;
   }
   leaf alarm-type-id {
      type alarm-type-id;
      mandatory true;
   }
   leaf alarm-type-qualifier {
      type alarm-type-qualifier;
   }
}

The alarm-list container stores all the active alarms managed in the system. But before going any further, we should define what an alarm is. Basically, an alarm is a persistent indication of a fault that clears only when its triggering condition has been resolved. An active alarm is always in at least these two states: raised or cleared.

When an alarm is raised a new entry is created in alarm-list. An alarm is identified by the triple: {resource, alarm-type-id, alarm-type-qualifier}, describing the resource that is affected, a type of alarm identifier and a qualifier that contains other optional information. Besides this information, an alarm also stores other information (omitted in the example for simplification) such as whether the alarm is-cleared, its last-changed timestamp, perceived-severity and a list of status changes. When an alarm is created, a new item is created in this list. If later the alarm increases or decreases its priority, or changes some other properties as per defined in the standard, a new status change is added to this list.

Most of the YANG Alarms module business logic is implemented in lib/yang/alarms.lua. This library provides an API that allows to define alarms and handle when to raise them or clear them. If we would like to monitor a special condition we just simply need to import the alarms module and create a check point. For instance:

function ARP:maybe_send_arp_request (output)
   if self.next_mac then return end
   self.next_arp_request_time = self.next_arp_request_time or engine.now()
   if self.next_arp_request_time <= engine.now() then
      self:arp_resolving(self.next_ip)
      ...
   end
end
function ARP:arp_resolving (ip)
   print(("ARP: Resolving '%s'"):format(ipv4:ntop(self.next_ip)))
   if self.alarm_notification then
      arp_alarm:raise()
   end
end

When the condition is not met (self.next_arp wasn’t solved yet and self.next_arp_request_time has expired), an alarm is raised. But what if this check point is executed repeatedly, for instance every second until an operator fixes the alarm condition? To avoid saturating the alarm list, the standard specifies an elapse time of 2 minutes before the same alarm is raised again. This elapse is managed by the alarms library.

Besides a list of alarms, the module also defines these other containers:

  • alarm-inventory: It contains all possible alarm types for the system.
  • summary: Summary of numbers of alarms and shelved alarms.
  • shelved-alarms: A shelved alarm is ignored and won’t emit raise or clear events. Shelved alarms don’t emit notifications either. Shelving an alarm is a convenient way to silent an alarm.

When an alarm is raised, cleared or changes its status, a notification is sent. The alarms module specifies three types of notifications:

  • alarm-notification: Used to report a state change for an alarm. This alarm is emitted when an alarm is raised, clear or its status change.
  • alarm-inventory-changed: Used to report that the list of possible alarms has changed.
  • operator-action: Used to report that an operator acted upon an alarm.

Continuing with the ARP alarm example, here’s how a notification looks like when such alarm raises:

$ sudo ./snabb alarms listen lwaftr
{"event":"alarm-notification",
 "resource":"16446", "alarm_type_id":"arp-resolution", "alarm_type_qualifier":"",
 "perceived_severity":"critical", "alarm_text":"Make sure you can resolve..."}

Upon receiving a notification, an operator, or an external program, can act on the affected resource signaled by the alarm and fix the condition that triggered it. For instance, in the case of the lwAFTR being unable to resolve the next hop IPv4 address, such alarm indicates the host isn’t reachable (the host is down, or there’s no route to that address).

Lastly, the module also specifies one YANG action and two YANG RPCs:

  • set-operator-state: Allows an operator to change the state of an alarm. The specification defines 4 possible operator states: cleared-not-closed, cleared-closed, not-cleared-closed, not-cleared-closed, not-cleared-not-closed.
  • purge-alarms: Deletes entries from the alarm list according to the supplied criteria. It can be used to delete alarms that are in closed state or an older than a specified time.
  • compress-alarms: Compress entries in the alarm list by removing all but the latest state change for all alarms.

NETCONF side

Adding alarms support to Snabb, and more precisely to the lwAFTR, has brought in many good things. First of all, Snabb’s YANG library has added support for more data types such as empty, identityref and leafref. It has also improved parsing and validation of other data types such as ipv4-prefix, ipv6-prefix and enum, in addition to other minor improvements and bug fixes. For the moment, the lwAFTR is the poster child for alarms, but the mechanism is generic enough and it can be used by other data-planes.

A new program has been added to Snabb, not surprisingly being called alarms. It consists of five sub-commands:

  • listen: Listens to a Snabb instance which provides alarms support. The subprogram can send RPC requests calls to the server program or listen to notifications.
  • get-state: Sends an XPath request to a target Snabb instance that provides alarms state information.
  • set-operator-state: User interface to set-operator-state action.
  • purge-alarms: User interface to purge-alarms.
  • compress-alarms: User interface to compress-alarms.

Below there’s an excerpt of get-state subprogram and its output:

$ sudo ./snabb alarms get-state lwaftr /
alarm-list {
   alarm {
      alarm-type-id arp-resolution;
      alarm-type-qualifier '';
      resource 21385;
      alarm-text
         "Make sure you can resolve external-interface.next-hop.ip address manually."
         "If it cannot be resolved, consider setting the MAC address of the next-hop directly."
         "To do it so, set external-interface.next-hop.mac to the value of the MAC address.";
      is-cleared false;
      last-changed 2018-06-18T14:57:40Z;
      perceived-severity critical;
      status-change {
         time 2018-06-18T14:57:40Z;
         alarm-text 
            "Make sure you can resolve external-interface.next-hop.ip address manually."
            "If it cannot be resolved, consider setting the MAC address of the next-hop directly."
            "To do it so, set external-interface.next-hop.mac to the value of the MAC address.";
         perceived-severity critical;
      }
      time-created 2018-06-18T14:57:40Z;
   }
   last-changed 2018-06-18T14:57:40Z;
   number-of-alarms 1;
}

The alarms module keeps all its state into one Snabb instance, the leader process. As a reminder, since v3.0 the lwAFTR runs in a multiprocess architecture which consists of:

  • 1 Leader, which manages changes in lwAFTR configuration file. For instance, changes in softwires (add, remove, update).
  • 1 or N Workers, which runs a lwAFTR data-plane.

Both processes communicate via an IPC (Inter-process communication) mechanism, in this case a message channel implemented using sockets. When a worker raises an alarm, a message is sent to the leader via a worker. The leader polls the alarms-channel periodically, consuming all the stored messages. The result of processing a message is an action that alters the alarms state, for instance, adding a new alarm to the inventory, raising an alarm, clearing it, etc. All this logic is coded in lib/ptree/ptree.lua and lib/ptree/alarm_coded.lua.

Besides alarms, there are also notifications. A notification is a sort of simple message that is emitted under certain circumstances: when an alarm is raised, when its status change or when a new alarm-type is added to the inventory. Notifications are a native YANG element, not particular only to alarms.

In Snabb, the notifications mechanism is also implemented via sockets. In this case, a socket connects a lwAFTR leader to a series of peers that listen on the socket. When a notification is triggered, a new notification is added to the leader’s list of notifications. The leader process runs a fiber that constantly polls this list. If it finds new entries, the notifications got serialized to a JSON object and are sent through the socket. Once a notification is sent, it’s removed from the alarms state. This logic is implemented lib/ptree/ptree.lua and lib/yang/alarms.lua.

Summary and conclusions

YANG Alarms is a simple mechanism to notify erroneous conditions. The main strengths of this module are:

  • It’s encoded as a YANG module, with all the advantages which that represents (common vocabulary and semantics, reusable).
  • Signaling errors by simply printing out messages in stdout is not reliable, as they can be easily missed. Alarms are in-memory stored, they keep state which can be later consulted on demand.
  • Active notifications for the most important state changes. This allows to hook external programs, which do not need to constantly poll the artifact current state to check whether a change happened.

On the down side, I personally think that the amount of information tracked per alarm is excessive, making the YANG specification more complex than one may thought at first. Fortunately, programs interested in supporting this module do not need to implement all the features specified, being satisfied with just a subset of all the module’s features. At the moment of writing this, the YANG alarms proposal is still a draft but hopefully it will become an standard after several revisions.