Logs and alarms

Content:

Introduction

This document describes the alarms a platform integrating Cygnus should raise when an incident happens. Thus, it is addressed to professional operators and such platform adminitrators.

Cygnus messages are explained before the alarm conditions deriving from those messages are described.

For each alarm, the following information is given:

  • Alarm identifier. A unique numerical identifier starting by 1.
  • Severity. CRITICAL or WARNING.
  • Detection strategy. An example log trace which identifies related alarm.
  • Stop condition. An example log trace which means that related problem is no longer active.
  • Description. A detailed explanation of the situation which triggers the alarm.
  • Action. A detailed plan to cope with this situation (e.g. reboots, checks connectivities, etc).

Top

Log message types

Cygnus logs are categorized under seven message types, each one identified by a tag in the custom message part of the trace. These are the tags:

  • Fatal error (FATAL level). These kind of errors may cause Cygnus to stop, and thus must be repported to the development team through stackoverflow.com (please, tag it with fiware).

    Example: Fatal error (SSL cannot be used, no such algorithm. Details=...) * Runtime error (ERROR level). These kind of errors may cause Cygnus to fail, and thus must be repported to the development team through stackoverflow.com (please, tag it with fiware).

    Example: Runtime error (The Hive table cannot be created. Hive query=.... Details="...) * Bad configuration (ERROR level). These kind of errors regard to a bad configuration parameter, and eventually may lead to a Cygnus fail.

    Example: Bad configuration (Unrecognized HDFS API. The sink can start, but the data is not going to be persisted!) * Bad HTTP notification (WARN level). These kind of errors are related to malformed notifications regarding the HTTP message: not supported REST method, target, user agent or content type, and empty body as well. They are exclusively thrown by the NGSIRestHandler component.

    Example: Bad HTTP notification (aggregation target not supported) * Bad context data (WARN level). These kind of errors are related to semantic inconsistences within the notified context data: anomalous number of attributes or not existent attribute (even when the number of attributes matches) for an already known instance. They are exclusively thrown by the sinks.

    Example: Bad context data (The markup in the document following the root element must be well-formed) * Channel error (ERROR level). These kind of errors tell about problems with the internal channel of the agent. This channel is used as part of the failover mechanisms of Flume, storing those events that cannot be processed by the sinks. Nevertheless, the channel may fail itself, either because the HTTP source is not able to put the event (channel error, or simply it is full), either because the sink cannot get a new event.

    Example: Channel error (The event could not be got. Details=...) * Persistence error (ERROR level). These kind of errors tell about problems with the persistence backend: unable to connect or not existent entity (when the backend needs to have provisioned a container for that entity, e.g. entity-related tables in MySQL or CKAN). They are exclusively thrown by the sinks. Please observe Cygnus itself may solve the problem thanks to the channel-based failover mechanism of Flume, and the Flume Failover Sink Processor which switchs to a passive sink (if configured).

    Example: Persistence error (Could not connect to the MySQL server)

Debug messages are labeled as Debug, with a logging level of DEBUG. Informational messages such as Cygnus version, transaction start/end and other are labeled as Informational, being INFO the logging level.

Top

Alarm conditions

Alarm ID Severity Detection strategy Stop condition Description Action
1 CRITICAL A FATAL trace is found. For each configured Cygnus component (i.e. NGSIRestHandler, NGSIHDFSSink, NGSIMySQLSink and NGSICKANSink), the following trace is found: Startup completed. A problem has happend at Cygnus startup. The msg field details the particular problem. Fix the issue that is precluding Cygnus startup, e.g. if the problem was due to the listening port of a certain source is already being used, then change such listening port or stop the process using it.
2 CRITICAL The following ERROR trace is found: Runtime error. N/A A runtime error has happened. The msg field containts the detailed information. Restart Cygnus. If the error persits (e.g. new Runtime errors appear within the next hour), scale up the problem to the development team.
3 CRITICAL The following ERROR trace is found: Bad configuration. For each configured Cygnus component (i.e. NGSIRestHandler, NGSIHDFSSink, NGSIMySQLSink and NGSICKANSink), the following INFO trace is found: Startup completed. A Cygnus component has not been configured in the appropriate way. Configure the component in the appropriate way.
4 CRITICAL The following ERROR trace is found: Channel error. The following INFO traces are found: Event got from the channel. Flume events, put by the sources, cannot be got by the sinks from the internal channel due to a problem with the channel (most probably) or the sink itself A runtime error has happened. The msg field containts the detailed information.
5 WARNING The following WARN trace is found: Bad HTTP notification. The following INFO traces are found: Event put in the channel. The HTTP notification sent by Orion is not properly formed, being the target, the method, the user agent and/or the content type anomalous. Nothing has to be done at Cygnus. Check why the sender (Orion Context Broker) is building the notification in such anomalous way.
6 WARNING The following WARN trace is found: Bad context data in sink_name, being sink_name: NGSIHDFSSink, NGSIMySQLSink or NGSICKANSink. The following INFO traces are found: Persisting data in sink_name, being sink_name the same sink that raised the alarm. The context data within the notification is wrong, either making reference to an unexistent entity, either showing an abnormal number of attributes, either showing an unexistent attribute. Nothing has to be done at Cygnus. Check the provision of the data containers (e.g. tables in case of using MySQL) and fix any inconsistence it may exist.
7 WARNING The following ERROR trace is found: Persistence error in sink_name, being sink_name: NGSIHDFSSink, NGSIMySQLSink or NGSICKANSink. The following INFO traces are found: Persisting data in sink_name, being sink_name the same sink that raised the alarm. Any of the sinks is not able to persist the context data in the final storage HDFS, MySQL or CKAN), due to a connection problem or a storage crash/shutdown. Once solved the problem with the storage, Cygnus should be able to fix this kind of errors automatically by means of the internal channel, which works as a temporal buffer for not already processed Flume events (containing context data to be persisted).

Top