Logs and alarms

Content:

Introduction

This document describes the alarms a platform integrating Cygnus-twitter should raise when an incident happens. Thus, it is addressed to professional operators and such platform administrators.

Cygnus messages are explained before the alarm conditions deriving from those messages are described.

For each alarm, the following information is given:

  • Alarm identifier. A unique numerical identifier starting by 1.
  • Severity. CRITICAL or WARNING.
  • Detection strategy. An example log trace which identifies related alarm.
  • Stop condition. An example log trace which means that related problem is no longer active.
  • Description. A detailed explanation of the situation which triggers the alarm.
  • Action. A detailed plan to cope with this situation (e.g. reboots, checks connectivities, etc).

Top

Log message types

Cygnus logs are categorized under seven message types, each one identified by a tag in the custom message part of the trace. These are the tags:

  • Fatal error (FATAL level). These kind of errors may cause Cygnus to stop, and thus must be repported to the development team through stackoverflow.com (please, tag it with fiware).

    Example: Fatal error (SSL cannot be used, no such algorithm. Details=...) * Runtime error (ERROR level). These kind of errors may cause Cygnus to fail, and thus must be repported to the development team through stackoverflow.com (please, tag it with fiware).

    Example: Runtime error (The Hive table cannot be created. Hive query=.... Details="...) * Bad configuration (ERROR level). These kind of errors regard to a bad configuration parameter, and eventually may lead to a Cygnus fail.

    Example: Bad configuration (Unrecognized HDFS API. The sink can start, but the data is not going to be persisted!) * Channel error (ERROR level). These kind of errors tell about problems with the internal channel of the agent. This channel is used as part of the failover mechanisms of Flume, storing those events that cannot be processed by the sinks. Nevertheless, the channel may fail itself, either because the HTTP source is not able to put the event (channel error, or simply it is full), either because the sink cannot get a new event.

    Example: Channel error (The event could not be got. Details=...) * Persistence error (ERROR level). These kind of errors tell about problems with the persistence backend: unable to connect or not existent folder (when the backend needs to have provisioned a container for that data, e.g. a folder in HDFS). They are exclusively thrown by the sinks. Please observe Cygnus itself may solve the problem thanks to the channel-based failover mechanism of Flume, and the Flume Failover Sink Processor which switchs to a passive sink (if configured).

    Example: Persistence error (Could not connect to the HDFS) * Streaming error (ERROR level). These kind of errors tell about problems with the Twitter API: unable to connect to the API because of invalid credentials or temporary unavailability of Twitter. They are exclusively thrown by the TwitterSource.

    Example: Exception while streaming tweets

Debug messages are labeled as Debug, with a logging level of DEBUG. Informational messages such as Cygnus version, transaction start/end and other are labeled as Informational, being INFO the logging level.

Top

Alarm conditions

Alarm ID Severity Detection strategy Stop condition Description Action
1 CRITICAL A FATAL trace is found. For each configured Cygnus-twitter component (i.e. TwitterSource and TwiterHDFSSink), the following trace is found: Startup completed. A problem has happend at Cygnus startup. The msg field details the particular problem. Fix the issue that is precluding Cygnus startup, e.g. if the problem was due to and invalid twitter API key or invalid coordinates for the geoquery, then change such values.
2 CRITICAL The following ERROR trace is found: Runtime error. N/A A runtime error has happened. The msg field containts the detailed information. Restart Cygnus. If the error persits (e.g. new Runtime errors appear within the next hour), scale up the problem to the development team.
3 CRITICAL The following ERROR trace is found: Bad configuration. For each configured Cygnus component (i.e. TwitterSource and TwitterHDFSSink), the following INFO trace is found: Startup completed. A Cygnus component has not been configured in the appropriate way. Configure the component in the appropriate way.
4 CRITICAL The following ERROR trace is found: Channel error. The following INFO traces are found: Event got from the channel. Flume events, put by the sources, cannot be got by the sinks from the internal channel due to a problem with the channel (most probably) or the sink itself A runtime error has happened. The msg field containts the detailed information.
5 WARNING The following ERROR trace is found: Persistence error in TwitterHDFSSink. The following INFO traces are found: Persisting data in TwitterHDFSSink. TwitterHDFSSink is not able to persist the context data in the final storage (HDFS), due to a connection problem or a storage crash/shutdown. Once solved the problem with the storage, Cygnus should be able to fix this kind of errors automatically by means of the internal channel, which works as a temporal buffer for not already processed Flume events (containing context data to be persisted).
6 WARNING The following ERROR trace is found: Exception while streaming tweets. N/A TwitterSource is not able to retrieve tweets from Twitter API, due to a connection problem or an invalid credential. Once checked that the problem is not due to an external Twitter unavailability, ensure your API credentials (consumerKey, consumerSecret, accessToken and accessTokenSecret) are valid and active.

Top