NGSIHDFSSink

Content:

Functionality
Administration guide
Programmers guide

Functionality

com.iot.telefonica.cygnus.sinks.NGSIHDFSSink, or simply NGSIHDFSSink is a sink designed to persist NGSI-like context data events within a HDFS deployment. Usually, such a context data is notified by a Orion Context Broker instance, but could be any other system speaking the NGSI language.

Independently of the data generator, NGSI context data is always transformed into internal NGSIEvent objects at Cygnus sources. In the end, the information within these events must be mapped into specific HDFS data structures at the Cygnus sinks.

Next sections will explain this in detail.

Parameter	Mandatory	Default value	Comments
type	yes	N/A	Must be com.telefonica.iot.cygnus.sinks.NGSIHDFSSink
channel	yes	N/A
enable_encoding	no	false	true or false, true applies the new encoding, false applies the old encoding.
enable_grouping	no	false	true or false. Check this link for more details.
enable_name_mappings	no	false	true or false. Check this link for more details.
enable_lowercase	no	false	true or false.
data_model	no	dm-by-entity	Always dm-by-entity, even if not configured.
file_format	no	json-row	json-row, json-column, csv-row or json-column.
backend.impl	no	rest	rest, if a WebHDFS/HttpFS-based implementation is used when interacting with HDFS; or binary, if a Hadoop API-based implementation is used when interacting with HDFS.
backend.max_conns	no	500	Maximum number of connections allowed for a Http-based HDFS backend. Ignored if using a binary backend implementation.
backend.max_conns_per_route	no	100	Maximum number of connections per route allowed for a Http-based HDFS backend. Ignored if using a binary backend implementation.
hdfs_host	no	localhost	FQDN/IP address where HDFS Namenode runs, or comma-separated list of FQDN/IP addresses where HDFS HA Namenodes run.
hdfs_port	no	14000	14000 if using HttpFS (rest), 50070 if using WebHDFS (rest), 8020 if using the Hadoop API (binary).
hdfs_username	yes	N/A	If `service_as_namespace=false` then it must be an already existent user in HDFS. If `service_as_namespace=true` then it must be a HDFS superuser.
hdfs_password	yes	N/A	Password for the above `hdfs_username`; this is only required for Hive authentication.
oauth2_token	yes	N/A	OAuth2 token required for the HDFS authentication.
service_as_namespace	no	false	If configured as true then the `fiware-service` (or the default one) is used as the HDFS namespace instead of `hdfs_username`, which in this case must be a HDFS superuser.
csv_separator	no	,
batch_size	no	1	Number of events accumulated before persistence.
batch_timeout	no	30	Number of seconds the batch will be building before it is persisted as it is.
batch_ttl	no	10	Number of retries when a batch cannot be persisted. Use `0` for no retries, `-1` for infinite retries. Please, consider an infinite TTL (even a very large one) may consume all the sink's channel capacity very quickly.
batch_retry_intervals	no	5000	Comma-separated list of intervals (in miliseconds) at which the retries regarding not persisted batches will be done. First retry will be done as many miliseconds after as the first value, then the second retry will be done as many miliseconds after as second value, and so on. If the batch_ttl is greater than the number of intervals, the last interval is repeated.
hive	no	true	true or false.
hive.server_version	no	2	`1` if the remote Hive server runs HiveServer1 or `2` if the remote Hive server runs HiveServer2.
hive.host	no	localhost
hive.port	no	10000
hive.db_type	no	default-db	default-db or namespace-db. If `hive.db_type=default-db` then the default Hive database is used. If `hive.db_type=namespace-db` and `service_as_namespace=false` then the `hdfs_username` is used as Hive database. If `hive.db_type=namespace-db` and `service_as_namespace=true` then the notified fiware-service is used as Hive database.
krb5_auth	no	false	true or false.
krb5_user	yes	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_password	yes	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_login_conf_file	no	/usr/cygnus/conf/krb5_login.conf	Ignored if `krb5_auth=false`.
krb5_conf_file	no	/usr/cygnus/conf/krb5.conf	Ignored if `krb5_auth=false`.

NGSIHDFSSink

Functionality

Mapping NGSI events to NGSIEvent objects

Mapping NGSIEvents to HDFS data structures

HDFS paths naming conventions

Json row-like storing

Json column-like storing

CSV row-like storing

CSV column-like storing

Hive

Example

NGSIEvent

Path names

Json row-like storing

Json column-like storing

CSV row-like storing

CSV column-like storing

Hive storing

Administration guide

Configuration

Use cases

Important notes

About the persistence mode

About the binary backend

About batching

About the encoding

Programmers guide

NGSIHDFSSink class

OAuth2 authentication

Kerberos authentication

conf/cygnus.conf

conf/krb5_login.conf

conf/krb5.conf

Mapping NGSI events to `NGSIEvent` objects

Mapping `NGSIEvent`s to HDFS data structures

`NGSIEvent`

`NGSIHDFSSink` class

`conf/cygnus.conf`

`conf/krb5_login.conf`

`conf/krb5.conf`