OrionHDFSSink

Content:

Functionality
Administration guide
Programmers guide

Functionality

com.iot.telefonica.cygnus.sinks.OrionHDFSSink, or simply OrionHDFSSink is a sink designed to persist NGSI-like context data events within a HDFS deployment. Usually, such a context data is notified by a Orion Context Broker instance, but could be any other system speaking the NGSI language.

Independently of the data generator, NGSI context data is always transformed into internal Flume events at Cygnus sources. In the end, the information within these Flume events must be mapped into specific HDFS data structures at the Cygnus sinks.

Next sections will explain this in detail.

Parameter	Mandatory	Default value	Comments
type	yes	N/A	Must be com.telefonica.iot.cygnus.sinks.OrionHDFSSink
channel	yes	N/A
enable_grouping	no	false	true or false.
enable_lowercase	no	false	true or false.
data_model	no	dm-by-entity	Always dm-by-entity, even if not configured.
file_format	no	json-row	json-row, json-column, csv-row or json-column.
backend_impl	no	rest	rest, if a WebHDFS/HttpFS-based implementation is used when interacting with HDFS; or binary, if a Hadoop API-based implementation is used when interacting with HDFS.
hdfs_host	no	localhost	FQDN/IP address where HDFS Namenode runs, or comma-separated list of FQDN/IP addresses where HDFS HA Namenodes run.
cosmos_host (deprecated)	no	localhost	FQDN/IP address where HDFS Namenode runs, or comma-separated list of FQDN/IP addresses where HDFS HA Namenodes run. Still usable; if both are configured, `hdfs_host` is preferred.
hdfs_port	no	14000	14000 if using HttpFS (rest), 50070 if using WebHDFS (rest), 8020 if using the Hadoop API (binary).
cosmos_port (deprecated)	no	14000	14000 if using HttpFS (rest), 50070 if using WebHDFS (rest), 8020 if using the Hadoop API (binary). Still usable; if both are configured, `hdfs_port` is preferred.
hdfs_username	yes	N/A	If `service_as_namespace=false` then it must be an already existent user in HDFS. If `service_as_namespace=true` then it must be a HDFS superuser.
cosmos_default_username (deprecated)	yes	N/A	If `service_as_namespace=false` then it must be an already existent user in HDFS. If `service_as_namespace=true` then it must be a HDFS superuser. Still usable; if both are configured, `hdfs_username` is preferred.
hdfs_password	yes	N/A	Password for the above `hdfs_username`/`cosmos_default_username`; this is only required for Hive authentication.
oauth2_token	yes	N/A	OAuth2 token required for the HDFS authentication.
service_as_namespace	no	false	If configured as true then the `fiware-service` (or the default one) is used as the HDFS namespace instead of `hdfs_username`/`cosmos_default_username`, which in this case must be a HDFS superuser.
file_format	no	json-row	json-row, json-column, csv-row or json-column.
csv_separator	no	,
batch_size	no	1	Number of events accumulated before persistence.
batch_timeout	no	30	Number of seconds the batch will be building before it is persisted as it is.
batch_ttl	no	10	Number of retries when a batch cannot be persisted. Use `0` for no retries, `-1` for infinite retries. Please, consider an infinite TTL (even a very large one) may consume all the sink's channel capacity very quickly.
hive	no	true	true or false.
hive_server_version (deprecated)	no	2	`1` if the remote Hive server runs HiveServer1 or `2` if the remote Hive server runs HiveServer2. Still usable; if both are configured, `hive.server_version` is preferred.
hive.server_version	no	2	`1` if the remote Hive server runs HiveServer1 or `2` if the remote Hive server runs HiveServer2.
hive_host (deprecated)	no	localhost	Still usable; if both are configured, `hive.host` is preferred.
hive.host	no	localhost
hive_port (deprecated)	no	10000	Still usable; if both are configured, `hive.port` is preferred.
hive.port	no	10000
hive.db_type	no	default-db	default-db or namespace-db. If `hive.db_type=default-db` then the default Hive database is used. If `hive.db_type=namespace-db` and `service_as_namespace=false` then the `hdfs_username` is used as Hive database. If `hive.db_type=namespace-db` and `service_as_namespace=true` then the notified fiware-service is used as Hive database.
krb5_auth	no	false	true or false.
krb5_user	yes	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_password	yes	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_login_conf_file	no	/usr/cygnus/conf/krb5_login.conf	Ignored if `krb5_auth=false`.
krb5_conf_file	no	/usr/cygnus/conf/krb5.conf	Ignored if `krb5_auth=false`.

OrionHDFSSink

Functionality

Mapping NGSI events to flume events

Mapping Flume events to HDFS data structures

Hive

Example

Administration guide

Configuration

Use cases

Important notes

About the persistence mode

About the binary backend

About batching

Programmers guide

OrionHDFSSink class

HDFSBackendImpl class

OAuth2 authentication

Kerberos authentication

conf/cygnus.conf

conf/krb5_login.conf

conf/krb5.conf

`OrionHDFSSink` class

`HDFSBackendImpl` class

`conf/cygnus.conf`

`conf/krb5_login.conf`

`conf/krb5.conf`