Spark streaming with kafka and HDFS

January 09, 2019

There are lot of resources for doing spark streaming to consume messages from Kafka, while running on top of HDFS. It is slightly tricky to get it done if both the Kafka and HDFS is secured using Kerberos and both belongs to different realms. If cross realms trust is setup between kafka and HDFS, we can use single principal/keytab for both of the realms. If not, spark must be configured as follows

HDFS auth:

In Yarn cluster deploy mode, spark config spark.yarn.principal and spark.yarn.keytab must be set with the hadoop user's crendentials so that Yarn can periodically get kerberos tickets to access HDFS. This will copy the keytab over to the hdfs for yarn to access. Yarn will get a new ticket once the current ticket reaches 80% of its lifetime. Since this involves copying keytab to the hdfs, we must make sure that the hdfs have proper access security policy and unauthorised access to keytab file is checked. These two config can also be passed to spark-submit with --principal and --keytab options.

Otherwise, if we don't want to send confidential keytab to hdfs, we can avoid setting these spark config. It's possible by getting kerberos ticket just before submitting spark job and let the spark use that ticket to get DELEGATION_TOKEN to gain access to hdfs. A disadvantage with this approach is that, each ticket has max a lifetime, once it is reached the ticket will purged. Then the DELEGATION_TOKEN, which is based on this original ticket becomes invalid as well and the spark job will fail.

Note that in the second way, --principal and --keytab is missing, spark gets the hdfs access because of the kinit just before the spark-submit.

Kafka auth:

Kafka Java driver supports JAAS configuration to get access to the secured kafka cluster. We need to provide kafka principal, its keytab, krb5.conf, jaas.conf. jaas.conf specifies which auth method to use and provides its required credentials. krb5.conf specifies kerberos related config and how to contact KDC servers.

Please note that I had merged hadoop KDC servers and kafka KDC servers location into a single krb5.conf, since the JVM accepts only one input file. If spark supports JAAS config, we might have provided hadoop auth mechanism as well in the jaas.conf, but it doesn't supports it yet.

Finally, putting it altogether, spark-submit finally looks as below

Search This Blog

Tech learnings