Spark streaming with kafka and HDFS
There are lot of resources for doing spark streaming to consume messages from Kafka, while running on top of HDFS. It is slightly tricky to get it done if both the Kafka and HDFS is secured using Kerberos and both belongs to different realms. If cross realms trust is setup between kafka and HDFS, we can use single principal/keytab for both of the realms. If not, spark must be configured as follows HDFS auth: In Yarn cluster deploy mode, spark config spark.yarn.principal and spark.yarn.keytab must be set with the hadoop user's crendentials so that Yarn can periodically get kerberos tickets to access HDFS. This will copy the keytab over to the hdfs for yarn to access. Yarn will get a new ticket once the current ticket reaches 80% of its lifetime. Since this involves copying keytab to the hdfs, we must make sure that the hdfs have proper access security policy and unauthorised access to keytab file is checked. These two config can also be passed to spark-submit with --princ...