Posts

Showing posts from January, 2019

Spark streaming with kafka and HDFS

There are lot of resources for doing spark streaming to consume messages from Kafka, while running on top of HDFS. It is slightly tricky to get it done if both the Kafka and HDFS is secured using Kerberos and both belongs to different realms. If cross realms trust is setup between kafka and HDFS, we can use single principal/keytab for both of the realms. If not, spark must be configured as follows HDFS auth: In Yarn cluster deploy mode, spark config spark.yarn.principal and  spark.yarn.keytab must be set with the hadoop user's crendentials so that Yarn can periodically get kerberos tickets to access HDFS. This will copy the keytab over to the hdfs for yarn to access. Yarn will get a new ticket once the current ticket reaches 80% of its lifetime. Since this involves copying keytab to the hdfs, we must make sure that the hdfs have proper access security policy and unauthorised access to keytab file is checked. These two config can also be passed to spark-submit with --princ...

Spring data mongodb secondary node reads

Reading from secondary mongodb is one of the simplest way to scale number of reads from the application. By default, all the reads and writes goes to primary. Although mongodb docs caution against going ahead with this option, there are certain scenarios where it's useful, Data is mostly static and changes if at all, rarely Reporting systems, where lagging data is fine High number of reads, relatively low writes The risk is that the data read from secondary might not be latest, so we have to be careful in choosing which queries are to be sent to secondary.  It's best to consider setting SecondaryPreferred with maxStaleness for the driver to determine, whether to send to secondary or fallback to primary. We can achieve this behaviour by making use of Read Preference Spring Data MongoDB Spring provides two API to set Read Preference. One is via MongoTemplate and the other is through Query flags(slave ok). Option 1: MongTemplate: This approach is slightly tedio...

Troubleshooting & monitoring mongodb performance

Trouble: In our production systems we started to receive that the Apache current connection is above the threshold(6k) for few mins. It went up to a maximum of 17k from 1k in 5 minutes and then again it returned to 1k in about 10 mins. It's triggered by the spike in traffic from 500 req/sec to 1500 req/sec in under 2 mins. It's puzzling to us because it's usual that in our products that during this peak time, the traffic almost tipples for few mins and comes back to normal levels. But never caused system performance degradation.!! Architecture: Mongodb 3.2 in CentoOS 6. One primary, two secondary replication mode.  Data profile: It hosts two db, one has app related config collections, other db has app related KPI logs collections.  config - very low write, high read. Total data size is < 1GB KPI logs - constant write load from log forwarders(Fluentd), occasional read from reporting system. Total data size is around 500GB. Troubleshooting: I started...