Troubleshooting & monitoring mongodb performance

January 07, 2019

Trouble:

In our production systems we started to receive that the Apache current connection is above the threshold(6k) for few mins. It went up to a maximum of 17k from 1k in 5 minutes and then again it returned to 1k in about 10 mins. It's triggered by the spike in traffic from 500 req/sec to 1500 req/sec in under 2 mins. It's puzzling to us because it's usual that in our products that during this peak time, the traffic almost tipples for few mins and comes back to normal levels. But never caused system performance degradation.!!

Architecture:

Mongodb 3.2 in CentoOS 6. One primary, two secondary replication mode.

Data profile:

It hosts two db, one has app related config collections, other db has app related KPI logs collections.

config - very low write, high read. Total data size is < 1GB

KPI logs - constant write load from log forwarders(Fluentd), occasional read from reporting system. Total data size is around 500GB.

Troubleshooting:

I started to look at all the connected systems and DB server's metrics page. It was apparent that the issue is caused by high load in the dedicated mongodb servers. Although the total CPU usage is only around 45%, the load avg reached 150. It is interesting to see that, when the load avg was high, CPU idle percentage was 45% and there is no I/O wait.

On the next day, I used the linux command sar -q to see what happens on CPU run queue during the issue. For a moment the CPU run queue builds up to 100 -150 and this affects the load avg. It is strange that the CPU is free for almost 50% of the time, but some threads are waiting in the CPU run queue. Definitely these threads are not waiting for I/O, since I/O wait percentage is 0. There is contention among these threads, maybe to get locks or to access a shared resource.

In the mean time to avoid service impact, as a workaround, I started to delete some of the old KPI log data from mongodb. But unfortunately it didn't help. Of the two databases in the mongodb, one db is config related, it is small, few writes, lots of reads. The other db is KPI logs related, it's huge, regular
writes, low reads. Next added an in-memory cache as temporary measure for the config db in the API servers itself, this reduced the traffic to mongodb, thus reduced the load avg spike and service impact.

I had taken some of the mongo diagnostic command outputs to help debug this issue. First started to look at db.serverStatus().wiredTiger.concurrentTransactions, as whether there is contention in getting read/write tickets. It was ok, there were ample tickets available to do concurrent read/write. Next, I found interesting things in db.serverStatus().wiredTiger.cache. I observed that the wiredTiger cache is 80% full and high numbers for pages read into cache and unmodified pages evicted. It means that the cache is almost full and the mongodb is busy trying to find pages in cache to evict to get accommodation for newer pages. Effectively bulk of the data is cycled though the cache and eviction process is becoming bottleneck. Now it makes sense, the constant high volume writes to KPI logs db is affecting the relatively low write, high read config db within the same mongodb server. Might be related to wiredTiger issue.

Solution:

For the KPI logs instead of mongodb, we started to use Hadoop. So the KPI logs db is completely taken out of mongodb and all the data is dropped. This released lots of space in both disk as well as in memory cache. Now the cache usage is only 20%, there are no more evictions. No load avg spike during peak, but I can observe increasing CPU user time percentage. It is expected, when the traffic increases, CPU user percentage increases(up to 35%), no disk I/O, and the load avg stays well below 1. Now the whole dataset is in the memory cache, mongodb became CPU bound. In the short term, to scale this setup, I plan to send reads to least changing collections to secondary nodes. And in the long term, it can be scaled up by adding more CPU's.

Comments

lamyajackoMarch 3, 2022 at 7:35 PM
Caesars Completes Nevada's $50 Million SuperLotto Plus
Caesars Entertainment announced 순천 출장안마 Thursday 파주 출장마사지 it 안양 출장마사지 has completed a $50 정읍 출장마사지 million multi-year multi-year 공주 출장샵 partnership with its global mobile gaming division to
ReplyDelete
Replies

Add comment

Search This Blog

Tech learnings

Troubleshooting & monitoring mongodb performance

Trouble:

Architecture:

Data profile:

Troubleshooting:

Solution:

Comments

Post a Comment

Popular posts from this blog

Spring data mongodb secondary node reads

Spring mongodb connection pool limit

Spark streaming with kafka and HDFS