Why you should (generally) avoid using include_docs in Cloudant and CouchDB view queries

One of my most often repeated pieces of performance advice when using CouchDB or Cloudant is to avoid using include_docs=true in view queries. When you look at the work CouchDB needs to do, the reason for the recommendation becomes obvious.

During a normal view query, CouchDB must only read a single file on disk. It streams results directly from that file. I guess it’s a b-tree style thing under the hood. Therefore, if you are reading the entire index or doing a range query with startkey and endkey, CouchDB can just find the appropriate starting point in the file on disk and then read until it reaches the end of the index or the endkey. Not much data needs to be held in memory as it can go straight onto the wire. As any data emitted by the map function is stored inline in the b-tree, it’s very fast to stream this as part of the response.

When you use include_docs=true, CouchDB has a lot more work to do. In addition to streaming out the view row data as above, Couch has to read each and every document referenced by the view rows it returns. Briefly, this involves:

Loading up the database’s primary data file.
Using the document ID index contained in that file to find the offset within that file where the document data resides.
Finally, reading the document data itself before returning the row to the client.

Given the view is ordered by some field in the document rather than by doc ID, this is essentially a lot of random document lookups within the main data file. That’s a lot of extra tree traversals and data read from disk.

While in theory this is going to be much slower – and many people I trust had told me this – I’d not done a simple benchmark to get a feel for the difference myself. So I finally got around to doing a quick experiment to see what kind of affect this has. It was just on my MacBook Air (Mid-2012, 2GHz i7, 8GB RAM, SSD), using CouchDB 1.6.1 in a single node instance, so the specific values are fairly meaningless. The process:

I uploaded 100,000 identical tiny documents to the CouchDB instance. The tiny document hopefully minimises the actual document data read time vs. the lookups involved in reading data.
I created two views, one which saved the document data into the index and one which emitted null instead.
I pre-warmed the views by retrieving each to make sure that CouchDB had built them.
I did a few timed runs of retrieving every row in each view in a single call. For the null view, I timed both include_docs=true and include_docs=false.

The view was simply:

{
   "_id": "_design/test",
   "language": "javascript",
   "views": {
       "emitdoc": {
           "map": "function(doc) {\n  emit(doc.id, doc);\n}"
       },
       "emitnull": {
           "map": "function(doc) {\n  emit(doc.id, null);\n}"
       }
   }
}

And each document looked like:

{
   "_id": "0d469cdd8a7c054bf5eed0c954000ba4",
   "value1": "sjntwpacttftohms"
}

I then called each view and read through the whole thing, all 100,000 rows. I timed the calls using curl. It’s not very statistically rigorous, but I don’t think you need to be for this magnitude of difference. For kicks, I also eye-balled the CPU usage in top during each call and guessed an average.

Test	Time, seconds	Eye-balled CPU
`emitdoc`	5.821	105%
`emitnull`	4.502	99%
`emitnull?include_docs=true`	48.492	140%

The headline result is that reading the document from the view index itself (emitdoc) was just over 8x faster than using include_docs. It’s also significantly less computationally expensive. There’s also a difference between reading emitnull and emitdoc, though far less pronounced.

This was done on CouchDB 1.6.1 on my laptop. So while it wasn’t a Cloudant or CouchDB cluster, given clustered query processing and clustered read behaviour, I would say that the results there would be similar or worse.

While this is a read of 100,000 documents, which you might say is unusual, over the many calls an application will make for smaller numbers of documents this kind of difference will add up. In addition, it adds a lot of load to your CouchDB instance, and likely screws around with disk caches and the like.

So, broadly, it seems pretty sound advice to avoid include_docs=true in practice as well as in theory.

As a bonus, here’s how to time things using curl.

Addendum

I was asked on Twitter what I’d recommend overall for when to use include_docs. It’s a bit of a judgement call.

The core trade off is one of space vs. query latency. Emitting entire documents into the index is expensive in terms of space. But, as shown above, it speeds up retrieval significantly. Ideally, of course, you’d be able to emit a subset of the fields from a document, but often that’s not possible.

My decision tree would start something like this:

If the query happens often – many times a minute for example – it’s worth emitting the documents into the index. The query latency being lower will help your overall throughput.
For any query running more than once every ten minutes or so, when retrieving a lot of documents – many hundreds – I’d consider emitting the documents regardless of latency requirements. Reading many documents from the primary data file will chew up disk caches and internal CouchDB caches.
If the query is rare and need not run at optimal speeds, go ahead and use include_docs. For rarely used queries, you might as well read the documents at query time and save the disk space.
For relatively rare queries, a few a minute, if speed is important (e.g. it’s for a UI), I’d consider the number of documents retrieved. If it’s just one or two, the extra latency in using include_docs probably isn’t going to matter. If it’s a lot of documents, the delay may become unacceptable. This one is particularly application dependent.

This would help me decide what the first iteration of my index would look like, but I’d want to monitor and tweak the index over time if it appeared slow. As always, testing different configuration is the best strategy, but hopefully the above saves a little time.