Querying Cloudant: what are stale, update and stable?

tl;dr If you are using stale=ok in queries to Cloudant or CouchDB 2.x, you most likely want to be using update=false instead. If you are using stale=update_after, use update=lazy instead.

This question has come up a few times, so here’s a reference to what the situation is with these parameters to query requests in Cloudant and CouchDB 2.x.

CouchDB originally used stale=ok on the query string to specify that you were okay with receiving out-of-date results. By default, CouchDB lazily updates indexes upon querying them rather than when JSON data is changed or added. If up to date results are not strictly required, using stale=ok provides a latency improvement for queries as the request does not have to wait for indexes to be updated before returning results. This is particularly useful for databases with a high write rate.

As an aside, Cloudant automatically enqueues indexes for update when primary data changes, so this problem isn’t so acute. However, in the face of high update rate bursts, it’s still possible for indexing to fall behind so a delay may occur.

When using a single node, as in CouchDB 1.x, this parameter behaved as you’d expect. However, when clustering was added to CouchDB, a second meaning was added to stale=ok: also use the same set of shard replicas to retrieve the results.

Recall that Cloudant and CouchDB 2.x stores three copies of each shard and by default will use the shard replica that starts returning results fastest for a query request. This latter fact helps even out load across the cluster. Heavily loaded nodes will likely return slower and so won’t be picked to respond to a given query. When using stale=ok, the database will instead always use the same shard replicas for every request to that index. The use of the same replica to answer queries has two effects:

  1. Using stale=ok could drive load unevenly across the nodes in your database cluster because certain shard replicas would always be used for the queries to the index that specify stale=ok. This means a set of nodes could receive outside numbers of requests.
  2. If one of the replicas was hosted on a heavily loaded node in the cluster, this would slow down all queries to that index using stale=ok. This is compounded by the tendency of stale=ok to drive imbalanced load.

The end result is that using stale=ok can, counter-intuitively, cause queries to become slower. Worse, they may become unavailable during cluster split-brain scenarios because of the forced use of a certain set of replicas. Given that mostly people use stale=ok to improve performance, this wasn’t a great state to be in.

As stale=ok’s existing behaviour needed to be maintained for backwards compatibility, the fix for this problem was to introduce two new query string parameters were introduced which set each of the two stale=ok behaviours independently:

  1. update=true/false/lazy: controls whether the index should be up to date before the query is executed.
    1. true: the index will be updated first.
    2. false: the index will not be updated.
    3. lazy: the index will not be updated before the query, but enqueued for update after the query is completed.
  2. stable=true/false: controls the use of the certain shard replicas.

The main use of stable=true is that queries are more likely to appear to “go forward in time” because each shard replica may update its indexes in different orders. However, this isn’t guaranteed, so the availability and performance trade offs are likely not worth it.

The end result is that virtually all applications using stale=ok should move to instead use update=false.

What is docker?

When I first came across docker a few years ago, probably late 2014, so a year after it was introduced at PyCon during 2013, I found it a confusing concept. “Like GitHub, but for containers” was a phrase that I recall from that period, which I think ended up causing a lot of my confusion – I conflated Docker Hub with docker the tool.

Since then, I’ve learned more about docker, particularly in the last year. I think that things started to click around a year ago, and over the past few months as I’ve looked further into Kubernetes and written my own pieces of software destined for container deployment I’ve formed my own mental model of where docker fits into my world. This post is about my writing that down to understand its coherency.

I tend towards understanding systems like this bottom-up, so let’s start at the beginning, which is also conveniently the bottom.

cgroups

Cgroups, or control groups to give them their full name, were introduced into the mainline Linux kernel in 2.6.24, released in January

  1. What cgroups allow is for processes running on a system to be hierarchically grouped in such a way that various controls and boundaries can be applied to a process hierarchy.

Cgroups are a necessary but not sufficient part of a container solution, and they are also used for lots of things other than containers. Systemd, for example, uses cgroups when defining resource limits on the processes it manages.

Like many things within the Linux kernel, cgroups are exposed within the file hierarchy. A system administrator writes and reads from files within the mounted cgroups filesystem to define cgroups and their properties. A process is added to a cgroup by writing its PID to a file within the cgroups hierarchy; the process is automatically removed from its previous cgroup.

Overall, cgroups provide docker with a simple(ish) way to control the resources a process hierarchy uses (like CPU) and has access to (like networks and part of the filesystem).

Cgroups provide control of various resources, but the main ones to consider for docker containers are:

  • CPU controller – using cpu shares, CPU time can be divided up between processes to ensure a process gets a share of CPU time to run in.
  • Memory controller – a process can be given its own chunk of memory which has a hard limit on its size.

From this, it’s relatively easy to see how docker can assign resources to a container – put the process running within in the container in a cgroup and set up the resource constraints for it.

Beyond controlling scarce resources like CPU and memory, cgroups provide a way to assign a namespace to a process. Namespaces are the next piece of the puzzle and we move to them next.

Kernel namespaces

Putting a process within a namespace is a means to define what the process has access to. A namespace is the boundary, whereas cgroups is the control plane that puts a process within the namespace’s boundary.

Also in 2.6.24 came the core of network namespaces. This and future patchsets enable processes to be presented with their own view of the network stack, covering network functions such as interfaces, routing tables and so on.

The Wikipedia article on kernel namespaces has a list of the current resources that can be isolated using namespaces. We can form a basic view of how containers are run by docker (and any other container management software) using just a couple of these:

  1. The Mount (MNT) namespace
  2. The Network (NET) namespace

Things like the PID and User namespaces provide extra isolation, but I’m not going to cover them here.

I confess here I’m making some guesses as to what’s going on, but the mental model has served me okay so I’ll reproduce it here. Broadly I consider these two namespaces to be the basis of docker’s ability to run what amounts to “pre-packaged” software.

Mount

Mount namespaces define what the filesystem looks like to the process running within the namespace. So different processes can see entirely different views of the filesystem.

My general assumption here is that docker is using MNT namespaces to provide the running container with a unique view of the filesystem, both its own “root image” that we’ll talk about later and the parts of the host filesystem mounted into the running container using the --mount option.

Network

As NET namespaces provide processes with a custom view of the network stack and provide ways for processes in different namespaces to poke holes to each other via the network, I assume this is the basis for docker’s bridge network type which sets up a private network between processes running in containers. When one runs a container with the host network type, my basic layman’s assumption is that the container’s process is not placed within its own network namespace (or it lives within the default namespace).

Union filesystems

A union filesystem essentially takes several filesystem images and layers them on top of each other. Images “above” override values from images “below”. For any file read, the union filesystem traverses the image stack from top to bottom and returns the file content from the first image containing the file. For writes, either the write just fails (for a read-only union filesystem) or the write goes to the top-most layer. Often this top-most layer is initially an empty image created specifically for the writes of files to the mounted union filesystem.

An important point to note is that two or more mounted union I filesystems can share images, meaning that two union filesystems could have, say, the first five images in their respective stacks shared but each with different images stacked on top to provide very different resultant filesystems.

Docker images and layers

When running a docker container, one specifies an image to “run” via a command like:

docker run couchdb

The couchdb part of this command specifies the image to download. I find the naming gets a bit confusing here, because essentially the “image” is actually a pointer to the top image of a stack of images which together form the union filesystem that ends up being the root filesystem of the running container.

While the above command reads “run the couchdb container”, it’s really more like “create and start a new container using the couchdb image as the base image for the container”. In fact, the docker run documentation describes this as:

The docker run command must specify an IMAGE to derive the container from.

The word derive is key here. By my understanding, docker adds a further image to the top of the stack which is where writes that the running container makes are written to. This image is saved under the name of the newly started container, and persists after the container is stopped under the container’s name. This is what allows docker to essentially stop and start containers while maintaining the files changed within the container – behind the scenes it’s managing an image used as the top image on the container’s union filesystem.

Docker calls the images that stack up to form the union filesystem “layers” and the ordered collection of layers an “image”. This concept is key to how dockerfiles work – at a first approximation, each line in a docker file adds a new image to the image stack used as the base image for a container. So, for example, a command that runs apt to install software creates a new image containing the changes to the filesystem made while installing the software.

It’s also obvious how this allows for dockerfile FROM lines to work – it just points back to the image at the top of the stack and then further dockerfile commands layer more images onto that stack to form a new docker image.

In addition, the fact that a union filesystem is able to share images at the lower levels means that docker is able to share lower level base images across many containers but only ever have a single copy on disk. These base images are read-only and so can be safely used in many containers’ union filesystem image stacks.

Putting it together

So basically what docker does when we use the docker run command is:

  1. Download the base image to derive the container from.
  2. Create a union filesystem consisting of the layers in the base image and a new layer at the top of the stack for the container to write its own files to.
  3. Set up a network namespace for the container.
  4. Set up a cgroup that has appropriate mount and network namespaces set up such that the process has the union filesystem mounted as its root filesystem and a private view of the host’s network stack. Mount other volumes into this namespace as specified on the docker run command.
  5. Start up a process within the union filesystem within the cgroup.

Image repositories

This is where the “GitHub for containers” thing comes from. A docker daemon manages a local collection of union filesystem images on your machine called a repository – which contains all the base images and other images used to form the union filesystems for containers on the system (including the top-of-the-stack writable images containers use).

But Docker also manages a large central collection of images which can be used as base images for either direct running via the docker run command or used as the start point for other docker images using the FROM dockerfile command. When used, the docker daemon downloads the image from the remote repository and uses it to derive a new container.

There’s some naming stuff here that I never quite got my head around, in that the couchdb bit in the docker run command is actually a repository itself, making Docker Hub more of a collection of repositories. The actual image used by the docker tool on your computer is chosen by the “tag” you select, and each repository has a set of tags defining the images you can use. There’s a default tag specified for each image repository, which is used when the docker run command just specifies a repository name and misses out the tag.

So I guess this use of Docker Hub as a collection of repos, which contain tagged images, can be mapped imperfectly to the way GitHub works, GitHub being a collection of Git repos. However, the terminology match is far from exact, which definitely caused me problems when trying to understand docker.

A key thing I found confusing is that because a repository is really just a collection of arbitrary images, your local repository can – and almost certainly does! – contain the base images for lots of different pieces of software as well as many, many intermediate layers, whereas the repos on Docker Hub typically contain several tagged versions of a single piece of software. A Docker Hub repo could presumably therefore also contain many disparate pieces of software, but convention dictates that is not what happens, at least in public repos.

Summary

Thinking of docker merely as the repository concept misses out a lot of useful context for what docker means for your machines – the level of access required to create and maintain the cgroups and namespaces is high, and hopefully it’s a bit clearer why the docker daemon requires it from this post.

The v2 interface for cgroups provides for delegation of a portion of the cgroups heirarchy to a non-privileged process, which at a first scan read suggests a route to a less privileged docker daemon, or perhaps it’s already possible to use this. We’ve reached the boundaries of my knowledge and mental model here, so it’s time to stop for now.

As noted at the beginning, this post is a synthesis of my current understanding of how containers and docker work. While on writing it down, my model does seem coherent and logical, but there is quite a bit of guesswork going on. I’d therefore be very pleased to receive corrections to my descriptions and explanations.

Cloudant replication topologies and failover

Cloudant’s (and CouchDB’s) replication feature allows you to keep databases in sync across countries and continents. However, sometimes it’s not obvious how to use this basic pair-wise feature in order to create more complicated replication topologies, like three or more geographical replicas, and then how to do disaster recovery between them. Let’s discuss these in turn.

Throughout the following, it’s important to remember that replication is an asynchronous, best-effort process in which a change is propagated to peers sometime after the client receives the response to its write request. This means that longer replication chains don’t directly affect document write latency, but also that discrepancies between peers will exist for some small period of time (typically low single digit seconds maximum) after a write to one peer completes.

More complicated topologies

Firstly it’s important to understand that Cloudant’s replication creates synchronised copies of databases between peers once a two-way replication is set up. Each replication flows changes only in one direction, which means that a two-way replication involves setting up two separate replications, one in each direction. Visualising this as a directed graph works well. Each node in the graph is a Cloudant database and the arrows are directed edges showing which way changes are flowing.

Using this style, a basic two-peer setup between databases A and B looks like this:

A basic two peer replication

There are two arrows because to have synchronised peers a replication needs to be set up in each direction. After a very short time – milliseconds to seconds – peer A will know about any changes to B, and, similarly, B will know about any changes to A.

These peers can then further replicate with other peers to support more complicated scenarios than a single pair of peers. The key point is that, by setting up replications, one is setting up a graph that changes traverse to get from one peer to another. In order for a change to get from peer A to peer Z, at least one directed link must exist between A and Z. This is the foundation piece to creating more complicated topologies because either A or B can now replicate the changes elsewhere.

So A can propagate the changes from B to another database, C. In this scenario, A is the primary database and the others could be considered replicas.

A basic three peer replication with primary

This obviously introduces a single point of failure. The replication process is smart enough that you can add a replication from B to C to this set, which means that A is no longer a single point of failure. In graph terms, it’s safe to set up a cyclic graph.

A three peer cyclic replication

As more database replicas are added to the set, however, having a fully connected mesh starts to add undue load to each database, and it’s not necessary as each peer is able to act as a “stepping stone” to push changes through the network.

Here, a change at peer B is replicated to E via A then C:

A more complicated six peer replication

In the diagram I only have one stepping stone in each path for simplicity of diagramming, but one could add redundant steps to ensure at least two paths through the network for any given change.

Using One-Way Replications for Synchronisation between Peers

Finally, it’s worth revisiting the point that all we require for synchronisation is that there is a directed path for a change to follow from one peer to another. This means that two-way replications between peers are not strictly required. Instead, one alternative is to set up a circle topology:

A more complicated circular replication

Here, there is still a path for changes to follow between any pair of nodes. Again, setting up redundant links to provide two paths may be useful. Using one-way replications in this way further allows you to decrease the load upon each peer database while still maintaining acceptable replication latency.

Failing over between databases

After setting up the synchronised peers in whatever topology works for your needs, you’re ready to set up failover between the database replicas.

The important takeaway point in this section is that, while you might be tempted to manage failover elsewhere, the only way to reliably failover is from within the application itself. The reason for this is simple: the application is the only place you can be sure whether a given replica is contactable from the application.

An application may be unable to contact a database peer for several reasons, such as:

  • The database servers are actually offline.
  • The application cannot route to the preferred database servers because of network disruption.
  • There is very high latency suddenly introduced on the network path between the application and database clusters.
  • The application’s DNS cache is out of date so it’s resolving the database location to the wrong IP address.
  • Key database requests made by the application have a latency above a given threshold.

The last condition is an example of how the application can use its own measurements to ensure failover happens before users become aware of the problem and how a failover strategy can benefit from being application performance-indicator aware.

The only thing you care about is whether the application can reach the database; not whether, for example, the third-party health-checking service you might use can contact it, or your DNS provider.

The basic steps are:

  • Configure each application instance with a prioritised list of database peers.
  • Use an approach like the circuit breaker pattern to guide failover.
  • Progress down the prioritised list.
  • Have a background task checking for recovery of higher priority peers, failing back when they are reliably available to the application again.

This approach is simple to understand, not too complicated to implement and gives your application the best chance of surviving the most number of possible failure modes.

Avoiding Cloudant's HTTP Siren's Call

I was browsing Cloudant questions on Stackoverflow last night and came across a question about how to securely access Cloudant from directly from a browser. Summarising:

How do I securely pass my database credentials to the browser without the user being able to see them?

Over my time at Cloudant I’ve been asked variants of this questions many times. It makes me think that Cloudant and CouchDB’s HTTP interface is a bit of a siren’s call, luring unwary travellers onto security rocks.

Let’s cut to the chase: sending credentials to an untrusted device means you have to assume those credentials are compromised. By compromised I mean that in any reasonable security model you have to assume the credentials are public and anyone can use them. Your data is available to anyone.

The above question also misses the point that it’s not just the user themselves that must be unable to see the credentials, but everything else:

  • Their browser may have malicious addons sending passwords to an attacker.
  • Their laptop may have malware, sending passwords to an attacker.
  • Their home network may have other compromised machines or routers, sending pass— you get the point.
  • And so on.

Everything that ever sees the credentials is able to leak them to the world. The actual person using the application is just one part of that (albeit one that’s easier to phish). The only way to prevent this is to never allow the credentials to leave an environment controlled by you, the web application developer.

Over time, I’ve come to the conclusion that both Cloudant and CouchDB are best suited to being used like any other database, using a three-tier architecture. No one would consider having a browser connect to a Postgres database or SQL Server – hopefully because it seems weird rather than just because it’d be difficult. Cloudant’s HTTP interface makes it simple to connect from a browser, which can be misleading.

The HTTP interface was originally intended to enable couchapps, and can work well if you can live within the very tight constraints required in order to do this securely. This makes expressing many to most applications as CouchApps impossible. In addition, web application frameworks do a lot of work on your behalf to make your applications secure from a wide variety of attacks. CouchDB provides little help in this regard, which makes creating a secure application practically difficult even if it’s theoretically possible.

For most applications, therefore, the temptation to directly connect from the browser should be avoided. It’s only suitable for a pretty small subset of applications and then is hard to do securely when compared with a traditional architecture with an application in front of the database.

This of course isn’t to say CouchDB’s HTTP interface isn’t secure. It’s very secure when accessed over HTTPS; at least as much as any other database. Possibly more so, given the wide real world testing of the security properties of HTTPS. However, the security of a protocol is only as useful when its secrets are protected, and sending credentials to any device you don’t control is almost certain to undermine this.

Why you should (generally) avoid using include_docs in Cloudant and CouchDB view queries

One of my most often repeated pieces of performance advice when using CouchDB or Cloudant is to avoid using include_docs=true in view queries. When you look at the work CouchDB needs to do, the reason for the recommendation becomes obvious.

During a normal view query, CouchDB must only read a single file on disk. It streams results directly from that file. I guess it’s a b-tree style thing under the hood. Therefore, if you are reading the entire index or doing a range query with startkey and endkey, CouchDB can just find the appropriate starting point in the file on disk and then read until it reaches the end of the index or the endkey. Not much data needs to be held in memory as it can go straight onto the wire. As any data emitted by the map function is stored inline in the b-tree, it’s very fast to stream this as part of the response.

When you use include_docs=true, CouchDB has a lot more work to do. In addition to streaming out the view row data as above, Couch has to read each and every document referenced by the view rows it returns. Briefly, this involves:

  1. Loading up the database’s primary data file.
  2. Using the document ID index contained in that file to find the offset within that file where the document data resides.
  3. Finally, reading the document data itself before returning the row to the client.

Given the view is ordered by some field in the document rather than by doc ID, this is essentially a lot of random document lookups within the main data file. That’s a lot of extra tree traversals and data read from disk.

While in theory this is going to be much slower – and many people I trust had told me this – I’d not done a simple benchmark to get a feel for the difference myself. So I finally got around to doing a quick experiment to see what kind of affect this has. It was just on my MacBook Air (Mid-2012, 2GHz i7, 8GB RAM, SSD), using CouchDB 1.6.1 in a single node instance, so the specific values are fairly meaningless. The process:

  1. I uploaded 100,000 identical tiny documents to the CouchDB instance. The tiny document hopefully minimises the actual document data read time vs. the lookups involved in reading data.
  2. I created two views, one which saved the document data into the index and one which emitted null instead.
  3. I pre-warmed the views by retrieving each to make sure that CouchDB had built them.
  4. I did a few timed runs of retrieving every row in each view in a single call. For the null view, I timed both include_docs=true and include_docs=false.

The view was simply:

{
   "_id": "_design/test",
   "language": "javascript",
   "views": {
       "emitdoc": {
           "map": "function(doc) {\n  emit(doc.id, doc);\n}"
       },
       "emitnull": {
           "map": "function(doc) {\n  emit(doc.id, null);\n}"
       }
   }
}

And each document looked like:

{
   "_id": "0d469cdd8a7c054bf5eed0c954000ba4",
   "value1": "sjntwpacttftohms"
}

I then called each view and read through the whole thing, all 100,000 rows. I timed the calls using curl. It’s not very statistically rigorous, but I don’t think you need to be for this magnitude of difference. For kicks, I also eye-balled the CPU usage in top during each call and guessed an average.

TestTime, secondsEye-balled CPU
emitdoc5.821105%
emitnull4.50299%
emitnull?include_docs=true48.492140%

The headline result is that reading the document from the view index itself (emitdoc) was just over 8x faster than using include_docs. It’s also significantly less computationally expensive. There’s also a difference between reading emitnull and emitdoc, though far less pronounced.

This was done on CouchDB 1.6.1 on my laptop. So while it wasn’t a Cloudant or CouchDB cluster, given clustered query processing and clustered read behaviour, I would say that the results there would be similar or worse.

While this is a read of 100,000 documents, which you might say is unusual, over the many calls an application will make for smaller numbers of documents this kind of difference will add up. In addition, it adds a lot of load to your CouchDB instance, and likely screws around with disk caches and the like.

So, broadly, it seems pretty sound advice to avoid include_docs=true in practice as well as in theory.

As a bonus, here’s how to time things using curl.

Addendum

I was asked on Twitter what I’d recommend overall for when to use include_docs. It’s a bit of a judgement call.

The core trade off is one of space vs. query latency. Emitting entire documents into the index is expensive in terms of space. But, as shown above, it speeds up retrieval significantly. Ideally, of course, you’d be able to emit a subset of the fields from a document, but often that’s not possible.

My decision tree would start something like this:

  • If the query happens often – many times a minute for example – it’s worth emitting the documents into the index. The query latency being lower will help your overall throughput.
  • For any query running more than once every ten minutes or so, when retrieving a lot of documents – many hundreds – I’d consider emitting the documents regardless of latency requirements. Reading many documents from the primary data file will chew up disk caches and internal CouchDB caches.
  • If the query is rare and need not run at optimal speeds, go ahead and use include_docs. For rarely used queries, you might as well read the documents at query time and save the disk space.
  • For relatively rare queries, a few a minute, if speed is important (e.g. it’s for a UI), I’d consider the number of documents retrieved. If it’s just one or two, the extra latency in using include_docs probably isn’t going to matter. If it’s a lot of documents, the delay may become unacceptable. This one is particularly application dependent.

This would help me decide what the first iteration of my index would look like, but I’d want to monitor and tweak the index over time if it appeared slow. As always, testing different configuration is the best strategy, but hopefully the above saves a little time.