Cloudant replication topologies and failover

Cloudant's (and CouchDB's) replication feature allows you to keep databases in sync across countries and continents. However, sometimes it's not obvious how to use this basic pair-wise feature in order to create more complicated replication topologies, like three or more geographical replicas, and then how to do disaster recovery between them. Let's discuss these in turn.

Throughout the following, it's important to remember that replication is an asynchronous, best-effort process in which a change is propagated to peers sometime after the client receives the response to its write request. This means that longer replication chains don't directly affect document write latency, but also that discrepancies between peers will exist for some small period of time (typically low single digit seconds maximum) after a write to one peer completes.

More complicated topologies

Firstly it's important to understand that Cloudant's replication creates synchronised copies of databases between peers once a two-way replication is set up. Each replication flows changes only in one direction, which means that a two-way replication involves setting up two separate replications, one in each direction. Visualising this as a directed graph works well. Each node in the graph is a Cloudant database and the arrows are directed edges showing which way changes are flowing.

Using this style, a basic two-peer setup between databases A and B looks like this:

A basic two peer replication

There are two arrows because to have synchronised peers a replication needs to be set up in each direction. After a very short time -- milliseconds to seconds -- peer A will know about any changes to B, and, similarly, B will know about any changes to A.

These peers can then further replicate with other peers to support more complicated scenarios than a single pair of peers. The key point is that, by setting up replications, one is setting up a graph that changes traverse to get from one peer to another. In order for a change to get from peer A to peer Z, at least one directed link must exist between A and Z. This is the foundation piece to creating more complicated topologies because either A or B can now replicate the changes elsewhere.

So A can propagate the changes from B to another database, C. In this scenario, A is the primary database and the others could be considered replicas.

A basic three peer replication with primary

This obviously introduces a single point of failure. The replication process is smart enough that you can add a replication from B to C to this set, which means that A is no longer a single point of failure. In graph terms, it's safe to set up a cyclic graph.

A three peer cyclic replication

As more database replicas are added to the set, however, having a fully connected mesh starts to add undue load to each database, and it's not necessary as each peer is able to act as a "stepping stone" to push changes through the network.

Here, a change at peer B is replicated to E via A then C:

A more complicated six peer replication

In the diagram I only have one stepping stone in each path for simplicity of diagramming, but one could add redundant steps to ensure at least two paths through the network for any given change.

Using One-Way Replications for Synchronisation between Peers

Finally, it's worth revisiting the point that all we require for synchronisation is that there is a directed path for a change to follow from one peer to another. This means that two-way replications between peers are not strictly required. Instead, one alternative is to set up a circle topology:

A more complicated circular replication

Here, there is still a path for changes to follow between any pair of nodes. Again, setting up redundant links to provide two paths may be useful. Using one-way replications in this way further allows you to decrease the load upon each peer database while still maintaining acceptable replication latency.

Failing over between databases

After setting up the synchronised peers in whatever topology works for your needs, you're ready to set up failover between the database replicas.

The important takeaway point in this section is that, while you might be tempted to manage failover elsewhere, the only way to reliably failover is from within the application itself. The reason for this is simple: the application is the only place you can be sure whether a given replica is contactable from the application.

An application may be unable to contact a database peer for several reasons, such as:

  • The database servers are actually offline.
  • The application cannot route to the preferred database servers because of network disruption.
  • There is very high latency suddenly introduced on the network path between the application and database clusters.
  • The application's DNS cache is out of date so it's resolving the database location to the wrong IP address.
  • Key database requests made by the application have a latency above a given threshold.

The last condition is an example of how the application can use its own measurements to ensure failover happens before users become aware of the problem and how a failover strategy can benefit from being application performance-indicator aware.

The only thing you care about is whether the application can reach the database; not whether, for example, the third-party health-checking service you might use can contact it, or your DNS provider.

The basic steps are:

  • Configure each application instance with a prioritised list of database peers.
  • Use an approach like the circuit breaker pattern to guide failover.
  • Progress down the prioritised list.
  • Have a background task checking for recovery of higher priority peers, failing back when they are reliably available to the application again.

This approach is simple to understand, not too complicated to implement and gives your application the best chance of surviving the most number of possible failure modes.

Avoiding Cloudant's HTTP Siren's Call

I was browsing Cloudant questions on Stackoverflow last night and came across a question about how to securely access Cloudant from directly from a browser. Summarising:

How do I securely pass my database credentials to the browser without the user being able to see them?

Over my time at Cloudant I've been asked variants of this questions many times. It makes me think that Cloudant and CouchDB's HTTP interface is a bit of a siren's call, luring unwary travellers onto security rocks.

Let's cut to the chase: sending credentials to an untrusted device means you have to assume those credentials are compromised. By compromised I mean that in any reasonable security model you have to assume the credentials are public and anyone can use them. Your data is available to anyone.

The above question also misses the point that it's not just the user themselves that must be unable to see the credentials, but everything else:

  • Their browser may have malicious addons sending passwords to an attacker.
  • Their laptop may have malware, sending passwords to an attacker.
  • Their home network may have other compromised machines or routers, sending pass--- you get the point.
  • And so on.

Everything that ever sees the credentials is able to leak them to the world. The actual person using the application is just one part of that (albeit one that's easier to phish). The only way to prevent this is to never allow the credentials to leave an environment controlled by you, the web application developer.

Over time, I've come to the conclusion that both Cloudant and CouchDB are best suited to being used like any other database, using a three-tier architecture. No one would consider having a browser connect to a Postgres database or SQL Server -- hopefully because it seems weird rather than just because it'd be difficult. Cloudant's HTTP interface makes it simple to connect from a browser, which can be misleading.

The HTTP interface was originally intended to enable couchapps, and can work well if you can live within the very tight constraints required in order to do this securely. This makes expressing many to most applications as CouchApps impossible. In addition, web application frameworks do a lot of work on your behalf to make your applications secure from a wide variety of attacks. CouchDB provides little help in this regard, which makes creating a secure application practically difficult even if it's theoretically possible.

For most applications, therefore, the temptation to directly connect from the browser should be avoided. It's only suitable for a pretty small subset of applications and then is hard to do securely when compared with a traditional architecture with an application in front of the database.

This of course isn't to say CouchDB's HTTP interface isn't secure. It's very secure when accessed over HTTPS; at least as much as any other database. Possibly more so, given the wide real world testing of the security properties of HTTPS. However, the security of a protocol is only as useful when its secrets are protected, and sending credentials to any device you don't control is almost certain to undermine this.

Why you should (generally) avoid using include_docs in Cloudant and CouchDB view queries

One of my most often repeated pieces of performance advice when using CouchDB or Cloudant is to avoid using include_docs=true in view queries. When you look at the work CouchDB needs to do, the reason for the recommendation becomes obvious.

During a normal view query, CouchDB must only read a single file on disk. It streams results directly from that file. I guess it's a b-tree style thing under the hood. Therefore, if you are reading the entire index or doing a range query with startkey and endkey, CouchDB can just find the appropriate starting point in the file on disk and then read until it reaches the end of the index or the endkey. Not much data needs to be held in memory as it can go straight onto the wire. As any data emitted by the map function is stored inline in the b-tree, it's very fast to stream this as part of the response.

When you use include_docs=true, CouchDB has a lot more work to do. In addition to streaming out the view row data as above, Couch has to read each and every document referenced by the view rows it returns. Briefly, this involves:

  1. Loading up the database's primary data file.
  2. Using the document ID index contained in that file to find the offset within that file where the document data resides.
  3. Finally, reading the document data itself before returning the row to the client.

Given the view is ordered by some field in the document rather than by doc ID, this is essentially a lot of random document lookups within the main data file. That's a lot of extra tree traversals and data read from disk.

While in theory this is going to be much slower -- and many people I trust had told me this -- I'd not done a simple benchmark to get a feel for the difference myself. So I finally got around to doing a quick experiment to see what kind of affect this has. It was just on my MacBook Air (Mid-2012, 2GHz i7, 8GB RAM, SSD), using CouchDB 1.6.1 in a single node instance, so the specific values are fairly meaningless. The process:

  1. I uploaded 100,000 identical tiny documents to the CouchDB instance. The tiny document hopefully minimises the actual document data read time vs. the lookups involved in reading data.
  2. I created two views, one which saved the document data into the index and one which emitted null instead.
  3. I pre-warmed the views by retrieving each to make sure that CouchDB had built them.
  4. I did a few timed runs of retrieving every row in each view in a single call. For the null view, I timed both include_docs=true and include_docs=false.

The view was simply:

{
   "_id": "_design/test",
   "language": "javascript",
   "views": {
       "emitdoc": {
           "map": "function(doc) {\n  emit(doc.id, doc);\n}"
       },
       "emitnull": {
           "map": "function(doc) {\n  emit(doc.id, null);\n}"
       }
   }
}

And each document looked like:

{
   "_id": "0d469cdd8a7c054bf5eed0c954000ba4",
   "value1": "sjntwpacttftohms"
}

I then called each view and read through the whole thing, all 100,000 rows. I timed the calls using curl. It's not very statistically rigorous, but I don't think you need to be for this magnitude of difference. For kicks, I also eye-balled the CPU usage in top during each call and guessed an average.

TestTime, secondsEye-balled CPU
emitdoc5.821105%
emitnull4.50299%
emitnull?include_docs=true48.492140%

The headline result is that reading the document from the view index itself (emitdoc) was just over 8x faster than using include_docs. It's also significantly less computationally expensive. There's also a difference between reading emitnull and emitdoc, though far less pronounced.

This was done on CouchDB 1.6.1 on my laptop. So while it wasn't a Cloudant or CouchDB cluster, given clustered query processing and clustered read behaviour, I would say that the results there would be similar or worse.

While this is a read of 100,000 documents, which you might say is unusual, over the many calls an application will make for smaller numbers of documents this kind of difference will add up. In addition, it adds a lot of load to your CouchDB instance, and likely screws around with disk caches and the like.

So, broadly, it seems pretty sound advice to avoid include_docs=true in practice as well as in theory.

As a bonus, here's how to time things using curl.

Addendum

I was asked on Twitter what I'd recommend overall for when to use include_docs. It's a bit of a judgement call.

The core trade off is one of space vs. query latency. Emitting entire documents into the index is expensive in terms of space. But, as shown above, it speeds up retrieval significantly. Ideally, of course, you'd be able to emit a subset of the fields from a document, but often that's not possible.

My decision tree would start something like this:

  • If the query happens often -- many times a minute for example -- it's worth emitting the documents into the index. The query latency being lower will help your overall throughput.
  • For any query running more than once every ten minutes or so, when retrieving a lot of documents -- many hundreds -- I'd consider emitting the documents regardless of latency requirements. Reading many documents from the primary data file will chew up disk caches and internal CouchDB caches.
  • If the query is rare and need not run at optimal speeds, go ahead and use include_docs. For rarely used queries, you might as well read the documents at query time and save the disk space.
  • For relatively rare queries, a few a minute, if speed is important (e.g. it's for a UI), I'd consider the number of documents retrieved. If it's just one or two, the extra latency in using include_docs probably isn't going to matter. If it's a lot of documents, the delay may become unacceptable. This one is particularly application dependent.

This would help me decide what the first iteration of my index would look like, but I'd want to monitor and tweak the index over time if it appeared slow. As always, testing different configuration is the best strategy, but hopefully the above saves a little time.

Note

Apple's 2016 in review, on the new Macbook Pro lineup:

Apple has built some laptops that nicely handle the needs of the vast majority of its users. It’s easy to look at the volume numbers on what sells and convince yourself that these edge cases aren’t worth building a device for. That seems to be what Apple’s done with these new laptops.

But here’s the problem: sitting in this niche of excluded users are some of Apple’s strongest supporters, the influencers that create word of mouth, and to me, most importantly it includes a significant number of the developers Apple depends on to create it’s Mac and IOS apps.

Couldn't agree more.

.:.

The tree behind Cloudant's documents and how to use it

On the surface, Cloudant's document API appears to be a reasonably simple way to upload JSON data to a key value store and update the JSON data for a particular key. However, the reality is more complicated -- isn't it always? -- and understanding what's happening is important in using Cloudant effectively and becoming an advanced user of the database.

The takeaway point is this: a document in Cloudant is a tree and Cloudant's document HTTP API is essentially a tree manipulation API.

Below, we'll create and manipulate these trees to get a feel for what this means. The default behaviours of many calls allow you to ignore the tree in order that Cloudant appears more like a database, and we'll explore why and how this works as we go.

Aside: why a tree? A document's tree representation is key to Cloudant's replication, which is essentially transferring tree nodes from one location to another in order to synchronise data robustly.

All this information is valid for CouchDB, but I've chosen to refer to Cloudant throughout for simplicity. This mini-tutorial assumes a basic knowledge of the Cloudant document API. Let's get started.

Building a simple document tree

Uploading a new document to Cloudant gives you something like this:

> curl -XPUT \
    'http://localhost:5984/tree-demo/mydoc' \
    -d '{"foo":"bar"}'
{"ok":true,"id":"mydoc","rev":"1-4c6114c65e295552ab1019e2b046b10e"}

Updating that document gives you this:

> curl -XPUT \
    'http://localhost:5984/tree-demo/mydoc?rev=1-4c6114c65e295552ab1019e2b046b10e' \
    -d '{"foo":"baz"}'
{"ok":true,"id":"mydoc","rev":"2-cfcd6781f13994bde69a1c3320bfdadb"}

Each call has returned a new rev ID in the response. If we didn't already know the two rev IDs for this document, we could find them out by making a request for the document using the revs=true querystring parameter:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc?revs=true' | jq .
{
  "_id": "mydoc",
  "_rev": "2-cfcd6781f13994bde69a1c3320bfdadb",
  "foo": "baz",
  "_revisions": {
    "start": 2,
    "ids": [
      "cfcd6781f13994bde69a1c3320bfdadb",
      "4c6114c65e295552ab1019e2b046b10e"
    ]
  }
}

The special thing about this request is that the response lists the rev IDs in order -- which defines the history of the document. From this call, we can construct a tree for the document with a single branch and two nodes. We use the _revisions field, and construct the tree backwards using the start and ids array:

A tree representation of the document, with two nodes

The tree has two nodes. Cloudant labels each node with its rev ID. Cloudant also labels one node winner; we'll come to that in a moment.

Note: Cloudant calls these tree nodes revisions but I'm going to stick with node to emphasise the tree semantics.

Now we've seen the tree, let's see how we can think about the three calls if we consider them in terms of trees rather updates to a JSON document.

  1. The first call creates a new document tree and its root node, 1-4c6114c65e295552ab1019e2b046b10e.
  2. The second call adds a node to the tree, parented by the node we pass in via the rev querystring parameter. The call returns the rev ID of the new node, 2-cfcd6781f13994bde69a1c3320bfdadb.
  3. The final call uses the revs=true parameter to return the history of the 2- node along with its content.

One other thing is worth mentioning at this point: a rev ID can be used to retrieve any node of the tree at any time. This is done by using the rev querystring parameter when making a GET request for the document. While that node will be returned, the JSON content of the document at that point may not: Cloudant cleans up the content of old tree nodes to conserve space.

The winner node

The winner abstraction is key to Cloudant's database semantics. It's used to give the appearance of a single value for a document rather than a tree. It does this by freeing us from having to specify the rev ID of the node we want with every call -- though some calls, like updates, always need one. Let's look at how this works.

Cloudant applies the winner label to the node at the tip of the tree as shown in the diagram above. This node is the one that Cloudant will refer to when responding to document retrieval requests that do not have any particular rev ID specified. The rev ID for this node is included in the response.

While the node that Cloudant chooses to label the winner for any given document is an implementation detail, it's essentially decided by choosing the longest path in the document tree that's not terminated by a deletion node. This will be more obvious later when we introduce a branch to the tree.

An essential property is that Cloudant chooses which node to label winner in a deterministic way. This means that when the database is replicated, Cloudant will label the same node as winner at all replicas, given the same document tree.

Consider the last call above:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc?revs=true'

By not specifying a rev ID to return, we're instructing Cloudant to return information about the tree node pointed to by the winner label.

Updates are tree manipulation

We've now seen that an update to a document can be viewed as adding a new node to the document's tree. And we have seen that the update request's rev parameter is specifying which existing node in the document tree should be used as the parent for this new node.

If we update twice more, we can see the tree grow. In the background, Cloudant will be updating the winner pointer to a new rev ID each time. It sends that back in the rev field of the response:

> curl -XPUT \
    'http://localhost:5984/tree-demo/mydoc?rev=2-cfcd6781f13994bde69a1c3320bfdadb' \
    -d '{"foo":"bop"}'
{"ok":true,"id":"mydoc","rev":"3-2766344359f70192d3a68bf205c37743"}

> curl -XPUT \
    'http://localhost:5984/tree-demo/mydoc?rev=3-2766344359f70192d3a68bf205c37743' \
    -d '{"foo":"bloop"}'
{"ok":true,"id":"mydoc","rev":"4-a5be949eeb7296747cc271766e9a498b"}

Retrieving the tree description again using revs=true, we can see this has changed the JSON description of the document tree history to include the new rev IDs (node labels):

> curl -XGET 'http://localhost:5984/tree-demo/mydoc?revs=true' | jq .
{
  "_id": "mydoc",
  "_rev": "4-a5be949eeb7296747cc271766e9a498b",
  "foo": "bloop",
  "_revisions": {
    "start": 4,
    "ids": [
      "a5be949eeb7296747cc271766e9a498b",
      "2766344359f70192d3a68bf205c37743",
      "cfcd6781f13994bde69a1c3320bfdadb",
      "4c6114c65e295552ab1019e2b046b10e"
    ]
  }
}

The tree now has four nodes, as shown by the _revisions array. As we didn't specify a rev in the request, Cloudant has again returned the node content and history for the winner node, as we'd expect. As with any direct request to retrieve a document, we could have supplied a rev parameter to receive information about a different node.

A tree representation of the document, with four nodes

Branching in the document tree

To show how the tree can branch, we need to make this document a conflicted document. A conflicted document is a document where there are two or more active branches. An active branch is one that ends in a node that is not deleted. This often happens during replication, where a document with a common history may have been updated in different ways in the databases being replicated.

If you've been using Cloudant for a while, you'll be familiar with Cloudant's behaviour when you send the "wrong" rev ID with an update: the 409 Conflict response. With our new tree knowledge, we can see that what Cloudant is doing is either (a) rejecting a request that tries to add a node to a parent which already has a child, or (b) trying to use a parent node that doesn't exist. It does this so the tree remains a single stem, which makes sense for a database. This can be seen if we try to add a second child to our 1- revision:

$ curl -XPUT \
    'http://localhost:5984/tree-demo/mydoc?rev=1-4c6114c65e295552ab1019e2b046b10e' \
    -d '{"foo":"baz"}' -v
[...]
> PUT /tree-demo/mydoc?rev=1-4c6114c65e295552ab1019e2b046b10e HTTP/1.1
[...]
< HTTP/1.1 409 Conflict
[...]
{"error":"conflict","reason":"Document update conflict."}

However, we can use the _bulk_docs call to bypass these protections (which is what Cloudant's replicator does). Clearly, this is something you'd never use day-to-day, but it's instructive to see the effects. In the following call, we show this bypassing in action by creating a new branch rooted at the first update we made above.

The _bulk_docs request takes a JSON body describing the updates to make. First, we create the body of the request in bulk.json:

{
    "docs": [
        {
            "_id": "mydoc",
            "_rev": "3-917fa2381192822767f010b95b45325b",
            "_revisions": {
                "ids": [
                    "917fa2381192822767f010b95b45325b",
                    "cfcd6781f13994bde69a1c3320bfdadb",
                    "4c6114c65e295552ab1019e2b046b10e"
                ],
                "start": 3
            },
            "bar": "baz"
        }
    ],
    "new_edits": false
}

This document describes a new node for the document mydoc that we created above, 3-917fa2381192822767f010b95b45325b. This node has a history that diverges from the existing document. The new node's history is described using the _revisions field, using the same format as we received in revs=true calls above. In this history, the first two rev IDs are the same, but the third is different. This will cause a branch to happen at rev 2-.

By including new_edits=false in the JSON, we tell Cloudant to graft this node into the document tree, creating any parent nodes required. Without new_edits=false, Cloudant would reject this update as the history is incompatible with the winner node.

Next, send the request using this body:

> curl -XPOST \
    'http://localhost:5984/tree-demo/_bulk_docs' \
    -d @bulk.json \
    -H "Content-Type:application/json"

We now have a tree which branches. To check this, we can make a request to Cloudant that will return the content of the nodes at the tip of each branch, along with the history of each of these nodes.

The open_revs parameter is specifically used for retrieving the content of multiple tip nodes at once. To return all the branches' tip nodes, we pass open_revs=all into the GET for the document. This is a second way to get non-winner nodes, along with specifying rev.

As above, we also include revs=true so that the history of each tip node is also returned. We also specify Accept: application/json to get a simpler return format: all the tip nodes in a JSON array.

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc?open_revs=all&revs=true' \
    -H "Accept:application/json" | jq .
[
  {
    "ok": {
      "_id": "mydoc",
      "_rev": "4-a5be949eeb7296747cc271766e9a498b",
      "foo": "bloop",
      "_revisions": {
        "start": 4,
        "ids": [
          "a5be949eeb7296747cc271766e9a498b",
          "2766344359f70192d3a68bf205c37743",
          "cfcd6781f13994bde69a1c3320bfdadb",
          "4c6114c65e295552ab1019e2b046b10e"
        ]
      }
    }
  },
  {
    "ok": {
      "_id": "mydoc",
      "_rev": "3-917fa2381192822767f010b95b45325b",
      "bar": "baz",
      "_revisions": {
        "start": 3,
        "ids": [
          "917fa2381192822767f010b95b45325b",
          "cfcd6781f13994bde69a1c3320bfdadb",
          "4c6114c65e295552ab1019e2b046b10e"
        ]
      }
    }
  }
]

Which represents the following tree:

A tree representation of the conflicted document, with a branch

Resolving the conflicted state

A document is conflicted so long as it has two or more branches which don't end in a deletion node. A conflicted document is clearly in a bad state for a database: some data is effectively hidden. So before we finish, let's resolve the conflict. In tree terms, resolving means updating the document's tree so that there is only one active branch.

As we have two active branches, we need to update one branch with a deletion node, called a tombstone, to make that branch inactive. We do this by making a DELETE request to the document's resource. This request specifies the parent rev ID (rev) as the node at the tip of the branch we wish to mark as a dead end.

Before this, we need to choose which branch to keep. To do this, we need to retrieve both. Instead of using open_revs=all, we can retrieve each individually.

First, let's look at what Cloudant considers the winner document node. This corresponds to the data it returns when a simple document lookup is carried out:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc' | jq .
{
  "_id": "mydoc",
  "_rev": "4-a5be949eeb7296747cc271766e9a498b",
  "foo": "bloop"
}

And, as stated earlier, we can get the other branch's data by specifying the rev ID of the node at the tip of the other branch:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc?rev=3-917fa2381192822767f010b95b45325b' | jq .
{
  "_id": "mydoc",
  "_rev": "3-917fa2381192822767f010b95b45325b",
  "bar": "baz"
}

Let's say that the non-winning node, 3-917fa2381192822767f010b95b45325b contains the JSON data we actually want for this document. We issue a delete request specifying the unwanted branch's tip node's rev ID as its parent:

> curl -XDELETE \
    'http://localhost:5984/tree-demo/mydoc?rev=4-a5be949eeb7296747cc271766e9a498b'
{"ok":true,"id":"mydoc","rev":"5-ab21cb5ac4c8da916c47c45330d8a655"}

Recapping what we're doing here, this request instructs Cloudant to add a tombstone node to the parent node specified by the rev parameter, thereby rendering that branch inactive.

Recall the algorithm to select the node the winner label points to is to choose the longest active path in a document tree. Therefore, after this operation, Cloudant will have moved the winner label to point at the tip of the remaining active branch. A lookup of the doc ID without specifying a rev ID therefore now returns the other node:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc' | jq .
{
  "_id": "mydoc",
  "_rev": "3-917fa2381192822767f010b95b45325b",
  "bar": "baz"
}

A corollary of this is that a document is not considered deleted -- that is, a query for that document will not return a 404 Not Found -- until all branches are inactive. This can be the source of great confusion if one doesn't know beforehand that a document is conflicted. Deleting the document appears to bring back to life another version of the document!

If a document has many active branches, it will take several requests to add a tombstone to every branch tip to make the document actually appear deleted. Using the open_revs=all argument will show how many active branch tips there are.

Returning to our example, we can study the resulting tree using the request that we've made a few times now:

> curl -XGET \
    'http://localhost:5984/tree-demo/mydoc?open_revs=all&revs=true' \
    -H "Accept:application/json" | jq .
[
  {
    "ok": {
      "_id": "mydoc",
      "_rev": "5-ab21cb5ac4c8da916c47c45330d8a655",
      "_deleted": true,
      "_revisions": {
        "start": 5,
        "ids": [
          "ab21cb5ac4c8da916c47c45330d8a655",
          "a5be949eeb7296747cc271766e9a498b",
          "2766344359f70192d3a68bf205c37743",
          "cfcd6781f13994bde69a1c3320bfdadb",
          "4c6114c65e295552ab1019e2b046b10e"
        ]
      }
    }
  },
  {
    "ok": {
      "_id": "mydoc",
      "_rev": "3-917fa2381192822767f010b95b45325b",
      "bar": "baz",
      "_revisions": {
        "start": 3,
        "ids": [
          "917fa2381192822767f010b95b45325b",
          "cfcd6781f13994bde69a1c3320bfdadb",
          "4c6114c65e295552ab1019e2b046b10e"
        ]
      }
    }
  }
]

The response still shows both branches. This is expected, as the branches still exist! However, we can see that the branch ending in 5- is deleted and so the 3- branch is the only active branch. Which therefore gives us this final tree:

A tree representation of the resolved document, with an active branch and a deleted branch

Conclusion

Hopefully, this has shown how the Cloudant API for documents is really a tree manipulation API. There convenience wrappers and behaviours for common operations which make the API more database-like, most of which make use of the tree node labelled winner.

It's useful to think this way, as it explains what effect requests are having on stored data which helps with understanding how to make Cloudant work effectively. It also helps to understand replication as the replication process is just using calls like the ones above to retrieve and duplicate document tree nodes from one database to another.

Before finishing, it's worth repeating that the content of tree nodes that are not at the tip of a branch is discarded periodically by Cloudant -- so don't rely on it sticking around!