Working effectively with CouchDB Mango indexes

Because you work with CouchDB indexes using JSON and Javascript, it’s tempting to imagine there is something JSON or Javascript-y about how you use them. In the end, there isn’t: they end up on disk as B+ Trees, like pretty much every other database. In order to create appropriate indexes for your queries, it’s important to understand how these work. We can use tables as an easy mental model for indexes, and this article shows how that works for CouchDB’s Mango feature (also called Cloudant Query).

Our data

Let’s make a simple data model for people, and add three people:

{
    "_id": "123",
    "name": "Mike",
    "age": 36
}
{
    "_id": "456",
    "name": "Mike",
    "age": 22
}
{
    "_id": "abc",
    "name": "Dave",
    "age": 29
}

Now we’ll look at how we can index these, and how that affects our queries.

Single field indexes

Let’s take a simple query first: what’s an effective index for finding all people with a given name? This one feels easy: index on name. Here’s how this is indexed in Mango:

"index": {
    "fields": ["name"]
}

This creates an index on a single field, name. This field is the key in the index. Conceptually, a good representation for this is a table:

key doc ID
name
mike 123
mike 456
dave abc

The doc ID is included as a tie-breaker for entries with equal keys.

This ends up on disk as a B+ Tree. In a similar way to how it’s easy to visually scan the table above from top to bottom (or bottom to top), a B+ Tree makes it fast to scan a file on disk in the same way. Therefore the table and B+ Tree can be considered somewhat equivalent when imagining how a query performs.

Specifically, for a query name == "mike" query, we can see that it’s fast to search the first column of the table for "mike" and return data about those entries. This same inference holds for the on-disk B+ Tree, so from here we’ll just talk about the tables.

Most queries involve more fields, however, so lets now look at how multi-field indexes work.

Two field indexes

Say we wanted to search by age and name. We can create an index to help with this. In CouchDB we can index both fields in two ways. We’ll get to this below, but before we get started we need to understand that:

  • Indexing an array of fields as a key creates a table with several columns, one per entry in the key array.
  • The entries are ordered by the first column, then the second column, and so on in the overall table.
  • Therefore, an index can only be searched in column order from left to right, because this is the only efficient way to scan the table (or on disk tree).
    • If an index cannot be used for a query, the database has to resort to loading every document in the database and check it against the query selector; this is the worst case, and can take a very long time. This is called a table scan.

The key we choose dictates how we can search, and therefore how efficient a given query can be. Therefore it’s very important to get your indexes right if you want results to arrive quickly to your application. In particular, avoiding table scans for queries used a lot is vital.

Let’s look at the two ways we can index these two fields, and how that ordering shows up in how we can query the indexes.

Firstly, we could use by age, then name:

"index": {
    "fields": ["age", "name"]
}

Giving us the index:

key doc ID
age name
22 mike 456
29 dave abc
36 mike 123

So a query for name == "mike", age == 36 will initially efficiently search the first column until it finds the first entry for 36. It will then scan each entry with 36 in the first column until it finds the first entry with the value mike. When it reaches the end of the entries with age == 36, the query can stop reading the index because it knows every row in the table after that will have an age greater than 36.

This index can also be used to query on just the age field. In particular, it’s great for queries like age > 20 or age > 20, age < 30 because the query engine can efficiently search for the lower bound, and then scan and return entries until it reaches the upper bound.

The way a query like age > 20, age < 30, name == "mike" works is the first column is searched for the lowerbound age, then the index is scanned for entries where the name column is "mike". When the search encounters an entry in the first column – the age column – that is greater than 30, it can stop reading the index.

This is important: the first column is searched, but the second column is checked via a slower scan operation. Therefore, for any query, the best index is the one where the first column reduces the search space the most, which reduces the number of rows that need to be scanned through to match entries in the second and further columns of the key.

This index cannot, however, be used for the query name == "mike" because it cannot be efficiently scanned by name because there is no overall ordering for the name column. Entries are all jiggled around as they are ordered by the age field. Therefore the query engine would need to scan every entry in the index – not very efficient. This is called an index scan and can actually be more efficient than a full table scan. However, CouchDB doesn’t support index scans at this time, so will fall back to the table scan.

Secondly, we could use by name, then age:

"index": {
    "fields": ["name", "age"]
}

Giving us the index:

key doc ID
name age
mike 22 456
mike 36 123
dave 29 abc

So a query for name == "mike", age == 36 will initially scan the first column until it finds the first entry for mike. It will then scan each entry with mike in the first column until it finds the first entry with the value 36.

This index can also be used to query on just the name field. It can also be used to effectively answer questions like name == "mike", age > 30 because the first column narrows down the name quickly and the scan for age can be fast.

It might help to imagine we have many millions of entries: it’s likely there are lots of people with a certain age, but far fewer with a certain name. Therefore, the initial search for name == "mike" will constrain the search space far more than a search for age > 30, and so we end up scanning far fewer rows for the age value in the second column.

This index cannot be used for queries on the age field for the same reason as above; it just wouldn’t be efficient to scan the whole table.

Summary

The above logic holds for indexes of three, four or any number of further fields. The first column can be efficiently searched, and then we can reasonably efficiently scan for matching entries in the second, third and so on columns presuming the first column search narrows down the search space enough.

Creating appropriate indexes is key for the performance of CouchDB applications making use of Mango (or Cloudant Query on Cloudant). Hopefully this article helps show that it’s relatively straightforward to generate effective indexes once you have worked out the queries they need to service, and that it is possible to create indexes that can serve more than one query’s need by judicious use of multi-field indexes.

How docker build args expose passwords

Avoiding using docker build --build-arg to inject secrets or passowrds into Docker image builds is established wisdom within the Docker community. Here’s why.

TLDR: Using build args for secrets exposes the secret to users of your image via docker history.

Take the following Dockerfile:

FROM alpine:latest
ARG password
RUN echo hello world

This looks pretty innocent – we’re not even using the password during the build!

Let’s build the image, using the password secretsquirrel:

> docker build --build-arg password=secretsquirrel .
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM alpine:latest
latest: Pulling from library/alpine
bdf0201b3a05: Pull complete
Digest: sha256:28ef97b86[...]
Status: Downloaded newer image for alpine:latest
 ---> cdf98d1859c1
Step 2/3 : ARG password
 ---> Running in 38993dbd0f31
Removing intermediate container 38993dbd0f31
 ---> 8bef9d60eae8
Step 3/3 : RUN echo hello world
 ---> Running in 0c4214ebfce8
hello world
Removing intermediate container 0c4214ebfce8
 ---> 2fd2a25cfdb3
Successfully built 2fd2a25cfdb3

Again, looks pretty safe – the password doesn’t appear in the output.

However, let’s take a look at this using docker history:

> docker history 2fd2a25cfdb3
IMAGE         ...  CREATED BY                                      ...
2fd2a25cfdb3  ...  |1 password=secretsquirrel /bin/sh -c echo h…   ...
8bef9d60eae8  ...  /bin/sh -c #(nop)  ARG password                 ...
cdf98d1859c1  ...  /bin/sh -c #(nop)  CMD ["/bin/sh"]              ...
<missing>     ...  /bin/sh -c #(nop) ADD file:2e3a37883f56a4a27…   ...

There we go, our password is right there for anyone with access to the image. The docker build command passes ARG values to all RUN steps as environment variables which appear in the history output 😭

Using sed to extract HTTP headers

Today I needed to take a HTTP request and extract the etag header; the etag was used as part of an MVCC implementation in a service I was using and I wanted to script an update to a resource. I was doing this in a Makefile so wanted to do this without firing up a scripting language.

It turns out this is the domain of tools like sed. sed stands for stream editor. It applies scripts to text streams which edit the content of the stream. When you watch someone using sed, the scripts look super-cryptic, but in fact they’re not too bad. Like a regular expression, they benefit from reading left to right; when viewed as a whole they are just a mess. In fact, half of a sed script is often a regular expression!

The sample headers

First, we’ll get the HTTP headers to work with. I found a new curl option, -D <filename> that will do this for you. So to get the headers for dx13.co.uk:

curl -D headers.txt https://dx13.co.uk

There’s quite a lot of headers that come with a call to dx13.co.uk, so I trimmed most of them from the end to leave something a bit shorter to work with, which doesn’t affect the sed commands at all. I left us with:

> cat headers.txt
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

A sed primer

We’ll come to executing scripts in a minute. First, we’ll get familiar with what a script looks like. The basic form is:

[addr]X[options]
  • addr selects a set of lines to operate on. It can be a single line, a line range or a regular expression.
    • A single line is just the line number, 12.
    • A regex is delimited using backslashes, /regex/.
    • A range is comma-separated, 12,16.
    • Matching can be inverted using ! at the end of the address.
    • If there is no addr, command is executed on all file lines.
    • The documentation for addresses.
  • X is a command (like d or s).
  • options are options to the command.
    • s has the option /foo/bar/.

So in:

  • '14d': the range is line 14; and then d removes the line; no options are used. This removes line 14 of the input.
  • '/:/d': the range is the regex :; and then d removes the lines; no options are used. This will remove lines containing : from the input.
  • 's/^.*: /foo! /': the range is all lines; the command is s; the option is the find/replace specification. We’ll see what this does later.

I found the s command familiar – it’s just like vim’s.

Using sed to get the etag

By default, sed applies its first argument as a script and second as the input file, and outputs to stdout.

Substitution

A simple script is a vim-like search and replace. Here, we replace the header names with foo!:

> sed 's/^.*: /foo! /' headers.txt
HTTP/2 200
foo! GitHub.com
foo! text/html; charset=utf-8
foo! Tue, 06 Nov 2018 15:58:30 GMT
foo! "5be1ba26-a9dd"
foo! *
foo! Fri, 22 Mar 2019 14:03:49 GMT
foo! max-age=600
foo! 6F9E:2F59:86E637:B2E922:5C94E8ED

As we head straight to the s command and don’t specify an address, the command is executed on all lines of the file.

Chaining

By using the -e flag, multiple scripts can be chained. You can also use one big script string with semi-colons, but I find multiple -e flags easier to read.

Replace header names with foo! as above, then replace foo with bar:

> sed -e 's/^.*: /foo! /' -e 's/foo/bar/' headers.txt
HTTP/2 200
bar! GitHub.com
bar! text/html; charset=utf-8
bar! Tue, 06 Nov 2018 15:58:30 GMT
bar! "5be1ba26-a9dd"
bar! *
bar! Fri, 22 Mar 2019 14:03:49 GMT
bar! max-age=600
bar! 6F9E:2F59:86E637:B2E922:5C94E8ED

Removing lines

As mentioned in the primer, removing lines is done using a command within the script, d. !d is used to invert the behaviour.

Remove all the lines containing a colon:

> sed '/:/d' headers.txt
HTTP/2 200

Note that we use the address /:/ which is a regex that matches all lines with a colon. The rest of the script executes on these lines.

Remove all the lines without a colon:

> sed '/:/!d' headers.txt
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

Here we use /:/! as the address – this causes the command to be executed on the lines that don’t match the regex.

Getting the etag

Finally we’re ready!

Combining the above, we can retrieve the ETag header using a chain of three scripts:

> sed -e '/etag/!d' -e 's/^etag: //' -e 's/"//g' headers.txt
5be1ba26-a9dd

That is:

  1. Remove the lines not containing etag.
    • This passes just one line to the next script: etag: "5be1ba26-a9dd"
  2. Remove the header name from the remaining line.
    • This leaves: "5be1ba26-a9dd"
  3. Remove the quotes. The g in s/"//g means global; leaving it out means that sed would replace only the first instance of " that it found. Making the replacement global means that all instances on the line are replaced.
    • Giving us: 5be1ba26-a9dd

In the end, it feels like a bit of an anti-climax. However, it’s now much clearer to me where I’d try to make use of sed, and I feel I’ve learned enough to be dangerous!

References:

Querying Cloudant: what are stale, update and stable?

tl;dr If you are using stale=ok in queries to Cloudant or CouchDB 2.x, you most likely want to be using update=false instead. If you are using stale=update_after, use update=lazy instead.

This question has come up a few times, so here’s a reference to what the situation is with these parameters to query requests in Cloudant and CouchDB 2.x.

CouchDB originally used stale=ok on the query string to specify that you were okay with receiving out-of-date results. By default, CouchDB lazily updates indexes upon querying them rather than when JSON data is changed or added. If up to date results are not strictly required, using stale=ok provides a latency improvement for queries as the request does not have to wait for indexes to be updated before returning results. This is particularly useful for databases with a high write rate.

As an aside, Cloudant automatically enqueues indexes for update when primary data changes, so this problem isn’t so acute. However, in the face of high update rate bursts, it’s still possible for indexing to fall behind so a delay may occur.

When using a single node, as in CouchDB 1.x, this parameter behaved as you’d expect. However, when clustering was added to CouchDB, a second meaning was added to stale=ok: also use the same set of shard replicas to retrieve the results.

Recall that Cloudant and CouchDB 2.x stores three copies of each shard and by default will use the shard replica that starts returning results fastest for a query request. This latter fact helps even out load across the cluster. Heavily loaded nodes will likely return slower and so won’t be picked to respond to a given query. When using stale=ok, the database will instead always use the same shard replicas for every request to that index. The use of the same replica to answer queries has two effects:

  1. Using stale=ok could drive load unevenly across the nodes in your database cluster because certain shard replicas would always be used for the queries to the index that specify stale=ok. This means a set of nodes could receive outside numbers of requests.
  2. If one of the replicas was hosted on a heavily loaded node in the cluster, this would slow down all queries to that index using stale=ok. This is compounded by the tendency of stale=ok to drive imbalanced load.

The end result is that using stale=ok can, counter-intuitively, cause queries to become slower. Worse, they may become unavailable during cluster split-brain scenarios because of the forced use of a certain set of replicas. Given that mostly people use stale=ok to improve performance, this wasn’t a great state to be in.

As stale=ok’s existing behaviour needed to be maintained for backwards compatibility, the fix for this problem was to introduce two new query string parameters were introduced which set each of the two stale=ok behaviours independently:

  1. update=true/false/lazy: controls whether the index should be up to date before the query is executed.
    1. true: the index will be updated first.
    2. false: the index will not be updated.
    3. lazy: the index will not be updated before the query, but enqueued for update after the query is completed.
  2. stable=true/false: controls the use of the certain shard replicas.

The main use of stable=true is that queries are more likely to appear to “go forward in time” because each shard replica may update its indexes in different orders. However, this isn’t guaranteed, so the availability and performance trade offs are likely not worth it.

The end result is that virtually all applications using stale=ok should move to instead use update=false.

What is docker?

When I first came across docker a few years ago, probably late 2014, so a year after it was introduced at PyCon during 2013, I found it a confusing concept. “Like GitHub, but for containers” was a phrase that I recall from that period, which I think ended up causing a lot of my confusion – I conflated Docker Hub with docker the tool.

Since then, I’ve learned more about docker, particularly in the last year. I think that things started to click around a year ago, and over the past few months as I’ve looked further into Kubernetes and written my own pieces of software destined for container deployment I’ve formed my own mental model of where docker fits into my world. This post is about my writing that down to understand its coherency.

I tend towards understanding systems like this bottom-up, so let’s start at the beginning, which is also conveniently the bottom.

cgroups

Cgroups, or control groups to give them their full name, were introduced into the mainline Linux kernel in 2.6.24, released in January 2008. What cgroups allow is for processes running on a system to be hierarchically grouped in such a way that various controls and boundaries can be applied to a process hierarchy.

Cgroups are a necessary but not sufficient part of a container solution, and they are also used for lots of things other than containers. Systemd, for example, uses cgroups when defining resource limits on the processes it manages.

Like many things within the Linux kernel, cgroups are exposed within the file hierarchy. A system administrator writes and reads from files within the mounted cgroups filesystem to define cgroups and their properties. A process is added to a cgroup by writing its PID to a file within the cgroups hierarchy; the process is automatically removed from its previous cgroup.

Overall, cgroups provide docker with a simple(ish) way to control the resources a process hierarchy uses (like CPU) and has access to (like networks and part of the filesystem).

Cgroups provide control of various resources, but the main ones to consider for docker containers are:

  • CPU controller – using cpu shares, CPU time can be divided up between processes to ensure a process gets a share of CPU time to run in.
  • Memory controller – a process can be given its own chunk of memory which has a hard limit on its size.

From this, it’s relatively easy to see how docker can assign resources to a container – put the process running within in the container in a cgroup and set up the resource constraints for it.

Beyond controlling scarce resources like CPU and memory, cgroups provide a way to assign a namespace to a process. Namespaces are the next piece of the puzzle and we move to them next.

Kernel namespaces

Putting a process within a namespace is a means to define what the process has access to. A namespace is the boundary, whereas cgroups is the control plane that puts a process within the namespace’s boundary.

Also in 2.6.24 came the core of network namespaces. This and future patchsets enable processes to be presented with their own view of the network stack, covering network functions such as interfaces, routing tables and so on.

The Wikipedia article on kernel namespaces has a list of the current resources that can be isolated using namespaces. We can form a basic view of how containers are run by docker (and any other container management software) using just a couple of these:

  1. The Mount (MNT) namespace
  2. The Network (NET) namespace

Things like the PID and User namespaces provide extra isolation, but I’m not going to cover them here.

I confess here I’m making some guesses as to what’s going on, but the mental model has served me okay so I’ll reproduce it here. Broadly I consider these two namespaces to be the basis of docker’s ability to run what amounts to “pre-packaged” software.

Mount

Mount namespaces define what the filesystem looks like to the process running within the namespace. So different processes can see entirely different views of the filesystem.

My general assumption here is that docker is using MNT namespaces to provide the running container with a unique view of the filesystem, both its own “root image” that we’ll talk about later and the parts of the host filesystem mounted into the running container using the --mount option.

Network

As NET namespaces provide processes with a custom view of the network stack and provide ways for processes in different namespaces to poke holes to each other via the network, I assume this is the basis for docker’s bridge network type which sets up a private network between processes running in containers. When one runs a container with the host network type, my basic layman’s assumption is that the container’s process is not placed within its own network namespace (or it lives within the default namespace).

Union filesystems

A union filesystem essentially takes several filesystem images and layers them on top of each other. Images “above” override values from images “below”. For any file read, the union filesystem traverses the image stack from top to bottom and returns the file content from the first image containing the file. For writes, either the write just fails (for a read-only union filesystem) or the write goes to the top-most layer. Often this top-most layer is initially an empty image created specifically for the writes of files to the mounted union filesystem.

An important point to note is that two or more mounted union I filesystems can share images, meaning that two union filesystems could have, say, the first five images in their respective stacks shared but each with different images stacked on top to provide very different resultant filesystems.

Docker images and layers

When running a docker container, one specifies an image to “run” via a command like:

docker run couchdb

The couchdb part of this command specifies the image to download. I find the naming gets a bit confusing here, because essentially the “image” is actually a pointer to the top image of a stack of images which together form the union filesystem that ends up being the root filesystem of the running container.

While the above command reads “run the couchdb container”, it’s really more like “create and start a new container using the couchdb image as the base image for the container”. In fact, the docker run documentation describes this as:

The docker run command must specify an IMAGE to derive the container from.

The word derive is key here. By my understanding, docker adds a further image to the top of the stack which is where writes that the running container makes are written to. This image is saved under the name of the newly started container, and persists after the container is stopped under the container’s name. This is what allows docker to essentially stop and start containers while maintaining the files changed within the container – behind the scenes it’s managing an image used as the top image on the container’s union filesystem.

Docker calls the images that stack up to form the union filesystem “layers” and the ordered collection of layers an “image”. This concept is key to how dockerfiles work – at a first approximation, each line in a docker file adds a new image to the image stack used as the base image for a container. So, for example, a command that runs apt to install software creates a new image containing the changes to the filesystem made while installing the software.

It’s also obvious how this allows for dockerfile FROM lines to work – it just points back to the image at the top of the stack and then further dockerfile commands layer more images onto that stack to form a new docker image.

In addition, the fact that a union filesystem is able to share images at the lower levels means that docker is able to share lower level base images across many containers but only ever have a single copy on disk. These base images are read-only and so can be safely used in many containers’ union filesystem image stacks.

Putting it together

So basically what docker does when we use the docker run command is:

  1. Download the base image to derive the container from.
  2. Create a union filesystem consisting of the layers in the base image and a new layer at the top of the stack for the container to write its own files to.
  3. Set up a network namespace for the container.
  4. Set up a cgroup that has appropriate mount and network namespaces set up such that the process has the union filesystem mounted as its root filesystem and a private view of the host’s network stack. Mount other volumes into this namespace as specified on the docker run command.
  5. Start up a process within the union filesystem within the cgroup.

Image repositories

This is where the “GitHub for containers” thing comes from. A docker daemon manages a local collection of union filesystem images on your machine called a repository – which contains all the base images and other images used to form the union filesystems for containers on the system (including the top-of-the-stack writable images containers use).

But Docker also manages a large central collection of images which can be used as base images for either direct running via the docker run command or used as the start point for other docker images using the FROM dockerfile command. When used, the docker daemon downloads the image from the remote repository and uses it to derive a new container.

There’s some naming stuff here that I never quite got my head around, in that the couchdb bit in the docker run command is actually a repository itself, making Docker Hub more of a collection of repositories. The actual image used by the docker tool on your computer is chosen by the “tag” you select, and each repository has a set of tags defining the images you can use. There’s a default tag specified for each image repository, which is used when the docker run command just specifies a repository name and misses out the tag.

So I guess this use of Docker Hub as a collection of repos, which contain tagged images, can be mapped imperfectly to the way GitHub works, GitHub being a collection of Git repos. However, the terminology match is far from exact, which definitely caused me problems when trying to understand docker.

A key thing I found confusing is that because a repository is really just a collection of arbitrary images, your local repository can – and almost certainly does! – contain the base images for lots of different pieces of software as well as many, many intermediate layers, whereas the repos on Docker Hub typically contain several tagged versions of a single piece of software. A Docker Hub repo could presumably therefore also contain many disparate pieces of software, but convention dictates that is not what happens, at least in public repos.

Summary

Thinking of docker merely as the repository concept misses out a lot of useful context for what docker means for your machines – the level of access required to create and maintain the cgroups and namespaces is high, and hopefully it’s a bit clearer why the docker daemon requires it from this post.

The v2 interface for cgroups provides for delegation of a portion of the cgroups heirarchy to a non-privileged process, which at a first scan read suggests a route to a less privileged docker daemon, or perhaps it’s already possible to use this. We’ve reached the boundaries of my knowledge and mental model here, so it’s time to stop for now.

As noted at the beginning, this post is a synthesis of my current understanding of how containers and docker work. While on writing it down, my model does seem coherent and logical, but there is quite a bit of guesswork going on. I’d therefore be very pleased to receive corrections to my descriptions and explanations.