Today I needed to take a HTTP request and extract the
etag header; the
was used as part of an
implementation in a service I was using and I wanted to script an update to a
resource. I was doing this in a
Makefile so wanted to do this without firing
up a scripting language.
It turns out this is the domain of tools like
sed stands for stream
editor. It applies scripts to text streams which edit the content of the
stream. When you watch someone using
sed, the scripts look super-cryptic,
but in fact they’re not too bad. Like a regular expression, they benefit from
reading left to right; when viewed as a whole they are just a mess. In fact,
half of a
sed script is often a regular expression!
First, we’ll get the HTTP headers to work with. I found a new
-D <filename> that will do this for you. So to get the headers for dx13.co.uk:
curl -D headers.txt https://dx13.co.uk
There’s quite a lot of headers that come with a call to dx13.co.uk, so I
trimmed most of them from the end to leave something a bit shorter to work
with, which doesn’t affect the
sed commands at all. I left us with:
> cat headers.txt HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 06 Nov 2018 15:58:30 GMT etag: "5be1ba26-a9dd" access-control-allow-origin: * expires: Fri, 22 Mar 2019 14:03:49 GMT cache-control: max-age=600 x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED
We’ll come to executing scripts in a minute. First, we’ll get familiar with what a script looks like. The basic form is:
addrselects a set of lines to operate on. It can be a single line, a line range or a regular expression.
!at the end of the address.
commandis executed on all file lines.
Xis a command (like
optionsare options to the command.
shas the option
'14d': the range is line 14; and then
dremoves the line; no options are used. This removes line 14 of the input.
'/:/d': the range is the regex
:; and then
dremoves the lines; no options are used. This will remove lines containing
:from the input.
's/^.*: /foo! /': the range is all lines; the command is
s; the option is the find/replace specification. We’ll see what this does later.
I found the
s command familiar – it’s just like vim’s.
By default, sed applies its first argument as a script and second as the input
file, and outputs to
A simple script is a vim-like search and replace. Here, we replace the header
> sed 's/^.*: /foo! /' headers.txt HTTP/2 200 foo! GitHub.com foo! text/html; charset=utf-8 foo! Tue, 06 Nov 2018 15:58:30 GMT foo! "5be1ba26-a9dd" foo! * foo! Fri, 22 Mar 2019 14:03:49 GMT foo! max-age=600 foo! 6F9E:2F59:86E637:B2E922:5C94E8ED
As we head straight to the
s command and don’t specify an address, the command
is executed on all lines of the file.
By using the
-e flag, multiple scripts can be chained. You can also use one
big script string with semi-colons, but I find multiple
-e flags easier to
Replace header names with
foo! as above, then replace
> sed -e 's/^.*: /foo! /' -e 's/foo/bar/' headers.txt HTTP/2 200 bar! GitHub.com bar! text/html; charset=utf-8 bar! Tue, 06 Nov 2018 15:58:30 GMT bar! "5be1ba26-a9dd" bar! * bar! Fri, 22 Mar 2019 14:03:49 GMT bar! max-age=600 bar! 6F9E:2F59:86E637:B2E922:5C94E8ED
As mentioned in the primer, removing lines is done using a command within the
!d is used to invert the behaviour.
Remove all the lines containing a colon:
> sed '/:/d' headers.txt HTTP/2 200
Note that we use the address
/:/ which is a regex that matches all lines
with a colon. The rest of the script executes on these lines.
Remove all the lines without a colon:
> sed '/:/!d' headers.txt server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 06 Nov 2018 15:58:30 GMT etag: "5be1ba26-a9dd" access-control-allow-origin: * expires: Fri, 22 Mar 2019 14:03:49 GMT cache-control: max-age=600 x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED
Here we use
/:/! as the address – this causes the command to be executed
on the lines that don’t match the regex.
Finally we’re ready!
Combining the above, we can retrieve the ETag header using a chain of three scripts:
> sed -e '/etag/!d' -e 's/^etag: //' -e 's/"//g' headers.txt 5be1ba26-a9dd
s/"//gmeans global; leaving it out means that
sedwould replace only the first instance of
"that it found. Making the replacement global means that all instances on the line are replaced.
In the end, it feels like a bit of an anti-climax. However, it’s now much
clearer to me where I’d try to make use of
sed, and I feel I’ve learned
enough to be dangerous!
tl;dr If you are using
stale=ok in queries to Cloudant or CouchDB 2.x, you
most likely want to be using
update=false instead. If you are using
CouchDB originally used
stale=ok on the query string to specify that you were
okay with receiving out-of-date results. By default, CouchDB lazily updates
indexes upon querying them rather than when JSON data is changed or added. If up
to date results are not strictly required, using
stale=ok provides a latency
improvement for queries as the request does not have to wait for indexes to be
updated before returning results. This is particularly useful for databases with
a high write rate.
As an aside, Cloudant automatically enqueues indexes for update when primary data changes, so this problem isn’t so acute. However, in the face of high update rate bursts, it’s still possible for indexing to fall behind so a delay may occur.
When using a single node, as in CouchDB 1.x, this parameter behaved as you’d
expect. However, when clustering was added to CouchDB, a second meaning was
stale=ok: also use the same set of shard replicas to retrieve the
Recall that Cloudant and CouchDB 2.x stores three copies of each shard and
by default will use the shard replica that starts returning results fastest for
a query request. This latter fact helps even out load across the cluster.
Heavily loaded nodes will likely return slower and so won’t be picked to respond
to a given query. When using
stale=ok, the database will instead always use
the same shard replicas for every request to that index. The use of the same
replica to answer queries has two effects:
stale=okcould drive load unevenly across the nodes in your database cluster because certain shard replicas would always be used for the queries to the index that specify
stale=ok. This means a set of nodes could receive outside numbers of requests.
stale=ok. This is compounded by the tendency of
stale=okto drive imbalanced load.
The end result is that using
stale=ok can, counter-intuitively, cause queries
to become slower. Worse, they may become unavailable during cluster split-brain
scenarios because of the forced use of a certain set of replicas. Given that
mostly people use
stale=ok to improve performance, this wasn’t a great state
to be in.
stale=ok’s existing behaviour needed to be maintained for backwards
compatibility, the fix for this problem was to introduce two new query string
parameters were introduced which set each of the two
update=true/false/lazy: controls whether the index should be up to date before the query is executed.
true: the index will be updated first.
false: the index will not be updated.
lazy: the index will not be updated before the query, but enqueued for update after the query is completed.
stable=true/false: controls the use of the certain shard replicas.
The main use of
stable=true is that queries are more likely to appear to “go
forward in time” because each shard replica may update its indexes in different
orders. However, this isn’t guaranteed, so the availability and performance
trade offs are likely not worth it.
The end result is that virtually all applications using
stale=ok should move
to instead use
When I first came across docker a few years ago, probably late 2014, so a year after it was introduced at PyCon during 2013, I found it a confusing concept. “Like GitHub, but for containers” was a phrase that I recall from that period, which I think ended up causing a lot of my confusion – I conflated Docker Hub with docker the tool.
Since then, I’ve learned more about docker, particularly in the last year. I think that things started to click around a year ago, and over the past few months as I’ve looked further into Kubernetes and written my own pieces of software destined for container deployment I’ve formed my own mental model of where docker fits into my world. This post is about my writing that down to understand its coherency.
I tend towards understanding systems like this bottom-up, so let’s start at the beginning, which is also conveniently the bottom.
Cgroups, or control groups to give them their full name, were introduced into the mainline Linux kernel in 2.6.24, released in January 2008. What cgroups allow is for processes running on a system to be hierarchically grouped in such a way that various controls and boundaries can be applied to a process hierarchy.
Cgroups are a necessary but not sufficient part of a container solution, and they are also used for lots of things other than containers. Systemd, for example, uses cgroups when defining resource limits on the processes it manages.
Like many things within the Linux kernel, cgroups are exposed within the file hierarchy. A system administrator writes and reads from files within the mounted cgroups filesystem to define cgroups and their properties. A process is added to a cgroup by writing its PID to a file within the cgroups hierarchy; the process is automatically removed from its previous cgroup.
Overall, cgroups provide docker with a simple(ish) way to control the resources a process hierarchy uses (like CPU) and has access to (like networks and part of the filesystem).
Cgroups provide control of various resources, but the main ones to consider for docker containers are:
From this, it’s relatively easy to see how docker can assign resources to a container – put the process running within in the container in a cgroup and set up the resource constraints for it.
Beyond controlling scarce resources like CPU and memory, cgroups provide a way to assign a namespace to a process. Namespaces are the next piece of the puzzle and we move to them next.
Putting a process within a namespace is a means to define what the process has access to. A namespace is the boundary, whereas cgroups is the control plane that puts a process within the namespace’s boundary.
Also in 2.6.24 came the core of network namespaces. This and future patchsets enable processes to be presented with their own view of the network stack, covering network functions such as interfaces, routing tables and so on.
The Wikipedia article on kernel namespaces has a list of the current resources that can be isolated using namespaces. We can form a basic view of how containers are run by docker (and any other container management software) using just a couple of these:
Things like the PID and User namespaces provide extra isolation, but I’m not going to cover them here.
I confess here I’m making some guesses as to what’s going on, but the mental model has served me okay so I’ll reproduce it here. Broadly I consider these two namespaces to be the basis of docker’s ability to run what amounts to “pre-packaged” software.
Mount namespaces define what the filesystem looks like to the process running within the namespace. So different processes can see entirely different views of the filesystem.
My general assumption here is that docker is using MNT namespaces to provide the
running container with a unique view of the filesystem, both its own “root
image” that we’ll talk about later and the parts of the host filesystem mounted
into the running container using the
As NET namespaces provide processes with a custom view of the network stack and
provide ways for processes in different namespaces to poke holes to each other
via the network, I assume this is the basis for docker’s
bridge network type
which sets up a private network between processes running in containers. When
one runs a container with the
host network type, my basic layman’s assumption
is that the container’s process is not placed within its own network namespace
(or it lives within the default namespace).
A union filesystem essentially takes several filesystem images and layers them on top of each other. Images “above” override values from images “below”. For any file read, the union filesystem traverses the image stack from top to bottom and returns the file content from the first image containing the file. For writes, either the write just fails (for a read-only union filesystem) or the write goes to the top-most layer. Often this top-most layer is initially an empty image created specifically for the writes of files to the mounted union filesystem.
An important point to note is that two or more mounted union I filesystems can share images, meaning that two union filesystems could have, say, the first five images in their respective stacks shared but each with different images stacked on top to provide very different resultant filesystems.
When running a docker container, one specifies an image to “run” via a command like:
docker run couchdb
couchdb part of this command specifies the image to download. I find the
naming gets a bit confusing here, because essentially the “image” is actually a
pointer to the top image of a stack of images which together form the union
filesystem that ends up being the root filesystem of the running container.
While the above command reads “run the couchdb container”, it’s really more like “create and start a new container using the couchdb image as the base image for the container”. In fact, the docker run documentation describes this as:
The docker run command must specify an IMAGE to derive the container from.
The word derive is key here. By my understanding, docker adds a further image to the top of the stack which is where writes that the running container makes are written to. This image is saved under the name of the newly started container, and persists after the container is stopped under the container’s name. This is what allows docker to essentially stop and start containers while maintaining the files changed within the container – behind the scenes it’s managing an image used as the top image on the container’s union filesystem.
Docker calls the images that stack up to form the union filesystem “layers” and
the ordered collection of layers an “image”. This concept is key to how
dockerfiles work – at a first approximation, each line in a docker file adds a
new image to the image stack used as the base image for a container. So, for
example, a command that runs
apt to install software creates a new image
containing the changes to the filesystem made while installing the software.
It’s also obvious how this allows for dockerfile FROM lines to work – it just points back to the image at the top of the stack and then further dockerfile commands layer more images onto that stack to form a new docker image.
In addition, the fact that a union filesystem is able to share images at the lower levels means that docker is able to share lower level base images across many containers but only ever have a single copy on disk. These base images are read-only and so can be safely used in many containers’ union filesystem image stacks.
So basically what docker does when we use the docker run command is:
This is where the “GitHub for containers” thing comes from. A docker daemon manages a local collection of union filesystem images on your machine called a repository – which contains all the base images and other images used to form the union filesystems for containers on the system (including the top-of-the-stack writable images containers use).
But Docker also manages a large central collection of images which can be used
as base images for either direct running via the
docker run command or used as
the start point for other docker images using the
FROM dockerfile command.
When used, the docker daemon downloads the image from the remote repository and
uses it to derive a new container.
There’s some naming stuff here that I never quite got my head around, in that
couchdb bit in the
docker run command is actually a repository itself,
making Docker Hub more of a collection of repositories. The actual image used by
the docker tool on your computer is chosen by the “tag” you select, and each
repository has a set of tags defining the images you can use. There’s a default
tag specified for each image repository, which is used when the
command just specifies a repository name and misses out the tag.
So I guess this use of Docker Hub as a collection of repos, which contain tagged images, can be mapped imperfectly to the way GitHub works, GitHub being a collection of Git repos. However, the terminology match is far from exact, which definitely caused me problems when trying to understand docker.
A key thing I found confusing is that because a repository is really just a collection of arbitrary images, your local repository can – and almost certainly does! – contain the base images for lots of different pieces of software as well as many, many intermediate layers, whereas the repos on Docker Hub typically contain several tagged versions of a single piece of software. A Docker Hub repo could presumably therefore also contain many disparate pieces of software, but convention dictates that is not what happens, at least in public repos.
Thinking of docker merely as the repository concept misses out a lot of useful context for what docker means for your machines – the level of access required to create and maintain the cgroups and namespaces is high, and hopefully it’s a bit clearer why the docker daemon requires it from this post.
The v2 interface for cgroups provides for delegation of a portion of the cgroups heirarchy to a non-privileged process, which at a first scan read suggests a route to a less privileged docker daemon, or perhaps it’s already possible to use this. We’ve reached the boundaries of my knowledge and mental model here, so it’s time to stop for now.
As noted at the beginning, this post is a synthesis of my current understanding of how containers and docker work. While on writing it down, my model does seem coherent and logical, but there is quite a bit of guesswork going on. I’d therefore be very pleased to receive corrections to my descriptions and explanations.
Cloudant’s (and CouchDB’s) replication feature allows you to keep databases in sync across countries and continents. However, sometimes it’s not obvious how to use this basic pair-wise feature in order to create more complicated replication topologies, like three or more geographical replicas, and then how to do disaster recovery between them. Let’s discuss these in turn.
Throughout the following, it’s important to remember that replication is an asynchronous, best-effort process in which a change is propagated to peers sometime after the client receives the response to its write request. This means that longer replication chains don’t directly affect document write latency, but also that discrepancies between peers will exist for some small period of time (typically low single digit seconds maximum) after a write to one peer completes.
Firstly it’s important to understand that Cloudant’s replication creates synchronised copies of databases between peers once a two-way replication is set up. Each replication flows changes only in one direction, which means that a two-way replication involves setting up two separate replications, one in each direction. Visualising this as a directed graph works well. Each node in the graph is a Cloudant database and the arrows are directed edges showing which way changes are flowing.
Using this style, a basic two-peer setup between databases A and B looks like this:
There are two arrows because to have synchronised peers a replication needs to be set up in each direction. After a very short time – milliseconds to seconds – peer A will know about any changes to B, and, similarly, B will know about any changes to A.
These peers can then further replicate with other peers to support more complicated scenarios than a single pair of peers. The key point is that, by setting up replications, one is setting up a graph that changes traverse to get from one peer to another. In order for a change to get from peer A to peer Z, at least one directed link must exist between A and Z. This is the foundation piece to creating more complicated topologies because either A or B can now replicate the changes elsewhere.
So A can propagate the changes from B to another database, C. In this scenario, A is the primary database and the others could be considered replicas.
This obviously introduces a single point of failure. The replication process is smart enough that you can add a replication from B to C to this set, which means that A is no longer a single point of failure. In graph terms, it’s safe to set up a cyclic graph.
As more database replicas are added to the set, however, having a fully connected mesh starts to add undue load to each database, and it’s not necessary as each peer is able to act as a “stepping stone” to push changes through the network.
Here, a change at peer B is replicated to E via A then C:
In the diagram I only have one stepping stone in each path for simplicity of diagramming, but one could add redundant steps to ensure at least two paths through the network for any given change.
Finally, it’s worth revisiting the point that all we require for synchronisation is that there is a directed path for a change to follow from one peer to another. This means that two-way replications between peers are not strictly required. Instead, one alternative is to set up a circle topology:
Here, there is still a path for changes to follow between any pair of nodes. Again, setting up redundant links to provide two paths may be useful. Using one-way replications in this way further allows you to decrease the load upon each peer database while still maintaining acceptable replication latency.
After setting up the synchronised peers in whatever topology works for your needs, you’re ready to set up failover between the database replicas.
The important takeaway point in this section is that, while you might be tempted to manage failover elsewhere, the only way to reliably failover is from within the application itself. The reason for this is simple: the application is the only place you can be sure whether a given replica is contactable from the application.
An application may be unable to contact a database peer for several reasons, such as:
The last condition is an example of how the application can use its own measurements to ensure failover happens before users become aware of the problem and how a failover strategy can benefit from being application performance-indicator aware.
The only thing you care about is whether the application can reach the database; not whether, for example, the third-party health-checking service you might use can contact it, or your DNS provider.
The basic steps are:
This approach is simple to understand, not too complicated to implement and gives your application the best chance of surviving the most number of possible failure modes.
I was browsing Cloudant questions on Stackoverflow last night and came across a question about how to securely access Cloudant from directly from a browser. Summarising:
How do I securely pass my database credentials to the browser without the user being able to see them?
Over my time at Cloudant I’ve been asked variants of this questions many times. It makes me think that Cloudant and CouchDB’s HTTP interface is a bit of a siren’s call, luring unwary travellers onto security rocks.
Let’s cut to the chase: sending credentials to an untrusted device means you have to assume those credentials are compromised. By compromised I mean that in any reasonable security model you have to assume the credentials are public and anyone can use them. Your data is available to anyone.
The above question also misses the point that it’s not just the user themselves that must be unable to see the credentials, but everything else:
Everything that ever sees the credentials is able to leak them to the world. The actual person using the application is just one part of that (albeit one that’s easier to phish). The only way to prevent this is to never allow the credentials to leave an environment controlled by you, the web application developer.
Over time, I’ve come to the conclusion that both Cloudant and CouchDB are best suited to being used like any other database, using a three-tier architecture. No one would consider having a browser connect to a Postgres database or SQL Server – hopefully because it seems weird rather than just because it’d be difficult. Cloudant’s HTTP interface makes it simple to connect from a browser, which can be misleading.
The HTTP interface was originally intended to enable couchapps, and can work well if you can live within the very tight constraints required in order to do this securely. This makes expressing many to most applications as CouchApps impossible. In addition, web application frameworks do a lot of work on your behalf to make your applications secure from a wide variety of attacks. CouchDB provides little help in this regard, which makes creating a secure application practically difficult even if it’s theoretically possible.
For most applications, therefore, the temptation to directly connect from the browser should be avoided. It’s only suitable for a pretty small subset of applications and then is hard to do securely when compared with a traditional architecture with an application in front of the database.
This of course isn’t to say CouchDB’s HTTP interface isn’t secure. It’s very secure when accessed over HTTPS; at least as much as any other database. Possibly more so, given the wide real world testing of the security properties of HTTPS. However, the security of a protocol is only as useful when its secrets are protected, and sending credentials to any device you don’t control is almost certain to undermine this.