Using sed to extract HTTP headers

Today I needed to take a HTTP request and extract the etag header; the etag was used as part of an MVCC implementation in a service I was using and I wanted to script an update to a resource. I was doing this in a Makefile so wanted to do this without firing up a scripting language.

It turns out this is the domain of tools like sed. sed stands for stream editor. It applies scripts to text streams which edit the content of the stream. When you watch someone using sed, the scripts look super-cryptic, but in fact they’re not too bad. Like a regular expression, they benefit from reading left to right; when viewed as a whole they are just a mess. In fact, half of a sed script is often a regular expression!

The sample headers

First, we’ll get the HTTP headers to work with. I found a new curl option, -D <filename> that will do this for you. So to get the headers for dx13.co.uk:

curl -D headers.txt https://dx13.co.uk

There’s quite a lot of headers that come with a call to dx13.co.uk, so I trimmed most of them from the end to leave something a bit shorter to work with, which doesn’t affect the sed commands at all. I left us with:

> cat headers.txt
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

A sed primer

We’ll come to executing scripts in a minute. First, we’ll get familiar with what a script looks like. The basic form is:

[addr]X[options]
  • addr selects a set of lines to operate on. It can be a single line, a line range or a regular expression.
    • A single line is just the line number, 12.
    • A regex is delimited using backslashes, /regex/.
    • A range is comma-separated, 12,16.
    • Matching can be inverted using ! at the end of the address.
    • If there is no addr, command is executed on all file lines.
    • The documentation for addresses.
  • X is a command (like d or s).
  • options are options to the command.
    • s has the option /foo/bar/.

So in:

  • '14d': the range is line 14; and then d removes the line; no options are used. This removes line 14 of the input.
  • '/:/d': the range is the regex :; and then d removes the lines; no options are used. This will remove lines containing : from the input.
  • 's/^.*: /foo! /': the range is all lines; the command is s; the option is the find/replace specification. We’ll see what this does later.

I found the s command familiar – it’s just like vim’s.

Using sed to get the etag

By default, sed applies its first argument as a script and second as the input file, and outputs to stdout.

Substitution

A simple script is a vim-like search and replace. Here, we replace the header names with foo!:

> sed 's/^.*: /foo! /' headers.txt
HTTP/2 200
foo! GitHub.com
foo! text/html; charset=utf-8
foo! Tue, 06 Nov 2018 15:58:30 GMT
foo! "5be1ba26-a9dd"
foo! *
foo! Fri, 22 Mar 2019 14:03:49 GMT
foo! max-age=600
foo! 6F9E:2F59:86E637:B2E922:5C94E8ED

As we head straight to the s command and don’t specify an address, the command is executed on all lines of the file.

Chaining

By using the -e flag, multiple scripts can be chained. You can also use one big script string with semi-colons, but I find multiple -e flags easier to read.

Replace header names with foo! as above, then replace foo with bar:

> sed -e 's/^.*: /foo! /' -e 's/foo/bar/' headers.txt
HTTP/2 200
bar! GitHub.com
bar! text/html; charset=utf-8
bar! Tue, 06 Nov 2018 15:58:30 GMT
bar! "5be1ba26-a9dd"
bar! *
bar! Fri, 22 Mar 2019 14:03:49 GMT
bar! max-age=600
bar! 6F9E:2F59:86E637:B2E922:5C94E8ED

Removing lines

As mentioned in the primer, removing lines is done using a command within the script, d. !d is used to invert the behaviour.

Remove all the lines containing a colon:

> sed '/:/d' headers.txt
HTTP/2 200

Note that we use the address /:/ which is a regex that matches all lines with a colon. The rest of the script executes on these lines.

Remove all the lines without a colon:

> sed '/:/!d' headers.txt
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

Here we use /:/! as the address – this causes the command to be executed on the lines that don’t match the regex.

Getting the etag

Finally we’re ready!

Combining the above, we can retrieve the ETag header using a chain of three scripts:

> sed -e '/etag/!d' -e 's/^etag: //' -e 's/"//g' headers.txt
5be1ba26-a9dd

That is:

  1. Remove the lines not containing etag.
    • This passes just one line to the next script: etag: "5be1ba26-a9dd"
  2. Remove the header name from the remaining line.
    • This leaves: "5be1ba26-a9dd"
  3. Remove the quotes. The g in s/"//g means global; leaving it out means that sed would replace only the first instance of " that it found. Making the replacement global means that all instances on the line are replaced.
    • Giving us: 5be1ba26-a9dd

In the end, it feels like a bit of an anti-climax. However, it’s now much clearer to me where I’d try to make use of sed, and I feel I’ve learned enough to be dangerous!

References:

Querying Cloudant: what are stale, update and stable?

tl;dr If you are using stale=ok in queries to Cloudant or CouchDB 2.x, you most likely want to be using update=false instead. If you are using stale=update_after, use update=lazy instead.

This question has come up a few times, so here’s a reference to what the situation is with these parameters to query requests in Cloudant and CouchDB 2.x.

CouchDB originally used stale=ok on the query string to specify that you were okay with receiving out-of-date results. By default, CouchDB lazily updates indexes upon querying them rather than when JSON data is changed or added. If up to date results are not strictly required, using stale=ok provides a latency improvement for queries as the request does not have to wait for indexes to be updated before returning results. This is particularly useful for databases with a high write rate.

As an aside, Cloudant automatically enqueues indexes for update when primary data changes, so this problem isn’t so acute. However, in the face of high update rate bursts, it’s still possible for indexing to fall behind so a delay may occur.

When using a single node, as in CouchDB 1.x, this parameter behaved as you’d expect. However, when clustering was added to CouchDB, a second meaning was added to stale=ok: also use the same set of shard replicas to retrieve the results.

Recall that Cloudant and CouchDB 2.x stores three copies of each shard and by default will use the shard replica that starts returning results fastest for a query request. This latter fact helps even out load across the cluster. Heavily loaded nodes will likely return slower and so won’t be picked to respond to a given query. When using stale=ok, the database will instead always use the same shard replicas for every request to that index. The use of the same replica to answer queries has two effects:

  1. Using stale=ok could drive load unevenly across the nodes in your database cluster because certain shard replicas would always be used for the queries to the index that specify stale=ok. This means a set of nodes could receive outside numbers of requests.
  2. If one of the replicas was hosted on a heavily loaded node in the cluster, this would slow down all queries to that index using stale=ok. This is compounded by the tendency of stale=ok to drive imbalanced load.

The end result is that using stale=ok can, counter-intuitively, cause queries to become slower. Worse, they may become unavailable during cluster split-brain scenarios because of the forced use of a certain set of replicas. Given that mostly people use stale=ok to improve performance, this wasn’t a great state to be in.

As stale=ok’s existing behaviour needed to be maintained for backwards compatibility, the fix for this problem was to introduce two new query string parameters were introduced which set each of the two stale=ok behaviours independently:

  1. update=true/false/lazy: controls whether the index should be up to date before the query is executed.
    1. true: the index will be updated first.
    2. false: the index will not be updated.
    3. lazy: the index will not be updated before the query, but enqueued for update after the query is completed.
  2. stable=true/false: controls the use of the certain shard replicas.

The main use of stable=true is that queries are more likely to appear to “go forward in time” because each shard replica may update its indexes in different orders. However, this isn’t guaranteed, so the availability and performance trade offs are likely not worth it.

The end result is that virtually all applications using stale=ok should move to instead use update=false.

What is docker?

When I first came across docker a few years ago, probably late 2014, so a year after it was introduced at PyCon during 2013, I found it a confusing concept. “Like GitHub, but for containers” was a phrase that I recall from that period, which I think ended up causing a lot of my confusion – I conflated Docker Hub with docker the tool.

Since then, I’ve learned more about docker, particularly in the last year. I think that things started to click around a year ago, and over the past few months as I’ve looked further into Kubernetes and written my own pieces of software destined for container deployment I’ve formed my own mental model of where docker fits into my world. This post is about my writing that down to understand its coherency.

I tend towards understanding systems like this bottom-up, so let’s start at the beginning, which is also conveniently the bottom.

cgroups

Cgroups, or control groups to give them their full name, were introduced into the mainline Linux kernel in 2.6.24, released in January

  1. What cgroups allow is for processes running on a system to be hierarchically grouped in such a way that various controls and boundaries can be applied to a process hierarchy.

Cgroups are a necessary but not sufficient part of a container solution, and they are also used for lots of things other than containers. Systemd, for example, uses cgroups when defining resource limits on the processes it manages.

Like many things within the Linux kernel, cgroups are exposed within the file hierarchy. A system administrator writes and reads from files within the mounted cgroups filesystem to define cgroups and their properties. A process is added to a cgroup by writing its PID to a file within the cgroups hierarchy; the process is automatically removed from its previous cgroup.

Overall, cgroups provide docker with a simple(ish) way to control the resources a process hierarchy uses (like CPU) and has access to (like networks and part of the filesystem).

Cgroups provide control of various resources, but the main ones to consider for docker containers are:

  • CPU controller – using cpu shares, CPU time can be divided up between processes to ensure a process gets a share of CPU time to run in.
  • Memory controller – a process can be given its own chunk of memory which has a hard limit on its size.

From this, it’s relatively easy to see how docker can assign resources to a container – put the process running within in the container in a cgroup and set up the resource constraints for it.

Beyond controlling scarce resources like CPU and memory, cgroups provide a way to assign a namespace to a process. Namespaces are the next piece of the puzzle and we move to them next.

Kernel namespaces

Putting a process within a namespace is a means to define what the process has access to. A namespace is the boundary, whereas cgroups is the control plane that puts a process within the namespace’s boundary.

Also in 2.6.24 came the core of network namespaces. This and future patchsets enable processes to be presented with their own view of the network stack, covering network functions such as interfaces, routing tables and so on.

The Wikipedia article on kernel namespaces has a list of the current resources that can be isolated using namespaces. We can form a basic view of how containers are run by docker (and any other container management software) using just a couple of these:

  1. The Mount (MNT) namespace
  2. The Network (NET) namespace

Things like the PID and User namespaces provide extra isolation, but I’m not going to cover them here.

I confess here I’m making some guesses as to what’s going on, but the mental model has served me okay so I’ll reproduce it here. Broadly I consider these two namespaces to be the basis of docker’s ability to run what amounts to “pre-packaged” software.

Mount

Mount namespaces define what the filesystem looks like to the process running within the namespace. So different processes can see entirely different views of the filesystem.

My general assumption here is that docker is using MNT namespaces to provide the running container with a unique view of the filesystem, both its own “root image” that we’ll talk about later and the parts of the host filesystem mounted into the running container using the --mount option.

Network

As NET namespaces provide processes with a custom view of the network stack and provide ways for processes in different namespaces to poke holes to each other via the network, I assume this is the basis for docker’s bridge network type which sets up a private network between processes running in containers. When one runs a container with the host network type, my basic layman’s assumption is that the container’s process is not placed within its own network namespace (or it lives within the default namespace).

Union filesystems

A union filesystem essentially takes several filesystem images and layers them on top of each other. Images “above” override values from images “below”. For any file read, the union filesystem traverses the image stack from top to bottom and returns the file content from the first image containing the file. For writes, either the write just fails (for a read-only union filesystem) or the write goes to the top-most layer. Often this top-most layer is initially an empty image created specifically for the writes of files to the mounted union filesystem.

An important point to note is that two or more mounted union I filesystems can share images, meaning that two union filesystems could have, say, the first five images in their respective stacks shared but each with different images stacked on top to provide very different resultant filesystems.

Docker images and layers

When running a docker container, one specifies an image to “run” via a command like:

docker run couchdb

The couchdb part of this command specifies the image to download. I find the naming gets a bit confusing here, because essentially the “image” is actually a pointer to the top image of a stack of images which together form the union filesystem that ends up being the root filesystem of the running container.

While the above command reads “run the couchdb container”, it’s really more like “create and start a new container using the couchdb image as the base image for the container”. In fact, the docker run documentation describes this as:

The docker run command must specify an IMAGE to derive the container from.

The word derive is key here. By my understanding, docker adds a further image to the top of the stack which is where writes that the running container makes are written to. This image is saved under the name of the newly started container, and persists after the container is stopped under the container’s name. This is what allows docker to essentially stop and start containers while maintaining the files changed within the container – behind the scenes it’s managing an image used as the top image on the container’s union filesystem.

Docker calls the images that stack up to form the union filesystem “layers” and the ordered collection of layers an “image”. This concept is key to how dockerfiles work – at a first approximation, each line in a docker file adds a new image to the image stack used as the base image for a container. So, for example, a command that runs apt to install software creates a new image containing the changes to the filesystem made while installing the software.

It’s also obvious how this allows for dockerfile FROM lines to work – it just points back to the image at the top of the stack and then further dockerfile commands layer more images onto that stack to form a new docker image.

In addition, the fact that a union filesystem is able to share images at the lower levels means that docker is able to share lower level base images across many containers but only ever have a single copy on disk. These base images are read-only and so can be safely used in many containers’ union filesystem image stacks.

Putting it together

So basically what docker does when we use the docker run command is:

  1. Download the base image to derive the container from.
  2. Create a union filesystem consisting of the layers in the base image and a new layer at the top of the stack for the container to write its own files to.
  3. Set up a network namespace for the container.
  4. Set up a cgroup that has appropriate mount and network namespaces set up such that the process has the union filesystem mounted as its root filesystem and a private view of the host’s network stack. Mount other volumes into this namespace as specified on the docker run command.
  5. Start up a process within the union filesystem within the cgroup.

Image repositories

This is where the “GitHub for containers” thing comes from. A docker daemon manages a local collection of union filesystem images on your machine called a repository – which contains all the base images and other images used to form the union filesystems for containers on the system (including the top-of-the-stack writable images containers use).

But Docker also manages a large central collection of images which can be used as base images for either direct running via the docker run command or used as the start point for other docker images using the FROM dockerfile command. When used, the docker daemon downloads the image from the remote repository and uses it to derive a new container.

There’s some naming stuff here that I never quite got my head around, in that the couchdb bit in the docker run command is actually a repository itself, making Docker Hub more of a collection of repositories. The actual image used by the docker tool on your computer is chosen by the “tag” you select, and each repository has a set of tags defining the images you can use. There’s a default tag specified for each image repository, which is used when the docker run command just specifies a repository name and misses out the tag.

So I guess this use of Docker Hub as a collection of repos, which contain tagged images, can be mapped imperfectly to the way GitHub works, GitHub being a collection of Git repos. However, the terminology match is far from exact, which definitely caused me problems when trying to understand docker.

A key thing I found confusing is that because a repository is really just a collection of arbitrary images, your local repository can – and almost certainly does! – contain the base images for lots of different pieces of software as well as many, many intermediate layers, whereas the repos on Docker Hub typically contain several tagged versions of a single piece of software. A Docker Hub repo could presumably therefore also contain many disparate pieces of software, but convention dictates that is not what happens, at least in public repos.

Summary

Thinking of docker merely as the repository concept misses out a lot of useful context for what docker means for your machines – the level of access required to create and maintain the cgroups and namespaces is high, and hopefully it’s a bit clearer why the docker daemon requires it from this post.

The v2 interface for cgroups provides for delegation of a portion of the cgroups heirarchy to a non-privileged process, which at a first scan read suggests a route to a less privileged docker daemon, or perhaps it’s already possible to use this. We’ve reached the boundaries of my knowledge and mental model here, so it’s time to stop for now.

As noted at the beginning, this post is a synthesis of my current understanding of how containers and docker work. While on writing it down, my model does seem coherent and logical, but there is quite a bit of guesswork going on. I’d therefore be very pleased to receive corrections to my descriptions and explanations.

Cloudant replication topologies and failover

Cloudant’s (and CouchDB’s) replication feature allows you to keep databases in sync across countries and continents. However, sometimes it’s not obvious how to use this basic pair-wise feature in order to create more complicated replication topologies, like three or more geographical replicas, and then how to do disaster recovery between them. Let’s discuss these in turn.

Throughout the following, it’s important to remember that replication is an asynchronous, best-effort process in which a change is propagated to peers sometime after the client receives the response to its write request. This means that longer replication chains don’t directly affect document write latency, but also that discrepancies between peers will exist for some small period of time (typically low single digit seconds maximum) after a write to one peer completes.

More complicated topologies

Firstly it’s important to understand that Cloudant’s replication creates synchronised copies of databases between peers once a two-way replication is set up. Each replication flows changes only in one direction, which means that a two-way replication involves setting up two separate replications, one in each direction. Visualising this as a directed graph works well. Each node in the graph is a Cloudant database and the arrows are directed edges showing which way changes are flowing.

Using this style, a basic two-peer setup between databases A and B looks like this:

A basic two peer replication

There are two arrows because to have synchronised peers a replication needs to be set up in each direction. After a very short time – milliseconds to seconds – peer A will know about any changes to B, and, similarly, B will know about any changes to A.

These peers can then further replicate with other peers to support more complicated scenarios than a single pair of peers. The key point is that, by setting up replications, one is setting up a graph that changes traverse to get from one peer to another. In order for a change to get from peer A to peer Z, at least one directed link must exist between A and Z. This is the foundation piece to creating more complicated topologies because either A or B can now replicate the changes elsewhere.

So A can propagate the changes from B to another database, C. In this scenario, A is the primary database and the others could be considered replicas.

A basic three peer replication with primary

This obviously introduces a single point of failure. The replication process is smart enough that you can add a replication from B to C to this set, which means that A is no longer a single point of failure. In graph terms, it’s safe to set up a cyclic graph.

A three peer cyclic replication

As more database replicas are added to the set, however, having a fully connected mesh starts to add undue load to each database, and it’s not necessary as each peer is able to act as a “stepping stone” to push changes through the network.

Here, a change at peer B is replicated to E via A then C:

A more complicated six peer replication

In the diagram I only have one stepping stone in each path for simplicity of diagramming, but one could add redundant steps to ensure at least two paths through the network for any given change.

Using One-Way Replications for Synchronisation between Peers

Finally, it’s worth revisiting the point that all we require for synchronisation is that there is a directed path for a change to follow from one peer to another. This means that two-way replications between peers are not strictly required. Instead, one alternative is to set up a circle topology:

A more complicated circular replication

Here, there is still a path for changes to follow between any pair of nodes. Again, setting up redundant links to provide two paths may be useful. Using one-way replications in this way further allows you to decrease the load upon each peer database while still maintaining acceptable replication latency.

Failing over between databases

After setting up the synchronised peers in whatever topology works for your needs, you’re ready to set up failover between the database replicas.

The important takeaway point in this section is that, while you might be tempted to manage failover elsewhere, the only way to reliably failover is from within the application itself. The reason for this is simple: the application is the only place you can be sure whether a given replica is contactable from the application.

An application may be unable to contact a database peer for several reasons, such as:

  • The database servers are actually offline.
  • The application cannot route to the preferred database servers because of network disruption.
  • There is very high latency suddenly introduced on the network path between the application and database clusters.
  • The application’s DNS cache is out of date so it’s resolving the database location to the wrong IP address.
  • Key database requests made by the application have a latency above a given threshold.

The last condition is an example of how the application can use its own measurements to ensure failover happens before users become aware of the problem and how a failover strategy can benefit from being application performance-indicator aware.

The only thing you care about is whether the application can reach the database; not whether, for example, the third-party health-checking service you might use can contact it, or your DNS provider.

The basic steps are:

  • Configure each application instance with a prioritised list of database peers.
  • Use an approach like the circuit breaker pattern to guide failover.
  • Progress down the prioritised list.
  • Have a background task checking for recovery of higher priority peers, failing back when they are reliably available to the application again.

This approach is simple to understand, not too complicated to implement and gives your application the best chance of surviving the most number of possible failure modes.

Avoiding Cloudant's HTTP Siren's Call

I was browsing Cloudant questions on Stackoverflow last night and came across a question about how to securely access Cloudant from directly from a browser. Summarising:

How do I securely pass my database credentials to the browser without the user being able to see them?

Over my time at Cloudant I’ve been asked variants of this questions many times. It makes me think that Cloudant and CouchDB’s HTTP interface is a bit of a siren’s call, luring unwary travellers onto security rocks.

Let’s cut to the chase: sending credentials to an untrusted device means you have to assume those credentials are compromised. By compromised I mean that in any reasonable security model you have to assume the credentials are public and anyone can use them. Your data is available to anyone.

The above question also misses the point that it’s not just the user themselves that must be unable to see the credentials, but everything else:

  • Their browser may have malicious addons sending passwords to an attacker.
  • Their laptop may have malware, sending passwords to an attacker.
  • Their home network may have other compromised machines or routers, sending pass— you get the point.
  • And so on.

Everything that ever sees the credentials is able to leak them to the world. The actual person using the application is just one part of that (albeit one that’s easier to phish). The only way to prevent this is to never allow the credentials to leave an environment controlled by you, the web application developer.

Over time, I’ve come to the conclusion that both Cloudant and CouchDB are best suited to being used like any other database, using a three-tier architecture. No one would consider having a browser connect to a Postgres database or SQL Server – hopefully because it seems weird rather than just because it’d be difficult. Cloudant’s HTTP interface makes it simple to connect from a browser, which can be misleading.

The HTTP interface was originally intended to enable couchapps, and can work well if you can live within the very tight constraints required in order to do this securely. This makes expressing many to most applications as CouchApps impossible. In addition, web application frameworks do a lot of work on your behalf to make your applications secure from a wide variety of attacks. CouchDB provides little help in this regard, which makes creating a secure application practically difficult even if it’s theoretically possible.

For most applications, therefore, the temptation to directly connect from the browser should be avoided. It’s only suitable for a pretty small subset of applications and then is hard to do securely when compared with a traditional architecture with an application in front of the database.

This of course isn’t to say CouchDB’s HTTP interface isn’t secure. It’s very secure when accessed over HTTPS; at least as much as any other database. Possibly more so, given the wide real world testing of the security properties of HTTPS. However, the security of a protocol is only as useful when its secrets are protected, and sending credentials to any device you don’t control is almost certain to undermine this.