Python Packaging in 2020

For a long time, I’ve kind of existed with a barely-there understanding of Python packaging. Just enough to copy a requirements.txt file from an old project and write a Makefile with pip install -r requirements.txt. A few years ago, I started using pipenv, and again learned just-enough to make it work.

Over the past year, I became frustrated with this situation:

  • pipenv became increasingly hard to make work through upgrades and its kitchen-sink approach.
  • I started building docker images for Python applications, and understanding packaging in more detail became essential to build secure and performant images.

Last year (2019), I started to look at tools like poetry, which essentially start the whole process from scratch, including new dependency resolution and package-building code. When figuring out how to use these in Dockerfiles, I realised I needed to understand a bunch more about both packaging and virtual environments. The good news was this actually progressed a lot in the 2018-9 time frame. The bad news was that meant there was a lot to learn, and a bunch of stuff was out of date.

In the beginning, there was the source distribution

Until 2013, when PEP 427 defined the whl archive format for Python packages, whenever a package was installed via pip install it was always built from source via a distribution format called sdist. For pure-python files this wasn’t typically much of a problem, but for any packages making use of C extensions it meant that the machine where pip install was run needed a compiler toolchain, python development headers and so on.

This situation is more than a little painful. As PEP 427’s rationale states:

Python’s sdist packages are defined by and require the distutils and setuptools build systems, running arbitrary code to build-and-install, and re-compile, code just so it can be installed into a new virtualenv.

After PEP 427, packages could also be distributed as so-called binary packages or wheels.

At the time I started to see python binary packages, because I had never looked in depth into python packaging, I was confused and even somewhat alarmed by the term binary package, particularly as I was quite used to source distributions by 2013. But in general they are a big win:

  • For pure python packages the term is a slight misnomer as the wheel format is just about how the files are laid out inside an archive – typically these wheels will have a single .whl per python version they support, that will be named like Flask-1.1.1-py2.py3-none-any.whl, where none and any specify the python ABI version (for C extensions) and the target platform respectively. As pure python packages have no C extensions, they have no target ABI and platform, but will often have a python version requirement, though this example supports both python 2 and 3.
    • The tags, such as none, in filenames are defined in PEP 425.
  • For packages including C extensions which are linked to the Python C runtime during compilation, the name does make sense because the build process pre-compiles the extension into a binary, unlike in the sdist world where C extensions were compiled during package installation. This results in several different .whl files, as a separate .whl files must be created for each target system and python version. For example, cryptography-2.8-cp34-abi3-manylinux2010_x86_64.whl is a package with binaries built against C Python 3.4, ABI level 3 for a Linux machine and processor architecture.

In the end, wheels provide a much simpler and more reliable install experience as every user is not forced to compile packages themselves, with all the tooling and security concerns inherent in that approach.

Stepping back to how wheels are built

Wheels soon started taking over the python packaging ecosystem, though there are still hold-outs even today that ship source packages rather than binary packages (often for good reasons).

However, all python packages were still defined via setup.py, an opaque standard that was defined purely by the distutils and setuptools source code. While there was now a binary standard for built packages, in practice there was only one way of building them. pip for example hardcoded the calls to setup.py into its pip wheel command, so using other build systems was very difficult, making implementation of them somewhat thankless tasks. Before poetry, it doesn’t look like anyone much attempted it.

The distutils module was shipped with Python, so it was natural that it came to be the defacto standard, and including a packaging tool was a good decision from the python maintainers. distutils wasn’t that easy to use on its own, however, so setuptools was built as a package to improve that. Over time, setuptools also grew to be somewhat gnarly itself.

Tools like flit were then created to tame the new complexity, and wrap distutils and setuptools in another layer – though flit is opinionated. Flit’s way of doing things became popular, but in the end it was still using distutils and setuptools under the hood (per this flit source code). Even so, flit became pretty popular as its workflow is simple and understandable. Indeed, generation of the files used by distutils happens behind the scenes so far as I can tell (I didn’t actually try flit out, so may have made some errors here).

Poetry and PEPs 517 & 518

In 2018 development of poetry started, at least per the earliest commits from the github repository. Poetry is an ambitious rebuild of python packaging pretty much from scratch. It’s able to resolve dependencies and build wheels without any use of distutils and setuptools. The main problem with poetry is that it needs to re-implement a lot of existing functionality that is already present in other tools like pip to be accepted into development and CI pipelines.

At a similar time, the python community came up with PEP 517 and 518.

  • PEP 517 (status Provisional, 2015-2018) is about a standard way to specify alternative build backends that pip can use when building wheels – for example, using Poetry or flit’s build engine rather than going directly to distutils. A build backend is a Python module with a standard interface that is used to take a python package source tree and spit out a wheel.
  • PEP 518 (status Provisional, 2016) works in tandem with PEP 517 and specifies a way for a tool like pip to know how to install the build backend specified by PEP 517 when pip is building packages. Specifically, it describes how to create an isolated python environment with just the needed requirements to build the package (that is, the packages to install the build backend, not the package’s dependencies).

Both PEPs 517 and 518 use a new file called pyproject.toml to describe their settings:

[build-system]
# Defined by PEP 518, what the build environment requires:
requires = ["poetry>=0.12"]
# Defined by PEP 517, how to kick off the build:
build-backend = "poetry.masonry.api"

Both poetry and flit work with pyproject.toml via its support for namespacing tool-specific settings. An example using poetry:

[tool.poetry]
name = "my-package"
version = "0.1.0"
description = "The description of the package"

[tool.poetry.dependencies]
python = "^3.7"
flask-hookserver = "==1.1.0"
requests = "==2.22.0"

While both PEPs 517 and 518 were started a while ago, it’s only from pip 19.1 (early 2019) that pip started supporting the use of build backends specified via PEP 517.

pip enters “PEP 517 mode” when pip wheel is called if pip finds a pyproject.toml file in the package it is building. When in this mode, pip acts as a build frontend, a term defined by PEP 517 for the application that is used from the command line and is making calls into a build backend, such as poetry. As a build frontend, the job for pip here is to:

  1. Create an isolated python environment.
  2. Install the build backend into this environment via the PEP 518 requirements (requires = ["poetry>=0.12"]).
  3. Get the package ready for building in this environment.
  4. Invoke the build backend, for example poetry, using the entrypoint defined by PEP 517 (build-backend = "poetry.masonry.api") within the created isolated environment.

The build backend then must create a wheel from the source folder or source distribution and put it in the place that pip tells it to.

For me, this seems like big news for projects like poetry that do a lot from scratch and end up with laundry lists of feature requirements to enable them to be integrated into full development and CI pipelines. If they can instead be ingrated into CI via existing tools like pip, then they are much easier to adopt in development for their useful features there, such as poetry’s virtual environment management features. In particular, both flit and poetry will use the information defined in their respective sections of pyproject.toml to build the wheel and requirement wheels just as they would on a developer’s machine (to an extent anyway, my experiments indicate poetry ignores its .lock file when resolving requirements).

In this way, PEPs 517 and 518 close the loop in allowing tools like poetry to concentrate on what they want to concentrate on, rather than needing to build out a whole set of functions before they can be accepted into developers’ toolboxes.

An example Dockerfile shows this in action, for building the myapp package into a wheel along with its dependencies, and then copying the app and dependency wheels into the production image and installing them:

# Stage 1 build to allow pulling from private repos requiring creds
FROM python:3.8.0-buster AS builder
RUN mkdir -p /build/dist /build/myapp
# pyproject.toml has deps for the `myapp` package
COPY pyproject.toml /build
# Our project source code
COPY myapp/*.py /build/myapp/
# This line installs and uses the build backend defined in
# pyproject.toml to build the application wheels from the source
# code we copy in, outputting the app and dependency wheels
# to /build/dist.
RUN pip wheel -w /build/dist /build

# Stage 2 build: copy and install wheels from stage 1 (`builder`).
FROM python:3.8.0-slim-buster as production-image
COPY --from=builder [ "/build/dist/*.whl", "/install/" ]
RUN pip install --no-index /install/*.whl \
    && rm -rf /install
CMD [ "my-package-script" ]

And this is what I now understand about the state of python packaging as we enter 2020. The future looks bright.

Kubernetes by Types

It’s relatively easy to find articles online about the basics of Kubernetes that talk about how Kubernetes looks on your servers. That a Kubernetes cluster consists of master nodes (where Kubernetes book-keeping takes place) and worker nodes (where your applications and some system applications run). And that to run more stuff, you provision more workers, and that each pod looks like its own machine. And so on.

But for me, I found a disconnect between that mental image of relatively clean looking things running on servers and the reams and reams of YAML one must write to seemingly do anything with Kubernetes. Recently, I found the Kubernetes API overview pages. Somehow I’d not really internalised before that the reams of YAML are just compositions of types, like programming in any class-based language.

But they are, because in the end all the YAML you pass into kubectl is just getting kubectl to work with a data model inside the Kubernetes master node somewhere. The types described in the Kubernetes API documentation are the building blocks of that data model, and learning them unlocked a new level of understanding Kuberentes for me.

The data model is built using object composition, and I found a nice way to discover it was to start from a single container object and build out to a running deployment, using the API documentation as much as I could but returning to the prose documentation for examples when I got stuck or, as we’ll see with ConfigMaps, when the API documentation just can’t describe everything you need to know.

Containers

This is our starting point. While the smallest thing that Kubernetes will schedule on a worker is a Pod, the basic entity is the Container, which encapsulates (usually) a single process running on a machine. Looking at the API definition, we can easily see what the allowed values are – for me this was the point where what had previously been seemingly arbitrary YAML fields started to slot together into a type system! Just like other API documentation, suddenly there’s a place where I can see what goes in the YAML rather than copy-pasting things from the Kubernetes prose documentation, tweaking it and then just having to 🀞.

Let’s take a quick look at some fields:

  • The most important thing for a Container is, of course, the image that it will run. From the Container API documentation, we can look through the table of fields within the Container and see that a string is required for this field.
  • The documentation also says that a name is also required.
  • Another field that crops up a lot in my copy-pasted YAML is imagePullPolicy. If we look at imagePullPolicy, we can see that it’s also a string but also the documentation states what the acceptable values are: Always, Never and IfNotPresent. If YAML allowed enums, I’m sure this would be an enum. Anyway, we can immediately see what the allowed values are – this is much easier than trying to find this within the prose documentation!
  • Finally, let’s take a look at volumeMounts, which is a little more complicated: it’s a field of a new type rather than a primitive value. The new type is VolumeMount and the documentation tells us that this is an array of VolumeMount objects and links us to the appropriate API docs for VolumeMount objects. This was the real moment when I stopped having to use copy-paste and instead was really able to start constructing my YAML – πŸ’ͺ!

The documentation is also super-helpful in telling us where we can put things. Right at the top of the Container API spec, it tells us:

Containers are only ever created within the context of a Pod. This is usually done using a Controller. See Controllers: Deployment, Job, or StatefulSet.

Totally awesome, we now know that we need to put the Container within something else for it to be useful!

So let’s make ourselves a minimal container:

name: haproxy
image: haproxy:2.1.0
imagePullPolicy: IfNotPresent
volumeMounts:
  name: HAProxyConfigVolume  # References a containing PodSpec
  mountPath: /usr/local/etc/haproxy/
  readOnly: true

We can build all this from the API documentation – and it’s easy to avoid the unneeded settings that often come along with copy-pasted examples from random websites on the internet. By reading the documentation for each field, we can also get a much better feel for how this container will behave, making it easier to debug problems later.

Pods

So now we have our Container we need to make a Pod so that Kubernetes can schedule HAProxy onto our nodes. From the Container docs, we have a link direct to the PodSpec documentation. Awesome, we can follow that up to our next building block.

A PodSpec has way more fields than a Container! But we can see that the first one we need to look at is containers which we’re told is an array of Container objects. And hey we have a Container object already, so let’s start our PodSpec with that:

containers:
- name: haproxy
  image: haproxy:2.1.0
  imagePullPolicy: IfNotPresent
  volumeMounts:
    name: HAProxyConfigVolume  # References a containing PodSpec
    mountPath: /usr/local/etc/haproxy/
    readOnly: true

Now, we also have that VolumeMount object in our HAProxy container that’s expecting a Volume from the PodSpec. So let’s add that. The Volume API spec should help and from the PodSpec docs we can see that a PodSpec has a volumes field which should have an array of Volume objects.

Looking at the Volume spec, we can see that it’s mostly a huge list of the different types of volumes that we can use. Each of which links off to yet another type which describes that particular volume. One thing to note is that the name of the Volume object we create needs to match the name of the VolumeMount in the Container object. Kubenetes has a lot of implied coupling like that, it’s just something to get used to.

We’ll use a configMap volume (ConfigMapVolumeSource docs) to mount a HAProxy config. We assume that the ConfigMap contains whatever files that HAProxy needs. Here’s the PodSpec with the volumes field:

containers:
- name: haproxy
  image: haproxy:2.1.0
  imagePullPolicy: IfNotPresent
  volumeMounts:
    mountPath: /usr/local/etc/haproxy/
    name: HAProxyConfigVolume  # This name comes from the PodSpec
    readOnly: true
volumes:
- name: HAProxyConfigVolume
  configMap:
    name: HAProxyConfigMap  # References a ConfigMap in the cluster

So now what we have is a PodSpec object which is composed from an array of Container objects and and array of Volume objects. To Kubernetes, our PodSpec object is a “template” for making Pods out of — we further need to embed this object inside another object which describes how we want to use this template to deploy one or more Pods to our Kubernetes cluster.

Deployments

There are several ways to get our PodSpec template actually made into a running process on the Kubernetes cluster. The ones mentioned all the way back in the Container docs are the most common:

  • Deployment: run a given number of Pod resources, with upgrade semantics and other useful things.
  • Job and CronJob: run a one-time or periodic job that uses the Pod as its executable task.
  • StatefulSet: a special-case thing where Pods get stable identities.

Deployment resources are most common, so we’ll build one of those. As always, we’ll look to the Deployment API spec to help. An interesting thing to note about Deployment resources is that the docs have a new set of options in the sidebar underneath the Deployment heading – links to the API calls in the Kubernetes API that we can use to manage our Deployment objects. Suddenly we’ve found that Kubernetes has a HTTP API we can use rather than kubectl if we want — time for our πŸ€– overlords to take over!

Anyway, for now let’s keep looking at the API spec for what our Deployments need to look like; whether we choose to pass them to either kubectl or these new shiny API endpoints we just found out about.

Deployment resources are top-level things, meaning that we can create, delete and modify them using the the Kubernetes API — up until now we’ve been working with definitions that need to be composed into higher level types to be useful. Top level types all have some standard fields:

  • apiVersion: this allows us to tell Kubernetes what version of the API we are using to manage this Deployment resource; as in any API, different API versions have different fields and behaviours.
  • kind: this specifies the kind of the resource, in this case Deployment.
  • metadata: this field contains lots of standard Kubernetes metadata, and it has a type of its own, ObjectMeta. The key thing we need here is the name field, which is a string.

Specific to a deployment we have just one field to look at:

  • spec: this describes how the Deployment will operate (e.g., how upgrades will be handled) and the Pod objects it will manage.

If we click kubectl example in the API spec, the API docs show a basic Deployment. From this, we can see the values we need to use for apiVersion, kind and metadata to get us started. A first version of our Deployment looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: haproxy-load-balancer
spec:
  # TODO

Next we’ll need to look at the DeploymentSpec API docs to see what we need to put into there. From experience, the most common fields here are:

  • template: a PodTemplateSpec which contains a standard metadata field containing ObjectMeta (the same type as at the top-level of the Deployment!) and a spec field where we finally find place to put the PodSpec we made earlier. This field is vital, as without it the Deployment has nothing to run!
  • selector: this field works with the metadata in the template field to tell the Deployment’s controller (the code within Kubernetes that manages Deployment resources) which Pods are related to this Deployment. Typically it references labels within the PodTemplateSpec’s metadata field. The selector documentation talks more about how selectors work; they are used widely within Kubernetes.
  • replicas: optional, but almost all Deployments have this field; how many Pods should exist that match the selector at all times. 3 is a common value as it works well for rolling reboots during upgrades.

We can add a basic DeploymentSpec with three replicas that uses the app label to tell the Deployment what Pods it is managing:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: haproxy-load-balancer
spec:
  replicas: 3
  selector:
    matchLabels:
        app: haproxy
  template:
    metadata:
      labels:
        app: haproxy
    spec:
        # PodSpec goes here

Finally, here is the complete Deployment built from scratch using the API documentation. While I think it would be pretty impossible to get here from the API documentation alone, once one has a basic grasp of concepts like “I need a Deployment to get some Pods running”, reading the API docs alongside copy-pasting YAML into kubectl is most likely a really fast way of getting up to speed; I certainly wish I’d dived in to the API docs a few months before I did!

apiVersion: apps/v1
kind: Deployment
metadata:
  name: haproxy-load-balancer
spec:
  replicas: 3
  selector:
    matchLabels:
      app: haproxy
  template:
    metadata:
      labels:
        app: haproxy
    spec:
      containers:
      - name: haproxy
        image: haproxy:2.1.0
        imagePullPolicy: IfNotPresent
        volumeMounts:
          mountPath: /usr/local/etc/haproxy/
          name: HAProxyConfigVolume
          readOnly: true
        volumes:
      - name: HAProxyConfigVolume
        configMap:
          name: HAProxyConfigMap

ConfigMaps

For completeness, let’s get a trivial HAProxy configuration and put it inside a ConfigMap resource so this demonstration is runnable. The API documentation for ConfigMap is less helpful than we’ve seen so far, frankly.

We can see ConfigMap objects can be worked with directly via the API, as they have the standard apiVersion, kind and metadata fields we saw on Deployment objects.

HAProxy configuration is a text file, so we can see that it probably goes in the data field rather than the binaryData field, as data can hold any UTF-8 sequence. We can see that data is an object, but further than that there isn’t detail about what should be in that object.

In the end, we need to go and check out the prose documentation on how to use a ConfigMap to understand what to do. Essentially what we find is that the keys used in the data object are used in different ways based on how we are using the ConfigMap. If we choose to mount the ConfigMap into a container — as we do in the PodSpec above — then the keys of the data object become filenames within the mounted filesystem. If, instead, we set up the ConfigMap to be used via environment variables, the keys would become the variable names. So we need to know this extra information before we can figure what to put in that data field.

The API documentation often requires reading alongside the prose documentation in this manner as many Kubernetes primitives have this use-dependent aspect to them.

So in this case, we add a haproxy.cfg key to the data object, as the HAProxy image we are using by default will look to /usr/local/etc/haproxy/haproxy.cfg for its configuration.

apiVersion: v1
kind: ConfigMap
metadata:
    name: HAProxyConfigMap  # Match name in VolumeMount
data:
    haproxy.cfg: |
        defaults
            mode http

        frontend normal
            bind *:80
            default_backend normal

        backend normal
            server app webapp:8081  #Β Assumes webapp Service

Recall from Just enough YAML that starting an object value with a | character makes all indented text that comes below into a single string, so this ConfigMap ends up with a file containing the HAProxy configuration correctly.

Summary

So we now have a simple HAProxy deployment in Kubernetes which we’ve mostly been able to build from reading the API documentation rather than blindly copy-pasting YAML from the internet. We — at least I — better understand what’s going on with all the bits of YAML and it’s starting to feel much less arbitrary. I feel now like I might actually stand a chance of writing some code that calls the Kubernetes API rather than relying on YAML and kubectl. And what’s that code called? An operator! I’d heard the name bandied about a lot, but had presumed some black magic was involved — but nope, it’s just about calls that manipulate objects within the Kubernetes API using the types we’ve talked about above, along with about a zillion other ones, including ones you make up yourself! Obviously you need to figure out how best to manage the objects, but when all is said and done that’s what you are doing.

Anyway, hopefully this has de-mystified some more of Kubernetes for you, dear reader; as I mentioned understanding these pieces helped me go from a copy-paste-hope workflow towards a much less frustrating experience building up my Kubernetes resources.

Just enough YAML to understand Kubernetes manifests

When we talk about Kubernetes, we should really be talking about the fact that when you, as an administrator, interact with Kubernetes using kubectl, you are using kubectl to manipulate the state of data within Kubernetes via Kubernetes’s API.

But when you use kubectl, the way you tend to tell kubectl what to do with the Kubernetes API is using YAML. A lot of freakin’ YAML. So while I hope to write more about the actual Kubernetes API sometime soon, first we’ll have talk a bit about YAML; just enough to get going. Being frank, I don’t get on well with YAML. I do get on with JSON, because in JSON there is a single way to write anything. While you don’t even get to choose between double and single quotes for your strings in JSON, I overheard a colleague say that there are over sixty ways to write a string in YAML. Sixty ways to write a string! I think they were being serious.

Thankfully, idiosyncratic Kubernetes YAML doesn’t do much with the silly end of YAML, and even sticks to just three ways to represent strings πŸ’ͺ.

While not required for Kubernetes, while writing this I found some even more strange corners of YAML than I’d come across before. I thought I’d note these down for amusement’s sake even though I think they just come from over-applying the grammar rather than anyone seriously believing that they are sensible.

Below I’ve included YAML and the JSON equivalents, simply because I find JSON a conveniently unambiguous representation, and one that I expect to be familiar to most readers (including myself).

Objects (maps)

In JSON, you write an object like this:

"object": {
    "key": "value",
    "boolean": true,
    "null_value": null,
    "integer": 1,
    "anotherobject": {
        "hello": "world"
    }
}

Unlike JSON, in YAML the spacing makes a difference. We write a object like this:

object:
  key: value
  boolean: true
  null_value:
  integer: 1
  anotherobject:
    hello: world

If you bugger up the indenting, you’ll get a different value. So this YAML:

object2:
key: value
boolean: true
null_value:
integer: 1

Means this JSON:

"object2": null,
"key": "value",
"boolean": true,
"null_value": null,
"integer": 1

When combined with a defacto standard of two-space indenting, I find YAML objects pretty hard to read. Particularly in a long sequence of objects, it’s very easy to miss where one object stops and another begins. It’s also easy to paste something with slightly wrong indent, changing its semantics, in a way that just isn’t possible in JSON.

You can actually just write an object with braces and everything in YAML, just like JSON. In fact JSON is a subset of YAML so any JSON document is also a YAML document. When I learned this it was essentially 🀯 combined with 😭. However, no-one ever writes JSON into YAML documents, so in the end this fact is purely academic.

Well apart from sometimes you see JSON arrays.

Arrays

Arrays look like (un-numbered) lists:

array1:
- mike
- fred

You can indent all the list items how you like, so this is the same:

array1:
        - mike
        - fred

Both translate to:

"array1": ["mike", "fred"]

But it’s easy to make a mistake. This YAML with its accidental indent:

array1:
  - mike
    - fred
  - john

Means this:

"array1": [
    "mike - fred",
    "john"
]

Which I find a bit too silently weird for my tastes.

Objects in arrays

The main thing I get wrong here is when writing arrays of objects. It’s very easy to misplace a -.

So this is a list of two objects:

array:
- key: value
  boolean: true
  null_value:
  integer: 1
- foo: bar
  hello: world

Which becomes:

"array": [
    {
        "key": "value",
        "boolean": true,
        "null_value": null,
        "integer": 1
    },
    {
        "foo": "bar",
        "hello": "world"
    }
]

But I find it very easy to miss the -, particularly in lists of objects with sub-objects. In addition, YAML’s permissiveness enables one to mistype syntactically valid but semantically different constructs, like here where we want to create an object but end up with an extra list item:

array:
- object:
- 	foo: bar
  	hello: world
  	baz: world
- key: value
  boolean: true
  null_value:
  integer: 1

Which gives the JSON:

"array": [
    {
        "object": null
    },
    {
        "foo": "bar",
        "hello": "world",
        "baz": "world"
    },
    {
        "key": "value",
        "boolean": true,
        "null_value": null,
        "integer": 1
    }
]

Particularly when reviewing complex structures, it’s easy to start to lose the thread of which - and which indent belongs to which object.

Arrays of arrays

I find this perhaps the best example of where YAML goes off the rails. It’s easy and (I find) clear to represent arrays of arrays in JSON:

[
    1,
    [1,2],
    [3, [4]],
    5
]

This is… pretty wild by default in YAML:

- 1
- - 1
  - 2
- - 3
  - - 4
- 5

I suspect this is reducing to the absurd for effect, however, perhaps the best thing here is to regress to inline JSON.

Strings

Anyway, let’s get back to those sixty ways to represent strings. The three ways you’ll commonly see used in Kubernetes manifest YAML files are as follows:

array:
- "mike"
- mike
- |
    mike

These all mean the same thing:

"array": [
    "mike",
    "mike",
    "mike"
]

The first form appears to actually always be a string. The second form is always a string – unless it’s a reserved word. The third form allows you to insert multiline strings as long as you indent appropriately. This third form is most seen in ConfigMap and Secret objects as it is very convenient for multi-line text files.

array:
- true
- "mike"
- |
  mike
  fred
  john
"array": [
    true,
    "mike",
    "mike\nfred\njohn\n"
],

A digression into wacky strings

Thankfully I’ve not seen them in Kubernetes YAML, but YAML contains at least two further forms that look remarkably similar to the | form. The first, which uses > to start it, only inserts newlines for two carriage returns, and for some reason (almost) always appears to insert a newline at the end. The second misses out a control character at the start of the string but looks identical in passing. In this variant the newlines embedded in the YAML disappear in the actual string.

In this example, I include the | form, the > and the prefix-less form using the same words and newline patterns to show how similar-looking YAML gives different strings:

array:
- |
  mike
  fred
  john
- >
  mike
  fred
  john
- mike
  fred
  john

Giving the JSON:

"array": [
    "mike\nfred\njohn\n",
    "mike fred john\n",
    "mike fred john"
],

I find the YAML definitely looks cleaner, but the JSON is better at spelling out what it means.

While experimenting, I find an odd edge case with the > prefix. Where I used it at the end of a file, the trailing \n ended up being dropped:

names: >
    mike
    fred
    john
names2: >
    mike
    fred
    john

Ends up with the \n going missing in names2:

"names": "mike fred john\n",
"names2": "mike fred john"

Just πŸ€·β€β™€οΈ and move on.

Multiple documents in one file

Finally, you will often see --- in Kubernetes YAML files. All this means is that what follows the --- is the start of a new YAML object; it’s a way of putting multiple YAML objects inside one file. This is actually pretty nice, although again it’s pretty minimal and easy to miss when scanning a file.

And that’s about enough YAML to understand Kubernetes manifests πŸŽ‰.

AirPods Pro: first impressions

I’ve been using a pair of AirPods Pro for just under a week now. I use headphones in three main environments, and up until now have used three separate pairs, each of which works best for that environment. As they combine true-wireless comfort, noise-cancelling, a high promise transparency mode and closed-backs, I wondered whether the AirPods Pro could possibly replace at least a couple of my existing sets. Here we go.

Commute. My go-to headphones for my commute were a pair of first-gen AirPods that I’ve had nearly three years. I walk my commute, so I like to be able to hear what’s going on around me on the street; the open backed AirPods work great for this. This is obviously a place where transparency mode comes into play. However, both the Sony and Bose pairs mentioned below have transparency modes that, well, just don’t feel transparent. They make it feel like the outside world is coming through water. The AirPods Pro, however, while they do seem to have minor trouble with siblants in spoken-word, feel much closer to super-imposing your audio on the surroundings than any other transparency mode I’ve used. It’s suprisingly close to the experience using the original AirPods. On top of this, you can obviously turn to noise-cancelling on busy streets rather than turning up the volume. These two combined are a game-changer; right now I’m not tempted to swap back.

In the office. The original AirPods are essentially useless in the office for blocking out chatter. So I’ve been using a pair of WI-1000X for a couple of years, which block out background chatter really well, especially when used with the foam tips they come with. However, here too the AirPods Pro still work okay even without foam tips, and the lack of neckband and wires are just as noticable an improvement as with my walk into the office. In addition, the AirPods Pro charging case is just easier to use than the somewhat fiddly charger of the WI-1000X. At the moment, I’m grabbing for the AirPods Pro in the office. They block out enough chatter and true-wireless is just way more comfortable.

Flying. For drowning out engine noise on flights, I have found the (wired) Bose Qc20 beat the WI-1000X (the reverse is true for office chatter, strangely). The noise-cancelling is better on the Bose pair, and they fit into a very small carrying pouch compared to the neckband-saddled WI-1000X; much easier to chuck into a bag. I would say the AirPods Pro have about the same noise cancelling effectiveness as the Sony headphones. I’ve yet to fly, so time will tell whether the convenience of the wireless headphones beats out the (likely) better noise cancelling of the Bose pair. I’ll certainly be taking both to try them out as I feel it’ll be a close call.

Overall I’ve been surprised by how close the AirPods Pro have come to replacing the three pairs I used previously. Time will tell how I end up settling long term, but Apple have hit a good balance with these headphones. I suspect the convenience of the true wireless, good-enough noise-cancelling and compact size may make these my go-to headphones most of the time. Oh, and they sound good enough too – but you’d expect that for the price.

Working effectively with CouchDB Mango indexes

Because you work with CouchDB indexes using JSON and Javascript, it’s tempting to imagine there is something JSON or Javascript-y about how you use them. In the end, there isn’t: they end up on disk as B+ Trees, like pretty much every other database. In order to create appropriate indexes for your queries, it’s important to understand how these work. We can use tables as an easy mental model for indexes, and this article shows how that works for CouchDB’s Mango feature (also called Cloudant Query).

Our data

Let’s make a simple data model for people, and add three people:

{
    "_id": "123",
    "name": "Mike",
    "age": 36
}
{
    "_id": "456",
    "name": "Mike",
    "age": 22
}
{
    "_id": "abc",
    "name": "Dave",
    "age": 29
}

Now we’ll look at how we can index these, and how that affects our queries.

Single field indexes

Let’s take a simple query first: what’s an effective index for finding all people with a given name? This one feels easy: index on name. Here’s how this is indexed in Mango:

"index": {
    "fields": ["name"]
}

This creates an index on a single field, name. This field is the key in the index. Conceptually, a good representation for this is a table:

key doc ID
name
mike 123
mike 456
dave abc

The doc ID is included as a tie-breaker for entries with equal keys.

This ends up on disk as a B+ Tree. In a similar way to how it’s easy to visually scan the table above from top to bottom (or bottom to top), a B+ Tree makes it fast to scan a file on disk in the same way. Therefore the table and B+ Tree can be considered somewhat equivalent when imagining how a query performs.

Specifically, for a query name == "mike" query, we can see that it’s fast to search the first column of the table for "mike" and return data about those entries. This same inference holds for the on-disk B+ Tree, so from here we’ll just talk about the tables.

Most queries involve more fields, however, so lets now look at how multi-field indexes work.

Two field indexes

Say we wanted to search by age and name. We can create an index to help with this. In CouchDB we can index both fields in two ways. We’ll get to this below, but before we get started we need to understand that:

  • Indexing an array of fields as a key creates a table with several columns, one per entry in the key array.
  • The entries are ordered by the first column, then the second column, and so on in the overall table.
  • Therefore, an index can only be searched in column order from left to right, because this is the only efficient way to scan the table (or on disk tree).
    • If an index cannot be used for a query, the database has to resort to loading every document in the database and check it against the query selector; this is the worst case, and can take a very long time. This is called a table scan.

The key we choose dictates how we can search, and therefore how efficient a given query can be. Therefore it’s very important to get your indexes right if you want results to arrive quickly to your application. In particular, avoiding table scans for queries used a lot is vital.

Let’s look at the two ways we can index these two fields, and how that ordering shows up in how we can query the indexes.

Firstly, we could use by age, then name:

"index": {
    "fields": ["age", "name"]
}

Giving us the index:

key doc ID
age name
22 mike 456
29 dave abc
36 mike 123

So a query for name == "mike", age == 36 will initially efficiently search the first column until it finds the first entry for 36. It will then scan each entry with 36 in the first column until it finds the first entry with the value mike. When it reaches the end of the entries with age == 36, the query can stop reading the index because it knows every row in the table after that will have an age greater than 36.

This index can also be used to query on just the age field. In particular, it’s great for queries like age > 20 or age > 20, age < 30 because the query engine can efficiently search for the lower bound, and then scan and return entries until it reaches the upper bound.

The way a query like age > 20, age < 30, name == "mike" works is the first column is searched for the lowerbound age, then the index is scanned for entries where the name column is "mike". When the search encounters an entry in the first column – the age column – that is greater than 30, it can stop reading the index.

This is important: the first column is searched, but the second column is checked via a slower scan operation. Therefore, for any query, the best index is the one where the first column reduces the search space the most, which reduces the number of rows that need to be scanned through to match entries in the second and further columns of the key.

This index cannot, however, be used for the query name == "mike" because it cannot be efficiently scanned by name because there is no overall ordering for the name column. Entries are all jiggled around as they are ordered by the age field. Therefore the query engine would need to scan every entry in the index – not very efficient. This is called an index scan and can actually be more efficient than a full table scan. However, CouchDB doesn’t support index scans at this time, so will fall back to the table scan.

Secondly, we could use by name, then age:

"index": {
    "fields": ["name", "age"]
}

Giving us the index:

key doc ID
name age
mike 22 456
mike 36 123
dave 29 abc

So a query for name == "mike", age == 36 will initially scan the first column until it finds the first entry for mike. It will then scan each entry with mike in the first column until it finds the first entry with the value 36.

This index can also be used to query on just the name field. It can also be used to effectively answer questions like name == "mike", age > 30 because the first column narrows down the name quickly and the scan for age can be fast.

It might help to imagine we have many millions of entries: it’s likely there are lots of people with a certain age, but far fewer with a certain name. Therefore, the initial search for name == "mike" will constrain the search space far more than a search for age > 30, and so we end up scanning far fewer rows for the age value in the second column.

This index cannot be used for queries on the age field for the same reason as above; it just wouldn’t be efficient to scan the whole table.

Summary

The above logic holds for indexes of three, four or any number of further fields. The first column can be efficiently searched, and then we can reasonably efficiently scan for matching entries in the second, third and so on columns presuming the first column search narrows down the search space enough.

Creating appropriate indexes is key for the performance of CouchDB applications making use of Mango (or Cloudant Query on Cloudant). Hopefully this article helps show that it’s relatively straightforward to generate effective indexes once you have worked out the queries they need to service, and that it is possible to create indexes that can serve more than one query’s need by judicious use of multi-field indexes.