What is docker?
When I first came across docker a few years ago, probably late 2014, so a year after it was introduced at PyCon during 2013, I found it a confusing concept. “Like GitHub, but for containers” was a phrase that I recall from that period, which I think ended up causing a lot of my confusion – I conflated Docker Hub with docker the tool.
Since then, I’ve learned more about docker, particularly in the last year. I think that things started to click around a year ago, and over the past few months as I’ve looked further into Kubernetes and written my own pieces of software destined for container deployment I’ve formed my own mental model of where docker fits into my world. This post is about my writing that down to understand its coherency.
I tend towards understanding systems like this bottom-up, so let’s start at the beginning, which is also conveniently the bottom.
Cgroups, or control groups to give them their full name, were introduced into the mainline Linux kernel in 2.6.24, released in January 2008. What cgroups allow is for processes running on a system to be hierarchically grouped in such a way that various controls and boundaries can be applied to a process hierarchy.
Cgroups are a necessary but not sufficient part of a container solution, and they are also used for lots of things other than containers. Systemd, for example, uses cgroups when defining resource limits on the processes it manages.
Like many things within the Linux kernel, cgroups are exposed within the file hierarchy. A system administrator writes and reads from files within the mounted cgroups filesystem to define cgroups and their properties. A process is added to a cgroup by writing its PID to a file within the cgroups hierarchy; the process is automatically removed from its previous cgroup.
Overall, cgroups provide docker with a simple(ish) way to control the resources a process hierarchy uses (like CPU) and has access to (like networks and part of the filesystem).
Cgroups provide control of various resources, but the main ones to consider for docker containers are:
- CPU controller – using cpu shares, CPU time can be divided up between processes to ensure a process gets a share of CPU time to run in.
- Memory controller – a process can be given its own chunk of memory which has a hard limit on its size.
From this, it’s relatively easy to see how docker can assign resources to a container – put the process running within in the container in a cgroup and set up the resource constraints for it.
Beyond controlling scarce resources like CPU and memory, cgroups provide a way to assign a namespace to a process. Namespaces are the next piece of the puzzle and we move to them next.
Putting a process within a namespace is a means to define what the process has access to. A namespace is the boundary, whereas cgroups is the control plane that puts a process within the namespace’s boundary.
Also in 2.6.24 came the core of network namespaces. This and future patchsets enable processes to be presented with their own view of the network stack, covering network functions such as interfaces, routing tables and so on.
The Wikipedia article on kernel namespaces has a list of the current resources that can be isolated using namespaces. We can form a basic view of how containers are run by docker (and any other container management software) using just a couple of these:
- The Mount (MNT) namespace
- The Network (NET) namespace
Things like the PID and User namespaces provide extra isolation, but I’m not going to cover them here.
I confess here I’m making some guesses as to what’s going on, but the mental model has served me okay so I’ll reproduce it here. Broadly I consider these two namespaces to be the basis of docker’s ability to run what amounts to “pre-packaged” software.
Mount namespaces define what the filesystem looks like to the process running within the namespace. So different processes can see entirely different views of the filesystem.
My general assumption here is that docker is using MNT namespaces to provide the
running container with a unique view of the filesystem, both its own “root
image” that we’ll talk about later and the parts of the host filesystem mounted
into the running container using the
As NET namespaces provide processes with a custom view of the network stack and
provide ways for processes in different namespaces to poke holes to each other
via the network, I assume this is the basis for docker’s
bridge network type
which sets up a private network between processes running in containers. When
one runs a container with the
host network type, my basic layman’s assumption
is that the container’s process is not placed within its own network namespace
(or it lives within the default namespace).
A union filesystem essentially takes several filesystem images and layers them on top of each other. Images “above” override values from images “below”. For any file read, the union filesystem traverses the image stack from top to bottom and returns the file content from the first image containing the file. For writes, either the write just fails (for a read-only union filesystem) or the write goes to the top-most layer. Often this top-most layer is initially an empty image created specifically for the writes of files to the mounted union filesystem.
An important point to note is that two or more mounted union I filesystems can share images, meaning that two union filesystems could have, say, the first five images in their respective stacks shared but each with different images stacked on top to provide very different resultant filesystems.
Docker images and layers
When running a docker container, one specifies an image to “run” via a command like:
docker run couchdb
couchdb part of this command specifies the image to download. I find the
naming gets a bit confusing here, because essentially the “image” is actually a
pointer to the top image of a stack of images which together form the union
filesystem that ends up being the root filesystem of the running container.
While the above command reads “run the couchdb container”, it’s really more like “create and start a new container using the couchdb image as the base image for the container”. In fact, the docker run documentation describes this as:
The docker run command must specify an IMAGE to derive the container from.
The word derive is key here. By my understanding, docker adds a further image to the top of the stack which is where writes that the running container makes are written to. This image is saved under the name of the newly started container, and persists after the container is stopped under the container’s name. This is what allows docker to essentially stop and start containers while maintaining the files changed within the container – behind the scenes it’s managing an image used as the top image on the container’s union filesystem.
Docker calls the images that stack up to form the union filesystem “layers” and
the ordered collection of layers an “image”. This concept is key to how
dockerfiles work – at a first approximation, each line in a docker file adds a
new image to the image stack used as the base image for a container. So, for
example, a command that runs
apt to install software creates a new image
containing the changes to the filesystem made while installing the software.
It’s also obvious how this allows for dockerfile FROM lines to work – it just points back to the image at the top of the stack and then further dockerfile commands layer more images onto that stack to form a new docker image.
In addition, the fact that a union filesystem is able to share images at the lower levels means that docker is able to share lower level base images across many containers but only ever have a single copy on disk. These base images are read-only and so can be safely used in many containers’ union filesystem image stacks.
Putting it together
So basically what docker does when we use the docker run command is:
- Download the base image to derive the container from.
- Create a union filesystem consisting of the layers in the base image and a new layer at the top of the stack for the container to write its own files to.
- Set up a network namespace for the container.
- Set up a cgroup that has appropriate mount and network namespaces set up such
that the process has the union filesystem mounted as its root filesystem
and a private view of the host’s network stack. Mount other volumes into this
namespace as specified on the
- Start up a process within the union filesystem within the cgroup.
This is where the “GitHub for containers” thing comes from. A docker daemon manages a local collection of union filesystem images on your machine called a repository – which contains all the base images and other images used to form the union filesystems for containers on the system (including the top-of-the-stack writable images containers use).
But Docker also manages a large central collection of images which can be used
as base images for either direct running via the
docker run command or used as
the start point for other docker images using the
FROM dockerfile command.
When used, the docker daemon downloads the image from the remote repository and
uses it to derive a new container.
There’s some naming stuff here that I never quite got my head around, in that
couchdb bit in the
docker run command is actually a repository itself,
making Docker Hub more of a collection of repositories. The actual image used by
the docker tool on your computer is chosen by the “tag” you select, and each
repository has a set of tags defining the images you can use. There’s a default
tag specified for each image repository, which is used when the
command just specifies a repository name and misses out the tag.
So I guess this use of Docker Hub as a collection of repos, which contain tagged images, can be mapped imperfectly to the way GitHub works, GitHub being a collection of Git repos. However, the terminology match is far from exact, which definitely caused me problems when trying to understand docker.
A key thing I found confusing is that because a repository is really just a collection of arbitrary images, your local repository can – and almost certainly does! – contain the base images for lots of different pieces of software as well as many, many intermediate layers, whereas the repos on Docker Hub typically contain several tagged versions of a single piece of software. A Docker Hub repo could presumably therefore also contain many disparate pieces of software, but convention dictates that is not what happens, at least in public repos.
Thinking of docker merely as the repository concept misses out a lot of useful context for what docker means for your machines – the level of access required to create and maintain the cgroups and namespaces is high, and hopefully it’s a bit clearer why the docker daemon requires it from this post.
The v2 interface for cgroups provides for delegation of a portion of the cgroups heirarchy to a non-privileged process, which at a first scan read suggests a route to a less privileged docker daemon, or perhaps it’s already possible to use this. We’ve reached the boundaries of my knowledge and mental model here, so it’s time to stop for now.
As noted at the beginning, this post is a synthesis of my current understanding of how containers and docker work. While on writing it down, my model does seem coherent and logical, but there is quite a bit of guesswork going on. I’d therefore be very pleased to receive corrections to my descriptions and explanations.