Python Packaging in 2020

For a long time, I’ve kind of existed with a barely-there understanding of Python packaging. Just enough to copy a requirements.txt file from an old project and write a Makefile with pip install -r requirements.txt. A few years ago, I started using pipenv, and again learned just-enough to make it work.

Over the past year, I became frustrated with this situation:

  • pipenv became increasingly hard to make work through upgrades and its kitchen-sink approach.
  • I started building docker images for Python applications, and understanding packaging in more detail became essential to build secure and performant images.

Last year (2019), I started to look at tools like poetry, which essentially start the whole process from scratch, including new dependency resolution and package-building code. When figuring out how to use these in Dockerfiles, I realised I needed to understand a bunch more about both packaging and virtual environments. The good news was this actually progressed a lot in the 2018-9 time frame. The bad news was that meant there was a lot to learn, and a bunch of stuff was out of date.

In the beginning, there was the source distribution

Until 2013, when PEP 427 defined the whl archive format for Python packages, whenever a package was installed via pip install it was always built from source via a distribution format called sdist. For pure-python files this wasn’t typically much of a problem, but for any packages making use of C extensions it meant that the machine where pip install was run needed a compiler toolchain, python development headers and so on.

This situation is more than a little painful. As PEP 427’s rationale states:

Python’s sdist packages are defined by and require the distutils and setuptools build systems, running arbitrary code to build-and-install, and re-compile, code just so it can be installed into a new virtualenv.

After PEP 427, packages could also be distributed as so-called binary packages or wheels.

At the time I started to see python binary packages, because I had never looked in depth into python packaging, I was confused and even somewhat alarmed by the term binary package, particularly as I was quite used to source distributions by 2013. But in general they are a big win:

  • For pure python packages the term is a slight misnomer as the wheel format is just about how the files are laid out inside an archive – typically these wheels will have a single .whl per python version they support, that will be named like Flask-1.1.1-py2.py3-none-any.whl, where none and any specify the python ABI version (for C extensions) and the target platform respectively. As pure python packages have no C extensions, they have no target ABI and platform, but will often have a python version requirement, though this example supports both python 2 and 3.
    • The tags, such as none, in filenames are defined in PEP 425.
  • For packages including C extensions which are linked to the Python C runtime during compilation, the name does make sense because the build process pre-compiles the extension into a binary, unlike in the sdist world where C extensions were compiled during package installation. This results in several different .whl files, as a separate .whl files must be created for each target system and python version. For example, cryptography-2.8-cp34-abi3-manylinux2010_x86_64.whl is a package with binaries built against C Python 3.4, ABI level 3 for a Linux machine and processor architecture.

In the end, wheels provide a much simpler and more reliable install experience as every user is not forced to compile packages themselves, with all the tooling and security concerns inherent in that approach.

Stepping back to how wheels are built

Wheels soon started taking over the python packaging ecosystem, though there are still hold-outs even today that ship source packages rather than binary packages (often for good reasons).

However, all python packages were still defined via, an opaque standard that was defined purely by the distutils and setuptools source code. While there was now a binary standard for built packages, in practice there was only one way of building them. pip for example hardcoded the calls to into its pip wheel command, so using other build systems was very difficult, making implementation of them somewhat thankless tasks. Before poetry, it doesn’t look like anyone much attempted it.

The distutils module was shipped with Python, so it was natural that it came to be the defacto standard, and including a packaging tool was a good decision from the python maintainers. distutils wasn’t that easy to use on its own, however, so setuptools was built as a package to improve that. Over time, setuptools also grew to be somewhat gnarly itself.

Tools like flit (started 2015) were then created to tame the new complexity, and wrap distutils and setuptools in another layer – though flit is opinionated. Flit’s way of doing things became popular, but in the end it was still using distutils and setuptools under the hood (per this flit source code). Even so, flit became pretty popular as its workflow is simple and understandable. Indeed, generation of the files used by distutils happens behind the scenes so far as I can tell (I didn’t actually try flit out, so may have made some errors here).

Poetry and PEPs 517 & 518

In 2018 development of poetry started, at least per the earliest commits from the github repository. Poetry is an ambitious rebuild of python packaging pretty much from scratch. It’s able to resolve dependencies and build wheels without any use of distutils and setuptools. The main problem with poetry is that it needs to re-implement a lot of existing functionality that is already present in other tools like pip to be accepted into development and CI pipelines.

At a similar time, the python community came up with PEP 517 and 518.

  • PEP 517 (status Provisional, 2015-2018) is about a standard way to specify alternative build backends that pip can use when building wheels – for example, using Poetry or flit’s build engine rather than going directly to distutils. A build backend is a Python module with a standard interface that is used to take a python package source tree and spit out a wheel.
  • PEP 518 (status Provisional, 2016) works in tandem with PEP 517 and specifies a way for a tool like pip to know how to install the build backend specified by PEP 517 when pip is building packages. Specifically, it describes how to create an isolated python environment with just the needed requirements to build the package (that is, the packages to install the build backend, not the package’s dependencies).

Both PEPs 517 and 518 use a new file called pyproject.toml to describe their settings:

# Defined by PEP 518, what the build environment requires:
requires = ["poetry>=0.12"]
# Defined by PEP 517, how to kick off the build:
build-backend = "poetry.masonry.api"

Both poetry and flit work with pyproject.toml via its support for namespacing tool-specific settings. An example using poetry:

name = "my-package"
version = "0.1.0"
description = "The description of the package"

python = "^3.7"
flask-hookserver = "==1.1.0"
requests = "==2.22.0"

While both PEPs 517 and 518 were started a while ago, it’s only from pip 19.1 (early 2019) that pip started supporting the use of build backends specified via PEP 517.

pip enters “PEP 517 mode” when pip wheel is called if pip finds a pyproject.toml file in the package it is building. When in this mode, pip acts as a build frontend, a term defined by PEP 517 for the application that is used from the command line and is making calls into a build backend, such as poetry. As a build frontend, the job for pip here is to:

  1. Create an isolated python environment.
  2. Install the build backend into this environment via the PEP 518 requirements (requires = ["poetry>=0.12"]).
  3. Get the package ready for building in this environment.
  4. Invoke the build backend, for example poetry, using the entrypoint defined by PEP 517 (build-backend = "poetry.masonry.api") within the created isolated environment.

The build backend then must create a wheel from the source folder or source distribution and put it in the place that pip tells it to.

For me, this seems like big news for projects like poetry that do a lot from scratch and end up with laundry lists of feature requirements to enable them to be integrated into full development and CI pipelines. If they can instead be ingrated into CI via existing tools like pip, then they are much easier to adopt in development for their useful features there, such as poetry’s virtual environment management features. In particular, both flit and poetry will use the information defined in their respective sections of pyproject.toml to build the wheel and requirement wheels just as they would on a developer’s machine (to an extent anyway, my experiments indicate poetry ignores its .lock file when resolving requirements).

In this way, PEPs 517 and 518 close the loop in allowing tools like poetry to concentrate on what they want to concentrate on, rather than needing to build out a whole set of functions before they can be accepted into developers' toolboxes.

An example Dockerfile shows this in action, for building the myapp package into a wheel along with its dependencies, and then copying the app and dependency wheels into the production image and installing them:

# Stage 1 build to allow pulling from private repos requiring creds
FROM python:3.8.0-buster AS builder
RUN mkdir -p /build/dist /build/myapp
# pyproject.toml has deps for the `myapp` package
COPY pyproject.toml /build
# Our project source code
COPY myapp/*.py /build/myapp/
# This line installs and uses the build backend defined in
# pyproject.toml to build the application wheels from the source
# code we copy in, outputting the app and dependency wheels
# to /build/dist.
RUN pip wheel -w /build/dist /build

# Stage 2 build: copy and install wheels from stage 1 (`builder`).
FROM python:3.8.0-slim-buster as production-image
COPY --from=builder [ "/build/dist/*.whl", "/install/" ]
RUN pip install --no-index /install/*.whl \
    && rm -rf /install
CMD [ "my-package-script" ]

And this is what I now understand about the state of python packaging as we enter 2020. The future looks bright.

← Older
Kubernetes by Types
→ Newer
Loading Kubernetes Types Into Go Objects