Using sed to extract HTTP headers

Today I needed to take a HTTP request and extract the etag header; the etag was used as part of an MVCC implementation in a service I was using and I wanted to script an update to a resource. I was doing this in a Makefile so wanted to do this without firing up a scripting language.

It turns out this is the domain of tools like sed. sed stands for stream editor. It applies scripts to text streams which edit the content of the stream. When you watch someone using sed, the scripts look super-cryptic, but in fact they’re not too bad. Like a regular expression, they benefit from reading left to right; when viewed as a whole they are just a mess. In fact, half of a sed script is often a regular expression!

The sample headers

First, we’ll get the HTTP headers to work with. I found a new curl option, -D <filename> that will do this for you. So to get the headers for dx13.co.uk:

curl -D headers.txt https://dx13.co.uk

There’s quite a lot of headers that come with a call to dx13.co.uk, so I trimmed most of them from the end to leave something a bit shorter to work with, which doesn’t affect the sed commands at all. I left us with:

> cat headers.txt
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

A sed primer

We’ll come to executing scripts in a minute. First, we’ll get familiar with what a script looks like. The basic form is:

[addr]X[options]
  • addr selects a set of lines to operate on. It can be a single line, a line range or a regular expression.
    • A single line is just the line number, 12.
    • A regex is delimited using backslashes, /regex/.
    • A range is comma-separated, 12,16.
    • Matching can be inverted using ! at the end of the address.
    • If there is no addr, command is executed on all file lines.
    • The documentation for addresses.
  • X is a command (like d or s).
  • options are options to the command.
    • s has the option /foo/bar/.

So in:

  • '14d': the range is line 14; and then d removes the line; no options are used. This removes line 14 of the input.
  • '/:/d': the range is the regex :; and then d removes the lines; no options are used. This will remove lines containing : from the input.
  • 's/^.*: /foo! /': the range is all lines; the command is s; the option is the find/replace specification. We’ll see what this does later.

I found the s command familiar – it’s just like vim’s.

Using sed to get the etag

By default, sed applies its first argument as a script and second as the input file, and outputs to stdout.

Substitution

A simple script is a vim-like search and replace. Here, we replace the header names with foo!:

> sed 's/^.*: /foo! /' headers.txt
HTTP/2 200
foo! GitHub.com
foo! text/html; charset=utf-8
foo! Tue, 06 Nov 2018 15:58:30 GMT
foo! "5be1ba26-a9dd"
foo! *
foo! Fri, 22 Mar 2019 14:03:49 GMT
foo! max-age=600
foo! 6F9E:2F59:86E637:B2E922:5C94E8ED

As we head straight to the s command and don’t specify an address, the command is executed on all lines of the file.

Chaining

By using the -e flag, multiple scripts can be chained. You can also use one big script string with semi-colons, but I find multiple -e flags easier to read.

Replace header names with foo! as above, then replace foo with bar:

> sed -e 's/^.*: /foo! /' -e 's/foo/bar/' headers.txt
HTTP/2 200
bar! GitHub.com
bar! text/html; charset=utf-8
bar! Tue, 06 Nov 2018 15:58:30 GMT
bar! "5be1ba26-a9dd"
bar! *
bar! Fri, 22 Mar 2019 14:03:49 GMT
bar! max-age=600
bar! 6F9E:2F59:86E637:B2E922:5C94E8ED

Removing lines

As mentioned in the primer, removing lines is done using a command within the script, d. !d is used to invert the behaviour.

Remove all the lines containing a colon:

> sed '/:/d' headers.txt
HTTP/2 200

Note that we use the address /:/ which is a regex that matches all lines with a colon. The rest of the script executes on these lines.

Remove all the lines without a colon:

> sed '/:/!d' headers.txt
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED

Here we use /:/! as the address – this causes the command to be executed on the lines that don’t match the regex.

Getting the etag

Finally we’re ready!

Combining the above, we can retrieve the ETag header using a chain of three scripts:

> sed -e '/etag/!d' -e 's/^etag: //' -e 's/"//g' headers.txt
5be1ba26-a9dd

That is:

  1. Remove the lines not containing etag.
    • This passes just one line to the next script: etag: "5be1ba26-a9dd"
  2. Remove the header name from the remaining line.
    • This leaves: "5be1ba26-a9dd"
  3. Remove the quotes. The g in s/"//g means global; leaving it out means that sed would replace only the first instance of " that it found. Making the replacement global means that all instances on the line are replaced.
    • Giving us: 5be1ba26-a9dd

In the end, it feels like a bit of an anti-climax. However, it’s now much clearer to me where I’d try to make use of sed, and I feel I’ve learned enough to be dangerous!

References:

.:.