Using sed to extract HTTP headers
Today I needed to take a HTTP request and extract the etag
header; the etag
was used as part of an
MVCC
implementation in a service I was using and I wanted to script an update to a
resource. I was doing this in a Makefile
so wanted to do this without firing
up a scripting language.
It turns out this is the domain of tools like sed
. sed
stands for stream
editor. It applies scripts to text streams which edit the content of the
stream. When you watch someone using sed
, the scripts look super-cryptic,
but in fact they’re not too bad. Like a regular expression, they benefit from
reading left to right; when viewed as a whole they are just a mess. In fact,
half of a sed
script is often a regular expression!
The sample headers
First, we’ll get the HTTP headers to work with. I found a new curl
option,
-D <filename>
that will do this for you. So to get the headers for dx13.co.uk:
curl -D headers.txt https://dx13.co.uk
There’s quite a lot of headers that come with a call to dx13.co.uk, so I
trimmed most of them from the end to leave something a bit shorter to work
with, which doesn’t affect the sed
commands at all. I left us with:
> cat headers.txt
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED
A sed primer
We’ll come to executing scripts in a minute. First, we’ll get familiar with what a script looks like. The basic form is:
[addr]X[options]
addr
selects a set of lines to operate on. It can be a single line, a line range or a regular expression.- A single line is just the line number,
12
. - A regex is delimited using backslashes,
/regex/
. - A range is comma-separated,
12,16
. - Matching can be inverted using
!
at the end of the address. - If there is no
addr
,command
is executed on all file lines. - The documentation for addresses.
- A single line is just the line number,
X
is a command (liked
ors
).options
are options to the command.s
has the option/foo/bar/
.
So in:
'14d'
: the range is line 14; and thend
removes the line; no options are used. This removes line 14 of the input.'/:/d'
: the range is the regex:
; and thend
removes the lines; no options are used. This will remove lines containing:
from the input.'s/^.*: /foo! /'
: the range is all lines; the command iss
; the option is the find/replace specification. We’ll see what this does later.
I found the s
command familiar – it’s just like vim’s.
Using sed to get the etag
By default, sed applies its first argument as a script and second as the input
file, and outputs to stdout
.
Substitution
A simple script is a vim-like search and replace. Here, we replace the header
names with foo!
:
> sed 's/^.*: /foo! /' headers.txt
HTTP/2 200
foo! GitHub.com
foo! text/html; charset=utf-8
foo! Tue, 06 Nov 2018 15:58:30 GMT
foo! "5be1ba26-a9dd"
foo! *
foo! Fri, 22 Mar 2019 14:03:49 GMT
foo! max-age=600
foo! 6F9E:2F59:86E637:B2E922:5C94E8ED
As we head straight to the s
command and don’t specify an address, the command
is executed on all lines of the file.
Chaining
By using the -e
flag, multiple scripts can be chained. You can also use one
big script string with semi-colons, but I find multiple -e
flags easier to
read.
Replace header names with foo!
as above, then replace foo
with bar
:
> sed -e 's/^.*: /foo! /' -e 's/foo/bar/' headers.txt
HTTP/2 200
bar! GitHub.com
bar! text/html; charset=utf-8
bar! Tue, 06 Nov 2018 15:58:30 GMT
bar! "5be1ba26-a9dd"
bar! *
bar! Fri, 22 Mar 2019 14:03:49 GMT
bar! max-age=600
bar! 6F9E:2F59:86E637:B2E922:5C94E8ED
Removing lines
As mentioned in the primer, removing lines is done using a command within the
script, d
. !d
is used to invert the behaviour.
Remove all the lines containing a colon:
> sed '/:/d' headers.txt
HTTP/2 200
Note that we use the address /:/
which is a regex that matches all lines
with a colon. The rest of the script executes on these lines.
Remove all the lines without a colon:
> sed '/:/!d' headers.txt
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Nov 2018 15:58:30 GMT
etag: "5be1ba26-a9dd"
access-control-allow-origin: *
expires: Fri, 22 Mar 2019 14:03:49 GMT
cache-control: max-age=600
x-github-request-id: 6F9E:2F59:86E637:B2E922:5C94E8ED
Here we use /:/!
as the address – this causes the command to be executed
on the lines that don’t match the regex.
Getting the etag
Finally we’re ready!
Combining the above, we can retrieve the ETag header using a chain of three scripts:
> sed -e '/etag/!d' -e 's/^etag: //' -e 's/"//g' headers.txt
5be1ba26-a9dd
That is:
- Remove the lines not containing
etag
.- This passes just one line to the next script:
etag: "5be1ba26-a9dd"
- This passes just one line to the next script:
- Remove the header name from the remaining line.
- This leaves:
"5be1ba26-a9dd"
- This leaves:
- Remove the quotes. The
g
ins/"//g
means global; leaving it out means thatsed
would replace only the first instance of"
that it found. Making the replacement global means that all instances on the line are replaced.- Giving us:
5be1ba26-a9dd
- Giving us:
In the end, it feels like a bit of an anti-climax. However, it’s now much
clearer to me where I’d try to make use of sed
, and I feel I’ve learned
enough to be dangerous!
References:
- The
sed
manual.