Build your own Database Index: part 1

I’ve been quiet in November because I’ve been working on a small toy project. Now it’s become interesting, I want to write about it. In part, to prove to myself I actually understand what I’ve built, by showing I can explain it in words. I’ve been working through creating a simple database-like index, to understand the concepts involved more concretely.

(Where’s the code, you ask. It… requires some clean-up first. I’ll get there sometime soon!)

I’ve been reading Database Internals as part of a book club for the last few weeks. One of the noticeable things in the first few chapters is that nearly everything is based on a storage layer with a relatively simple interface, that of a key-value store where both key and value are byte arrays.

The stores are ordered by key and are used for both storing primary data (eg, key = primary key; value = relational row data) and for indexes (key = indexed value; value = referenced row’s primary key). Inspired by my reading, I decided to have a go at writing a simple index for JSON data.

I wanted to concentrate on the index part, so used an existing key-value store. Any would do, but I decided to use PebbleDB which is the underlying store used by CockroachDB. I chose PebbleDB because it was written in Go. As I was going to write my code in Go, using a Go storage layer meant that I could more easily peak at how the storage layer worked if I needed to.

The plan

I wanted to create a generic JSON index. Generic, meaning that adding an arbitrary JSON document should index each field automatically, without having a pre-existing schema. This means that given a document like:

{ "a": { "b": 12, "c": "foo" } }

We should be able to find it using queries like:

"a.b" == 12
"a.c" == "foo"
"a.b" < 400
"a.c" > "e"

And we should be able to use the same index, queried in slightly different ways, to answer these queries.

Steps

To get there we need to be able to do a few things:

Encode primitive values. (That’s this part). We’ll cover that in this post. JSON has several primitive types we should be able to index: null, false, true, numbers and strings. We will need to be able to order these within the index to support searches like > 400.
Add field names to our index. We need to be able to search on field names, like a or a.b. This means they need encoding in our index somehow.
Index JSON documents. Next we need to be able to take a JSON document and output a list of the key-value pairs to insert into the index, and then insert them.
Update and delete documents from the index. We need to be able to update documents we’ve previously inserted into the index. I noticed that not many “create your own index” tutorials on the web cover updating, so I’ll try to do a decent job of that as it adds a chunk of complexity to an implementation.
Find documents in the index. And finally, of course, we need to support queries over the index. To start with, we will support AND queries, eg "a.c" = "foo" AND "a.b" > 10.
I ended up kinda skipping that last one and went straight to Extract a query planner and do an optimisation pass. It all got slightly (only slightly) fancy.

We will not support:

JSON field names with dots in. This is just to allow us to have the intuitive dotted notation for descending into the sub-objects within a JSON document, a.b as shown earlier, without worrying about escaping dots in field names.
Arrays in the JSON. We’ll ignore those fields. There are ways to cover this, and perhaps we’ll come back to this later.
OR queries like "name": "foo" OR "name": "bar".

Encoding primitive values as orderable byte arrays

In this first part, we’ll just look at how to encode primitive types. We have a key-value store that supports storing keys in order, where the keys are byte arrays.

Therefore, we need to encode each primitive value as a byte array and store those as keys in our index. The key-value store will then order these for us, allowing fast equality and range queries over the index.

Our encodings are representations of the underlying primitive values. To make our index work, these representations must order in the same way as the primitive values. That is, the byte array for 45.3 should order before the byte array for 400 in the key-value store. This turned out to be more complicated than I thought, particularly when it came to numbers.

Strings

First let’s consider strings. Thankfully, we can do something very simple here:

func encodeStringKey(s string) []byte {
   return []byte(s)
}

// encodeStringKey("foo") = []{0x66, 0x6f, 0x6f}

Per the Go conversion specification, []byte(string) converts into the string’s UTF-8 byte encoding. This is probably not going to 100% deal correctly with ordering unicode multibyte characters. But for now we can leave it like that; it will correctly let us have e sorted before foo. In a production index, we’d have to define this more tightly. Probably using something like ICU sort keys.

Numbers

Numbers are more complicated. First we need to support both integers and floating point numbers (ie, 1 and 1.1). We also want to be able to compare all the different types of numbers against each other in a “natural” way. So, for example, a floating point query such as "a.b" > 12.34 should find the document {"a": {"b": 13}} which has an integer in the field.

My first go at this just supported unsigned 32-bit integers. We can create byte arrays that sort correctly from these relatively easily: by encoding them as big-endian byte strings:

35   => []byte{00, 00, 00, 23}
1234 => []byte{00, 00, 04, d2}
1235 => []byte{00, 00, 04, d3}

This orders them just fine for unsigned integers — trivially extended to 8 bytes for 64-bit unsigned integers — but it doesn’t work for other number types, even signed integers.

Instead, I figured that to do this we need to do two things:

All numbers will need representing in the same way, so they all compare against each other.
We need to ensure that the encoding we use for that representation actually does order the numbers “naturally”.

To represent numbers in the same way, we will cast every type of number to Go’s 64-bit floating point type:

var f float64
switch t := value.(type) {
   case float64:
      f = t
   case float32:
      f = float64(t)
   case uint:
      f = float64(t)
   // ... cases covering all of Go's other "number" primitives ...
}

Now we have every numerical type represented as a float64, we need to be able to encode the float64 as a byte array in a manner that orders the byte array values the same way the original floats were ordered.

This turns out to be not completely obvious, so with my low-rate skills with bit-twiddling, I turned to StackOverflow to find this out. It turns out there’s a standard way to do this, which I flagrantly copied:

func encodeFloatKey(value float64) []byte {
   // This StackOverflow answer shows how to
   // encode a float64 into a byte array that
   // has the same sort order as the floats.
   // https://stackoverflow.com/a/54557561
   var buf [8]byte
   bits := math.Float64bits(value)
   if value >= 0 {
   	bits ^= 0x8000000000000000
   } else {
   	bits ^= 0xffffffffffffffff
   }
   binary.BigEndian.PutUint64(buf[:], bits)
   return buf
}

// encodeFloatKey(12) = []byte{0xc0, 0x28, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}

If this was for production, I might dig further to understand what was going on here. Instead, to be confident in my copy-pasted answer, I wrote a fuzz test that checked the ordering of two numbers’ encodings and ran a million iterations. Frankly, perhaps that’s enough even for production 😬

This always uses eight bytes, which isn’t terribly space efficient. I didn’t look into whether there are other encodings which could do better. Databases with a schema can do much better, as they can be told that a given column only needs the range of a byte or two. But here we have to assume any number could be stored, and so use the format that’s most general, a float64. Strictly speaking JSON can store larger numbers, but in practice this should work okay.

We will come to null, false and true in a minute, but first we’ll turn to the question of how to impose an ordering between data types.

Ordering by data type

When we talk about ordering by data type, we mean that we want, for example, every string value to sort after every number value. If we look at the way we’re storing strings and numbers above, strings and numbers will interleave when we put them into the key-value store, because we will have byte arrays for strings whose first entry is larger than some float64 binary encoded byte array’s first entry.

That is, we could have these keys, which sort a number in between two strings:

encodeStringKey("?A") = []byte{0x3f, 0x41}
encodeFloatKey(-1) =    []byte{0x40, 0x0f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff}
encodeStringKey("AA") = []byte{0x41, 0x41}

(When ordering byte arrays, we compare each byte left-to-right as with text).

We don’t want that. So we need to create some way to force that not to happen. It turns out this is quite simple. We prefix every value with a “tag” for its type. The tags are a single byte, and they are defined so that our chosen total ordering over types is enforced.

const (
	// printable character makes easier debugging
	JSONTagNull   = 0x28 // char: (
	JSONTagFalse  = 0x29 // char: )
	JSONTagTrue   = 0x2a // char: *
	JSONTagNumber = 0x2b // char: +
	JSONTagString = 0x2c // char: ,
)

These tags create the ordering: null, true, false, number, string. There’s nothing special about this ordering, others would be perfectly valid. The important thing is that there is an ordering — any ordering — defined.

(There’s also nothing special about the specific values, 0x28 and so on. They can’t be the same, and they need to increase in value, but the specific values are arbitrary.)

To create the key, we prefix the encoding of the value with the tag:

numKey := append(JSONTagNumber, encodeFloatKey(-1)...)
strKey := append(JSONTagString, encodeStringKey("?A")...)

In this way, whatever value the encoded float or string is, the tag prefix ensures they order as we want them to in the key-value store. If we look at our above example where a number was interleaved with two strings, by adding the tag prefix, we instead have a correct ordering:

-1   = []byte{0x2b, 0x40, 0x0f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff}
"?A" = []byte{0x2c, 0x3f, 0x41}
"AA" = []byte{0x2c, 0x41, 0x41}

The first byte, JSONTagNumber or JSONTagString, has forced an ordering between the types.

False, True, Null

Initially, I encoded false as 0 and true as 1. I wasn’t quite sure what to use for null.

After I’d defined the tags, however, it became clear: I could use only the tag byte to define null, false and true. The tag alone defines both the type and the value: JSONTagNull, JSONTagFalse, JSONTagTrue. There’s no need for any more information in the key.

(I could have created a JSONTagBool rather than the true and false variants. Then I’d have needed an extra byte (eg, 0, 1) to store the false or true value. But it turns out that by representing true and false directly in the tag, we can avoid this and save a byte).

This leaves us with the following for fully ordered byte array encoding for primitive JSON types. null, false and true have a single byte, their tag. Numbers and strings have their tag followed by their value. As follows:

nullKey  := []byte{JSONTagNull}
falseKey := []byte{JSONTagFalse}
trueKey  := []byte{JSONTagTrue}
numKey   := append(JSONTagNumber, encodeFloatKey(12.5)...)
strKey   := append(JSONTagString, encodeStringKey("foo")...)

At this point, we can encode all the primitive JSON types into byte arrays that will be ordered correctly within the key-value store when used as keys. In addition to each value being ordered correctly within its type (eg, numbers all order correctly with respect to other numbers), we also have a total ordering over types (eg, all string values sort after all number values, which sort after true etc.). This is necessary to define for any generic JSON index, which allows different data types for the same field.

Right now, we could build an index that could quickly find keys equal to "foo" or 42, and do range queries like between 42 and 54. This might work if we were creating an index against a single field, like a standard database index, so the field name is encoded as part of the index definition rather than in each row. But for our experiment, we want to be able to index all the values in a JSON document in the same index, and yet still query on individual fields.

So, next time, we’ll look at how to add the field name to the key alongside the value. This will allow us to issue queries like "name": "mike". See you then.