Build your own Database Index: part 3

This is part 3 in a short series about writing a simple index for JSON documents. Previously we talked about things in the abstract: how we’ll encode field names and values into our key-value store so they are easy to lookup. This time we make things a bit more real.

By the end of this post, we’ll have worked up to extracting the fields from our JSON documents, then inserting those fields into our PebbleDB key-value store. And we’ll look at querying for particular values with an equality query. Finally, the bare bones of a useful thing!

The series outline:

  1. Encode primitive values.
  2. Add field names to our index.
  3. Index JSON documents. (That’s this part)
  4. Update and delete documents from the index
  5. Find documents in the index.

Recap on storing field paths and values

This is a recap of part 1 and part 2, so it might be confusing if you’ve not read those. Take the time to read through them if you’ve not, so that this post will make sense!

We store paths using a dotted notation which we encode directly to bytes, []byte("a.b"). Go does a decent job in this casting from string to []byte.

Values are stored as tagged values. Each JSON data type gets a tag, which is a single byte value that identifies it as null, false, true, number or string. We used this originally to ensure a total ordering on values by prefixing the value with the tag in the key. Later we will see that we need to use that tag to more smartly unpack the key we retrieve from the index, but let’s not jump ahead of ourselves.

Finally, we pack the field and value byte slices together, separated by a null separator.

A key might look like:

[ a . f o o d 00 2c a p p l e ]
  ----------- -- -- ---------
    field     |  |   value
              |  ` JSONTagString
              ` null separator

(Field and value bytes rendered as characters for easier reading. We’ll do this throughout this post. But remember they are just bytes!)

Storing doc IDs in the index

Reading around the subject, I found two basic approaches for storing the doc IDs in an index:

  1. A postings list, which is where the value in our key-value pair stores a list of doc IDs where the (path, value) pair is found.

    [ a . f o o d 00 2c a p p l e ] = "doc1,doc4,doc17"
    [ a . f o o d 00 2c o r a n g e ] = "doc3"
    

    Lucene uses a postings list approach. We can choose to order the postings list at write time if we want, which can improve querying performance.

  2. Store the doc ID in the key, meaning that we have a key for each (path, value, doc id) tuple.

    [ a . f o o d 00 2c a p p l e 00 d o c 1 ]
    [ a . f o o d 00 2c a p p l e 00 d o c 1 7 ]
    [ a . f o o d 00 2c a p p l e 00 d o c 4 ]
    [ a . f o o d 00 2c o r a n g e 00 d o c 3 ]
    

    Here the doc IDs are always ordered within a given field/value pair, because the key-value store orders the keys.

    I found this approach in CockroachDB’s notes on inverted indexes (scroll to Encoding the Index).

I originally forked my code from eatonphil/docdb: Basic document db from scratch in Go. This used the first (postings list) approach. So I decided to have a go at the second approach, to see how it was done.

Indexing

Before we get to the nuts and bolts of encoding the fields, values and document IDs, let’s outline the couple of things an indexing function needs to do. We will create an indexing method with this signature:

func index(id string, data map[string]any)

That is, we will associate a document identified by id with a set of index entries derived from data. We need two things to do that:

  1. We’ll quickly cover a function to extract the paths and values from the data.
  2. Then we’ll get to the meat of encoding the key when indexing, and after that, decoding the key when querying.

Extracting the fields and values

The first task our index function needs to do is to extract all the pairs of field path and values. Here’s an example:

{ "a": { "b": "foo", "c": { "d": "bar", "e": "baz" } } }

This results in the following path, value pairs:

"a.b"     => "foo"
"a.b.c.d" => "bar"
"a.b.c.e" => "baz"

This is a relatively common thing to do, and I was able to use the version in the original repository that I forked, with a couple of changes. First, a fix to allow more than two levels of JSON to be correctly extracted and second altering the return type to byte slices rather than strings, given we need the byte slices for our index.

The function is called by the index function, which then iterates over the []pathValue returned.

type pathValue struct {
	path, taggedValue []byte
}

// getPathValues returns all path value keys for obj, using prefix as
// key prefix for the path part of the key.
func getPathValues(obj map[string]any, prefix string) []pathValue {
	var pvs []pathValue
	for key, val := range obj {

		// The prefix extends as we recurse through nested objects
		if prefix != "" {
			key = prefix + "." + key
		}

		// Handle nested objects by recursing down into them.
		// Handle arrays by ... skipping them because they are hard.
		switch t := val.(type) {
		case map[string]any:
			pvs = append(pvs, getPathValues(t, key)...)
			continue
		case []interface{}:
			// Can't handle arrays
			continue
		}

		// For all other types, we can encode the value and add a pathValue
		// to our accumulator.
		pvk := pathValue{[]byte(key), encodeTaggedValue(val)}
		pvs = append(pvs, pvk)
	}

	return pvs
}

The encodeTaggedValue function returns a byte array for whatever JSON value we pass it, adding the tag prefix byte for the type.

You’ll have spotted the gap around arrays. Sorry arrays, no indexing for you. Some day, I might figure what to do with them, but not today.

Encoding the key

When querying, we still need to be able to efficiently prefix match by the field path and the value. Therefore, when we create the key during indexing, the document ID needs to be at the end of the key.

In part 2, we wrote the code to create a key with field path and value:

[ a . f o o d 00 2c a p p l e ]

We used a packTuple function to create this, which takes a slice of byte arrays and joins them with 00 separators.

As shown in the examples above, we can use the same 00 separator approach to append the document ID to our key:

[ a . f o o d 00 2c a p p l e 00 d o c 1 ]

In code, this is:

var sep []byte = []byte{0}

func packTuple(components ...[]byte) []byte {
	return bytes.Join(components, sep)
}

// later ...
const JSONTagString byte = 0x2c

fieldPath := []byte("a.food")
value := append([]byte{JSONTagString}, []byte("apple")...)
docID := []byte("doc1") // <=== new
key := packTuple(fieldPath, value, docID)

// key = [ a . f o o d 00 2c a p p l e 00 d o c 1 ]

Putting indexing together

And we’re done for indexing time: we can insert this key into our key-value store just as it pops out of packTuple. Our complete index function becomes:

// index adds document to the index, associated with id.
func index(indexDB *pebble.DB, id string, document map[string]any) {
	docID := []byte(id)
	b := indexDB.NewBatch() // Add indexed items atomically

	// Now, index the new values for the document
	pvs := getPathValues(document, "")

	for _, pathValue := range pvs {
		invIdxKey := packTuple(
			pathValue.path, pathValue.taggedValue, docID)

		err = b.Set(invIdxKey, nil, pebble.Sync)
		if err != nil {
			log.Printf("Could not update inverted index: %s", err)
		}
	}

	err = indexDB.Apply(b, pebble.Sync)
	if err != nil {
		log.Printf("Could not index %q: %v", id, err)
		return
	}
}

PebbleDB’s batch feature ensures that all the entries for the newly indexed document are added to the key-value store atomically. If we later allow for concurrent access, this will mean that readers don’t see a half-added document.

Querying

The more complicated side of this “doc IDs in the keys” method of storing our document IDs occurs during querying, when we need to extract the document IDs from our keys. Before we get to that, we’ll talk about how we do queries on the index in the abstract, then we’ll see the issue and work through the code to solve it.

A reminder of the format of our keys:

[ a . f o o d 00 2c o r a n g e 00 d o c 3 ]
  ----------- -- -- ----------- -- -------
    field     |  |   value       |  docId
              |  |               |
              |  ` JSONTagString |
              ` null separator   ` null separator

Say we have the four entry index from way up above where we were talking about possible ways to encode document IDs into the indexed data:

[ a . f o o d 00 2c a p p l e 00 d o c 1 ]
[ a . f o o d 00 2c a p p l e 00 d o c 1 7 ]
[ a . f o o d 00 2c a p p l e 00 d o c 4 ]
[ a . f o o d 00 2c o r a n g e 00 d o c 3 ]

To execute the query "a.food" == "apple" we need to do a range scan to find all the keys in the key-value that start with the prefix:

[ a . f o o d 00 2c a p p l e ]

For this, we set an upper and lower bound at the minimum and maximum keys for this prefix. As we have the 00 separator between the path/value prefix and the document ID, we know that we can set that second separator byte position to 01 for the upper bound of this range:

lower bound: [ a . f o o d 00 2c a p p l e 00 ]
upper bound: [ a . f o o d 00 2c a p p l e 01 ]

This will return the keys:

[ a . f o o d 00 2c a p p l e 00 d o c 1 ]
[ a . f o o d 00 2c a p p l e 00 d o c 1 7 ]
[ a . f o o d 00 2c a p p l e 00 d o c 4 ]

These keys contain all the details we need to respond to the callee of our as-yet-unwritten query method — we don’t need the value from the key-value pairs in our store at all!

But how do we get the document IDs out of the keys? This is where we need to be a bit careful.

At first blush, we might imagine we can use unpackTuple, that is, just split on 00 bytes:

var sep []byte = []byte{0}
// unpackTuple unpacks packed into its components.
func unpackTuple(packed []byte) [][]byte {
	return bytes.Split(packed, sep)
}

And, indeed, this will work fine for null, true, false and strings. Because, for these types, their encoded tagged value can never contain any 00 bytes. Calling unpackTuple on one of our rows above might give us unpacked = ["a.food", "apple", "doc1"] and we know unpacked[2] will always give us the doc ID.

Numbers, however, are different. To understand why, here’s the encoding of {"a": 12} in our index, and what happens if we call unpackTuple on it:

// encode:
// path  = "a"
// value = 12
// id = doc1
[]byte{
	0x61,                                     // a
	0x0,                                      // null separator
	0x2b,                                     // JSONTagNumber
	0xc0, 0x28, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, // float64 12
	0x0,                                      // null separator
	0x64, 0x6f, 0x62, 0x31                    // doc1
	
}

unpackTuple(key12) = []byte{
	[]byte{0x61}                   // a
	[]byte{0x2b, 0xc0, 0x28}       // first three bytes of our number
	[]byte{}                       // oops! 
	[]byte{}                       //     lots
	[]byte{}                       //        of
	[]byte{}                       //         empty
	[]byte{}                       //             slices!
	[]byte{0x64, 0x6f, 0x62, 0x31} // doc1
}

That is, we get a ton of empty slices from “between” each of those 00 bytes in the number’s encoded representation.

Now, we could say, well, the doc ID is the last item, screw the fact we’ve got a weird blob of stuff in the middle. But, in the end, that’s probably going to turn around and bite us sometime with some weird bug. And, let’s be honest, it basically feels like a messy hack, and we’re not those people — at least not right now 😬 — are we?

So instead, let’s figure out how to reliably decode this new key format into its components — the path, tagged value and document ID.

Recalling the fact that during query we only need to extract the document ID, one useful simplification to our decoding code is that we just need to be able to accurately skip over the path and the encoded value. We know that the next component will be the document ID. (If we later pack in another component, we can null-separate the doc ID and this new component). In case we need them for debugging, we’ll extract the path and values as we go, rather than silently dropping them.

Therefore, in our decoding method we’ll:

  1. Extract the path by pulling everything to the first 00 we find.
  2. Read the tag from the next part of the key, then use the tag to figure out how to find the end of the value.
  3. Pull the rest of the key into the document ID and add that doc ID to the result.

See part 1 for how we encoded each value. Knowing this, we know that once we get to our value in the key, we need to do the following to get past the value to the doc ID:

  • For null, true and false, we know that the value is only a single byte — the tag byte — so we can hard code skipping two bytes: the tag and the 00 separator.
  • For numbers, we converted them all to float64 values, so we know their length is 9 bytes — the tag byte plus the 8 bytes of the encoded float64. So we skip ten bytes in total: the tag, the 8-byte number encoding and the separator.
  • Strings require us to walk forward in the key until we find a 00 byte.

In the following code, we return each component of the key to the caller as a []byte, and the caller has to decode the components if it needs to. The main subtlety in the code is that, as the code finds each component of the key k, it advances k to the start of the next component. bytes.Index() is used to find the next instance of a []byte in another []byte.

As can be seen, the hard part is extracting the path and the value. Once we’ve parsed those out of the key, getting the doc ID is just one line of code to collect the remaining bytes of the key into the DocID field of our returned struct.

// Return type for decodeInvIdxKey
type InvIndexKey struct {
	Path, TaggedValue, DocID []byte
}

// unpack the key. nb returns value with tag.
func decodeInvIndexKey(k []byte) (InvIndexKey, error) {
	// [ path 00 tag value 00 docid ]
	iik := InvIndexKey{}
	sep := []byte{0}

	// path
	m := bytes.Index(k, sep)
	if m < 0 {
		return iik, errors.New("No path found in inverted index key")
	}
	iik.Path = k[:m:m]
	k = k[m+len(sep):] // advance k past the path and sep

	// value
	switch k[0] {
	// These first three are just the one-byte tag
	case JSONTagNull:
		fallthrough
	case JSONTagTrue:
		fallthrough
	case JSONTagFalse:
		iik.TaggedValue = []byte{k[0]}
		k = k[1+len(sep):] // advance k past the tag and sep
	case JSONTagNumber:
		n := 9 // tag + 8 byte float encoding
		iik.TaggedValue = k[0:n]
		k = k[n+len(sep):] // advance k past the tagged float value and sep
	case JSONTagString:
		m = bytes.Index(k, sep)
		if m < 0 {
			return iik, errors.New(
				"String type with no value found in inverted index key")
		}
		iik.TaggedValue = k[:m:m]
		k = k[m+len(sep):] // advance k past the string value and sep
	default:
		return iik, errors.New(
			"Unrecognised type tag for inverted index key")
	}

	// docID is the remainder of the key, which is now k
	iik.DocID = k[:]

	return iik, nil
}

We will have to do this for every key returned by our index scans when servicing a query. Once we’ve done that, we can return the document IDs to our user.

Once I’d written this, I remembered that FoundationDB has a tuple layer that does something similar. After writing the above code, it was really easy to understand how the tuple layer code worked:

  • It follows the exact same “tagged value” concept I landed on — neat!
  • The decoding of the tuple is generic — it can’t assume any value in the tuple is a particular type, like our code can for the path and document ID components — but follows the “figure the type from the tag and decode it based on the type”, just as the above code does.

The people who wrote FoundationDB definitely know what they are doing, so as my approach was similar, I decided I’d probably done a decent job when making up how to encode and decode index keys. After reading the FoundationDB code, I considered reworking my code to be more general. But I decided that, as I knew the format of the keys, it wasn’t the right choice.

Example equality query execution

For completeness, let’s write out all that we’ve done into the code that will perform an equality query on the index.

Recall (yet again!) our tiny example of a set of four indexed documents, containing either {"a": {"food": "apple"}} or {"a": {"food": "orange"}}:

[ a . f o o d 00 2c a p p l e 00 d o c 1 ]
[ a . f o o d 00 2c a p p l e 00 d o c 1 7 ]
[ a . f o o d 00 2c a p p l e 00 d o c 4 ]
[ a . f o o d 00 2c o r a n g e 00 d o c 3 ]

Here’s an example for equality search: "a.food" = "apple". To find the documents matching this predicate, first we generate the lower and upper bounds from the query as described above. Then we pass those into an iteration over the keys in PebbleDB. As we iterate the keys, we decode them and add the IDs into a slice to return to the callee.

// We're calling this with:
// indexDB - a PebbleDB database (key-value store) containing our index.
// path    - "a.food"
// value   - "apple"
func lookupEq(
	indexDb *pebble.DB, 
	path string, 
	value interface{},
) ([]string, error) {
	ids := []string{}

	// First we have to encode our path and value into a start and end key
	// (lower and upper bound).

	// This returns our lower bound, eg: [ a . f o o d 00 2c a p p l e 00 ]
	startKey := pathValueStartKey(path, value)
	// This returns our upper bound, eg: [ a . f o o d 00 2c a p p l e 01 ]
	endKey := pathValueEndKey(path, value)
	readOptions := &pebble.IterOptions{LowerBound: startKey, UpperBound: endKey}

	// Second, get an iterator from the database, and iterate our
	// lower to upper bound. The upper bound is exclusive, so
	// this lower/upper bound pair will return exactly the items
	// that match our path/value pair: the three "apple" documents.
	iter := indexDb.NewIter(readOptions)
	for iter.SeekGE(startKey); iter.Valid(); iter.Next() {
		id, _ := decodeInvIndexKey(iter.Key())
		if err != nil {
			continue // skip bad keys for now
		}
		ids = append(ids, string(id.DocID))
	}
	return ids, iter.Close() // return doc1, doc17, doc4
}

Now what?

Ace! We can execute simple queries! But what else do we need to have something that does everything we might expect of an index:

  1. It’d be nice if we could update and delete documents we previously added to the index. This turns out tricker than I expected.
  2. We can round out our set of operators (equals, greater than, greater than or equal to, less than, less than or equal to) and see how we might combine more than one operator in a single query (what if we want documents describing people with favourite food “apple” and favourite pet “cat”?)

I’m hoping to get those last two articles written in the next few weeks, I just need to work the code into shape and make sure I’m happy with what’s going on in it.

When I take a step back, I realise that, frankly, database indexing has about a bazillion papers dedicated to the topic, so we’re just scratching the surface in these posts. This code isn’t very performant nor is it very robust. It’s nowhere close to a real system.

But even so, writing this much code has allowed me to grasp the problem space a bunch better than I did. Both papers and other people’s code are starting to make a lot more sense than they would’ve before I started playing with this. For example, three months ago, that tuple layer code I mentioned would’ve taken me a while to grok, but, this time reading it, I recognized the pattern quickly. For me, this is a big win and makes the time totally worthwhile — even if it is really quite a shallow understanding 🌟.

← Older
Link: Choose boring culture
→ Newer
Seven Times: building a "read it again (and again)" app with Val Town and Xata