Build your own Database Index: part 4

In this forth part of a series about writing a simple JSON document index, we’ll talk about the ways that we can update and delete documents from the index.

Most — all? — tutorials I could find online about making simple indexes only covered adding documents to the index and basic querying. It turns out that updating the index is quite an interesting problem, with various different approaches depending on the data format used to represent the index, and the strategy we use to update our indexes. So I think it’s useful to expand our series to talk more about this.

As this is a spare-time learning exercise for me, we’ll end up with a pretty simple system. It’s not super-impractical for real world use, however, albeit with a bit more thought put into efficiency. And error handling, of course 🙃. But even this simple system shows a few tradeoffs we have to make in real world systems, in particular using more disk space to make updates faster.

Time to dig in.

Series outline

There are several parts to this series. Each builds on the previous, so it’s pretty vital to read each of them before getting started with this one.

  1. Encode primitive values.
  2. Add field names to our index.
  3. Index JSON documents (and basic queries)
  4. Update and delete documents from the index. (That’s this part)
  5. Find documents, extract a query planner and do an optimisation pass.

Working out the shape of the answer

We’re working with an existing key-value store, PebbleDB. The API that PebbleDB provides shapes the space of options that we can use. To a first approximation, the API is getting and setting individual key/value pairs, and querying over ranges. The question we need to answer is, when we update a document, how do we get rid of the old index entries for that document before adding the new entries?

As with the whole of this series, I wanted to work through this problem myself rather than implement someone else’s solution. I find this helps me bump into the rough edges of a problem, so understand it better. Given this is my first rodeo in this area, I looked at two of the simplest schemes (at least, in the sense of “two of the first ways I thought to do this that were not immediately impractical”).

As a reminder, the core of this problem is finding the old index entries to remove. This problem differs based on whether you update the index within the same transaction you are using to update the document, or whether the index is eventually consistent and is updated at some point after the document write.

  1. If one is updating the index transactionally, then you have the old and new versions of the document when you are updating the index with the new version of the document. You can process an update by passing both the old and new versions of a document to the index function, and:

    • Generating the old index rows from the old version of the document;
    • Deleting those rows from the index;
    • Generating the new index rows from the new version of the document;
    • Inserting the new index rows into the index.
  2. If you are not updating the index transactionally with primary data, the problem changes because several document updates might happen between the times the index is updated. Therefore, you can’t guarantee that you have the version of the document that was previously indexed to hand.

    The best approach I came up with here is pretty naive. It is to create a forward index, that is, one that maps document IDs to index rows. You can use this index to look up the last set of index entries you created for the document, ie, for the old version of the document.

    In this approach, when updating a document, rather than generating the index entries to remove by processing the old version of the document, you instead derive that list of entries from the forward index. Of course, when you are inserting any new rows into the inverted index, you have to mirror them into the forward index to be used during the next update.

    CouchDB views work like this, I believe.

    This is data duplication and significant write amplification. At the level of a KV store, however, where one can only update individual keys and values, most other approaches I came up with that initially seemed more efficient ended up being even worse (a little more thought mostly lead to the realisation that, while they were a little more space efficient, they had catastrophic performance characteristics, if they even worked at all).

I decided to look at (2) as I liked the idea of creating the second (forward) index and understanding what I needed to do to keep it in-sync with the main (inverted) index.

Both approaches allow optimisations. Later in the article we’ll take a quick look one optimisation: calculating the delta between the old and new index rows to decrease the number of writes to the underlying data store.

The plan

The plan, then, is to maintain a forward index alongside our inverted index. Let’s first think about what the keys in the forward index should look like.

The job of the forward index key is to allow us to find the inverted index keys that we last inserted for a document. Therefore we need to be able to look up all inverted index keys for a given document. The simplest way to do this is if we can issue a single range scan that will give us all the inverted index keys for that document. What this tells us is that the document ID needs to be part of the key prefix, so we can range scan on it.

First, let’s recall our inverted index keys, those that we want to be able to find. We use an approach where each (field, value, document) tuple has a key in the key value store. We don’t use the value for anything, so I don’t show it in the below.

# structure: <field path> 00 <tag> <value> 00 <docID>`
[ n a m e 00 2c t i d d l e s 00 d o c 1 ]
[ o w n e r . c i t y 00 2c b r i s t o l 00 d o c 1 ]
[ o w n e r . n a m e 00 2c m i k e 00 d o c 1 ]
[ s p e c i e s 00 2c c a t 00 d o c 1 ]

To create the forward index keys, I quickly realised that all we need to do is to flip this around to put the document ID at the front, so we can range scan on it. This felt too simple, but it pretty much really is all you need so far as I can tell 🤷:

# structure: <docID> 00 <field path> 00 <tag> <value>`
[ d o c 1 00 n a m e 00 2c t i d d l e s ]
[ d o c 1 00 o w n e r . c i t y 00 2c b r i s t o l ]
[ d o c 1 00 o w n e r . n a m e 00 2c m i k e ]
[ d o c 1 00 s p e c i e s 00 2c c a t ]

This isn’t the only way to do it. An obvious alternative is to use the document ID alone as the key and pack all the inverted index keys into the forward index’s value. But I’d already decided to us a multiple-key approach for the inverted index, so I stuck with it.

Putting them in the same store

One thing that requires a small tweak to both our existing inverted index keys and our proposed forward index keys is that we want to put both of these indexes structures into the same PebbleDB store. This allows us to atomically update both the inverted and forward indexes together, which is needed for correctness.

The problem this leaves us with is how we differentiate forward index entries from inverted index entries, given that the keys are just byte arrays in the end. The way to do this is to use a “tag” again — I gave inverted index entries an i prefix and forward index entries an f prefix.

var invIdxNamespace byte = 'i'
var fwdIdxNamespace byte = 'f'

This is just put as the first byte in the key, so for a forward index key:

[ f 00 d o c 1 00 n a m e 00 2c t i d d l e s ]

The 00 separator isn’t strictly needed, but it allowed me to reuse the packTuple method to encode the keys (which just packs components into a byte array with 00 separators):

type fwdIdxKey struct {
	id          []byte
	path        []byte
	taggedValue []byte
}

// encodeFwdIdxKey serialises k to a byte slice.
func encodeFwdIdxKey(k fwdIdxKey) []byte {
	return packTuple([]byte{fwdIdxNamespace}, k.id, k.path, k.taggedValue)
}

Indexing

Our indexing steps are:

  1. Get the old inverted index entries via the forward index.
  2. Remove those index entries from the inverted index.
  3. Remove the document’s entries in the forward index; we don’t need them any more.
  4. Add the inverted index entries for the new document.
  5. Add the forward index entries for the new document.

Threading note: I’ve not discussed it so far, but this code only works in a single-writer-thread environment. Even though we apply the changes to the data store atomically, we might:

  1. Execute the index writes from index(A), then execute index writes for index(B) (our database would now show just write B).
  2. Then return from index(B) before index(A).

In effect, this is telling the application that A was the write that won. However, if a read is performed, the state after B was written is used.

While I think writers shouldn’t execute in parallel, readers can definitely proceed concurrently to each other. I am not sure whether they can safely proceed in parallel to a writer, however. In particular, a multiple-predicate query might execute multiple range reads while it’s being processed. Even assuming atomic writes of the whole index update, I’d need to better understand PebbleDB’s snapshots — both for iterators and explicit snapshots — to understand whether the different range reads could see different views of the database if it’s being modified concurrently.

Adding documents

This is easier to explain if we start from the end, because it’s easier to understand what we are working with in the forward index if we first look at how stuff gets into it.

So we will begin by adding code to populate the forward index as we index documents. For now, we assume that the code we’re about to look at only ever adds documents to the index, as we’re not yet handling delete/update! Also, we will pretend that errors never happen (I told you this wasn’t ready for the real world 😬).

Anyway, the indexing code becomes:

func index(indexDB *pebble.DB, id string, document map[string]any) {

	// ...
	// Existing code: convert id to a []byte and other house-keeping
	// ...

	// Create a batch of changes to apply atomically
	b := indexDB.NewIndexedBatch()

	// Existing code: 
	// Get each field in the document as dotted-path + value
	pv := getPathValues(document, "")

	for _, pathValue := range pv {
		// Existing code: 
		// Update inverted index
		invIdxKey := encodeInvIdxKey(
			pathValue.path, pathValue.taggedValue, docID)
		b.Set(invIdxKey, nil, pebble.Sync)

		// NEW CODE: 
		// Create the fwd index entries for this field of the document
		fwdIdxKey := encodeFwdIdxKey(fwdIdxKey{
			id:          docID,
			path:        pathValue.path,
			taggedValue: pathValue.taggedValue,
		})
		b.Set(fwdIdxKey, []byte{}, pebble.Sync)
	}

	// Atomically write the batch of changes
	indexDb.apply(b, pebble.Sync)
}

Updating documents

Now let’s look at the more complicated side, what happens when we update?

First we need to remove the existing inverted index keys by using the forward index to look them up. I broke this out into an unindex function, which takes the PebbleDB Batch (for atomic updates) and the document ID to remove from the index. The function removes both inverted and forward index entries.

There’s only one mildly confusing bit here. The inverted index key that we are looking up has the document ID at the end. But, to avoid wasting space by repeating the document’s ID in the forward index key, we don’t embed the inverted index key there verbatim (ie, with the document ID). So to recreate the inverted index key from the forward index key, we have to decode the forward index key and re-encode it into the inverted index key.

func unindex(b *pebble.Batch, docID []byte) error {
	// To unindex, we use the forward index (id -> path-values) to
	// find all the keys in the inverted index to remove. After removing
	// those, we clean up the forward index.

	// 1. Get the range for all the keys for the doc id from the 
	//    forward index. Everything is encoded into the keys.
	startKey := packTuple([]byte{fwdIdxNamespace}, docID)
	endKey := packTuple([]byte{fwdIdxNamespace}, docID)
	endKey = append(endKey, 1) // this trailing 1 is greater-than the 0-separator

	// 2. Read all the keys. Deserialise each key to find the
	//    path-value that is in the inverted index, and delete
	//    that from the inverted index.
	readOptions := &pebble.IterOptions{LowerBound: startKey, UpperBound: endKey}
	iter := b.NewIter(readOptions)
	for iter.SeekGE(startKey); iter.Valid(); iter.Next() {
		fik := decodeFwdIdxKey(iter.Key())
		invIdxKey := encodeInvIdxKey(fik.path, fik.taggedValue, fik.id)
		err := b.Delete(invIdxKey, pebble.Sync)
		if err != nil {
			log.Printf(
				"Couldn't delete/set invIdxKey %v in index: %v",
				invIdxKey, err)
		}
	}
	iter.Close()

	// 3. Clean up: remove all the entries for id in the forward index.
	b.DeleteRange(startKey, endKey, pebble.Sync)

	return nil
}

For updating documents, all that remains is to call unindex from the index function:

func index(indexDB *pebble.DB, id string, document map[string]any) {

	// ...
	// Existing code: convert id to a []byte and other house-keeping
	// ...

	// Create a batch of changes to apply atomically
	b := indexDB.NewIndexedBatch()

	// NEW CODE
	unindex(b, docID)

	// Existing code: Get each field in the document as dotted-path + value
	pv := getPathValues(document, "")

	for _, pathValue := range pv {
		// ...
		// code we saw earlier for adding inverted and forward
		// index entries for new document.
		// ...
	}
}

Deleting documents

We can just call the unindex function on its own to delete documents from the index. That was easy!

Optimisations

The above code is naive. In particular, we create a lot of unnecessary writes.

Think of the case where we are updating a single field in a document. Our code will delete all the previous index entries related to that document and then write all the new index entries. To both the inverted and forward indexes. But it is only necessary to update the entries related to the field that has changed.

The above code requires four writes per field that’s changed (delete then add to inverted index and delete then add to forward index). With the above code, if the document has N indexed fields, when making a change to one field, instead of 4 writes we will make 4 * N writes.

The solution is to get the old inverted index keys from the forward index, and then get the new inverted index keys from the document, and before doing any writes, compare the old and new key sets to find the ones that have changed. Then perform only the changes needed for those changed keys. By doing that, we will take the number of writes back to only those strictly required.

This is an important optimisation — even if a document only has two indexed fields, that would result in twice the number of writes to the index key value store than are required, and it just gets worse with real world documents that have tens or hundreds!

(Perhaps PebbleDB is able to spot this redundant set of changes when it applies the Batch, and perhaps it won’t. We probably want to do this ourselves in real code, so we have better control of the performance.)

Summary

That was quite a long one! But now we’ve got a pretty decent index that we can add, update and remove documents from. Part 3 also covered doing simple equality queries on the entries in the index, so we can do some basic index searches too.

The main thing we’ve got left to explore is more complicated querying. That’s a deep topic. So we’ll stick to a relatively shallow exploration. We’ll expand from equality to range queries, and perhaps combining several range queries with an AND operator to see how we can start to build more complicated structures.

In real databases there’s also a big chunk of code called a query planner which takes the user’s query (whether as a SQL query or in the query language of a NoSQL database) and decides on one or more “internal” queries to execute using its internal indexes (ie, the ones like ours). While I hope to learn more about this layer over time, and how it uses smart algorithms to optimise the execution of the user’s query, for now it’s going to be outside the scope of this series; the next part will stick to the abstraction level that our index sits at.

’til next time 👋

← Older
Seven Times: building a "read it again (and again)" app with Val Town and Xata
→ Newer
Link: The Unforeseen Costs of Extraordinary Experience