BYODI 5: plan, optimise, execute ๐Ÿš€

Welcome to part 5 of the build your own database index series! This was supposed to be about finding documents, but I kind of got ahead of myself in the code, so we will skim over some of the details of finding documents and instead talk about moving from a naive query execution to a (very slightly) more sophisticated two-stage process:

  1. The query planner takes the caller’s query and creates physical operations from it. Physical operations in this case are index scans — instructions to return document IDs between a start and end key in the index.
  2. The query executor takes the list of index scans produced by the planner and executes them, creating a final list of document IDs by intersecting the results from each scan.

The original search_index method mangled these two stages together. It also performed no pre-processing of the query, and naively executed each predicate in the query exactly as they were passed in.

We can do better than that. Once we have extracted the planning and execution phases into their own methods, we’ll look at performing a simple optimisation of the index scans we do when performing a query.

Series outline

There are several parts to this series. I’ll do a short recap of the important points before starting with this post proper, so it’s probably not necessary to revisit these posts if you’ve already read them. But if you need a deeper dive, pop over to these in order to get the full picture.

  1. Encode primitive values.
  2. Add field names to our index.
  3. Index JSON documents (and basic queries).
  4. Update and delete documents from the index.
  5. Extract a query planner and do an optimisation pass (That’s this part).

Our index

As a refresher, our toy database stores JSON documents and automatically indexes all their fields, including arrays, in one big underlying key-value store.

The index stores all its data in the keys, leaving values blank. It does this by concatenating the JSON field path, the value and a doc ID, then encoding all three into the key. We’ve used various encodings for the field path over the course of the project, but at this point we store each key “component” separated by the value 00 and the path is separated by 01.

Furthermore, each path part and the value is “tagged” with its type, to aid in decoding the key. The tag is stored a single-byte prefix of the tagged value. For the following examples, the tag byte is always 2c which is the tag for string.

Warning! Previous entries in this series describe an index key format where only the field’s value is tagged; somewhere along the way I changed that decision to make everything tagged.

In practice, when looking at the examples, you’ll see that the separators and tagging mean that we typically have either 01 2c or 00 2c between components of the key. If our values were different types — like numbers or booleans — the value of the tag byte would differ.

The document {"name": "tiddles", "owner": {"name": "mike", "city": "bristol" } } would be indexed like this:

[ 2c n a m e 00 2c t i d d l e s 00 2c d o c 1 ]
[ 2c o w n e r 01 2c c i t y 00 2c b r i s t o l 00 2c d o c 1 ]
[ 2c o w n e r 01 2c n a m e 00 2c m i k e 00 2c d o c 1 ]
  --------------------------    ----------    ----------
            path                  value         doc ID
Note the way that the fields are stored in lexographical byte order rather than in the order they are found in the document, and that we have the embedded doc ID, doc1, at the end of the key.

If multiple documents have the same value in the same field, we store that as multiple keys:

[ 2c s p e c i e s 00 2c c a t 00 2c t i d d l e s ]
[ 2c s p e c i e s 00 2c c a t 00 2c s o o t y ]
[ 2c s p e c i e s 00 2c c a t 00 2c m a v i s ]
  ----------------    --------    ------------
     path              value         doc ID

Query language

Our query language is very simple. We have five predicates:

  • E: equality. Documents with a path matching a value; a.b == "foo".
  • GT/GTE/LT/LTE: greater than, greater than or equal to, less than, less than or equal to; a.b > 12, a.b <= 24 and so on.

Queries are expressed directly in Rust code. The caller constructs a vec of a QP (Query Predicate) enum to represent its predicates, and these are treated as a conjunction, ie AND.

This rust code expresses == mike AND a.years > 12:

    QP::E { path: keypath!("a", "name"), v: tv("mike") },
    QP::GT { path: keypath!("a", "years"), v: tv(12) },
The keypath! macro and the tv function are helpers; the API to define queries is … probably not the best I’ve ever written.

Creating scan ranges

Internally, a query predicate is converted into a scan range for the index.

The query >= mike is transformed to this range:

                  path               value
             ------------------    ----------
start key: [ 2c a 01 2c n a m e 00 2c m i k e 00 ]
end key:   [ 2c a 01 2c n a m e 02 ]

The start key sets the lower bound of the range to the field path and the value mike. The upper bound is “greater than” every value for the field path in the index. The trailing 02 on the end key is so the byte sorts after both the 00 key component separator and the 01 path separator.

This query would scan rows 2, 3, 4 and 5 (start and end keys included to show how they define the range, they are obviously not in the index itself!):

1: [ 2c a 01 2c n a m e 00 2c b o b 00 2c d o c 1 ]
   [ 2c a 01 2c n a m e 00 2c m i k e 00 ]        // start key location
2: [ 2c a 01 2c n a m e 00 2c m i k e 00 2c d o c 2 ]
3: [ 2c a 01 2c n a m e 00 2c j a n e 00 2c d o c 6 ]
4: [ 2c a 01 2c n a m e 00 2c f r a n c i s 00 2c d o c 8 ]
5: [ 2c a 01 2c n a m e 00 2c k i r k 00 2c d o c 3 ]
   [ 2c a 01 2c n a m e 02 ]                      // end key location
6: [ 2c a 01 2c s p e c i e s 00 2c d o g 00 2c d o c 1 ]
     ------------------------    --------    ----------
           path                    value        doc ID

The doc IDs are then parsed from the scanned keys. After all the scans have been executed, the doc IDs extracted from the keys are intersected (because we only want to return the IDs that matched all query predicates/scans). The result of the intersection is returned to the caller.


Now we’ve show how a query executes in theory, let’s have a look at the search code that we’ll start off with. This code is naive. It takes the query the caller passes to it and executes it exactly as given — an index scan per query predicate, executed in the order of the predicates in the passed vec — regardless of how efficient this set of index scans might end up being.

pub fn search_index(db: &Db, q: Query) -> Result<Vec<String>, DocDbError> {
    // BTreeMap so we return IDs to caller in order
    let mut result_ids = BTreeMap::new();
    let mut n_preds = 0;

    // Loop over each predicate in the query and lookup
    // the IDs that match each in turn. Count the instances
    // of each ID in a map.
    for qp in q {
        n_preds += 1;
        let ids = match qp {
            QP::E { p, v } => lookup_eq(db, p, v)?,
            QP::GT { p, v } => lookup_gt(db, p, v)?,
            QP::GTE { p, v } => lookup_gte(db, p, v)?,
            QP::LT { p, v } => lookup_lt(db, p, v)?,
            QP::LTE { p, v } => lookup_lte(db, p, v)?,
        for id in ids {
            let count = result_ids.entry(id).or_insert(0);
            *count += 1;

    // Return the IDs which had a result in every scan.
    let mut result = vec![];
    for (id, n) in result_ids {
        if n == n_preds {


(rust-docdb/src/ at e9ac36fbce774b16d492f558d35909eb5d083877 ยท mikerhodes/rust-docdb)

The lookup_* functions both “plan” and “execute”: they generate the physical operation (start/end key) for the predicate and execute it using the scan function that executes the scan on the index and returns the doc IDs extracted from the resulting keys.

fn lookup_eq(
    db: &Db,
    path: Vec<TaggableValue>,
    v: TaggableValue,
) -> Result<Vec<String>, DocDbError> {
    let start_key = encoding::encode_index_query_pv_start_key(&path, &v);
    let end_key = encoding::encode_index_query_pv_end_key(&path, &v);
    scan(&db, &start_key, &end_key)

This will be our first problem: we need to break these functions apart so we can put the “plan” bit — extracting the scan range — into our query planner and putting the scan call into our query executor.

Let’s do that next.

Extracting a planner

First, we need to pull apart the lookup_* functions as noted above.

After we’ve done that, we can “plan”: loop over the QP::* query predicates generating the start/end key pairs for all the predicates.

Once we have carried out that transformation, we can execute: loop over the scan ranges and execute the scans. Each scan will return IDs. So, exactly as before, we count the number of times each ID appears in the scan results, and return those IDs that were in every scan.

Pulling apart the lookup functions

This is quite simple. Instead of carrying out the scan in the lookup function, instead return the start and end keys. We can rename the functions for good measure to scan_range_*. Here’s the new scan_range_eq function.

fn scan_range_eq(path: Vec<TaggableValue>, v: TaggableValue) -> (Vec<u8>, Vec<u8>) {
    let start_key = encoding::encode_index_query_pv_start_key(&path, &v);
    let end_key = encoding::encode_index_query_pv_end_key(&path, &v);
    (start_key, end_key)
๐ŸŒŸ I should note that if you head over the the rust-docdb codebase, you’ll not see neat commits for the upcoming code; I worked through this in a series of messy commits that mixed and matched various ideas that I had before settling on this approach.

Next we rewrite search_index to use these new functions to get our scan ranges, and then execute them using the scan function that used to be called in lookup_*.

First, we define a Scan struct to hold our physical operation. Then we extract the query planner and exector into their own functions for clarity, query_plan and query_execute. The query_plan function returns a vec of Scan structs. These are passed to the query_execute function for execution against the database.

// A physical operation that defines scanning the index between
// skey and ekey, returning the doc IDs found.
struct Scan {
    skey: Vec<u8>,
    ekey: Vec<u8>,

pub fn search_index(db: &Db, mut q: Query) -> Result<Vec<String>, DocDbError> {
    let scans = query_plan(q);
    let results = query_execute(scans, db)?;

// query_plan returns a list of physical operations to carry out
// to execute the query q.
fn query_plan(q: Vec<QP>) -> Result<Vec<Scan>, DocDbError> {
    let mut scans: Vec<Scan> = vec![];
    for qp in q {
        let (skey, ekey) = match qp {
            QP::E { p, v } => search_range_eq(p, v),
            QP::GT { p, v } => search_range_gt(p, v),
            QP::GTE { p, v } => search_range_gte(p, v),
            QP::LT { p, v } => search_range_lt(p, v),
            QP::LTE { p, v } => search_range_lte(p, v),
        scans.push(Scan { skey, ekey });

// query_execute executes a set of physical operations (Scans) against
// a database, and returns the matching document IDs.
fn query_execute(scans: Vec<Scan>, db: &Db) -> Result<Vec<String>, DocDbError> {
    // BTreeMap so we return IDs to caller in order
    let mut result_ids = BTreeMap::new();
    let mut n_preds = 0;
    for s in scans {
        n_preds += 1;
        let ids = scan(&db, &s.skey, &s.ekey)?;
        stats.scans += 1;
        for id in ids {
            let count = result_ids.entry(id).or_insert(0);
            *count += 1;
    // Return the IDs that we saw as a result for every
    // predicate by checking the count in the map.
    let mut result = vec![];
    for (id, n) in result_ids {
        if n == n_preds {

Now, this code is just as naive, but also more complicated. Well done us! However, this structure puts us in a better place to work with the query to make it more efficient before it’s executed. Broadly speaking, the place to carry out optimisations is once we’ve translated the query into index scans. Now we have an obvious place to put this code: in query_plan, after the for loop that does the predicate to scan transformation.

Our next goal, then, is to make our planner a little more smart, rather than merely regurgitating the exact input query as scan operations. To do this, we will narrow our focus to the query_plan method. Its goal becomes: produce the quickest to execute list of scan operations ๐Ÿš€

Let’s optimise! (Just a little bit)

There are many possible optimisation steps we can do in query_plan. One of the most obvious is to collapse predicates over the same field into a single range scan.

As an example, in our query language, 12 <= a < 46 needs to be expressed as two predicates: a >= 12 AND a < 46. The existing query planner takes these and executes them as two range scans:

  1. From the start key [ a 00 12 00 ] to the end key [ a 02 ]. (Ie, from a == 12 until the last index key for the field a).
  2. From the start key [ a 00 ] to the end key [ a 00 46 ]. (Ie, from the first index key for a, stopping when we find a == 46 as end key is exclusive).

Clearly, theoretically we should be able to express this as just one scan, [a 00 12 00] to [ a 00 46]. Given that the range of 12 to 46 is quite a small range, even if we’re only considering people’s ages, say, the naive way of executing this query likely ends up reading a lot of data that we don’t need to.

A note on equality

Before we get started, there’s one subtlety to take note of. It’s that an equality operation is just a range scan of a single path and value. That is, a == 12, written in code as:

QP::E { path: keypath!("a"), v: tv(12) },

Is actually expressed as a range scan operation — necessarily, because we are storing an index key for every document with the field a being 12.

Scan {
    // the trailing 00 ensures that our skey sorts less-than all
    // document keys for a == 12, recall our keys are 
    // [ a 00 12 00 d o c 1 ], [ a 00 12 00 d o c 2 ]
    // and so on.
    skey: [ a 00 12 00 ], 
    // the trailing 02 in ekey ensures that our end key sorts
    // greater-than all keys for documents with a == 12. 
    ekey: [ a 00 12 02 ] 

This means that we don’t have to treat equality separately in what is about to unfold. It’s just another range scan like everything else. See what I said about it being nice to do this after we’d transformed everything to scans?

Back to the story

Our query 12 <= a < 46 is represented in Rust code by the two query predicates:

let query = vec![
    QP::GTE { path: keypath!("a"), v: tv(12) },
    QP::LT { path: keypath!("a"), v: tv(46) },

Our naive query planner currently generates two scan operations:

Scan { skey: [ a 00 12 00 ], ekey: [ a 02 ] } // a >= 12
Scan { skey: [ a 00 ], ekey: [ a 00 46 00 ] } // a < 46

As noted, we should do better, and instead do this scan:

Scan { skey: [ a 00 12 00 ], ekey: [ a 00 46 00 ] }

But how do we get to that more efficient scan? And what happens if we have lots of predicates for a specific field? And what if the predicates describe a range that isn’t valid in a conjunction, like a < 10 AND a > 50?

This is something our query planner can work out!

It turns out that when we are working with a conjunction, so far as I can tell with a quick pen-and-paper exercise, we can work out the range that satisfies all predicates with a simple method:

  1. First, group the predicates by their field. Then, for each field group:
    1. Generate all the scan ranges (skey,ekey) from the predicates.
    2. Find the largest skey.
    3. Find the smallest ekey.
    4. If skey <= ekey, add that range to our planned scans.
    5. If skey > ekey, our ranges don’t overlap, so return an error.
The last step is rather handy: it shows that, for some special cases, we can work out that the query can return no results before we’ve even executed a single database operation. That’s about as optimised as you can get!

The key thing about this code is that we first convert the predicates into their corresponding range scans, and only then do we start to apply optimisations. This is because the different query types all reduce to the same basic operation, a range scan on the database. Once we have a set of scans to work with, the optimisation of finding the smallest range scan that satisfies all predicates is straight forward.

In code, this isn’t quite so clean. In particular, the grouping-by-field code is a bit gnarly due to having to use a pretty awkward match to extract the path for the grouping key. While it’s not awful, it does make me wonder whether there’s a better way to represent the query. Something to think on separately.

fn query_plan(q: Vec<QP>) -> Result<Vec<Scan>, DocDbError> {
    // Map the query predicates into individual range scans
    // while grouping them by field. This is a nice structure
    // for later optimisations.
    let mut groups: BTreeMap<Vec<u8>, Vec<Scan>> = BTreeMap::new();
    for qp in q {
        let path = match &qp {
            QP::E { p, .. }
            | QP::GT { p, .. }
            | QP::GTE { p, .. }
            | QP::LT { p, .. }
            | QP::LTE { p, .. } => p.clone(),
        let (skey, ekey) = match qp {
            QP::E { p, v } => scan_range_eq(p, v),
            QP::GT { p, v } => scan_range_gt(p, v),
            QP::GTE { p, v } => scan_range_gte(p, v),
            QP::LT { p, v } => scan_range_lt(p, v),
            QP::LTE { p, v } => scan_range_lte(p, v),
        let x = (&path).encode(); // encode our field path to use as map key

        // Add this predicate's Scan to our grouping map
        groups.entry(x).or_insert(vec![]).push(Scan { skey, ekey })

    // Now we have a map that will be something like this:
    // groups["name"] = [Scan{}, Scan{}, Scan{}]
    // groups["age"] = [Scan{}, Scan{}]

    // And now collapse each grouped set of Scans into one scan.
    // Note we allow this to create invalid scan ranges; we 
    // check for this later.
    let mut collapsed_scans = vec![];
    for (_, scans) in groups {
        let skey = scans.iter().map(|x| x.skey.clone()).max().unwrap();
        let ekey = scans.iter().map(|x| x.ekey.clone()).min().unwrap();
        // this method gets our max-skey or min-ekey ---^^^
        collapsed_scans.push(Scan { skey, ekey });

    // collapsed_scans now contains just one Scan{} for every field. But
    // some might be invalid: the skey could be greater than the ekey.
    // So check for these invalid ranges. If we find any, return an 
    // error, the query doesn't quite make sense!
    match collapsed_scans.iter().find(|&s| s.skey > s.ekey) {
        Some(_) => Err(DocDbError::InvalidQuery),
        None => Ok(collapsed_scans),

And there you have it. You’ll have it after studying the code for a bit, anyway. Hopefully the comments help decipher it.


That’s quite a bit of code. But if we wrote more optimisations, we’d likely see this pattern repeating itself because it shows the general pattern we’d use for a lot of optimisations:

  • First, take the query we are given and generate the range scans we need for it.
  • Next, perform some operation over those range scans to reduce their number, or re-order the scans based on heuristics like the number of results they might return.

Or perhaps, like the above, we can even work out that a query will return no results before we even try to execute it!

From this we start to see why databases traditionally have several stages in their query execution pipelines. By separating out planning and executing we were able to find an obvious place to insert an important optimisation into our pipeline. Obvious places are good; the likely mean that the code will be easier to maintain later. This reminds me a lot of the way a compiler will first create an intermediate, simpler representation of our code before applying its optimisations.

I learned a lot writing this part of my toy database and index. I’m not sure how much further I’ll push this project. I’m enjoying it but I feel ready for a slightly different challenge. I’d like to get OR in, and perhaps code that’s able to check whether a document pulled from the database matches a query. But I also have a few other ideas for database related projects.

We’ll see what grabs me next!

โ† Older
Arboreal Labyrinth
โ†’ Newer
ToyKV teaser