BYODI 5: plan, optimise, execute ๐
Welcome to part 5 of the build your own database index series! This was supposed to be about finding documents, but I kind of got ahead of myself in the code, so we will skim over some of the details of finding documents and instead talk about moving from a naive query execution to a (very slightly) more sophisticated two-stage process:
- The query planner takes the caller’s query and creates physical operations from it. Physical operations in this case are index scans — instructions to return document IDs between a start and end key in the index.
- The query executor takes the list of index scans produced by the planner and executes them, creating a final list of document IDs by intersecting the results from each scan.
The original search_index
method mangled these two stages together. It also
performed no pre-processing of the query, and naively executed each predicate in
the query exactly as they were passed in.
We can do better than that. Once we have extracted the planning and execution phases into their own methods, we’ll look at performing a simple optimisation of the index scans we do when performing a query.
Series outline
There are several parts to this series. I’ll do a short recap of the important points before starting with this post proper, so it’s probably not necessary to revisit these posts if you’ve already read them. But if you need a deeper dive, pop over to these in order to get the full picture.
- Encode primitive values.
- Add field names to our index.
- Index JSON documents (and basic queries).
- Update and delete documents from the index.
- Extract a query planner and do an optimisation pass (That’s this part).
Our index
As a refresher, our toy database stores JSON documents and automatically indexes all their fields, including arrays, in one big underlying key-value store.
The index stores all its data in the keys, leaving values blank. It does this by
concatenating the JSON field path, the value and a doc ID, then encoding all
three into the key. We’ve used various encodings for the field path over the
course of the project, but at this point we store each key “component” separated
by the value 00
and the path is separated by 01
.
Furthermore, each path part and the value is “tagged” with its type, to aid in
decoding the key. The tag is stored a single-byte prefix of the tagged value.
For the following examples, the tag byte is always 2c
which is the tag for
string
.
Warning! Previous entries in this series describe an index key format where only the field’s value is tagged; somewhere along the way I changed that decision to make everything tagged.
In practice, when looking at the examples, you’ll see that the separators and
tagging mean that we typically have either 01 2c
or 00 2c
between components
of the key. If our values were different types — like numbers or booleans —
the value of the tag byte would differ.
The document
{"name": "tiddles", "owner": {"name": "mike", "city": "bristol" } }
would be
indexed like this:
[ 2c n a m e 00 2c t i d d l e s 00 2c d o c 1 ]
[ 2c o w n e r 01 2c c i t y 00 2c b r i s t o l 00 2c d o c 1 ]
[ 2c o w n e r 01 2c n a m e 00 2c m i k e 00 2c d o c 1 ]
-------------------------- ---------- ----------
path value doc ID
doc1
, at the end of the key.If multiple documents have the same value in the same field, we store that as multiple keys:
[ 2c s p e c i e s 00 2c c a t 00 2c t i d d l e s ]
[ 2c s p e c i e s 00 2c c a t 00 2c s o o t y ]
[ 2c s p e c i e s 00 2c c a t 00 2c m a v i s ]
---------------- -------- ------------
path value doc ID
Query language
Our query language is very simple. We have five predicates:
- E: equality. Documents with a path matching a value;
a.b == "foo"
. - GT/GTE/LT/LTE: greater than, greater than or equal to, less than, less than or
equal to;
a.b > 12
,a.b <= 24
and so on.
Queries are expressed directly in Rust code. The caller constructs a vec
of a
QP
(Query Predicate) enum to represent its predicates, and these are
treated as a conjunction, ie AND
.
This rust code expresses a.name == mike AND a.years > 12
:
vec![
QP::E { path: keypath!("a", "name"), v: tv("mike") },
QP::GT { path: keypath!("a", "years"), v: tv(12) },
];
keypath!
macro and the tv
function are helpers; the API to define
queries is … probably not the best I’ve ever written.Creating scan ranges
Internally, a query predicate is converted into a scan range for the index.
The query a.name >= mike
is transformed to this range:
path value
------------------ ----------
start key: [ 2c a 01 2c n a m e 00 2c m i k e 00 ]
end key: [ 2c a 01 2c n a m e 02 ]
The start key sets the lower bound of the range to the field path a.name
and
the value mike
. The upper bound is “greater than” every value for the field
path a.name
in the index. The trailing 02
on the end key is so the byte
sorts after both the 00
key component separator and the 01
path separator.
This query would scan rows 2, 3, 4 and 5 (start and end keys included to show how they define the range, they are obviously not in the index itself!):
1: [ 2c a 01 2c n a m e 00 2c b o b 00 2c d o c 1 ]
[ 2c a 01 2c n a m e 00 2c m i k e 00 ] // start key location
2: [ 2c a 01 2c n a m e 00 2c m i k e 00 2c d o c 2 ]
3: [ 2c a 01 2c n a m e 00 2c j a n e 00 2c d o c 6 ]
4: [ 2c a 01 2c n a m e 00 2c f r a n c i s 00 2c d o c 8 ]
5: [ 2c a 01 2c n a m e 00 2c k i r k 00 2c d o c 3 ]
[ 2c a 01 2c n a m e 02 ] // end key location
6: [ 2c a 01 2c s p e c i e s 00 2c d o g 00 2c d o c 1 ]
------------------------ -------- ----------
path value doc ID
The doc IDs are then parsed from the scanned keys. After all the scans have been executed, the doc IDs extracted from the keys are intersected (because we only want to return the IDs that matched all query predicates/scans). The result of the intersection is returned to the caller.
Searching
Now we’ve show how a query executes in theory, let’s have a look at the search
code that we’ll start off with. This code is naive. It takes the query the
caller passes to it and executes it exactly as given — an index scan per query
predicate, executed in the order of the predicates in the passed vec
—
regardless of how efficient this set of index scans might end up being.
pub fn search_index(db: &Db, q: Query) -> Result<Vec<String>, DocDbError> {
// BTreeMap so we return IDs to caller in order
let mut result_ids = BTreeMap::new();
let mut n_preds = 0;
// Loop over each predicate in the query and lookup
// the IDs that match each in turn. Count the instances
// of each ID in a map.
for qp in q {
n_preds += 1;
let ids = match qp {
QP::E { p, v } => lookup_eq(db, p, v)?,
QP::GT { p, v } => lookup_gt(db, p, v)?,
QP::GTE { p, v } => lookup_gte(db, p, v)?,
QP::LT { p, v } => lookup_lt(db, p, v)?,
QP::LTE { p, v } => lookup_lte(db, p, v)?,
};
for id in ids {
let count = result_ids.entry(id).or_insert(0);
*count += 1;
}
}
// Return the IDs which had a result in every scan.
let mut result = vec![];
for (id, n) in result_ids {
if n == n_preds {
result.push(id);
}
}
Ok(result)
}
(rust-docdb/src/query.rs at e9ac36fbce774b16d492f558d35909eb5d083877 ยท mikerhodes/rust-docdb)
The lookup_*
functions both “plan” and “execute”: they generate the physical
operation (start/end key) for the predicate and execute it using the scan
function that executes the scan on the index and returns the doc IDs extracted
from the resulting keys.
fn lookup_eq(
db: &Db,
path: Vec<TaggableValue>,
v: TaggableValue,
) -> Result<Vec<String>, DocDbError> {
let start_key = encoding::encode_index_query_pv_start_key(&path, &v);
let end_key = encoding::encode_index_query_pv_end_key(&path, &v);
scan(&db, &start_key, &end_key)
}
This will be our first problem: we need to break these functions apart so we can
put the “plan” bit — extracting the scan range — into our query planner and
putting the scan
call into our query executor.
Let’s do that next.
Extracting a planner
First, we need to pull apart the lookup_*
functions as noted above.
After we’ve done that, we can “plan”: loop over the QP::*
query predicates
generating the start/end key pairs for all the predicates.
Once we have carried out that transformation, we can execute: loop over the scan ranges and execute the scans. Each scan will return IDs. So, exactly as before, we count the number of times each ID appears in the scan results, and return those IDs that were in every scan.
Pulling apart the lookup functions
This is quite simple. Instead of carrying out the scan in the lookup function,
instead return the start and end keys. We can rename the functions for good
measure to scan_range_*
. Here’s the new scan_range_eq
function.
fn scan_range_eq(path: Vec<TaggableValue>, v: TaggableValue) -> (Vec<u8>, Vec<u8>) {
let start_key = encoding::encode_index_query_pv_start_key(&path, &v);
let end_key = encoding::encode_index_query_pv_end_key(&path, &v);
(start_key, end_key)
}
Refactor the search
Next we rewrite search_index
to use these new functions to get our scan
ranges, and then execute them using the scan
function that used to be called
in lookup_*
.
First, we define a Scan
struct to hold our physical operation. Then we extract
the query planner and exector into their own functions for clarity, query_plan
and query_execute
. The query_plan
function returns a vec of Scan
structs.
These are passed to the query_execute
function for execution against the
database.
// A physical operation that defines scanning the index between
// skey and ekey, returning the doc IDs found.
struct Scan {
skey: Vec<u8>,
ekey: Vec<u8>,
}
pub fn search_index(db: &Db, mut q: Query) -> Result<Vec<String>, DocDbError> {
let scans = query_plan(q);
let results = query_execute(scans, db)?;
Ok(results)
}
// query_plan returns a list of physical operations to carry out
// to execute the query q.
fn query_plan(q: Vec<QP>) -> Result<Vec<Scan>, DocDbError> {
let mut scans: Vec<Scan> = vec![];
for qp in q {
let (skey, ekey) = match qp {
QP::E { p, v } => search_range_eq(p, v),
QP::GT { p, v } => search_range_gt(p, v),
QP::GTE { p, v } => search_range_gte(p, v),
QP::LT { p, v } => search_range_lt(p, v),
QP::LTE { p, v } => search_range_lte(p, v),
};
scans.push(Scan { skey, ekey });
}
scans
}
// query_execute executes a set of physical operations (Scans) against
// a database, and returns the matching document IDs.
fn query_execute(scans: Vec<Scan>, db: &Db) -> Result<Vec<String>, DocDbError> {
// BTreeMap so we return IDs to caller in order
let mut result_ids = BTreeMap::new();
let mut n_preds = 0;
for s in scans {
n_preds += 1;
let ids = scan(&db, &s.skey, &s.ekey)?;
stats.scans += 1;
for id in ids {
let count = result_ids.entry(id).or_insert(0);
*count += 1;
}
}
// Return the IDs that we saw as a result for every
// predicate by checking the count in the map.
let mut result = vec![];
for (id, n) in result_ids {
if n == n_preds {
result.push(id);
}
}
Ok(results)
}
Now, this code is just as naive, but also more complicated. Well done us!
However, this structure puts us in a better place to work with the query to make
it more efficient before it’s executed. Broadly speaking, the place to carry out
optimisations is once we’ve translated the query into index scans. Now we have
an obvious place to put this code: in query_plan
, after the for
loop that
does the predicate to scan transformation.
Our next goal, then, is to make our planner a little more smart, rather than
merely regurgitating the exact input query as scan operations. To do this, we
will narrow our focus to the query_plan
method. Its goal becomes: produce the
quickest to execute list of scan operations ๐
Let’s optimise! (Just a little bit)
There are many possible optimisation steps we can do in query_plan
. One of the
most obvious is to collapse predicates over the same field into a single range
scan.
As an example, in our query language, 12 <= a < 46
needs to be expressed as
two predicates: a >= 12 AND a < 46
. The existing query planner takes these and
executes them as two range scans:
- From the start key
[ a 00 12 00 ]
to the end key[ a 02 ]
. (Ie, froma == 12
until the last index key for the fielda
). - From the start key
[ a 00 ]
to the end key[ a 00 46 ]
. (Ie, from the first index key fora
, stopping when we finda == 46
as end key is exclusive).
Clearly, theoretically we should be able to express this as just one scan,
[a 00 12 00]
to [ a 00 46]
. Given that the range of 12 to 46 is quite a
small range, even if we’re only considering people’s ages, say, the naive way of
executing this query likely ends up reading a lot of data that we don’t need to.
A note on equality
Before we get started, there’s one subtlety to take note of. It’s that an
equality operation is just a range scan of a single path and value. That is,
a == 12
, written in code as:
QP::E { path: keypath!("a"), v: tv(12) },
Is actually expressed as a range scan operation — necessarily, because we are
storing an index key for every document with the field a
being 12
.
Scan {
// the trailing 00 ensures that our skey sorts less-than all
// document keys for a == 12, recall our keys are
// [ a 00 12 00 d o c 1 ], [ a 00 12 00 d o c 2 ]
// and so on.
skey: [ a 00 12 00 ],
// the trailing 02 in ekey ensures that our end key sorts
// greater-than all keys for documents with a == 12.
ekey: [ a 00 12 02 ]
}
This means that we don’t have to treat equality separately in what is about to unfold. It’s just another range scan like everything else. See what I said about it being nice to do this after we’d transformed everything to scans?
Back to the story
Our query 12 <= a < 46
is represented in Rust code by the two query
predicates:
let query = vec![
QP::GTE { path: keypath!("a"), v: tv(12) },
QP::LT { path: keypath!("a"), v: tv(46) },
];
Our naive query planner currently generates two scan operations:
Scan { skey: [ a 00 12 00 ], ekey: [ a 02 ] } // a >= 12
Scan { skey: [ a 00 ], ekey: [ a 00 46 00 ] } // a < 46
As noted, we should do better, and instead do this scan:
Scan { skey: [ a 00 12 00 ], ekey: [ a 00 46 00 ] }
But how do we get to that more efficient scan? And what happens if we have lots
of predicates for a specific field? And what if the predicates describe a range
that isn’t valid in a conjunction, like a < 10 AND a > 50
?
This is something our query planner can work out!
It turns out that when we are working with a conjunction, so far as I can tell with a quick pen-and-paper exercise, we can work out the range that satisfies all predicates with a simple method:
- First, group the predicates by their field. Then, for each field group:
- Generate all the scan ranges
(skey,ekey)
from the predicates. - Find the largest
skey
. - Find the smallest
ekey
. - If
skey <= ekey
, add that range to our planned scans. - If
skey > ekey
, our ranges don’t overlap, so return an error.
- Generate all the scan ranges
The key thing about this code is that we first convert the predicates into their corresponding range scans, and only then do we start to apply optimisations. This is because the different query types all reduce to the same basic operation, a range scan on the database. Once we have a set of scans to work with, the optimisation of finding the smallest range scan that satisfies all predicates is straight forward.
In code, this isn’t quite so clean. In particular, the grouping-by-field code is
a bit gnarly due to having to use a pretty awkward match
to extract the path
for the grouping key. While it’s not awful, it does make me wonder whether
there’s a better way to represent the query. Something to think on separately.
fn query_plan(q: Vec<QP>) -> Result<Vec<Scan>, DocDbError> {
// Map the query predicates into individual range scans
// while grouping them by field. This is a nice structure
// for later optimisations.
let mut groups: BTreeMap<Vec<u8>, Vec<Scan>> = BTreeMap::new();
for qp in q {
let path = match &qp {
QP::E { p, .. }
| QP::GT { p, .. }
| QP::GTE { p, .. }
| QP::LT { p, .. }
| QP::LTE { p, .. } => p.clone(),
};
let (skey, ekey) = match qp {
QP::E { p, v } => scan_range_eq(p, v),
QP::GT { p, v } => scan_range_gt(p, v),
QP::GTE { p, v } => scan_range_gte(p, v),
QP::LT { p, v } => scan_range_lt(p, v),
QP::LTE { p, v } => scan_range_lte(p, v),
};
let x = (&path).encode(); // encode our field path to use as map key
// Add this predicate's Scan to our grouping map
groups.entry(x).or_insert(vec![]).push(Scan { skey, ekey })
}
// Now we have a map that will be something like this:
// groups["name"] = [Scan{}, Scan{}, Scan{}]
// groups["age"] = [Scan{}, Scan{}]
// And now collapse each grouped set of Scans into one scan.
// Note we allow this to create invalid scan ranges; we
// check for this later.
let mut collapsed_scans = vec![];
for (_, scans) in groups {
let skey = scans.iter().map(|x| x.skey.clone()).max().unwrap();
let ekey = scans.iter().map(|x| x.ekey.clone()).min().unwrap();
// this method gets our max-skey or min-ekey ---^^^
collapsed_scans.push(Scan { skey, ekey });
}
// collapsed_scans now contains just one Scan{} for every field. But
// some might be invalid: the skey could be greater than the ekey.
// So check for these invalid ranges. If we find any, return an
// error, the query doesn't quite make sense!
match collapsed_scans.iter().find(|&s| s.skey > s.ekey) {
Some(_) => Err(DocDbError::InvalidQuery),
None => Ok(collapsed_scans),
}
}
And there you have it. You’ll have it after studying the code for a bit, anyway. Hopefully the comments help decipher it.
Summary
That’s quite a bit of code. But if we wrote more optimisations, we’d likely see this pattern repeating itself because it shows the general pattern we’d use for a lot of optimisations:
- First, take the query we are given and generate the range scans we need for it.
- Next, perform some operation over those range scans to reduce their number, or re-order the scans based on heuristics like the number of results they might return.
Or perhaps, like the above, we can even work out that a query will return no results before we even try to execute it!
From this we start to see why databases traditionally have several stages in their query execution pipelines. By separating out planning and executing we were able to find an obvious place to insert an important optimisation into our pipeline. Obvious places are good; the likely mean that the code will be easier to maintain later. This reminds me a lot of the way a compiler will first create an intermediate, simpler representation of our code before applying its optimisations.
I learned a lot writing this part of my toy database and index. I’m not sure how
much further I’ll push this project. I’m enjoying it but I feel ready for a
slightly different challenge. I’d like to get OR
in, and perhaps code that’s
able to check whether a document pulled from the database matches a query. But I
also have a few other ideas for database related projects.
We’ll see what grabs me next!