Working effectively with CouchDB Mango indexes
Because you work with CouchDB indexes using JSON and Javascript, it’s tempting to imagine there is something JSON or Javascript-y about how you use them. In the end, there isn’t: they end up on disk as B+ Trees, like pretty much every other database. In order to create appropriate indexes for your queries, it’s important to understand how these work. We can use tables as an easy mental model for indexes, and this article shows how that works for CouchDB’s Mango feature (also called Cloudant Query).
Our data
Let’s make a simple data model for people, and add three people:
{
"_id": "123",
"name": "Mike",
"age": 36
}
{
"_id": "456",
"name": "Mike",
"age": 22
}
{
"_id": "abc",
"name": "Dave",
"age": 29
}
Now we’ll look at how we can index these, and how that affects our queries.
Single field indexes
Let’s take a simple query first: what’s an effective index for finding all
people with a given name? This one feels easy: index on name
. Here’s how
this is indexed in Mango:
"index": {
"fields": ["name"]
}
This creates an index on a single field, name
. This field is the key
in
the index. Conceptually, a good representation for this is a table:
key | doc ID |
---|---|
name | |
mike | 123 |
mike | 456 |
dave | abc |
The doc ID is included as a tie-breaker for entries with equal keys.
This ends up on disk as a B+ Tree. In a similar way to how it’s easy to visually scan the table above from top to bottom (or bottom to top), a B+ Tree makes it fast to scan a file on disk in the same way. Therefore the table and B+ Tree can be considered somewhat equivalent when imagining how a query performs.
Specifically, for a query name == "mike"
query, we can see that it’s fast to
search the first column of the table for "mike"
and return data about those
entries. This same inference holds for the on-disk B+ Tree, so from here we’ll
just talk about the tables.
Most queries involve more fields, however, so lets now look at how multi-field indexes work.
Two field indexes
Say we wanted to search by age
and name
. We can create an index to help with
this. In CouchDB we can index both fields in two ways. We’ll get to this below,
but before we get started we need to understand that:
- Indexing an array of fields as a
key
creates a table with several columns, one per entry in thekey
array. - The entries are ordered by the first column, then the second column, and so on in the overall table.
- Therefore, an index can only be searched in column order from left to right,
because this is the only efficient way to scan the table (or on disk tree).
- If an index cannot be used for a query, the database has to resort to loading every document in the database and check it against the query selector; this is the worst case, and can take a very long time. This is called a table scan.
The key
we choose dictates how we can search, and therefore how efficient a
given query can be. Therefore it’s very important to get your indexes right if
you want results to arrive quickly to your application. In particular, avoiding
table scans for queries used a lot is vital.
Let’s look at the two ways we can index these two fields, and how that ordering shows up in how we can query the indexes.
Firstly, we could use by age
, then name
:
"index": {
"fields": ["age", "name"]
}
Giving us the index:
key | doc ID | |
---|---|---|
age | name | |
22 | mike | 456 |
29 | dave | abc |
36 | mike | 123 |
So a query for name == "mike", age == 36
will initially efficiently search the
first column until it finds the first entry for 36
. It will then scan each
entry with 36
in the first column until it finds the first entry with the
value mike
. When it reaches the end of the entries with age == 36
, the query
can stop reading the index because it knows every row in the table after that
will have an age greater than 36
.
This index can also be used to query on just the age
field. In particular,
it’s great for queries like age > 20
or age > 20, age < 30
because the query
engine can efficiently search for the lower bound, and then scan and return
entries until it reaches the upper bound.
The way a query like age > 20, age < 30, name == "mike"
works is the first
column is searched for the lowerbound age, then the index is scanned for entries
where the name
column is "mike"
. When the search encounters an entry in the
first column – the age
column – that is greater than 30, it can stop reading
the index.
This is important: the first column is searched, but the second column is checked via a slower scan operation. Therefore, for any query, the best index is the one where the first column reduces the search space the most, which reduces the number of rows that need to be scanned through to match entries in the second and further columns of the key.
This index cannot, however, be used for the query name == "mike"
because it
cannot be efficiently scanned by name
because there is no overall ordering for
the name
column. Entries are all jiggled around as they are ordered by the
age
field. Therefore the query engine would need to scan every entry in the
index – not very efficient. This is called an index scan and can actually be
more efficient than a full table scan. However, CouchDB doesn’t support
index scans at this time, so will fall back to the table scan.
Secondly, we could use by name
, then age
:
"index": {
"fields": ["name", "age"]
}
Giving us the index:
key | doc ID | |
---|---|---|
name | age | |
mike | 22 | 456 |
mike | 36 | 123 |
dave | 29 | abc |
So a query for name == "mike", age == 36
will initially scan the first column
until it finds the first entry for mike
. It will then scan each entry with
mike
in the first column until it finds the first entry with the value 36
.
This index can also be used to query on just the name
field. It can also be
used to effectively answer questions like name == "mike", age > 30
because the
first column narrows down the name
quickly and the scan for age
can be fast.
It might help to imagine we have many millions of entries: it’s likely there are
lots of people with a certain age, but far fewer with a certain name. Therefore,
the initial search for name == "mike"
will constrain the search space far more
than a search for age > 30
, and so we end up scanning far fewer rows for the
age
value in the second column.
This index cannot be used for queries on the age
field for the same reason
as above; it just wouldn’t be efficient to scan the whole table.
Summary
The above logic holds for indexes of three, four or any number of further fields. The first column can be efficiently searched, and then we can reasonably efficiently scan for matching entries in the second, third and so on columns presuming the first column search narrows down the search space enough.
Creating appropriate indexes is key for the performance of CouchDB applications making use of Mango (or Cloudant Query on Cloudant). Hopefully this article helps show that it’s relatively straightforward to generate effective indexes once you have worked out the queries they need to service, and that it is possible to create indexes that can serve more than one query’s need by judicious use of multi-field indexes.