Working effectively with CouchDB Mango indexes

Because you work with CouchDB indexes using JSON and Javascript, it’s tempting to imagine there is something JSON or Javascript-y about how you use them. In the end, there isn’t: they end up on disk as B+ Trees, like pretty much every other database. In order to create appropriate indexes for your queries, it’s important to understand how these work. We can use tables as an easy mental model for indexes, and this article shows how that works for CouchDB’s Mango feature (also called Cloudant Query).

Our data

Let’s make a simple data model for people, and add three people:

{
    "_id": "123",
    "name": "Mike",
    "age": 36
}
{
    "_id": "456",
    "name": "Mike",
    "age": 22
}
{
    "_id": "abc",
    "name": "Dave",
    "age": 29
}

Now we’ll look at how we can index these, and how that affects our queries.

Single field indexes

Let’s take a simple query first: what’s an effective index for finding all people with a given name? This one feels easy: index on name. Here’s how this is indexed in Mango:

"index": {
    "fields": ["name"]
}

This creates an index on a single field, name. This field is the key in the index. Conceptually, a good representation for this is a table:

key doc ID
name
mike 123
mike 456
dave abc

The doc ID is included as a tie-breaker for entries with equal keys.

This ends up on disk as a B+ Tree. In a similar way to how it’s easy to visually scan the table above from top to bottom (or bottom to top), a B+ Tree makes it fast to scan a file on disk in the same way. Therefore the table and B+ Tree can be considered somewhat equivalent when imagining how a query performs.

Specifically, for a query name == "mike" query, we can see that it’s fast to search the first column of the table for "mike" and return data about those entries. This same inference holds for the on-disk B+ Tree, so from here we’ll just talk about the tables.

Most queries involve more fields, however, so lets now look at how multi-field indexes work.

Two field indexes

Say we wanted to search by age and name. We can create an index to help with this. In CouchDB we can index both fields in two ways. We’ll get to this below, but before we get started we need to understand that:

  • Indexing an array of fields as a key creates a table with several columns, one per entry in the key array.
  • The entries are ordered by the first column, then the second column, and so on in the overall table.
  • Therefore, an index can only be searched in column order from left to right, because this is the only efficient way to scan the table (or on disk tree).
    • If an index cannot be used for a query, the database has to resort to loading every document in the database and check it against the query selector; this is the worst case, and can take a very long time. This is called a table scan.

The key we choose dictates how we can search, and therefore how efficient a given query can be. Therefore it’s very important to get your indexes right if you want results to arrive quickly to your application. In particular, avoiding table scans for queries used a lot is vital.

Let’s look at the two ways we can index these two fields, and how that ordering shows up in how we can query the indexes.

Firstly, we could use by age, then name:

"index": {
    "fields": ["age", "name"]
}

Giving us the index:

key doc ID
age name
22 mike 456
29 dave abc
36 mike 123

So a query for name == "mike", age == 36 will initially efficiently search the first column until it finds the first entry for 36. It will then scan each entry with 36 in the first column until it finds the first entry with the value mike. When it reaches the end of the entries with age == 36, the query can stop reading the index because it knows every row in the table after that will have an age greater than 36.

This index can also be used to query on just the age field. In particular, it’s great for queries like age > 20 or age > 20, age < 30 because the query engine can efficiently search for the lower bound, and then scan and return entries until it reaches the upper bound.

The way a query like age > 20, age < 30, name == "mike" works is the first column is searched for the lowerbound age, then the index is scanned for entries where the name column is "mike". When the search encounters an entry in the first column – the age column – that is greater than 30, it can stop reading the index.

This is important: the first column is searched, but the second column is checked via a slower scan operation. Therefore, for any query, the best index is the one where the first column reduces the search space the most, which reduces the number of rows that need to be scanned through to match entries in the second and further columns of the key.

This index cannot, however, be used for the query name == "mike" because it cannot be efficiently scanned by name because there is no overall ordering for the name column. Entries are all jiggled around as they are ordered by the age field. Therefore the query engine would need to scan every entry in the index – not very efficient. This is called an index scan and can actually be more efficient than a full table scan. However, CouchDB doesn’t support index scans at this time, so will fall back to the table scan.

Secondly, we could use by name, then age:

"index": {
    "fields": ["name", "age"]
}

Giving us the index:

key doc ID
name age
mike 22 456
mike 36 123
dave 29 abc

So a query for name == "mike", age == 36 will initially scan the first column until it finds the first entry for mike. It will then scan each entry with mike in the first column until it finds the first entry with the value 36.

This index can also be used to query on just the name field. It can also be used to effectively answer questions like name == "mike", age > 30 because the first column narrows down the name quickly and the scan for age can be fast.

It might help to imagine we have many millions of entries: it’s likely there are lots of people with a certain age, but far fewer with a certain name. Therefore, the initial search for name == "mike" will constrain the search space far more than a search for age > 30, and so we end up scanning far fewer rows for the age value in the second column.

This index cannot be used for queries on the age field for the same reason as above; it just wouldn’t be efficient to scan the whole table.

Summary

The above logic holds for indexes of three, four or any number of further fields. The first column can be efficiently searched, and then we can reasonably efficiently scan for matching entries in the second, third and so on columns presuming the first column search narrows down the search space enough.

Creating appropriate indexes is key for the performance of CouchDB applications making use of Mango (or Cloudant Query on Cloudant). Hopefully this article helps show that it’s relatively straightforward to generate effective indexes once you have worked out the queries they need to service, and that it is possible to create indexes that can serve more than one query’s need by judicious use of multi-field indexes.