Build your own Database Index: part 2

This is part 2 in a short series about writing a simple index for JSON documents. This index should automatically index JSON documents and enable a relatively simple set of queries over those documents, purely so I can learn more about the problems one encounters. The planned parts:

  1. Encode primitive values.
  2. Add field names to our index. (That’s this part).
  3. Index JSON documents
  4. Update and delete documents from the index
  5. Find documents, extract a query planner and do an optimisation pass.

As a reminder, we want to take this JSON document:

{ "a": { "b": 12, "c": "foo" } }

And be able to find it using queries like:

"a.b" == 12
"a.c" == "foo"
"a.b" < 400
"a.c" > "e"

In part 1, we talked about how to take the primitive values in a JSON document — null, booleans, numbers and strings — and encode them into byte arrays in a way that a natural ordering between the values is maintained. This is the first step in creating the keys we need to be able to search for documents in our key-value store, PebbleDB.

In this part, we’ll see how to add the field name to this encoding. Then we’ll briefly cover how that enables equality and range queries. Let’s get going.

Read More…

Build your own Database Index: part 1

I’ve been quiet in November because I’ve been working on a small toy project. Now it’s become interesting, I want to write about it. In part, to prove to myself I actually understand what I’ve built, by showing I can explain it in words. I’ve been working through creating a simple database-like index, to understand the concepts involved more concretely.

(Where’s the code, you ask. It… requires some clean-up first. I’ll get there sometime soon!)

I’ve been reading Database Internals as part of a book club for the last few weeks. One of the noticeable things in the first few chapters is that nearly everything is based on a storage layer with a relatively simple interface, that of a key-value store where both key and value are byte arrays.

The stores are ordered by key and are used for both storing primary data (eg, key = primary key; value = relational row data) and for indexes (key = indexed value; value = referenced row’s primary key). Inspired by my reading, I decided to have a go at writing a simple index for JSON data.

I wanted to concentrate on the index part, so used an existing key-value store. Any would do, but I decided to use PebbleDB which is the underlying store used by CockroachDB. I chose PebbleDB because it was written in Go. As I was going to write my code in Go, using a Go storage layer meant that I could more easily peak at how the storage layer worked if I needed to.

Read More…

Journal: first month with my iPhone 15 Pro

I upgraded from an iPhone 13 Pro to a 15 Pro about a month ago. I enjoy looking back at older first impressions pieces, such as AirPods Pro: first impressions. So herewith, first impressions of the iPhone 15 Pro.

  • I really like the update to the shape. Rounded rather than chamfered edges feel a lot better in my hand. I ended up using a case for the 13 Pro to round off its almost sharp edges. I’d never used a case before, and being able to go caseless once again makes me very happy.

  • Titanium’s slightly lower conductivity of heat also contributes to a more pleasant hand-feel. It doesn’t take the heat from your skin so quickly as aluminium or steel. A friend commented this felt a little plastic-like, and I can see why. But I like it.

  • The weight difference is surprisingly noticeable. It took me by surprise when I picked the phone up in store. This discussion around moment of inertia rings true for me — it feels more than 10% lighter.

  • The Dynamic Island has really grown on me. I love the way it anchors animations to the hardware, bringing software and hardware together in a way I’ve not felt before. Some of the animations still seem a little “look at me using this new feature!”; maybe they will be moderated over time. Anyhow, a much larger difference than I anticipated.

  • I found choosing a colour harder than usual. None of the colours really grabbed me, but I did feel something pulling me towards the Natural titanium colour. It’s grown on me over the month I’ve had it; I’m now rather pleased with the choice.

  • Like the always-on face for Apple Watch, I’ve quickly grown used to the always-on screen. One touch that I find almost endearing is the way the phone sneakily turns off the screen when no-one’s looking. When I catch the phone doing this, and it quickly re-illuminates the screen like nothing happened, I like pretending it’s a little embarrassed to be caught out.

    Clearly there’s a lot of continual environment scanning going on. It’s impressive this is less power-hungry than a screen nowadays.

While the camera is improved and the screen is creeping ever closer to being completely flush with the edge, in the end it is the feel of the hardware that really pulls me to this phone. It’s one of the nicest iPhones in the hand since the iPhone 4 or even perhaps the 3GS, with its sublime curved back.

I’m really happy with this phone. To the extent I plan to keep it three years rather than my usual two. We’ll see if I can stick to that.

Read More…

Links: October 2023

  • Butterick’s Practical Typography

    Good writing is enhanced by good typography. It helps the reader’s eyes efficiently serve up the words to conscious attention. I enjoyed the writing here, as well as the advice.

  • Age of Invention: Does History have a Replication Crisis?

    A different take in a similar vein to last month’s link to a critique on the effectiveness of peer review.

    I’ve become increasingly worried that science’s replication crises might pale in comparison to what happens all the time in history, which is not just a replication crisis but a reproducibility crisis. Replication is when you can repeat an experiment with new data or new materials and get the same result. Reproducibility is when you use exactly the same evidence as another person and still get the same result — so it has a much, much lower bar for success, which is what makes the lack of it in history all the more worrying.

  • Designing a New Old Home

    I loved this series of articles on designing and building a new house that feels like it’s been part of the landscape for many years.

    We wanted to give attention to aesthetic and functional features that are absent in many newer homes, such as thoughtful use of light, space, ventilation, and some passive cooling. It seemed like it would be possible to design something more attractive and with fewer complications than most new homes. We did not have a large budget, certainly not a typical “custom house” budget, but we figured that with some careful tradeoffs, and by doing some of the finishing ourselves, we could build something modestly beautiful.

Read More…

What Is an Inverted Index?

I’d always been slightly confused by the term “inverted index”. The term comes from the world of text search. In that domain, a “forward” index is used to look up a document and retrieve its words. The index that goes the other way — from the words to the documents — is an inversion of that forward index, so the practitioners in that environment called it an inverted index.

But in databases, all indexes are pointing in that direction. When a database indexes a field, what happens is the database creates a large ordered mapping of all the values of the field to the original row they came from — which is identical to the “words (field value) to documents (row)” direction in text search.

So why don’t we call all indexes inverted in database world? In the end, I don’t know why that meaning didn’t get adopted for all indexes in databases, and this distinction between “normal” and “inverted” indexes has been niggling at me for years. A few days ago I decided to try to figure it out once and for all, so I wouldn’t be sitting on the sidelines wondering whether I was just stupid, as people made grand statements about inverted indexes being a perfect solution to this or that problem.

So anyway, what I found was that you don’t need a huge brain to comprehend the difference. It seems that database people just took the phrase from text search but kind of did it wrong, because they use it to talk about a different property of the index than the text search people do. I think this was the root of my confusion.

In the database world, instead of using “inverted” to indicate whether the index is doc -> words (forwards) or word -> docs (inverted) like text search people do, database people use the term “inverted” to refer to indexes that break up the indexed value to generate one or more keys before indexing it. Text search inverted indexes break up the document into words before indexing the separate words, and this “breaking up” meaning was taken into the database world, rather than the direction meaning.

This seems to be all it is: inverted indexes are those that take apart a field’s value to make multiple values, and then use each of those new values as a key in the index.

Unlike text search, where you pretty much only break into words, in a database, you could do this in many ways. For example:

  • Like text search, break up a value into words and use the words as keys.
  • For an array, index each value in the array as a key.
  • For a JSON document, index each path and its value as a key.

I think it took me longer than most to get this distinction because I work on CouchDB. CouchDB’s normal index type, views, are pretty much inverted by this definition, because, to index a document, you transform the document into one or more keys to index using JavaScript functions written to the database. So it’s very natural to, for example, emit a key for each entry in an array. As CouchDB just calls these “indexes”, I figured there must be some other magic to inverted indexes.

(It turns out, of course, that there can be. As an example, Lucene is an advanced text search library based on inverted indexes. It has all sorts of tricks up its sleeves to make a search for documents containing “hat AND cat AND mat” super-fast by doing fancy things when intersecting the documents associated with each of hat, cat and mat. But I don’t think that’s a fundamental of inverted indexes for the purposes of this discussion.)

Many thanks to CockroachDB’s great guide-level explanation of inverted indexes and this discussion of filtering with inverted indexes for prompting this realization. In part, because the filtering with inverted indexes article talked about inverted indexes like they were obviously different, which set me on a quest to find out just what the distinguishing factor was.

Read More…