What Is an Inverted Index?

I’d always been slightly confused by the term “inverted index”. The term comes from the world of text search. In that domain, a “forward” index is used to look up a document and retrieve its words. The index that goes the other way — from the words to the documents — is an inversion of that forward index, so the practitioners in that environment called it an inverted index.

But in databases, all indexes are pointing in that direction. When a database indexes a field, what happens is the database creates a large ordered mapping of all the values of the field to the original row they came from — which is identical to the “words (field value) to documents (row)” direction in text search.

So why don’t we call all indexes inverted in database world? In the end, I don’t know why that meaning didn’t get adopted for all indexes in databases, and this distinction between “normal” and “inverted” indexes has been niggling at me for years. A few days ago I decided to try to figure it out once and for all, so I wouldn’t be sitting on the sidelines wondering whether I was just stupid, as people made grand statements about inverted indexes being a perfect solution to this or that problem.

So anyway, what I found was that you don’t need a huge brain to comprehend the difference. It seems that database people just took the phrase from text search but kind of did it wrong, because they use it to talk about a different property of the index than the text search people do. I think this was the root of my confusion.

In the database world, instead of using “inverted” to indicate whether the index is doc -> words (forwards) or word -> docs (inverted) like text search people do, database people use the term “inverted” to refer to indexes that break up the indexed value to generate one or more keys before indexing it. Text search inverted indexes break up the document into words before indexing the separate words, and this “breaking up” meaning was taken into the database world, rather than the direction meaning.

This seems to be all it is: inverted indexes are those that take apart a field’s value to make multiple values, and then use each of those new values as a key in the index.

Unlike text search, where you pretty much only break into words, in a database, you could do this in many ways. For example:

  • Like text search, break up a value into words and use the words as keys.
  • For an array, index each value in the array as a key.
  • For a JSON document, index each path and its value as a key.

I think it took me longer than most to get this distinction because I work on CouchDB. CouchDB’s normal index type, views, are pretty much inverted by this definition, because, to index a document, you transform the document into one or more keys to index using JavaScript functions written to the database. So it’s very natural to, for example, emit a key for each entry in an array. As CouchDB just calls these “indexes”, I figured there must be some other magic to inverted indexes.

(It turns out, of course, that there can be. As an example, Lucene is an advanced text search library based on inverted indexes. It has all sorts of tricks up its sleeves to make a search for documents containing “hat AND cat AND mat” super-fast by doing fancy things when intersecting the documents associated with each of hat, cat and mat. But I don’t think that’s a fundamental of inverted indexes for the purposes of this discussion.)

Many thanks to CockroachDB’s great guide-level explanation of inverted indexes and this discussion of filtering with inverted indexes for prompting this realization. In part, because the filtering with inverted indexes article talked about inverted indexes like they were obviously different, which set me on a quest to find out just what the distinguishing factor was.

Read More…

Link: Worst Practices of Software Development

Nothing earth-shattering, but I enjoyed Worst Practices of Software Development.

Brett Slatkin, author of Effective Python, briefly interviews several famous faces of software development — like Julia Evans and Mitchell Hashimoto — about a “worst practice” that they are happy to embody. It’s good to have people go on record to remind everyone that they need not do all — or even most of — the “right things” to be successful.

Read More…

Journal: watchOS 10

Apple flagged watchOS 10 as “the biggest update since the introduction of Apple Watch”. And it is quite a large set of changes. This is a quick post about how one change made a big difference to the day-to-day UX of my watch.

Read More…

Big Trees of Bristol

Just south of Bristol, next to the small village of Failand, is a patch of woodland. We were surprised to find that in the middle of this woodland is “Big Tree Grove”, a stand of ten or so sequoia trees (“Giant Redwoods”). While these are not so giant as those in California, they are still impressive to be amongst.

Read More…

Links: September 2023

The challenges of building things with language models, problems with peer reviewing of scientific papers, MongoDB’s new query execution engine and comparing operational automation with Iron Man.

Enjoy.

  • Squish Meets Structure: Designing with Language Models

    I really enjoyed Maggie Appleton’s talk about designing with large language models. It’s a fascinating discussion of how we are trying to understand how to work with these new tools, which work in different ways to how we are used to.

    We’re trying to make an unpredictable and opaque system adhere to our rigid expectations for how computers behave. We currently have a mismatch between our old and new mental models for computer systems.

    And the slides are somehow beautiful.

  • The rise and fall of peer review - by Adam Mastroianni

    Peer review in the sciences has come in for a bit of a bludgeoning recently. This piece argues that it’s adding little value. If anything, that peer review is actively hostile to new ideas and serves mostly to entrench existing hierarchies (and a large publishing industry).

    Can we do better?

  • Inside New Query Engine of MongoDB | Nikita Lapkov

    In my day job at Cloudant, I think a lot about how we could make our database better. I enjoy any and all deep dives into how other systems work. I find it helps create a library of patterns that I can later match against as I dig into problems.

    Here we learn how MongoDB created an idea called slots, which they used to significantly improve the efficiency of moving data values through their query execution pipeline.

  • Automation Should Be Like Iron Man, Not Ultron - ACM Queue

    Also at Cloudant, we’ve found automation that increases and magnifies the abilities of our operators to be the best kind of automation. It allows us to manage more and more machines, while also increasing the flexibility we have in managing load across the system. Because it is magnifying our abilities, instead of replacing them, it allows us to maintain understanding of the system. Our understanding comes in very useful when things go wrong that automation does not cover.

    This article sits well with this experience, and I can see myself referring to it over time to explain our approach.

Read More…