What Is an Inverted Index?
I’d always been slightly confused by the term “inverted index”. The term comes from the world of text search. In that domain, a “forward” index is used to look up a document and retrieve its words. The index that goes the other way — from the words to the documents — is an inversion of that forward index, so the practitioners in that environment called it an inverted index.
But in databases, all indexes are pointing in that direction. When a database indexes a field, what happens is the database creates a large ordered mapping of all the values of the field to the original row they came from — which is identical to the “words (field value) to documents (row)” direction in text search.
So why don’t we call all indexes inverted in database world? In the end, I don’t know why that meaning didn’t get adopted for all indexes in databases, and this distinction between “normal” and “inverted” indexes has been niggling at me for years. A few days ago I decided to try to figure it out once and for all, so I wouldn’t be sitting on the sidelines wondering whether I was just stupid, as people made grand statements about inverted indexes being a perfect solution to this or that problem.
So anyway, what I found was that you don’t need a huge brain to comprehend the difference. It seems that database people just took the phrase from text search but kind of did it wrong, because they use it to talk about a different property of the index than the text search people do. I think this was the root of my confusion.
In the database world, instead of using “inverted” to indicate whether the index
is doc -> words
(forwards) or word -> docs
(inverted) like text search
people do, database people use the term “inverted” to refer to indexes that
break up the indexed value to generate one or more keys before indexing it. Text
search inverted indexes break up the document into words before indexing the
separate words, and this “breaking up” meaning was taken into the database
world, rather than the direction meaning.
This seems to be all it is: inverted indexes are those that take apart a field’s value to make multiple values, and then use each of those new values as a key in the index.
Unlike text search, where you pretty much only break into words, in a database, you could do this in many ways. For example:
- Like text search, break up a value into words and use the words as keys.
- For an array, index each value in the array as a key.
- For a JSON document, index each path and its value as a key.
I think it took me longer than most to get this distinction because I work on CouchDB. CouchDB’s normal index type, views, are pretty much inverted by this definition, because, to index a document, you transform the document into one or more keys to index using JavaScript functions written to the database. So it’s very natural to, for example, emit a key for each entry in an array. As CouchDB just calls these “indexes”, I figured there must be some other magic to inverted indexes.
(It turns out, of course, that there can be. As an example, Lucene is an advanced text search library based on inverted indexes. It has all sorts of tricks up its sleeves to make a search for documents containing “hat AND cat AND mat” super-fast by doing fancy things when intersecting the documents associated with each of hat, cat and mat. But I don’t think that’s a fundamental of inverted indexes for the purposes of this discussion.)
Many thanks to CockroachDB’s great guide-level explanation of inverted indexes and this discussion of filtering with inverted indexes for prompting this realization. In part, because the filtering with inverted indexes article talked about inverted indexes like they were obviously different, which set me on a quest to find out just what the distinguishing factor was.