I have written several posts about this project, and collected them under the database-diy tag.
During 2023-24, my ongoing side project has been slowly building a very simple and naive database sort of from scratch. This is a pure learning exercise, as I wanted to have some experience of writing from the storage up.
The code I’ve written during this project — ToyKV in particular — has helped me a lot in being able to understand more deeply what I read in papers, and other databases’ source code and documentation. By diving down through levels of abstraction during this project, I’ve vastly improved the mental models I use to understand and predict the behaviours of all types of databases.
There are three main codebases I’ve written as part of this:
The first was a Go codebase that implements all-field indexing for JSON data and supports simple queries over that data. It’s inefficient and the next stage would be a query planner. This was mostly written late 2023.
In early 2024, I ported that Go code base to Rust. It has similar features.
Both the Go and Rust docdb versions used someone else’s underlying data storage. The largest codebase in this project is toykv where I have started to build a super-naive storage engine. This is the part that’s most new to me.
Toykv starts by defining a data format for individual key-value records, each
of which is a sequence of bytes. From that it builds up an LSM-like storage
format based on an in-memory memtables and on-disk sstables. I did a bunch of
work on this in early 2024. Recently, in late 2024, I’ve picked this up for an
hour here and there to work on a scan
method, which is the key for range
searching and the compaction operation that is key to LSM efficiency.
Overall, all these database DIY projects progress very slowly. I’ve probably spent 30-50 hours in total, but spread over at least a year so far. But it’s one of my favourite projects, as I’ve learned a ton (including Rust!).