This is a brief tale, told mostly through links, about subtlety. And fsync, though perhaps the two are synonymous.

While I’m writing about this in September, the events actually happened back around March; I intended to write this up back then, but somehow it just never happened.

Earlier this year, I read NULL BITMAP Builds a Database #2: Enter the Memtable. At the end, Justin Jaffray mentions a potential sad path when the database you are coding up (as one does) crashes. Here, we are talking about whether the database can accidentally lie to a reader about whether a write is on-disk (durable):

I do a write, and it goes into the log, and then the database crashes before we fsync. We come back up, and the reader, having not gotten an acknowledgment that their write succeeded, must do a read to see if it did or not. They do a read, and then the write, having made it to the OS’s in-memory buffers, is returned. Now the reader would be justified in believing that the write is durable: they saw it, after all. But now we hard crash, and the whole server goes down, losing the contents of the file buffers. Now the write is lost, even though we served it!

The solution is easy: just fsync the log on startup so that any reads we do are based off of data that has made it to disk.

If you’re anything like me, that will take you at least three reads to get the order of events straight in your head. But once I did, it felt right to me. As I work on a database, I thought I’d ask the team whether we did that. I was pretty sure we did, but it’s part of my job to double-check these things when I come across them.

Herewith, the story and the warning about subtlety.

Cloudant and CouchDB

While I work at Cloudant, the database underlying the Cloudant service is CouchDB, so the question went to the CouchDB slack. Code was read, and a little while later we found that, yes, we do this; we fsync when we open the underlying data files.

Jan Lehnardt, a primary CouchDB committer, wrote about it: How CouchDB Prevents Data Corruption: fsync. All good you might think. And in the main, if fsync succeeds, you have written your data to disk.

(One thing we all learned a few years ago, in PostgreSQL’s fsync() surprise [LWN.net], is that if fsync fails your only hope is to crash immediately).

But what does fsync really do?

Next I read fsync() after open() is an elaborate no-op. This post talks about the POSIX definition of fsync:

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes.

The key part here is that it is for a particular file descriptor. And when you restart your database, it will open the same file, but with a different descriptor, and so (per the fsync guarantees above) there’s no contract that your application has with the kernel that anything other than the writes from your (new) file descriptor end up on disk.

Which, given you only just opened the file, means no changes at all. So perhaps this read-then-fsync procedure is just theatre.

But wait!

I came back to write this up a month or so later, and was searching for that above post about fsync being a no-op, and found it through its page on lobster.rs, fsync() after open() is an elaborate no-op | Lobsters.

On the lobster.rs page, a commentator notes that the man page for Linux tightens up what should happen on fsync:

fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.

So this definition says that anything in the file the file descriptor refers to should be written, which in this case should mean that our dirty page from way up above where we crashed should still end up on disk. Because our new file descriptor is opened on the same file.

So, at least on some file systems on some operating systems, the fsync after read isn’t such a bad idea after all.

What have I learned?

First, that even super-smart people like Justin Jaffray can misunderstand just the same things that I misunderstand. Second, there’s a lot of detail underneath fsync. Third, that in the end, what actually happens is … implementation dependent.

My previous understanding was “fsync writes the dirty pages of the file to disk”, not limited to the particular file descriptor’s writes. And while it turns out that’s not a stupid thing to believe, because lots of other people believe it, it’s not true everywhere, and perhaps you shouldn’t rely on it. Which means that, for some weird edge-cases, your database code possibly can’t be 100% safe.

But you’re probably okay on Linux, although frankly you’d need to wade through the source code for your filesystem to be sure.

Mostly, I was reminded that the OS does a lot to hide the unpredictability of the physical hardware from you, but at some point, it can only do so much.

Further reading

If you enjoyed this post, you’ll also want to read these, which will leave you with a far better understanding of things than you got from this post. Afterwards you might be confident in the write path for that database you are coding up in your spare time just for fun.

Speaking of databases one might write just for fun, I’m not yet confident in toykv’s write path per all this information. For example, perhaps I should use an assert here and here so that when fsync fails, it forces a crash. I’m not sure whether that’s something I’d want a library to do — crash my app — but I guess if it’s a data safety issue, then perhaps I would want it to 🤔

So, yes. Computers are hard.

← Older
July & August journal: thread-safe toykv, fancier ai-codeexplorer