This is a brief tale, told mostly through links, about subtlety. And fsync
,
though perhaps the two are synonymous.
While I’m writing about this in September, the events actually happened back around March; I intended to write this up back then, but somehow it just never happened.
Earlier this year, I read NULL BITMAP Builds a Database #2: Enter the Memtable. At the end, Justin Jaffray mentions a potential sad path when the database you are coding up (as one does) crashes. Here, we are talking about whether the database can accidentally lie to a reader about whether a write is on-disk (durable):
I do a write, and it goes into the log, and then the database crashes before we fsync. We come back up, and the reader, having not gotten an acknowledgment that their write succeeded, must do a read to see if it did or not. They do a read, and then the write, having made it to the OS’s in-memory buffers, is returned. Now the reader would be justified in believing that the write is durable: they saw it, after all. But now we hard crash, and the whole server goes down, losing the contents of the file buffers. Now the write is lost, even though we served it!
The solution is easy: just fsync the log on startup so that any reads we do are based off of data that has made it to disk.
If you’re anything like me, that will take you at least three reads to get the order of events straight in your head. But once I did, it felt right to me. As I work on a database, I thought I’d ask the team whether we did that. I was pretty sure we did, but it’s part of my job to double-check these things when I come across them.
Herewith, the story and the warning about subtlety.
Cloudant and CouchDB
While I work at Cloudant, the database underlying the Cloudant service is
CouchDB, so the question went to the CouchDB slack. Code was read, and a little
while later we found that, yes, we do this; we fsync
when we open the
underlying data files.
Jan Lehnardt, a primary CouchDB committer, wrote about it:
How CouchDB Prevents Data Corruption: fsync.
All good you might think. And in the main, if fsync
succeeds, you have written
your data to disk.
(One thing we all learned a few years ago, in
PostgreSQL’s fsync() surprise [LWN.net], is
that if fsync
fails your only hope is to crash immediately).
But what does fsync
really do?
Next I read fsync() after open() is an elaborate no-op. This post talks about the POSIX definition of fsync:
The fsync() function shall request that all data for the open file descriptor named by
fildes
is to be transferred to the storage device associated with the file described byfildes
.
The key part here is that it is for a particular file descriptor. And when you
restart your database, it will open the same file, but with a different
descriptor, and so (per the fsync
guarantees above) there’s no contract that
your application has with the kernel that anything other than the writes from
your (new) file descriptor end up on disk.
Which, given you only just opened the file, means no changes at all. So perhaps
this read
-then-fsync
procedure is just theatre.
But wait!
I came back to write this up a month or so later, and was searching for that
above post about fsync
being a no-op, and found it through its page on
lobster.rs,
fsync() after open() is an elaborate no-op | Lobsters.
On the lobster.rs page, a commentator notes that the
man page for Linux
tightens up what should happen on fsync
:
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor
fd
to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
So this definition says that anything in the file the file descriptor refers to should be written, which in this case should mean that our dirty page from way up above where we crashed should still end up on disk. Because our new file descriptor is opened on the same file.
So, at least on some file systems on some operating systems, the fsync
after read
isn’t such a bad idea after all.
What have I learned?
First, that even super-smart people like Justin Jaffray can misunderstand just
the same things that I misunderstand. Second, there’s a lot of detail underneath
fsync
. Third, that in the end, what actually happens is … implementation
dependent.
My previous understanding was “fsync
writes the dirty pages of the file to
disk”, not limited to the particular file descriptor’s writes. And while it
turns out that’s not a stupid thing to believe, because lots of other people
believe it, it’s not true everywhere, and perhaps you shouldn’t rely on it.
Which means that, for some weird edge-cases, your database code possibly can’t
be 100% safe.
But you’re probably okay on Linux, although frankly you’d need to wade through the source code for your filesystem to be sure.
Mostly, I was reminded that the OS does a lot to hide the unpredictability of the physical hardware from you, but at some point, it can only do so much.
Further reading
If you enjoyed this post, you’ll also want to read these, which will leave you with a far better understanding of things than you got from this post. Afterwards you might be confident in the write path for that database you are coding up in your spare time just for fun.
- Files are fraught with peril – an excellent article that’s worth your time.
- PostgreSQL’s fsync() surprise [LWN.net] –
you probably didn’t read it when it was embedded in the text above, but it’s
worth digesting because of the ramifications of what
fsync
does in the real world when it encounters an error. - Userland Disk I/O – a good set of things to know and consider if you care about writes.
- Darwin’s Deceptive Durability – Same blog as above, more interesting tidbits.
Speaking of databases one might write just for fun, I’m not yet confident in
toykv’s write path per all this
information. For example, perhaps I should use an assert
here
and
here
so that when fsync
fails, it forces a crash. I’m not sure whether that’s
something I’d want a library to do — crash my app — but I guess if it’s a
data safety issue, then perhaps I would want it to 🤔
So, yes. Computers are hard.