Journal July 2024: Helix, dprint, ClickHouse and tree shapes

A few quick-fire notes which might be of interest to others now, and myself in the future.

After I started to use it a year ago, I wasn’t sure how long I’d continue to use the Helix editor. And yet here I am writing this post in Helix, and still using it at work. It’s crashed four or five times — in a year — but has overall proven very stable and capable. I think it’s dev progress is a bit slower than I’d like, but really, I’m very happy with the editor. It starts instantly, LSP+tree-sitter still proves a winning combination, and the improvements that have arrived are solid.

One thing I’ve been searching for is a fast formatter for web languages, specifically the ones used in Hugo and Jekyll sites. Markdown, templated HTML and CSS in the main. Tools like Prettier tend to be noticably slow in kicking in to format if one isn’t willing to pay the price of a constantly running server. I’ve been using deno fmt for a while for Markdown, but it doesn’t do CSS or HTML. So now I’m trying dprint, which has inbuilt formatting for all three languages I wanted. It turns out that deno fmt actually uses some dprint formatters under-the-hood, specifically Markdown. I like finally having a CSS formatter, although since I moved to Tailwind this has been less important. (I still really like Tailwind).

In August 2023’s journal, I mentioned using ClickHouse in a PoC. That PoC became production recently, and we now have over 100TB of data stored in ClickHouse after our pre-production ramp up. We ingest more than a billion rows a day. Throughout our build out, ClickHouse has continued to impress me, coping with each bump in data volume smoothly. Querying has remained efficient. We may need to bump our hardware a bit as we start using it more in earnest, but the simple, vertically-scaled, replicated architecture we are using seems solid 🤞

We went for a walk in a small piece of woodland near Bristol today, Leigh Woods. I loved the shapes within the branches of this tree:

A neat way of thinking about generative AI for your products

I think this is one of the more concise tellings of how to think about where the current LLM-based crop of AI models can be useful in a product:

There is something of a trend for people (often drawing parallels with crypto and NFTs) to presume that [incorrect answers] means these things are useless. That is a misunderstanding. Rather, a useful way to think about generative AI models is that they are extremely good at telling you what a good answer to a question like [the one you asked] would probably look like. There are some use-cases where ‘looks like a good answer’ is exactly what you want, and there are some where ‘roughly right’ is ‘precisely wrong’.

Building AI products — Benedict Evans

So, the question to ask when looking for a good product fit: is the upside from shortening the time it takes to get something that looks right (and can be fixed up) greater than the downside of inaccuracy or downright falsehoods?

Some other recommended reading from the same author:

  • Apple intelligence and AI maximalism — Benedict Evans

    But meanwhile, if you step back from the demos and screenshots and look at what Apple is really trying to do, Apple is pointing to most of the key questions and points of leverage in generative AI, and proposing a thesis for how this is going to work that looks very different to all the hype and evangelism.

  • Looking for AI use-cases — Benedict Evans

    We’ve had ChatGPT for 18 months, but what’s it for? What are the use-cases? Why isn’t it useful for everyone, right now? Do Large Language Models become universal tools that can do ‘any’ task, or do we wrap them in single-purpose apps, and build thousands of new companies around that?

How SSH works, under the hood

I have used SSH as a way to get a shell on a remote machine for over twenty years, but I’ve never given that much thought to how the protocol works. In retrospect, I find this a little surprising as I tend to love this stuff.

But I got a chance to dig into it at work recently. In doing so, I found that my remote shells used a significantly more sophisticated protocol than I imagined. Instead of being super-specific, SSH turns out to be a general purpose, multiplexing, secure connection protocol, whose killer app appears to have been remote shells. I wanted to write a bit about it, to cement my understanding and give an introduction to the power SSH has.

The aim of this post is to give a working understanding of how SSH works one level down from how we typically see it. We’ll not cover the setting up of the SSH connection, but we will cover how the SSH client asks the server to do things like open a shell or run a program, and how data is moved between the two.

Go has an SSH client and server in its extended standard library, golang.org/x/crypto/ssh. We can use this to explore the SSH protocol in more detail. We’ll do that by building a simple SSH server that can run a single command, like when we run ssh mike@myserver.com ls -l / — ie, run ls in the root directory on the remote server. As we are doing this, we will log activity around SSH’s underlying primitives to peek under the covers.

Read More…

toykv updates

Since my last post, I’ve committed a couple of updates to toykv. The first is a nice functionality update, the second is enabled by my learning a little more rust.

  • Implement delete · mikerhodes/toykv@2325ff1

    With this update, toykv gains a delete(key) method. I did this in essentiallly the way I outlined previously: by adding a KVValue enum that holds either a value or Deleted, then threading this through the library, including serialisation.

  • Use generics with Into<Vec> for API ergonomics · mikerhodes/toykv@53dc863

    This is a nice update that reduces the amount of boilerplate, especially in tests.

    Previously there was quite a bit of get("foo".as_bytes().to_vec()) code, which has now been reduced to get("foo") by making the get, set and delete methods generic.

    I think this can further be improved using Cow. I think that would avoid unneeded cloning of data. But that is for later.

Obviously the library is still a learning aide, but it’s getting closer to at least having the functionality you would want in a real storage layer.

Writing bits to disk: toykv

In the early part of my experiments in writing a very simple document database, I thought that writing the bits to disk would be too difficult for me. And perhaps it was, back then. But I think things have changed.

As I wrote more of the document database, I got more experience transforming JSON data into key and value byte arrays that could be saved, retrieved and indexed. And as I got more comfortable with the tiny data formats I was creating, particularly for the indexes, I began to think, “why not, I bet I could write the code that organises these on disk too”.

Of course, just as my document database is nothing like ElasticSearch, my on disk formats would be nothing like so advanced as sled, the key-value storage layer I was using in my document database.

(I still think that I’d need a bunch more practice in this space to be able to write something production ready; but I am starting to think that I need not be a Better Programmer, but instead merely More Experienced In This Subject. Perhaps, in the end, they mean the same thing).

But I thought it would have a lot of value, to me at least. Because I believed I’d learn a lot, even writing the simplest, silliest key-value storage engine. And I think I have, and to try to cement that knowledge into my head, I’m going to write a whistlestop tour of that code into this post.

Read More…