March Journal: it seemed to be all about AI again
It’s now April, so I can write my journal for March. Overall, I’m not sure whether that’s really the right thing — should I be writing the March journal as March progresses? — but it’s how things are this time around.
March was a second “AI month”:
- I added a bunch of stuff to rapport.
- I started a second AI project, ai-toys.
- I finished reading AI Engineering.
- I started reading Build a Large Language Model (from scratch).
- I wrote and launched an (internal-facing) AI app at work.
Let’s talk about each of these projects.
Rapport
I use Rapport daily with Anthropic’s models.
I spent $5.27 during March, and would have spent much less without my
experimenting with writing codeexplorer.py
, described below, that eats tokens
like nobody’s business. Rapport is a cheap way to access Claude, and I feel I
can recommend you try it out if you want a simple but nice app, where the code
base is small enough that you can submit PRs to scratch your own itches.
I completed did a bunch of work during March:
- Changed the name to Rapport.
- Added support for images when using Anthropic.
- Interface cleanup:
- Used Streamlit’s new file attachment function in the chat widget.
- Cleaned up the somewhat intrusive
/include
docs sat in the main UX by moving them into their own Help page.
- Fixed a few places where Streamlit’s UI would “ghost”, where leftover UI components would hang about for a while.
- Made it into a proper package with an entrypoint so you can
uv run rapport
rather thanuv run streamlit run st_entrypoint.py
. I haven’t decided whether to upload to PyPI yet. - Started to use pydantic to serialise chats in the local chat storage. Although my particular serialisation problem is trivial, Pydantic seems to be a defacto serialisation library in python, so learning it felt valuable.
- Added functions to copy chat to clipboard, save to gist and sync to Obsidian.
- Implemented a ton of bug fixes and refactoring.
I’m very happy with how Rapport is growing in terms of features, and it’s still a great learning vehicle. However, now I’ve settled on Cloude as a model, the Ollama support in atrophying somewhat. I might pick that back up at some point, but perhaps not, because the models I can run locally are rather limited by the 16GB of RAM in this laptop. I hadn’t planned on running models locally when I bought it, so I didn’t shell out for the extra RAM.
Right now, I’ve become more interested in what the cutting edge can do, and so I’m focused on features that help me explore that using Anthropic.
ai-toys
While chatbots are a common delivery mechanism for AI, many more use-cases place the AI behind a more traditional application interface. And, of course, there are more advanced AI use-cases and functions that don’t lend themselves to a general purpose chat interface like Rapport.
So I created the ai-toys repository to give me a place to experiment outside the chatbot UX. All of these toys have extremely bare bones interfaces. While Rapport is partly about building the UX for a daily tool, these are all about playing and experimentation, without much consideration for regular use, or encouraging other people to use them. They are about trying things out, and seeing what emerges.
So far I have three things:
copyeditor.py
- sends off your markdown to Claude for evaluation. The way I built it into my workflow is that when you start it, it scans for the most recently edited markdown file in the passed directory and loads that up automatically. In doing so, it streamlines getting feedback on dx13 posts to a single command.cvevaluator.py
- an experiment in playing with prompts. You put in a markdown CV on the one side and a job description on the other. I added prompts for evaluation by busy hiring managers with tons of applicants and compared with simpler “personality free” prompts that just said “Evaluate this CV against the job description”. By experimenting with data I readily understand — my tech CV, various job descriptions — I could vary the prompts and see how closely the results mapped to my own opinions on the matter.codeexplorer.py
- this approaches the cutting edge a little more. In it I provide Claude with two tools: one to list directory contents and a second to read files. Then I ask for an explanation of the code it can see inside a directory, and watch Claude explore the code. While the other apps are built with streamlit, this one is pure CLI. It prints out Claude’s use of the tools, and its eventual evaluation of the code.This is my favourite toy. The code in
codeexplorer.py
in the repository is Anthropic specific, but I’ve hacked in a few other models to see how they go about things. Right now, it’s clear that Sonnet 3.7 — Anthropic’s cutting edge — is significantly better than either Anthropic Haiku (smaller, older, cheaper) or my other go-to model, Llama 3.3 70B. Sonnet’s explanations are better, and it seems better at navigating the repository. Currently I’ve triedcodeexplorer.py
on rapport, but I’m eager to try larger codebases.In this toy, my new skills were tool use and cost improvements with prompt-caching (which makes things both way faster and way cheaper). I also found a wonderful library for CLI applications in python, rich. It’s particularly good for AI apps because it will render markdown to the terminal.
Here’s codeexplorer.py
with its evaluation of rapport:
Books
My favourite chapter in AI Engineering was chapter 9, about hosting an AI service for inference and training. I got to combine my knowledge/love for large infrastructure with my new fascination with LLM models.
One thing I ended up doing while reading was having conversations with Claude to help me expand on topics where the book didn’t go into as much detail as I’d hoped. Here’s an example, when the penny dropped for me about why prompt caching is good and what it involves, during a discussion where we were going deep(ish) on how inference prefill and decode worked:
(I find Claude’s “way to go!” manner of speaking a bit wearing sometimes, but generally like it.)
While AI Engineering was a great book, it was focused on making things with AI models, rather than making AI models. So I’ve started Building a Large Language Model (from scratch), which I suspect will be a much harder undertaking. When I’d finished AI Engineering, and discovered that my greatest interest was in that intersection between models and infrastructure, I wanted to start understanding how the models worked to a much greater extent, or, at least, in enough detail that I could visualise how to build the systems that host and train them. I think this is the right book for that, but I think it will take me quite a while to work through it as there are a lot of new concepts.
I really like the hands-on approach the book takes, and think that, if I can stick with it, I’ll come out the other side with a solid idea of how to train and run LLMs. That feels worth knowing.
Work
Over a couple of months, as a kind of “spare time” project given AI isn’t on my critical path work, I’ve taken ideas from Rapport and built an AI chatbot that’s focused on helping engineers at Cloudant. I wrote this because I believe that LLMs are capable enough without further tuning that they can be helpful in day-to-day work, and I wanted to build something to validate that belief.
In particular, I believe that it’s important to treat AI as a part of a toolbelt to solve real problems (as opposed to AI being a solution, and we try to find/create a “problem” to apply it to). So this project was born of the problem of helping engineers in adopting our new ClickHouse-based observability system. SQL was unfamiliar to many members of the team, especially when you throw in aggregating into p99 latency metrics and the like.
It’s a run of the mill text-to-SQL implementation using a chatbox user experience. It’s only model customisation is a prompt that includes phrases like “You are an expert in ClickHouse” and “Generate a SQL query to solve the user’s question” alongside our table schemas. For bonus points I included a separate chat experience with a generic “you’re helping a programmer get their work done” prompt, to provide help with CLI tools, python etc. While simple, it’s still impressive how quickly I could put a useful solution together using Streamlit and a prompt.
The app uses Llama 3.3 70B hosted on watsonx (because I work at IBM). Smaller models seemed significantly worse at SQL; I suspect the larger model has just encoded more “knowledge” about SQL during its training.
I only got this deployed on a server in the last few days (learning some new Chef and a lot of new systemd features in the process). I’m waiting to see what uptake, if any, it gets. If it’s popular, it might be a useful jumping off point in my wider quest at work to show that we should be starting to treat AI functionality as another baseline tool, rather than feeling fancy about using it.
Overall I got a lot done and learned a lot in March. I wonder what April will bring.