How long is a sentence?

As with many things in computer science, the answer is: it depends.

Back in the depths of time, computing as we know it was invented by scientists who didn’t think so much about the unlikely possibility that our the lives of people the world over would become inexorably twined with our computing devices. Computers were mostly thought to be useful as glorified calculating machines, and so representing text wasn’t thought to be so important. Therefore we ended up stuck with the letters which would fit into just under a byte comfortably — 128 different characters. As much of this work was done in the US, this was fine because the Latin alphabet contains few characters. And so we ended up with a laughably bad way of representing characters for most of the world.

In 1992, the Unicode Consortium just about wrapped this one up when it published an enormous tome detailing how the characters from all the writing systems of the world should be represented within a computing system. However, programmers still need to think about this every day as most programming languages have decidedly broken representations of strings of characters; strings for short. Twenty years on, you’d have hoped high-level languages would abstracted this away.

Unicode describes each character as a code point. It then describes how to represent sequences of characters, or code points, as sequences of bytes — bytes being what computers tend to exchange between each other. Various different methods of representation — encoding schemes — have pros and cons, mostly around the speed at which strings can be processed and the amount of space they take up in computer memory.

Unhelpfully, different programming languages support different representations internally, so much work still needs to be done by the programmer. A common question: what encoding scheme does the program I am talking to expect and produce? If you are lucky, this will be well defined somewhere or the program will let you know along with each message. Otherwise, the only answer is to try out some combinations in order of most to least likely, and accept the first encoding that doesn’t appear to have an error. Not exactly a pleasing solution. The core point is that characters are represented by one or more bytes in sequence in a computer’s memory.

And so to the length. As ultimately a string an array of bytes, the fastest way is to count the number of bytes. As many characters take more than one byte, this will often give an incorrect number. Counting characters is slower, but more accurate. Mixing this up is problematic: one method assuming a length passed in bytes called by a method providing the length in characters is both common and potentially data destroying. Sobering.

Where this comes into contact with day-to-day life most vividly is Twitter. Twitter’s 140 character limit for tweets brings into focus how important the counting of characters is. Chinese characters take up three bytes, so if Twitter chose the easy route of counting bytes, Chinese people would have three times fewer characters they could input than users from the UK or US! Fortunately Twitter don’t do that; in fact they have an extensive explanation of how they count.

In the end, this is finally gradually being settled. The next version of Python abstracts bytes away in its default string type. The length of a string will always be the number of code points. Most languages, however, still make programmers think too much about how the characters they are dealing with are represented in bytes; the only time one should have to deal with this is on the fringes of one’s system, where the wild characters of the outside world need taming.

.:.