Language processing in Objective-C

This is an aide memoire, but I thought it worth pointing to. Objective-c’s foundation library offers several amazing classes for language processing; far superior to built-in functionality I’ve seen elsewhere. The two central players are CFStringTokenizer and NSLinguisticTagger.


CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

And [NSLinguisticTagger][NSLinguisticTagger]:

The NSLinguisticTagger class is used to automatically segment natural-language text and tag it with information, such as parts of speech. It can also tag languages, scripts, stem forms of words, etc.

Finally, CFStringTransform allows you to normalise strings: remove diacritic marks, perform ICU normalisations and so on — essential for a flexible search feature.

Holy crap those classes are powerful.

NSHipster on NSLinguisticTagger.

Finally, here’s a presentation which brings things together a bit.