Language processing in Objective-C
This is an aide memoire, but I thought it worth pointing to. Objective-c’s foundation library offers several amazing classes for language processing; far superior to built-in functionality I’ve seen elsewhere. The two central players are CFStringTokenizer
and NSLinguisticTagger
.
CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
And [NSLinguisticTagger][NSLinguisticTagger]:
The NSLinguisticTagger class is used to automatically segment natural-language text and tag it with information, such as parts of speech. It can also tag languages, scripts, stem forms of words, etc.
Finally, CFStringTransform allows you to normalise strings: remove diacritic marks, perform ICU normalisations and so on — essential for a flexible search feature.
Holy crap those classes are powerful.
NSHipster on NSLinguisticTagger.
Finally, here’s a presentation which brings things together a bit.
[NSLinguisticTagger]: