I’ve spent a couple of weeks reading about various methods for document classification, keyword recognition and other methods of automatically working out what documents are about. It seems that my initial idea was over-ambitious. I had intended to take groups of weblog posts and work out what each was about and then present the user with a list of the topics found. This would mean that they could choose the topics that interested them rather than having to read all the posts to find the interesting ones.
It appears that working out topics when you don’t have large corpora of priorly tagged-with-topic items is hard — i.e. working out topics on your own is hard. Instead I am going to aim for a lower bar: the user labels articles that interest them with topics, and then we try to classify new posts into the topics they have chosen.
This is similar to spam-filtering: we are trying to filter into predefined topics rather than making up our own. However, it is somewhat more complicated. This is because a document can be more than one topic (rather than just spam/not-spam). In addition, there is more variation in weblog posts about the same topic than in spam emails, which tend to contain similar words (cheap!, free!, etc.).
I still think that this will be very useful if it does work. Perhaps more so than if the program itself chose topics: it would choose topics regardless of whether you were interested in them. That would probably result in a huge list of topics, with a majority you didn’t care about, or too general or specific to be of use.
It also still provides for quite a wide scope of things to do; it should fill my time between now and when the project is due in. I’m probably going to code it in C# and gtk#. This is because I am fairly comfortable with them so should be able to get stuff up and running in a reasonable time.
Time to get my hands dirty with some code…