404 - The-University-Project

Someone asked about it, so here’s the “I’m not going to explain terms, no nonsense” version. The idea is to combine a natural language parsing engine with an RSS/Atom feed reader and see what useful things come out the other end.

I want to see if it is possible to try and cluster feed posts into topics, where the topics are chosen automatically via the contents of the posts themselves. The NLP engine will hopefully aid in pulling out the interesting bits of feeds. For example, the nouns are probably the most likely items to be used as topics, so we can try to pull them.

After this, we can apply some form of statistical analysis to pull in things that appear in multiple documents, but are still uncommon enough to provide good topic headings: i.e. that don’t contain too many articles. Then we “label” each article with a list of the topics.

If this topic thing works, then the next plan is to figure out how to display this in a useful way, perhaps with some visualisation of how popular the topics are.

As you can tell, the idea is fairly woolly at the moment. I’m reading about various different techniques to see what the useful ones are. Many NLP techniques either take a long time or require large corpus of documents; neither of these traits are particularly desirable in an application that attempts to classify weblog posts/news articles (small corpus) in near real-time as they arrive.

We shall see, as mentioned earlier, what useful things come out the other end…