Bayesian weblog detector?
Sat Oct 23 12:20:38 BST 2004 Permalink
Guessing if a link leads to a weblog or not?.Technical weblog research question:
I have a list of links and I'd like to find out which of them lead to weblogs. Is there a way of doing this automatically?
Things that I thought about:
- guessing from url - would work for weblogs hosted in most popular platforms
- check if there is RSS/Atom feed - would exclude weblogs without feeds and include general sites with RSS feeds
- match url against database of any weblog indexing site - would include only subset of weblogs and you have to get the database first
- ...
Do you have any suggestions?
This post also appears on channel weblog research
[Mathemagenic]
I thought of various approaches to this one involving looking for tags pointing to RSS feeds (Nope: the BBC correctly do that on their news pages), looking for author-information in the RSS, and so on. None of them would be foolproof and all would be a pain to implement with lots of edge cases.
I think that if it was my problem then I would make like Jon Udell and rig up a Bayesian categorizer: train it on some weblogs and likely looking non-weblogs and then feed it the full data set and see what happened.

