permalink.gif 2004-09-23

permalink.gif Getting the blogroll links out of Technorati

Thu Sep 23 15:20:16 BST 2004  Permalink 

Something that bugs me every time I use Technorati is the way the listings get clogged up with blogroll links. I wouldn't mind so much but they're usually the same links over and over again.

What i'd like is for those links to be displayed separately. I have in mind some kind of reverse blogroll box down the right hand side with the remaining entries listed pretty much as they are now.

I'm not sure what the best answer is to flagging blogroll links but a simple answer that occurs to me is to use a common CSS class name for example _linkdescriptor_blogroll. Any harvester which extracts a link with this class would know it's a blogroll link and could treat it differently.

I'd lead the way but I chopped my blogroll in the great purge. Maybe I should bring it back to try and bootstrap the idea. I could swap it for the calendar...

Update: In conversation Phil rebutted me saying it would be easier for Technorati to change their code to detect blogroll links than updating millions of weblogs. I guess he's right although I think the problem of reliably detecting a blogroll link may not be as simple as it sounds.

I also think there is a knowledge ownership perspective. As the blog holder I am the best person to decide how to mark up my information and links. And, by doing so, other applications than technorati can benefit (without having to duplicate Technorati's tricky heuristics for detecting every variation of a blogroll).

Although my way might be slower, I still think it could solve 80% of the problem in 6 months or so.

permalink.gif What's your handle?

Thu Sep 23 10:43:36 BST 2004  Permalink 

Mine is 1030.72/mmower. I'm looking into something called the Handle System which is like PURLs on steroids. I'm also getting interested in OpenURL.

permalink.gif Starting to pull the threads of RDF

Thu Sep 23 10:36:52 BST 2004  Permalink 

So i'm trying to puzzle out RDF.  I've read most of Practical RDF which has has given me a leg up on the basics (as well as making me realise how much I don't like looking at RDF/XML).  I'm also trying to grok RDQL .  My goal here is to understand enough to start to figure out OWL, it's significance and it's applications.  Already though i'm having flash backs to a conversation with David Weinberger where he said he was of the opinion that most people would be too lazy to add simple topic metadata to posts... the idea that people are going to invest the time in complex, structured, information... well i'm a little skeptical.  Hopefully it will become clear to me with time.

permalink.gif Harvesting PDFs in Java

Thu Sep 23 09:39:43 BST 2004  Permalink 

Mark Stephens has released JPedal 2.25, a pure Java library for extracting content from PDF files and rasterizing them.. Mark Stephens has released JPedal 2.25, a pure Java library for extracting content from PDF files and rasterizing them. Text fragments are extracted as XML elements with font and location information. Images are extracted in both their raw formats and their clipped and scaled formats as TIFF, PNG, or JPEG files. According to Stephens, "version 2.25 adds a number of significant features, the most major being support for outlines and thumbnails of the screens and ability to highlight text onscreen....This is a major release which also includes a large number well of fixes and substantial speed improvements. JPedal is published under the GPL. [Cafe au Lait Java News and Resources]

JPedal sounds very useful.  Especially so if you use Lucene.