July 19, 2006

Norvig the Heretic

Peter Norvig, whose challenge to Tim Berners-Lee regarding the Semantic Web was reported by CNET News only because he works for Google, listed three important problems for the meta-utopian dream.

"We deal with millions of Web masters who can't configure a server, can't write HTML. It's hard for them to go to the next step. The second problem is competition. Some commercial providers say, 'I'm the leader. Why should I standardize?' The third problem is one of deception. We deal every day with people who try to rank higher in the results and then try to sell someone Viagra when that's not what they are looking for. With less human oversight with the Semantic Web, we are worried about it being easier to be deceptive," Norvig said.

While we're somewhat shielded from that last problem in the enterprise search world, the first two, incompetence and lazyness, are enough to keep us busy. If you have multiple authors to work with, prepare to spend significant resources correcting all the different ways they devise to screw up structured metadata.

One of my favorites is the "template" method. You may never see the original document it comes from, but you'll suddenly get lots of Word documents that all have the same title (unrelated to the visible content of the document), or lots of HTML pages with the same meta tag block. You either have to fix these or exclude them from your index, or you ruin the good times for everyone.

The most effective solution I've seen to this problem so far is better authoring tools, combined with as much automation (adding metadata without pestering the user for it) as possible. While entity extraction is improving and can add some value, it would sure be nice if it were implemented like spell-checking software, in cooperation with the author. I've only seen it used too far down stream.

I can relate to the way Norvig began his comments. "What I get a lot is: 'Why are you against the Semantic Web?' I am not against the Semantic Web."

I get the same crap about similar stuff.

Them: "We're authoring everything in such-and-such XML schema."

Me: "Great, here's the best way to transform it to standards-compliant HTML and publish it so it's accessible and useful to people and search engines."

Them: "What do you have against XML?"

Me: "garrrh!"

Posted July 19, 2006 11:53 AM
Post a comment

Remember personal info?