June 29, 2006

Clustering Search Results

Greg Linden linked to an interview with Dr. Raul Valdes-Perez of Vivisimo.

I've done some trial searching both on the Internet, using Vivisimo's Clusty search engine, and in an intranet environment, and frankly I don't get it. That said, I've heard from a small number of people who say that they use Vivisimo all the time and find it really valuable for getting the "lay of the land" on a particular topic.

A couple statements in the interview bugged me. First, there was the obligatory Google-is-good-and-all-but-we're-better triangulation.

Google's overwhelmingly dominant search engine ranks a Web page based largely on how many other Web pages are linked to it, much as a scientist is sometimes ranked by how often his research is cited by other scientists.

Having a site with PageRank™ approaching zero, I can tell you that actual query relevance is still very much at play, or my pages wouldn't show up for any search.

Then there's the Serendipity Demo, which usually works to get people excited about a discovery tool:

But Dr. Valdes-Perez said that by clustering Web pages into themes, Vivisimo can sometimes reveal connections that people wouldn't have seen otherwise.

To demonstrate that, he recently used the search terms "Osama bin Laden" and "Madonna" for a group in Washington D.C.

One of the themes that was generated was "niece," he said, and when he opened that folder, it revealed Web sites about a niece of the terrorist "who actually hates him but has aspirations to be a pop singer like Madonna," Dr. Valdes-Perez said.

Interesting. So bin Laden has a niece ... who has a life ... and doesn't like him. And that helps us how?

Part of the challenge in enterprise search is guiding people to the right tool for the question at hand. What are the types of questions that are best answered by a clustering engine (making it worth the investment)? What are the requirements for a successful implementation (size of collection, type of information, etc.)? I'm still trying to figure this one out.

#000000 is the new black.

What did we do before the Web? Now when you think of something clever to say, you Google it and find someone selling the T-shirt on CafePress already.

June 1, 2006

The Dark Side of Tuning in Enterprise Search

Tony Byrne rips the Google enterprise search product again. This time, he's unhappy with the fact that you can't tune the search results. In my experience, this really is a feature. Why? Politics. When The Management finds out that they can change the ranking of everyone's search results, all kinds of bad things happen. In one Dilbert-worthy episode I witnessed, a committee of The Management decided to "reward" those authors and webmasters who had tagged their documents with metadata by giving a bump to their relevance ranking. The assumption was that metadata = relevance. It was a rather unpopular change among The Users. The complaints (from users) about relevance ranking errors went away when we switched to the Google appliance. No tuning feature, no temptation.