Fravia's web-searching lore (¯`·.¸(¯`·.¸ sonof

~ Essays ~

				essays

(Courtesy of fravia's advanced searching lores)

(¯`·.¸ Feed the search engines with synonyms ¸.·´¯)
by sonofsamiam
slightly edited
published at searchlores in July 2000

A very interesting idea to broaden your search results by using synonyms for words you are unsure about

I have made a synonyming metasearch engine which can be found at http://www.spunge.org/~uriah/cgi-bin/synmeta.cgi;. It queries Lexical FreeNet to find synonyms of words and "also-known-as"'s for people-searching (famous people searching :)

The user end of any search engine is just digging through their database via a web interface, and there are many other web-searchable databases: 411's, lexical databases (like lexfn), and etexts, for example.
A metasearch just adds another layer onto the onion, to act as an interpreter between us and the databases. (Or rather, between us and the web interfaces to the databases, an interpreter talking to an interpreter, so sometimes you don't get the joke). Perl or PHP gives more flexibility, but frontends can be javascripted, see web scripting secrets, Bombastic Search Engine Front-end, or snooz and all-in-one searches on fravia's.

It's easy to make your own two-dimensional (linear, read: slow) metasearches. Anyone with even basic programming skills can use LWP::Simple & HTML::Parser to automate web retrieval. I think REBOL is particularly well suited to these built-in html parser and document retriever. A 3d one will take some experience with multi-theading/tasking (i don't have yet, other than playing with fork()s ;)). There are several other search-engine front-ends on the web, Oingo tries to give natural language recognition to altavista and dmoz, as did electricmonk.com, which seems to be closed as of this writing.

Other possible directions to take metasearching (or front-ending?):

query news sites, find trends?
pipe foreign search engines through translators, or is there too much distortion..
use to avoid ads & messy html/http, of course
can be used to add onto the algorithms the engines already have, maybe use bayesian or other history-based algorithms to determine likely relevance.
a better design solution to graphing relevance could be used with the data from se's; visual searching, let's make these things intuitive.
and, of course, we are not limited to web-interface. I just prefer this usually, so I can run my scripts from Auntie Em's aol account.

Hmm, let's try it real quick, using say, +%archetypal figure +Jung, 'coz I'm interested in that stuff right now :) Entering that query in the synonyms metasearcher will return the following url : http://www.altavista.com/cgi-bin/query?sc=on&hl=on&kl=en&pg=q& text=yes&q=%2b%28archetypal+|+archetypical+|+prototypal+|+prototypic+|+prototypical%29+figure+%2bJung&search=Search)
And, wow, that came out much better than I thought it would, heh, good example :)

Well, here is the relevant part of the cgi (the rest is just html stuff):

use LWP::Simple; #no need for anything more in-depth

@in = split(/&/,$ENV{'QUERY_STRING'});
foreach $i(@in){
  $i =~ s/\+/ /g;
  $i =~ s/%(..)/chr(hex($1))/ge;
  @key_val = split(/=/,$i,2);
  $in{$key_val[0]} = $key_val[1];
}

$in{'q'}=~s/[^\w()|+\-~"% ]//; #strip bad chars.
open(L,"<<$logf");
print L time()."\n$ENV{'REMOTE_ADDR'}\n$ENV{'HTTP_USER_AGENT'}\n$in{'q'}\n";
close(L);

#there _must_ be a better way to do this! am I just stupid or what?!
$b=$in{'q'};
while($b=~/^.*?"(.*?)"(.*)$/s){
  $a=$1;
  $b=$2;
  $c=$a;
  $c=~tr/ /_/s;
  $in{'q'}=~s/$a/$c/s;
}
@tokens=split(/ /,$in{'q'});
foreach $token(@tokens){$token=~tr/_/ /;}

foreach $token(@tokens){
  if($token=~/^([+\-~|]*\(?)%/){
    $t="$1(";
    $token=~s/[^\w ]//g;
    @syns=($token);
    $token=~s/ /+/g;
    #grabbing and parsing out the synonyms
    $p=get("http://www.raisch.com/cgi-bin/lexfn/lexfn-cuff.cgi?sWord=$token&tWord=&query=show&maxReach=2&ASYN=on&ABAK=on")
    or last;
    $p=~/^.*<\/form<(.*)<font/is;
    $p=$1;
    while($p=~/<b<<a.*?<(.*?)<\/a<(.*)$/is){
      push @syns, $1;
      $p=$2;
    }
    #quote spaced synonyms
    foreach $s(@syns){
      if($s=~/ /){$s='"'.$s.'"';}
  }
  $token=$t.join(' | ',@syns).')';
}
$query=join(' ',@tokens);
$query=~s/\+/%2b/g;
$query=~s/ /+/g;
$url="http://www.altavista.com/cgi-bin/query?sc=on&hl=on&kl=$in{'kl'}&pg=q&text=yes&q=$query&search=Search";
print "Location: $url\n";
print "Content-type: text/html\n\n";

Is this kind of se additions useful? or maybe more can be got from Lexical FreeNet?
Please send any questions or comments or criticisms, it's most wanted :) sonofsamiam