Inktomi's search syntax

~ Inktomi's search syntax ~
By Nemo

Published @ Searchlores in February 2004 | Version 0.01 | By Nemo

There is a new version of this essay at inktomi.html!

Introduction Inktomi unveiled Inktomi's syntax References

Introduction

Inktomi is one of the best search engines out there. Unfortunately its search syntax is not well documented, which is a pity, because Inktomi offers one of the richest search syntaxes, with lots of unique features and a ranking algo which works often quite well. The purpose of this essay consists precisely in documenting Inktmi's search syntax and providing examples showing its usefulness. For that purpose old HotBot's search FAQs and others Inktomi's web partners' search FAQs were read. The core syntax present in them was expanded using search engines and the WayBack Machine. Finally, from the source code of old HotBot's advanced search pages, additional search syntax was guessed: feature:homepage, originurlextension: and stem:.

Inktomi unveiled

Inktomi doesn't provide a public search engine in a way that search engines like AltaVista or Google do. Instead, it licenses its search technology and database to other portals and sites on the Internet. While some of them display results which are taken directly from Inktomi's database, others display results from Inktomi after displaying results from another source.

Inktomi is also different from other search in a major way: it is build of two databases: WebMap and Web Search 9.

WebMap: This is not a searchable database. Inktomi claims that it contains three billion documents (cf. [1]). This database is used 'to analyze the characteristics of each page and the link structure that connects the pages' (cf. [1]) said otherwise they are using it to distill link popularity, as you can infer from the following SERPs: feature:title and depth:4, because no keywords were used.
Web Search 9: From WebMap database Inktomi chooses the highest quality documents to build Web Search 9 database, which contains the former databases: Eurocluster (110M Documents), the Asia Pacific cluster (55M Documents), the Best of Web cluster (110M Documents) and GEN3? (500 M documents) (cf. [2] and [3]). This database hasn't grown much since 2002, as you can infer from the following queries: feature:title and depth:4, which give an estimation of more or less 800 M documents...

Inktomi relies on title, keywords (meta tag) and text to sort search results (cf. [4] page 23). The use of meta tags is a good idea: -Webmasters know for sure what their pages are all about! Nevertheless Inktomi's use of titles and keywords meta tag is prone to abuse and examples are easy to find... just do a title search for any commercial keywords in singular and plural:

title:hotel title:hotels

where lonnnnnnnnnnnnng titles and keywords meta tags abbound... lets take the very first search result: originurl:http://www.nyc-realestate.com/, which has no body text! (disable javascript, if you don't want to being redirected duh:). Even worst, Inktomi still match merrily this document for the keyword manhattan which occupies chars 377 - 386 of title's string... I wonder why Inktomi allows this free ride to spammers, given that this annoyances are so easy to correct:

Inktomi shall only index the first ten keywords from page's title and only the first ten keywords from keywords' meta tag, ok, ok, twenty! If webmasters aren't able to describe their pages' content with these thirty keywords, they should move to another job where they would be more productive!
Inktomi shall provide a text: search field, may searchers want to restrict their search to pages containing the real thing...

Inktomi doesn't show their document's copy aka cache. This lack of transparency is a shot in their own feet because it prevents spammed pages from being easily exposed and get the deserved punishment.

Inktomi's database is queried by four main search engines: MSN (advanced search), HotBot, PositionTech and BluWin. Each one of these has its own advantages and inconvenients:

MSN (advanced search) Shows 15 results per page. Your query's length must be smaller than 150 characters. If you use MSN simple search you'll get first all paid crap, matching your query string, from the directories: Zeal and Looksmart, only after that you'll get Inktomi's search results. If you use MSN advanced search you'll get only Inktomi's search results. MSN clusters search results and only shows the two/three most relevant pages for each site. MSN is a puritan search engine, read more here.
HotBot Shows 10 results per page. Your query's length must be smaller than 300 characters. You may need some time to find the search results among the 'SPONSORED LINKS' aka paid crap. Don't take at face value HotBot's answer 'Sorry, your search had no web results' hit your browser's reload button a couple of times and often your search results will materialize! I think this behavior is due to HotBot caching queries to speed things, because it happens often for 'unusual' queries. HotBot clusters search results and only shows the two/three most relevant pages for each site.
PositionTech Shows 20 results per page. You get a clean search results page without sponsored links. PositionTech doesn't have an advanced search page and your query's length must be smaller than 64 characters. PositionTech doesn't cluster search results, shows all documents matching your query.
BluWin (Erweiterte Suche) (Recherche avancée) (Ricerca avanzata) Shows 15, 25, 50 and 100 results per page. Displays first its Sponsorlinks and after Inktomi's search results Your query's length must be smaller than 100 characters. BluWin clusters search results and only shows the two/three most relevant pages for each site.

Inktomi willingly prostitutes himself accepting payed search results and will merrily gives them a boost:

'While the pay-for-placement search model-in which a marketer pays to have certain keywords land high in search results-has been gaining steam over the last few years, Inktomi says that represents only 30% of the paid search model. Inktomi says it will tap into the remaining 70% through its paid inclusion model, which places sites into the results of relevant searches.' (cf. [5])

Nevertheless these pages are easy to spot because they were crawled in the past two days (cf. [6] and [7]) and when you put your mouse over the link the status bar shows rdrw1.inktomi.com/click?u=http (disable javascript to not being fooled duh:), if you are using MSN, HotBot or PositionTech.

Inktomi's syntax

Default search Multiple search terms are processed as an AND operation.

Boolean search Inktomi offers full Boolean searching and its syntax is AND, OR, and NOT, allows the use of - instead of NOT and searching can be nested using parentheses (). Operators must be in upper case. You are well advised to not use the OR operator for keyword variants, because your query will attract irrelevant search results (Inktomi gives an higher rank to documents containing all ORed keywords), in those cases you should use stemming whenever you can. Example, compare:

Case Inktomi has no case sensitive searching. Using either lower or upper case results in the same hits.

Truncation No truncation (*, ?) is currently available, but you can use word stemming (stem:).

Stop words All words are searched. There are no known stop words.

Ranking Inktomi is the only 'search engine' which lets you change its ranking algorithm, this is done by giving to each keyword a weight. Weight factors can vary betwen 0.0 and 9.9 and the syntax is weight*keyword, by default each keyword has weight 1.0 as you can see comparing these two queries: 1.0*fravia and fravia. This is a very useful feature which lets you fine tune how loud each keyword is allowed to talk and in this way reduce the noise level produced by noisier keywords, by multiplying them by a factor inferior to 1.0; and give an opportunity to more bona fide keywords by multiplying them by a factor grater than 1.0. Example:

Strategy: keep reducing the weight factor of the noisier keywords until you start to see a degradation of search results an take the previous weighting factor. For bona fide keywords do the inverse thing (multiplying by a factor grater than 1.0 duh! :). This feature also helps to partially fix Inktomi's broken title search using query strings of the following type: 0.0*title:keyword1 0.0*title:keyword2 keyword1 keyword2, which means that keyword1 and keyword2 must be in the title, but documents are ranked for keywords being located elsewhere.

depth:[number] Designates how far pages will be searched in a site's directory structure. The number (0, 1, 2, 3, 4) indicates the maximum number of subdirectories, relatively to host's root directory, which could appear in the URL. As a general rule (not universal! duh:) webpage's content increase with directory's depth and, besides, spammers think that webpages on home directory get a ranking boost and are more likely to being indexed, therefore they put often their doorway pages there. This useful feature offers a handy way of getting ride of those anoiances... excluding root directories' pages!

Example: title:german hear feature:audio -depth:0

As most pages are located in directories with depth inferior or equal to four, this feature gives a good estimation of how many documents are in Inktomi's database: depth:4.

domain: Restricts a search to the selected domain. Domains can be specified up to three levels deep. Examples: domain:org, domain:searchlores.org, domain:www.searchlores.org. Don't take HotBot's numbers at face value... for each query HotBot only shows two/three pages per site, which is a good antispam measure. If you want to see all indexed pages from a given site, you must do your search at PositionTech: domain:searchlores.org.

feature:acrobat Search for pages that links to a PDF file. Compare the queries:

Example: "link structure" feature:acrobat. As PDF files may have not been indexed by some reason (examples: robots.txt, robots meta tags), this feature may provide some interesting results.

feature:activex Detects pages containing embedded activex, i.e. the presence of the tag <object ... classid="clsid:... >, compare:

feature:applet Detects <applet ...> tag in HTML, compare:

the tags <object ...> (for Internet Explorer) and <embed ...> (for Netscape) can also be used embed applets, but Inktomi doesn't detect applets embedded this way. Compare:

Documents containing links to .class or .java aren't also taken into account, compare:

Example of use: +feature:applet +title:play +title:chess.

feature:audio Detects if a page links to an audio file. Audio files could be among others: wav, mp3, m3u, mid, midi, au, snd, ... The link could be in a:

<a href=...> tag compare:
- +"http://birds.cornell.edu/BRP/SoundQuake.html"
- +"http://birds.cornell.edu/BRP/SoundQuake.html" +feature:audio
<area href=...> tag compare:
- +"http://www.roxbury.org/staffdev/whales"
- +"http://www.roxbury.org/staffdev/whales" +feature:audio

Inktomi doesn't detect embed audio files:

Example: +title:whales +feature:audio

feature:flash Contrary to what we could expect, Inktomi doesn't detect neither the existence of the tag <embed ...> compare:

nor the existence of the tag <object ...>, compare:

for Inktomi feature:flash means webpages linking to files with extensions: fla, spl or swf, compare:

feature:form The Inktomi's crown jewel. Detects the <form> tag in HTML. Inktomi may not index the hidden web, but offers you a way of knowing where the front doors are! For instance you can use Inktomi to find Laws' Databases, translation services: dutch english translate url feature:form, etc.

feature:frame Detects pages containing frames.

feature:homepage Restrict your search to personal pages (identifier ~). Very useful, because it's still the convention for personal pages on educational sites. Example: web search feature:acrobat feature:homepage.

feature:image Detects <img src=...> tag in HTML or a link to an image.

<img src=...> tag, compare:
- "http://www.gnu.org"
- "http://www.gnu.org" feature:image
Link compare:
- "http://www.me.mtu.edu/~djhilber/tractor.html"
- "http://www.me.mtu.edu/~djhilber/tractor.html" feature:image

Example:

("bird of paradise" OR "birds of paradise") AND (papua OR "new guinea") AND feature:image -stem:travel -stem:hotel

Images are widely used for aesthetic reasons. If an HTML webpage doesn't contain images you may wonder if there's an hidden agenda... probably it's a cloaked/spammed page by a a spammer putting only keywords n' links and not taking the hassle of building a real webpage. You can trash often those annoyances using this useful feature!

feature:index Restricts your search results to the host's top page. Very useful to find sites about a given theme! The host's homepage is the most valuable site's real estate, there the site's owner should put a resume of what his site is all about and provide links to his most important pages. Example searching for FTP search engines: ftp search feature:index feature:form. Inktomi indexes approximately 27,927,941 webhosts cf.: feature:index.

feature:javascript Detects pages containing the <script ...> tag with the attribute language="javascript", comapare:

Inktomi doesn't recognize javascript embedded in other tags' attributes, compare:

webpages linking to javascript files (extension .js) are not considered as containing javascripts, compare:

Javascript, with the help of forms, is a cheap, yet powerfull, way of providing interactive pages. Sometimes is the right tool to cut all bragging pages that do not offer the interactive content they promise. Example:

title:german exercises feature:javascript feature:form

Spitze!

feature:meta Detects <meta ...> tags in HTML.

feature:shockwave Detects pages containing links to files with extension dcr, dir, fla, spl or swf, compare:

feature:script Detects <script ...> tag in HTML, in particular detects other script languages than javascript (for instance VB script), compare:

feature:table Search for pages containing the <table ...> tag. Tables are widely used to control page's layout and... ahem to build tables! If an HTML webpage doesn't contain tables you may wonder if there's an hidden agenda... probably its a cloaked/spammed page by a SEO fearing that some search engines may not fully support tables, or a spammer putting only keywords n' links and not taking the hassle of building a real webpage. You can trash often of those annoyances using this useful feature!

feature:title Detects pages containing the <title> tag. As allmost all webpages contain a title, this feature gives a good estimation of how many documents are in Inktomi's database. Cf.: feature:title.

feature:video Search for pages linking to video files (file extensions: avi, mpg, mpeg, mov, etc.). Example: title:chaplin feature:video. Videos embedded with the tags <embed src=...> or <img dynsrc=...> are not detected. Compare:

feature:vrml Search for pages containing a link to a vrml file (wrl, wrz, vrml). Compare:

Inktomi is unable to see embedded vrml files. Compare:

Example: web AND (link OR links) AND graph AND feature:vrml

link: Finds pages which contain hypertext links to the exact specified URL. Example: link:http://www.searchlores.org/news.htm, could also be a link to a parent directory link:../.

linkdomain: Search for pages linking to any page in a given site, example: linkdomain:www.searchlores.org -domain:searchlores.org.

linkextension: Very usefull. Search for pages linking to a file with a specified extension. As what characterizes a file is its extension, this feature provides a way of getting only the pages which links to the real thing.

originurl: Webmasters use this operator to see if a page is in Inktomi's database. Example: originurl:http://www.searchlores.org/news.htm (don't forget the http:// part! duh:). Can also be used to anchor a page and study what went wrong with Inktomi's ranking algo.

originurlextension: Very useful, restricts your search to documents with a given extension. Example: web search originurlextension:pdf.

originurlpath: The same as path:.

outgoingurltype:[url_type] Search for pages linking to a certain mime type. Example: outgoingurltype:image/jpeg. Does more or less the same as linkextension:.

path: Search for words in URLs path, you can also search for phrases, but the syntax isn't the one we would expect: path:"keyword1 keyword2", instead it is "path:keyword1 path:keyword2". Example: path:fravia.

region:name Restricts your search to a geographical region (africa, centralamerica, downunder Oceania, europe, mediterranean, mideast Middle East, northamerica, southamerica, southeastasia). Example: stem:laws noise stem:levels region:europe You can find which countries are included in each region here. This field can also be used to get an estimation of how many documents are in Inktomi's database:

region:africa
region:asia
region:centralamerica
region:downunder
region:europe
region:mediterranean
region:mideast
region:northamerica
region:southamerica
region:southeastasia
Total:
1,821,859 documents
35,037,863 documents
651,599 documents
11,232,997 documents
190,119,720 documents
1,611,480 documents
2,351,964 documents
524,553,025 documents
13,688,268 documents
2,113,385 documents
783,182,160 documents

stem: Search for documents containing grammatical word variants including plural, singular, and tense. Example: web search -stem:advertise -stem:business -stem:christ -stem:game -stem:genealogy -stem:host -stem:hotel -stem:job -stem:offer -stem:position -stem:product -stem:service -stem:shop -stem:travel.

title: Search for words in the title, you can also search for phrases, but the syntax isn't the one we would expect: title:"keyword1 keyword2", instead it is "title:keyword1 title:keyword2". Example: "title:index title:of" -originurlextension:htm -originurlextension:html

References