Petit image
June 2000

~ Altavista ~
Let's have a look at the STRUCTURE of Altavista.
Searchers should reverse their tools, eh... :-)


(this is still in fieri, of course)


[Basic recall]    [Advanced insight]    [AV querystrings]    [Altavista forms]


Basic recall

Write it down and remember it: for Altavista's queries the best method is ALWAYS advanced search with boolean logic (and field limiting) and then heavy use of the sort box.

Advanced insight

For any search engine, ergo for Altavista as well, it is critically important to assign location values properly during indexing. Inside Altavista, the assigning of locations is fully automatic in the simplest case... where a function called avs_addword does all the work.
In this case the words of the document are laid out end to end and are numbered sequentially starting with the value returned by other functions (avs_newdoc or avs_startdoc). The same is true for field boundaries and for values (indexed quantities like dates that can be range-searched). The following diagram shows how two very short documents would be stored inside altavista's index database.

document 1
document 2
wordhereyouhaveashortpageThisnotwithstanding
thisnotwithstanding
hereyouhaveanothershortpage
location12345678910111213

As the figure illustrates, each word is actually stored as a word-location pair. The index also contains information about the beginning and ending locations of each document. Document1 starts at location 1, and Document2 starts at location 7. In Document2, the first word contains an uppercase letter, so the word is indexed twice: once with case preserved and once in all lowercase. Both versions of the word are at the same location, so that the word would be found appropriately regardless of whether a query is case sensitive or case-insensitive.

The words are added sequentially, every so many documents, or when the last document of a linked bunch has been processed, the actual update to the index is made, using avs_makestable.

The avs_newdoc procedure defines a block of text as a document and establishes an identifier with which the document can be found in the index. The avs_newdoc procedure also defines a filter, which does the bulk of the work of preparing the document to be indexed. It is at the filter stage where any necessary document type conversion takes place. The filter function is called using the following arguments:
IN avshdl_t idx (index handle)
IN void *pFname (information sufficient for the filter to access
                 the document contents)
IN unsigned long startloc (starting location for adding words)
OUT unsigned long *pNumWords (number of words added to the index)
Once the filter is finished processing a block of text, it can pass the text (in the form of a line, a paragraph, or even the entire document), to the avs_addword procedure. The avs_addword procedure parses the text into words and adds those words to the index. It interprets as a word any sequence of letters or digits that is surrounded by spaces or other non-alphanumeric characters. When it adds a word to the index, the avs_addword procedure preserves the case of the word as it appears in the document. If the word contains any uppercase letters, the software also indexes a lowercase version of the word, to support case-insensitive searching.
That were it... re-read the snippet above and you will know more about search engines that many self-proclamed experts do.
Of course there is MUCH more to learn, though: knowledge is a never ending downhill run on your sledge, eh.


In fact many more "menial" tasks are performed, for instance the following ones: Set a date for the document ~ Specify a data string to be returned as a search result ~ Set a date and time for the document. ~ Identify certain words to be indexed as fields. ~ Add a single word exactly as entered to a document index. ~ Index the supplied date at the specified location. ~ Index the supplied value at the specified location. ~ Add a numeric value to a document index that can be used for custom ranking.

Ranking values are very important, when retrieving results. For example, suppose you want a value type called rlines to order search results by the the number of lines per document. You must supply the name (rlines), the lowest and highest possible values. The following code example defines the value type for extended ranking of search results, in this case, the number of lines per document.
error=avs_define_valtype ( "rlines", 0, 10000, NULL, &rlinesvaltype);
if (error != AVS_OK) {
   printf ("avs_define_valtype returned %s\n", avs_errmsg(error));
   return 1;
                     }
Dates
When indexing documents, a date can be set for each one through the avs_setdocdate or the avs_setdocdatetime procedures. Once the dates are in the index, it is possible to use dates or date ranges to limit searches. The date is returned in the search results.
Altavista is capable of storing dates from 01/01/0100 through 12/31/2148.
Searchers can limit a query with a date range added as an extra Boolean term. The format of the date range is [dd/mm/yyyy-dd/mm/yyyy]. If a searcher omits the beginning date, the query will return everything in the index with a date before the end date. If a searcher omits the end date, your query result will contain all documents with dates after the beginning date. If a searcher wants only the documents indexed on one date, he should use the same beginning and ending dates. The end dates are part of the range.

There are various types of possible searches: The search engine ranks the results of a search based on a weight value assigned to each word in the query, and a resulting overall relevance rating of each document that meets the search criteria.
A document earns a relevance rating based on the number of words in the search query that it contains, and the weight value of each of those words. The document containing the most words with the highest weight value is considered most relevant. The closer the relevance rating is to a value of 1, the more likely it is that a document meets the search criteria.
The weight of a word is determined by the number of occurrences of that word in the entire index. A word that occurs less frequently in the index earns a higher weight, based on the assumption that it is more precise and specific than a word that occurs frequently. For example, the word "searching" might occur many times in an index, whereas the word "combing" would probably occur less frequently. "combing" would be given a higher weight than "searching" in a search query containing both words, because a document containing only the word "combing" would be more likely to match the searcher's interest than a document containing only the word "searching." A document containing both "combing" and "searching" would earn the highest relevancy ranking.

The position of the word in the document, and the frequency of occurrence of the word in a single document, have some bearing on the ranking of a document. The most significant factor in determining ranking is the combined weight of words in the search query. Also, the search engine considers only words without an operator preceding them when it does ranking. If operators precede all words in the search query, the results are returned in no particular order.

Basic searches
As you know, to perform a basic search, a seeker uses the operators plus (+) and minus (-) to indicate words or phrases that are required or prohibited in the search results. For example, the following query expression requests documents that must contain the word hints and can also contain the phrase how to search:
"how to search" +hints


Boolean
Boolean Query Syntax For Boolean searches, use the logic operators AND, OR, NOT, NEAR, and WITHIN. For example, the following query requests that either of the words find or target appear in the same document with either of the words search or seek.
(find OR target) AND (search OR seek)
The following query requests that both the words search and seek appear in a document's title: field.
title:(search AND seek)

Rules for Query Processing
Both the ranking and Boolean search procedures follow the same basic rules for processing queries: Altavista provides support for Boolean searches, including AND, OR, NOT and NEAR (proximity) searches. This -as you know- allows for phrase searching and proximity searching to be performed on indexed documents.

Note that you can use the WITHIN ## (where ## is the number of words) command to control the number of words apart the words in your query string can be. For example, if you want to find the word Mary within 5 words of lamb, use the Boolean query string:
"fravia WITHIN 5 searchlores" This query will bring a result for fravia and searchlores when they are not more than 5 words apart instead of the default of 10 words apart.
Thus using NEAR in your search is the same as using WITHIN 10.

How the Public AltaVista Search Site Sets the Virtual Memory Attributes
The AltaVista Search site on the web has the following setting for its virtual memory attributes:
vm-mapentries = 1000
vm-maxvas = 1337438953472
ubc-maxpercent = 70

The following are settings for processes:
max-per-proc-address-space = 137438953472
max-per-proc-data-size = 17179869184
per-proc-address-space = 137438953472
per-proc-data-size = 17179869184
max-proc-per-user = 256
max-threads-per-user = 2048

Typically these machines are larger than average: 8-processor, 6-8 GB.

Note that there are limits to the "Ranking word maximum frequency": the ignore_thresh parameter is expressed in one hundredths of a percent, for example, 1000 = 10% Any word that occurs in the index more frequently than this percentage is not counted for ranking purposes (but the word is still counted for Boolean ranking purposes).
This is intended to be a performance optimization: if this value would be set as smaller than the default (1000), ranked searches would run faster but the ranking would be less precise. If the value would be set higher than the default, the ranked search would be slower, but the ranking would be more precise. The range for this parameter is 1- 1000.

AV querystrings


shoot an altavista query and then look at the resulting URL you asked for

This is only an appetizer :-) Serious seekers may enjoy having a look at a special classroom:
[c_fourth.htm]: Spelunking altavista's acronyms by Humphrey P., Gregor Samsa & Iefaf, June 2000, part of the [classroo.htm] section: A fundamental 'search engines reversing' classroom.

Altavista forms

ALTAVISTA ADVANCED SEARCH
Very quick! Text-only version, of course!
Boolean query: 

            Sort by:

        Language:          Show one result per Web site

                From:     To:   (e.g. 31/12/99)

Simple search - Graphic Version


ALTAVISTA SIMPLE SEARCH
Very quick! Text-only version, of course!
Simple search: no boolean! defaults to OR, use advanced instead!

Ask AltaVista a question.  Or enter a few words in

search refine

Search - Advanced - Usenet



Still quite in fieri, I'm afraid...

Petit image

(c) 2000: [fravia+], all rights reserved