xthought.htm: How to search the web, by fravia+, Some thoughts on CSE results

~ Some thoughts on CSE results ~

				How search engines work

Version October 2001

[ a database addition by rõnin]
Eh, this is still in fieri, and is just a first attempt, ya know...

Some thoughts on CSE results

by ~X~, October 2001

Well I've been doing a little bit of thinking about the Commercial Search Engines results. How do they judge which web page is the right one for our queries and by what criteria one web page is above the other (as in placement). I will not go into deep details, you can find them on the "analyzing search engines" part of searchlores, I'm just putting some arguments on the table here.

Well it seems that we either have a Pagerank system (or something similar) or simple you-submit+we-index-it. Couldn't get much information on either system through the official commercial sites of some well known Search Engines. Actually I checked google, alltheweb and av, looked in their help.htmls but nothing interesting poped up. Only google provides some pretty vague information on how it ranks the pages and then pushes them to us. On the other hand I didn't find anything on av or alltheweb (couldn't reach topclick.com at the time being..). I am not talking about using the engines to search for information on how they rank pages, I just checked their own (official) help files, if any.

Well it seems to me that they do not want to provide much information on how they do it. But most of them are said to use a familiar system to google's pagerank, so I'll stick to that. From what google says, pagerank is an almost perfect system to rank sites, and most of all it cannot be tampered by humans (well actually its "extremely difficult" which doesn't exclude the tampering factor totally does it?) and that's a very good thing. They tell us that not even their own engineers can put their hand on the ranking system. Most of all the pagerank system is based on the democratic (??!?!?) nature of the web.

And I thought the web had an anarchistic structure, silly me. It's pure democratic they say, and that means that _high-quality_ sites have heavier percentage on voting a site to get higher on a rank. Ok we have democracy and already high-quality (a.k.a better than the rest, like the aristocratic) sites are poping up, and their voting (i.e linking) means much more than your site's voting. It goes like this, if my high-quality site has a link to fravias site, and your low-quality site has a link to fravias as well, my link will represent a much more worthy "vote" than your site does. See? democracy is not about equality after all, google says it (copying&pasting "votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important"). That could mean that if you get your site linked from a high-quality site, then your site will go up in ranking faster than getting a vote from an everyday ordinary site THEY do not consider of high-quality. After all it's who you know, not who you are, don't you agree? Google does.

NOTE: In the name of google I express comments about ALL ENGINES THAT USE SIMILAR WAYS IN RANKING SITES. Nothing specific against google. It's just my scapegoat :)

So dear webmaster, what your site depends on, on being in the first 30 results on google (90% of usual users don't go further than the first 1-2 pages, except those annoying seekers that have the tendencie to jump over our first placed clients, err meant democratically ranked (pay) sites) is to have affiliates who have quality sites themselves. Now can someone PLEASE define what a quality site is? And it seems google has some quality sites BY DEFAULT whom by their turn transform other petty sites to high-quality as well. Now how do they do that? What kind of magic do they use to get the high-quality sites? I mean, damn I wanna get my hands on them too! Don't you? Why don't they just provide us an index of their high-quality sites instead of filling our screens with (what they perceive as) low quality sites in our search terms? Oh wait, that's how they are supposed to be working, what we get are ONLY high-quality sites. No garbitch, they don't present garbitch results. Hmm ok..but..

What about their whole index? If the first ranked sites are supposed to be the high-quality ones then why bother indexing more than (say) 50 sites (per searching string)? Aren't the ones further than the 50th supposed to be low quality ones? Why don't they just keep the 30 or 50 best (most wanted!) quality sites and delete the rest? That way they could lower their server load and make their index smaller. Ieek!!! smaller? did I say smaller? No man, we don't want small indexes, we want big ones, huge ones, because big indexes means a lot of pages and we can make the suckers think that with more pages you'll find what you want easier, ehehe, like it's up to them and their search queries, eheheh, it's just a matter of prestige. You see there is no relativeness of content and a huge index when your results are _judged_ by a ranking system. You won't get more than a few hundreds (even if you try). So what's the big fuss of indexing billions of sites unless they are all high quality ones? which can't be since they are adhoc discriminating their indexed sites in high quality sites and on-evaluation (to put it nicely) sites. See my point? Searching in a huge index that will give you (without your interference) a predetermined number of results is pointless, or at least keeping a huge index after checking all the pages for "quality" is pointless as well. Why not just keep the quality ones and serve those to us? They are supposed to be high quality, thus with super content and stomped with educative information.

I really think that huge indexes are a lure to the naive. It's like "come over here, we got 100 women for you to pick from, but you can only pick 1 and that will be among the 5 WE have already decided you can pick from, and that 5 are adhoc chosen to be given to you while the 95 rest of them are -one-by-one- ugly and simple vitrine and we don't intend on giving them to you or anyone else anyway, ever". Well there is always the case that one of the low quality sites will reach high standards at a point of time, but even so there is no need to serve us a huge index, they can keep their index and evaluate it without virtually-providing us with it since either way we can't get our hands on it. So why brag about it? (could it be that they are luring potential customers that way? shhh!!! I'll wash my mouth with soap, bad me)

What's the point in advertising something you cannot provide, or do not intend to provide? It's the exact point of advertising! Lurement into coming to their business and buying from them (crap as always)! Why don't they let us browse into the millionth result of our query? Yes they got a good excuse, it's called "99% of people would never bother into looking into the millionth result". Then why the heck do you provide it?

I do not intend to provide a high-quality analysis on how good the pagerank system is or not, I'm just supporting the idea that it is not what is seems and has some really big holes on being so politically correct as they want to present it. Even with this small post I believe I touch some aspects that need more consideration, especially when we try to attach content to indexed number of pages. It is obvious one doesn't include the other under the CSE terms of use, or if you like it limitations of usage.

It all boils down to this I think, content and information have nothing to do with how big an search engines index is, provided they limit your ability to use all the given results at your liking. Yes the bigger the index is the most probable is that you will find what you are searching for, but that cannot be accomplished while THEY have already determined WHAT is the quality you want and THAT is the quality you will get, cause they believe so, and you can't get any more than they WANT so. See? It's not so democratic after all (or I'll have to read about political systems again).

Sure we usually find what we seek with CSE, noone doubts that. They work convenient to our likings and we always get a hand on our target. But since we got a money based business (as it seems) how long can we rely on it? And how late will it be when we happen to realize that the real quality sites have been deleted from their index in the sake of being political correct and forming with the fashion. It's not that the CSE do not work, or do not provide results, it's just that when you try to find "weird" things they don't come so handy, which seems that something must be wrong. I mean, man, how can they index 1,6 billion sites and not have what I am looking for? How can it be that the more special the search becomes the less usable (not just on the number of results) and ugly the CSEs get. That's why we often rely on other means of searching to get our meat, especially when we got weird tastes. Who would have thought that hence the web is supposed to be THE mean to search (at least that's what fashion suggests) we would have to get back to archie and gopher to get some scraps of truth Don't let that slip out, by having a pagerank system which is not so democratic after all they can provide the public with superb propaganda and news coverage that has nothing to do with..news.. Just imagine how much cnns (and it's ugly propaganda) "voting" meant on the recent WTC attack and the relevant results that would pop up when you searched about it. See what I mean? The problem is much bigger that simply not finding something, it's about finding the wrong things as well while thinking you got the right ones. Ok, the biggest percent of users who are not interested into delving into the lore of searching will not be bothered, not now not ever. But we are bothered cause they are keeping us away from the good meat with stupid excuses and try to feed us with dirt! We want transparency into what we are looking at, not a well camuflazed trap. Maybe I am exaggerating a little, but you have to exaggerate sometimes to make your point clearer and sharper.

Another thing about noble google. It says that in no way a higher ranking can be bought. Why not? What protects us from it? The "democracy" of their pageranking? (that's even worse!) I mean, they got the pie and they got the knife, noone knows about the code-internals of their search engine to prove that they can't assocciate specific words in search queries with specific results corresponding into their affiliates' sites as No 1 in the results page. Can we trust something we can't know? Can we believe in a God we cannot see? Some can, I can't. It's a matter of good will, and I tend to lack in that part of my character. At least when something reaaaally fishy is going on.

By the way, I have noticed that several people prefer google from altavista because google is banner free. Ok, I'll give you a hint, take a look at their source code! No need to bother with not-so-sic ads when they direct all the traffic data to their ahm assocciate at once. Pretty cool huh?

The whole Commercial Search Engine scene is turning from bad to worse, data siphoned in front of our eyes, bleaching banners fill our screens and pre-chewed food/results are dumped into our monitors, and we are happy! Well you can't expect something better from something that has a commercial eye, it's they and us. We got to react if we want a really democratic ad free system! The first move was made with the "scrolls" which are perfect (congratulations to their "fathers" and to all those who assisted, I'm trying to compile an ftp scroll to send you, but I am veeery slow at it. Got the stuff but have to organize it into .def files. Will eventually do it). Searching bots are ideal, but we could use something more "public".

They work for money, everyone (?) works for money, but they work for money only as it seems. That's sad, you see money is good and can get you much, but it cannot give you reason to live, since all that reasons (glory, love, romance, happiness) cannot be bought. Heh, it's funny how life itself proves such things as money wrong, when people think money can make everything right.

Work well!

p.s I intend to post a couple of other parts as well, not entirely relevant to this part but very well connected to it. I am hoping for your feedback so I can reform this small post and maybe even create an essay from it. But I want it to be a team work, not some of my mambo-jumbo :)

Thank you for spending your precious time to read this.
~X~

A database addition by rõnin

I was thinking about the way someone can found the algorithm used by Search Engines to rank results, so i land there after a quick search :

http://www.searchengineworld.com/newsletter/2000/algo_cracking.htm
SEO: Zen and the Art of Algo Cracking

Wich is indeed a nice article, remembered me the essays written by Humphrey on altavista.

Then, a hole ("there is a crack, a crack in everything, that's how the light gets in") :)

Clicked here :
http://www.searchengineworld.com/misc/resource.htm

And clicked again to land on a conference called : The Eighth International World Wide Web Conference.
I don't know if it was already posted here. Anyway, i had a look ... and a deep one if fact :) I figured there were the 9th and 10th one already
passed. And all the publications can be found on CD, with content available online.

Fwwwuuuuuu .... lots of papers here. Many domains available.
So, may i propose a little Database from these sites ?
I've tried to dig out the ones related to searching, and especially focusing on _ranking_ the results. But this stuff could be use with many of our works. I'm especially interested in the ones concerning metaphors, mapping (MappeR ? Wake-up ! ;))
Put the company name when they were involved directly with their research centers. Others searchers are generally from various universities
and institutes. (the company labs are easier to find in the p rtners part of the search engine sites)
I'm sure i've not collected ALL the valuable stuff for our works. Feel free to dive into the www conferences websites to verify.
All papers available in pdf. I think this page needs a little of format to permit easier in-searching.

-------------------

I've extracted 3 papers from that DB, that are relatives to a project named Mercator, to show how we could use all these papers to dive inside _hidden_ search engine knowledges.
In effect, we have here peoples that seems to be on the top of the research in matter of Search Engine, especially in the ranking field. Even if their papers aren't easily under-
standable, we can extract usefull informations from them : Keywords, Project Names, SE Experts Names, Bibliographies, Wich Search Engine works with wich lab etc ...
We could spend month following the biblio path they offer us, and i'm sure we'll find gems and unvaluable tresors.

Back to the mercator path. You'll see that it can be indeed quite interesting, because this project seems to be A PART of the altavista 3.0 solution (the current one) :

"Welcome to the Mercator home page. Mercator is a web crawler built by researchers at Compaq's Systems Research Center.
We have received a number of requests for the Mercator sources. Mercator has been transferred to AltaVista, and is part of
AltaVista Search Engine 3. You should contact AltaVista for product and pricing information on Search Engine 3."
http://www.research.compaq.com/SRC/mercator/

-------------------
Wanna go even deeper ? Just perform a search of the sites that links to the project mercator (or another project mentionned in this DB)

see, for exemple, what you can fish :

http://www.cwi.nl/InfoVisu/links.html
Information Visualization (big links collection)

Okay, enough for this night. Hope you won't lost yourself in nodes ;)

rõnin