Luxembourg_2006.htm: fravia's talk at the Hack.lu 2006

portal.htm → mines.htm → Luxembourg_2006.htm			Version 0.33, Updated 21/OCT/2006

			How to find anything underneath the commercial web "Powersearching without google" Fravia's talk at the Hack.lu (Luxembourg, Luxembourg - 19-21 October 2006) This file dwells @ http://www.searchlores.org/Luxembourg_2006.htm Introduction Recall and Precision Let's analyze some simple music queries Today's targets (python & proximity galore) Finding the three targets Potpourri searches Look ma, no google! Let's search elsewhere Spamming (& "popularity") Conclusions Assignment Forms

Introduction

top

Well, I didn't know *anything* was already in the public domain!

Please excuse my English, and please excuse my excessive lack of political correctness. That -and my disdain for all commercial aspects- is probably also one of the reasons I'm the only one here, together with my good old friend: The Grugq, who prefers to use a pseudonym :-)

Thank-you for inviting me. I kinda like Luxembourg: klein aber fein.
I also like reversers, and crackers and hackers. Always did. Sons of the light: deserve some cosmic searching power.

We'll examine some 'alternative' searching techniques: On today's menu: music and books à la carte.
As a proof of concept we will search some mp3s and some books (today about the python programming language).

Searching such simple targets (music and books) we will, maybe, demonstrate

that everything is on the web;
that you can search rather effectively without using google;
that some "alternative searching paths" can be quite useful for seekers.

Of course our searches are not limited to just music and books.
Seekers can always find anything: whatever. Any image, any journal, any film, any software you fancy is somewhere on the nethervoid at your disposal.
In fact anything that may have been digitized is indeed somewhere out there (one of the 'optimistic' laws of searching).

Even more important is the fact that not only "tangible" and "digitized" targets are available to anyone, but also all kind of solutions are there, at your disposal. And I don't mean just messageboard solutions on -say- how to port a proprietary driver to GNU/Linux.
I mean real concrete solutions:

You want to counter a specific politician?
Find out and document all his nasty deeds during the last 10 years. A little stalking and a moderate amount of searching skills are probably all you need to send him in Jail.
You want to counter and diminish those annoying ubiquitous advertisement panels?
Someone, somewhere has posted experiences, best practices and ideas.
You want to free once for all a nice square of your town from those stinking private cars?
Someone, somewhere has already done it. Find the stuff.

Synergical cosmic power: Ideas, Methods, Techniques, Lines of attack, Tactics, Approaches, Strategies...

A propos mp3 and books: a small caveat: Somebody told me long ago -politely- that we should not download songs or books that are patented, or have not yet been released in the public domain. OK. I can understand that, as strange as it sounds (because, silly me, I always understood copyright as the right to copy).
Anyway we will just have an innocent look at our targets on line. If your local political clowns forbid it, don't download forbidden fruits, please, (or if you do it, you should practice some basic anonymity precautions...

(Torpark, Ubuntu on USB ---> Casper Cow, wardriving and downloading...).

Your browser (should be Opera btw) is of tantamount importance for searching purposes, and can do wonders, if correctly trimmed...

(HOST file, Proxomitron...).

And now let's begin...

Recall and Precision

top

Recall	Really comprehensive retrieval...		...or just a few but very relevant data?	Precision
Recall	Almost everything but also a lot of junk...		...or just a few mostly relevant results?	Precision

The real problem -when searching- is not even anonymity and it is not even speed: it is the relevance, coherence and reliability of our searching results.

In order to obtain such relevance, coherence and reliability you need some sound evaluation techniques, but a more methodological approach to searching can help as well.
Two important concepts you must be aware of, when searching, are those of RECALL and PRECISION ("precision" is the accepted terminology, but "relevance" would probably be a better term).

Recall is the ratio (%) of the number of relevant records retrieved to the total number of relevant records that might exist (hard to calculate how many might really exist, of course, but we are just using indicative parameters).
Hence recall is relevant retrieved documents / relevant documents.

Precision (or "Relevance") is the ratio (%) of the number of relevant records retrieved to the total number of all (irrelevant and relevant) records retrieved (this might be calculated, anyway note that in reality there is not such a clean "objective" distinction between relevant and irrelevant, a lot depends on your taste).
Hence Precision is retrieved documents / relevant documents.

Here two graphic attempts to explain this: in the void of the irrelevant blue results, there are some relevant red results.
Different kind of queries, here imagined as a transparent layer, (but you can see some real life examples -for mp3 searching- below) will give either recall or precision.

Broad (Recall) Query		Narrow (Precision) Query

Add synonyms to your query (or use our synecdochical searching method) and your precision will suffer (but your recall will increase). Use a proximity operator (like the NEAR operator in altavista) and your precision (may/will) increase (but also, of course, your recall will/might suffer).
There are differences between the main search engines as well: for instance google has a (slightly) higher precision than yahoo, yahoo has a (slightly) higher recall than google.

Also always consider that the main search engines DO NOT overlap too much and yet that they cover together (at best) only 1/4 of the web, this may be quite significant when deciding your search strategies.
Many clueless zombies consider "searching the web" tantamount to digit one term inside google and then clicking enter. Such a simplistic approach is wrong, and not only for the "one-termness" of it. The real problem is that google covers only a small part of the web.
In order to access a bigger part of it you will need to use techniques that go from stalking to social engineering, through trolling and passwords breaking.

Back to Recall and Precision: as a seeker, you must decide beforehand the aim of your search: do you want to find everything on a given topic, or do you want just the most relevant texts? Or do you instead want almost everything? Or just some texts?
This will determine what searching strategies you'll have to use. Many search techniques may be used to gain either more recall or more precision (in fact there are even techniques that will allow you to increase BOTH, despite the fact that recall and precision are -usually- inversely related).

This said, as the two graphics above point out, there's absolutely no way you can cover ALL relevant results, due to the fact that even using the best searching techniques (remember that the main search engines cover just about a third of the indexed web) you will never be able to cover the whole web.

Our aim as seekers is to have BOTH precision and recall. We want (ideally) to retrieve everything that is relevant, and everything that has been retrieved should (ideally) be relevant. We'll now use some tricks to (try to) do this.

Let's analyze some simple music queries

top

Pulling some MP3-webbits out of the web
(a "Webbit" is a "Querystring Rabbit" out of a magician's hat")

A quick list of the most important operators:
Yahoo operators:
site: hostname: link: linkdomain: (links that points to one domain) url: intitle: inurl: (a specific keyword as part of indexed urls, example: inurl:searching)
intitle & inurl are VERY important parameters... nomen est omen:

images giotto5.jpg...

Google's operators:
site: allintitle: (all of the query words in the title) intitle: (that word in the title) allinURL: (all of the query words in the URL) inURL: (that word in the URL) cache: link: related: (pages that are "similar" to a specified web page) info: (google's info)

Altavista's most important operator:
NEAR (more on this later)

MSN Live's operators:
contains: Restricts results to sites that have links to the file type(s) you specify. For example, to search for websites that contain links to mp3 files, type music contains:mp3. filetype: Returns only web pages created in the file format you specify. Live Search recognizes html, txt, and pdf extensions. Live Search also recognizes the extensions for primary Office document types. For example, to find reports created in PDF format, type your subject, followed by filetype:pdf. For example, type information filetype:pdf. inanchor:, inbody:, intitle:, inurl: Returns pages that contain the specified term in the anchor, body, title, or web address of the site, respectively. Specify only one term per keyword. You can string multiple keyword entries as needed. For example, to find pages that contain google in the anchor, and the terms black and blue in the body, type inanchor:google inbody:black inbody:blue. ip: Finds sites that are hosted by a specific IP address. The IP address must be a dotted quad address. Type the IP: keyword, followed by the IP address of the website. For example, type IP:80.83.47.151. language: Returns web pages for a specific language. Specify the language code directly after the language: keyword. link: Finds sites that have links to the specified website or domain. This is useful for determining who links to whom. Do not add a space between link: and the web address. For example, to find pages that contain the word games and that link to searchlores.org, type games link:searchlores.org linkdomain: Finds sites that link to any page within the specified domain. Use this keyword to determine how many links are being made to a specific page, as well as how those links are made. For example, to see pages that link to searchlores, type linkdomain:searchlores.org. linkfromdomain: Finds sites that are linked from the specified domain. Use this keyword to determine how many links are being made from a specific page, as well as how those links are made. For example, to see pages that are linked from my site, type linkfromdomain:fravia.com loc:, location: Returns web pages from a specific country or region. Specify the country or region code directly after the loc: keyword. To focus on two or more languages, use a logical OR and group the languages. For example, "core python" (loc:RU OR loc:CN) prefer: Adds emphasis on either a word or another operator. For example, type searching prefer:internet site: Returns web pages that belong to the specified site. To focus on two or more domains, use a logical OR and group the domains. Do not add a space after the colon (:). You can use site search for web domains, top level domains, and directories that are not more than two levels deep. For example, to see web pages about media reporting from the BBC or CNN websites, type "media reporting" (site:bbc.co.uk OR site:cnn.com). You can also search for web pages that contain a specific search word on a site. For example, to find the library pages on searchlores, type site:www.searchlores.org/library feed: Finds RSS or Atom feeds on a website. For example, to find RSS or Atom feeds about web searching, type feed:"web searching" hasfeed: Finds web pages that contain an RSS or Atom feed on a website. You can add search words to narrow your search. For example, to find web pages on the Guardian website that contain RSS or Atom feeds about google, type site:www.guardian.co.uk hasfeed:google url: Checks whether the listed domain or web address is in the Live Search index. Do not add a space between url: and the domain or web address. For example, to verify that searchlores is in the index, type url:searchlores.org
Most important MSNLive operator:
linkfromdomain: (an outbound links operator)

We'll now use as an example the intitle: operator.
The structure of the following old -and already "blunt"

- mp3s webbit, has various interesting characteristics, that may be used to exemplify general webbits' structures and purposes.

Click to try	1	2	3	4	5	6	Try s.e. swap!
High precision	beatles	imagine	mp3 OR ma4 OR ogg	intitle:"Index of"	-metallica	+"4.2M"	On google
High recall	lavigne		mp3 OR ma4 OR ogg	intitle:Index.of	-beatles	+"4.4M"	On Yahoo
	group	title	format variants	index of in title	spamkiller	variable parameter (guarantees length)

The "group" (or singer name) is mandatory.
Simply specifying a "title" adds precision and loses recall (precision and recall are -most of the time- inversely proportional). (This means that if you add your target's title you diminish excessive noise but may miss some target sites).
The "format variants" will guarantee a broader spectrum. If the search engine you are using is heavily censored (as it happens more and more often) just eliminate the mp3 parameter. Chances are that some (yet) uncensored ma4 or some ogg file will be present inside our "real target" (mp3 censored music lists), and that these "ogg oddballs" will allow their retrieval. When they will censor ma4s we'll invent something else :-)
The intitle:"index of" (or intitle:index.of, which is the same but avoids two key-presses) is mandatory, and -spammers notwithstanding- still allow fairly decent results. Of course the intitle: operator is to be used with google and yahoo, check the different operators for the other search engines, or just use a more simple (and spammed) "index of" string snippet.
The -metallica (or -beatles, or whatnots) serves as a spamkiller, because many clowns still try to fish zombies out of the knowledge web uploading huge lists of groups' names. If you'r going for high recall, then re-launch the same query with a different singer acting as spamkiller.
Finally we come to the LENGTH parameter, which not only guarantees the presence of at least some Megabyte heavy mp3, thus cutting away all the irrelevant noise of those bogus "index of" spammer sites with "small snippets" of music, but also can be varied ad libitum and will thus guarantee you hours of fishing pleasure. You may try for instance all variations in the range 1.6M - 6.4M. I suggest starting in the range 3.5-4.5 (which is a good signal ratio for mp3s) and moving upward if you are an optimist and downward if you are a pessimist.

A more recent webbit:

-inurl:htm -inurl:html -inurl:jsp -inurl:php -inurl:pdf -inurl:asp -inurl:txt -inurl:shtml -inurl:phtml -inurl:cgi -intitle:free -intitle:download -intitle:archive +intitle:index+of/ +parent-directory +name +"last modified" +size +description (oasis OR shakira) (mp3 OR wma OR m4a) -download

Note the useful spamkiller -download and the fact that we search only for (mp3 OR wma OR m4a). You may add ogg files as well. In fact there's no point in searching those infamous Ipod's m4p. However, should you happen to download some of those Ipod's "fairplay infected" AAC files, you may use a program like JHymn to play them wherever and whenever despite their ridiculous patents, and/or converting them on the fly to a more useful (and unprotected) m4a format.

Here a very simple, extremely short and yet quite useful query. You can launch this (or a very similar one) with any good search engine... even with google...
shakira "4.6m * snd"
"Grown up" humans will probably substitute "shakira" with -say- "bach" :-)
Note -again- the added "variable" parameter +"4.6m". which is quite important in order to reduce noise and spam (again: you can modulate as much as you fancy: 4.5, 4.6, 6.2 etc.)

Today's targets (python & proximity galore)

top

How to get inside libraries at night

Ah, Python! Maybe the most powerful programming language for the web.
The following books about python will today serve as "example queries".

Note that finding books on the web is extremely easy.

In fact there's even a direct relation between the "celebrity" of a target and how easy it will be to find it on the web.
For example: if you want hic et nunc the "Lord of the ring", you just search for a passage of it: At last the voice of Faramir ordered them to be uncovered. See? Even the heavy censored MSNSearch gives us some results. This is true for all books: "But few of any sort and none of name" and you immediately have Shakespeare's "Much ado about nothing" (and, I may add, you discover en passant a lot of book repositories as well).

Hence finding books, patented or not, is extremely easy. Yet today's "examples" will give us -I hope- an opportunity to investigate some alternative searching methods... some different ways of cutting the noise and getting the signal without just using google (or anyone of the many main search engines).

The following three books will be our 'virtual targets':

Core Python Programming by Wesley J. Chun
First Edition, Prentice Hall, December 14, 2000 ISBN: 0-13-026036-3, 816 pages.
Note that the new 1120 pages - September 2006 - second edition (Prentice Hall, ISBN: 0132269937) is too new to have already "percolated" the web, it's already out there all the same, of course, and you'll have to find it yourself (see the assignment section), but keep in mind that for a book it takes usually a couple of months after its publication to "sink in" the web.
A good introduction to python: explains the language, and does put it in a wider context. (Incidentally: reversers should note the usual tricks with the font sizes: here so big that the book could be condensed to half it's size -or less- if the type was reduced to a normal level :-)
Python Cookbook by Alex Martelli and David Ascher (O'Reilly, ISBN: 0596001673), 2002
Advanced: a collection of problems, solutions, and practical examples for Python programmers, written by Python programmers. In principle this book is for those that already know Python. In practice it's quite useful for anyone.
Python Essential Reference by David Beazley, (Sams, ISBN: 0672328623), 2006
Syntax, functions, operators, classes, and libraries. This is first and foremost a reference, yet quite useful if you have some experience with other programming languages.

Now, of course we all know that 'printed' books on such matters are often both obsolete as soon as printed and quite superficial vis-a-vis what you can find for free on the Web, because we also know that the best knowledge (and material) is to be found instead in some 'gray areas' of the web, or visiting ad hoc messageboards or perusing Usenet emails.
Moreover there is such a wealth of free useful texts on the web that this fact alone (not the fact that you can easily find all these patented books for free) makes one wonder whether nowadays it makes any sense at all to "buy" a book.

Again a caveat: the following queries are just a proof of concept, showing how we could search for books... the specific "quarries" (in this case our three python books) don't matter that much: you'll be able to adapt the following approaches to OTHER, different, targets of yours... replace for instance "python" with -say- "assembly" or "digital photography" and you'll obtain a different complete library instead.
The approaches and the techniques we are examining together are important, the targets themselves are irrelevant.
Proximity galore
Another interesting side effect of a correct web-seeking approach is that often, when searching for something, you will find on the same servers many other targets related to your topic, targets that you did not even know existed. While this is true for many different targets and only for books, we call this the "being inside the library" effect. Imagine you are not filling out a request form at the counter, imagine you are physically retrieving a book inside a library, with shelves and shelves of books around you and within your immediate reach. Thus you can scan with your eyes all other books in the proximity: books, more or less related to the topic you are searching for, that are, imagine, located on the same shelf, next to your target book and that you can also pick up at leisure. Gee... the amazing power of knowing how to search!

Enough theory! A good webbit for our python books? Here a "regional" one (say: China): python site:.cn "index+of"

Now let's leave theory and enter practice...

Finding the three targets (the simple way)

top

"A posse ad esse" :-)

Yep: searching for books on the web is -most of the time- extremely simple, especially if you have the exact title, the name of the author, the ISBN number and/or some snippets of text.

So let's now find the three targets listed above.

Core Python Programming by Wesley J. Chun

Please note that -in general- you find most books just inputting a title "tel quel" in any search engine: for instance a simple query on the good (and still commercially not infected) compound search engine seekz "core_python_programming" (note the underscores between the words) will almost immediately give us aplenty locations with the complete version of Chun's nice book.
On this same repository we register the "inside the library" phenomenon we discussed before: 19 further books on python, some of them crap, some of them good.

Alas! This "Core" target is the old "2000" edition. The new 1120 pages - September 2006 - second edition (Prentice Hall, ISBN: 0132269937) is more difficult to find, because too new to have already percolated the web: it takes (at most) a couple of months: you'll be anyway easily find it easily around December... but you may want to find it right now in order to solve today's assignement.

Python Cookbook by Alex Martelli and David Ascher Let's try good ole Altavista with
"Python Cookbook" NEAR rapidshare and choose results only in Russian, Persian, Chinese, Bulgarian, Arabic and other interesting non-copyright obsessed languages :-)

We'll reap good results aplenty, and the first one (an Iranian messageboard) looks promising: in fact it gives us another Iranian URL, where we could download our target BOTH in chm and pdf format. As an added bonus, due -again- to that powerful "inside the library" effect, we could fetch the following books:
In fact -its title notwithstanding- this last book is not *that* bad either.
(Caveat: again, don't download pirated stuff: as a seeker you'll always be able to find on the fly these books somewhere).
Please note that even if we DID apparently search for a book inside a online porn-depot à la rapidshare (more on this later), we found our target -in fact- inside an OPEN directory, and not on rapidshare itself... simply because the word 'rapidshare' was somewhere else on that first target page. Aso NOTE that this specific URL was not -until now- indexed by the main search engines, yet we found it nevetheless combing the web.

Python Essential Reference by David Beazley
We start with a simple yahoo query: "Python Essential Reference" rapidshare and we almost immediately get (inter many alia) a Russian promising URL, http://doci.nnm.ru/it_ebooks_digest, that gives us on a silver plate our third target (and many other as well, btw) .
Bingo. Done. Finished. We could stop here.
As many among you will know, rapidshare is (an important one) among HUNDREDS of file repositories services which make money hosting porn crap and patented material anyone can upload (they will 'remove it due to complaints' if needed) fooling idiots into buying "special accounts" in order to download this pirated and patented stuff/porn at leisure.
However, the logic is that in order to fool the zombies they HAVE TO be known and HAVE TO lure people into using them.
Therefore they HAVE TO provide a (often penalized and slower) free download opportunity for everyone and his leeching dog. Apart from the many simple possibilities to bypass these simple penalizing scripts (mostly written in php), the fact that there's now a huge industry thriving on the selling of patented software/porn/film/music material is very bad news for the poor & suffering patent holders. Not that anyone would give a dead rat, he :-)

Potpourri searches

top

Alltogether now!

We can also search all our titles together with our nice "potpourri" approach.
At times simply guessing that interesting places MUST have all your targets on the same page can be useful... in order to find these interesting places and also your targets :-)
Here a "potpourri" example: on yahoo Core Python Programming" "Python Cookbook" "Python Essential Reference" (note the &vst=.org&vs=.org&n=100 snippet in the search string)
and on google: "Core Python Programming" "Python Cookbook" "Python Essential Reference" (note the &as_sitesearch=.org&num=100 snippet in the search string)

When you search the web, the biggest problem is noise. Your target, your signal, will be often half-drowned underneath it.
And today's web has a lot of commercial noise.
If you search for an image for instance, say a picture of a famous painter, you will immediately find a gazillion spammers who want to sell you those very images you could easily find for free.
So you'll find thousand of low resolution images, or images defaced with an ugly watermark, put on line by "snake oil" sellers. So "cutting the noise" is crucial.
This holds true for everything, and for books as well. Hence we must "clean up" our queries a little, and, as the title of this conference: "searching underneath the commercial web" implies, a simple apprach is for instance to limit our searches to "edu" and "org" sites.

The first query above will give us this link that brings us in this subdirectory, and the second query will give us this "blocked link", yet through google's cached copy we will still be able to find this nice Russian site.
(Note in this example the importance of cached copies. In fact most search engines offer them: Ask, MSNSearch, Yahoo, Google, Alexa, Baidu, Gigablast...).

So, that's it: as you have seen, any and every book is at seekers' disposal.

(Caveat: Download your targets only if you are positively sure they have been released on the public domain. Real seekers do not need to waste harddisk space downloading doubtful stuff. This could even be constructed as 'illegal' by the patent holders and their political lackeys. Downloading is not necessary! Seekers will always find again and again on the fly -and consult on line- whatever they fancy).

Ok, ok. It was too easy. Much too easy. In order to continue, let's imagine for a moment that it would NOT have been so incredibly easy to find these books through such simple searches. Imagine we didn't find them. Imagine we will not find them again on those URLs.
After all, the web is a quicksand, and the specific locations where we found our targets after today's talk probably will disappear.

Look Ma: no google!
Go for the format, go for the name, do it like the lamers or search elsewhere

top

Sprinkles of cosmic searching power

Let's find again the same three targets WITHOUT using the same simple querystrings

We could go for the format.
For obvious reasons, given that our quarry are books, the most promising formats will be .chm, .pdf, .rar, .zip or .ace.
Of course we could take advantage of this with a simple webbit (on google this time): -inurl:htm -inurl:html intitle:”index of” +(“/ebooks”|”/book”) +(chm|pdf|zip) +”python”.
Here a similar query for yahoo: +(“/ebooks”|”/book”) +”python” intitle:"index of" (note the &vf=pdf inside this querystring).
Let's examine these two querystrings, step by step...

We could go for the name.
The very simple yahoo/rapidshare query: "Core Python Programming" rapidshare will give us not only some locations, but also much more valuable names:
- OReilly - Core Python Programming [miex.org].pdf
  (hxxp://www.miex.org/book.html)
- Core_Python_Programming[Wesley J. Chun](Prentice Hall PTR).pdf
  (hxxp://noc.unila.ac.id/~gigih/ebooks/)
- Core Python Programming.pdf
  (hxxp://e-books.amagrammer.net/Python/)
Now, even if these specific URLs disappear (and they will) you already have the names, i.e. the keys to find these files again and again whenever you want. Note also how some guessing can be useful when searching a target on the web.

We could do it like the lamers.
The lamers' slow way is simple: using P2P and torrents
Using P2P specific search engines is a somehow slow, but mostly very simple way to find your targets.
Let's exempli gratia give filedonkey a chance: let's just input core python and see what we fish. See?
Another "lame but valid" possibility is of course to check what the torrents can deliver when searching -say- python.
A lot of monty python as you can see, but also all our targets as well. In order to reduce the monty noise you can make the same search and limit the max size of the file to, say, 150 mega (note the &maxf=150 snippet in the searchstring).
Anyhow real seekers deprecate P2P and Torrents "searches". they are considered too slow and vulgar. It is always much quicker and more elegant to fetch your targets from locations nobody is visiting, than to slowly download them from overwhelmed servers together with a zillion other zombies doing the same thing.

Let's search elsewhere
You can search ftp, you can go local, or even better: regional. You can zap IRC channels and explore uncommon search engines

top

Aut inveniam viam aut faciam

FTP
Using ftp specific search engines is in many cases a useful way to quickly find your targets. Let's give our Lithuanian friends a chance: let's just input python and choose as format pdf.
Here we have only three results: A.D.S.Lessa - Python Developer's Handbook.pdf, Apress.Beginning.Python.From.Novice.to.Professional.Sep.2005.pdf and John.Wiley.and.Sons.Making.Use.of.Python.pdf.
BUT, if we repeat the same search for the rar format:
ftp://anonymous@202.96.64.144/pub/books/%28ebook%20-%20HTML%20-%20Python%29%20O%27Reilly%20-%20Core%20Python%20%5Bfound%20via%20ww.rar
(and of course all the other books as well).
So, when searching, choosing the "right" format is quite important.

Going local (Homepages, Usenet, topic-related messageboards, blogs, forums & webrings)
At times it can be useful to "throw a narrow net" inside the searchscape, going "specific".
Searching webrings gives no results, but on usenet you have comp.lang.python (a high-volume Usenet open newsgroup for general discussions and questions about Python), comp.lang.python.announce (a low-volume moderated forum for Python-related announcements, useful to keep up to date) and quite a lot of local or regional python newsgroups.

We could also Search private homepages like geocities or Tripod for "python".

A further very good idea is to spot relevant and authoritative topic-related messageboards conference proceedings and/or blogs. There you may even find topic-related fora, like for instance python-forum.org and programmers' weblogs.
It is also worth repeating ad nauseam that on the web much knowledge can be found in "grey areas" that are completely outside the "academic circuits". Indeed, if you are interested in python, you'll soon realise that the books we have chosen as targets today are already quite obsolete, and almost irrelevant, if compared to the free huge knowledge you can gather by visiting the messageboards, usenet groups and fora pointed out above.

Going regional

Going "regional" is ALWAYS a very good idea when searching. We have already seen how adding a simple .ru to our queries can help. But why Russia? WHERE should we search? Which are the, how should I say? the "less copyright-obsessed" countries? Here you can see a interesting "piracy subdivision" published this summer by The Economist.

We may as well use these 'scarecrow' data (produced by US-lobbyist Robert Holleyman's "Business Software alliance" in order to scrap some money) for our own purposes...
And look! As you can see, Vietnam, Zimbabwe, Indonesia, China, Pakistan, Kazakistan, Ukraine, Cameroon, Russia, Bolivia, Paraguay and Algeria seem to have a more relaxed attitude towards patent holders. Good to know :-)
Here the relevant country codes: .vn, .zw, .id, .cn, .pk, .kz, .ua, .cm, .ru, .bo, .py and .dz, codes, that we could use to restrict searches only and/or especially to such relaxed places.
Of course some of these countries are just tiny local niches, with next to no activity and extremely weak signals, and can be ignored: throwing our clever queries in -say- Zimbabwe or Cameroon we'll probably just wasting our (or our bots) precious searching time..
Let's say that -in general- .vn(Vietnam), .id (Indonesia), .cn (China), .pk (Pakistan), .ua (Ukraine) and .ru (Russia) look promising enough. We may add -out of our experience- Iran, Korea, Bulgaria and India (.ir, .kr, .bg and .in).
So let's go local: let's visit China, where we can find, among hundreds other, for instance this link, that requires just some guessing capacity (or some understanding of Chinese :-)
Of course we should also have a look in Vietnam, in Russia/Ukraine (where we will at once retrieve our Target and as many other programming books as you fancy), and here is how you would search in KOREA or in RUSSIA using MSN Search.

Caveat: this was all just academically speaking, duh. Once again: seekers don't need to download anything from the web, since they can always find their targets again and again if and when needed :-P

IRC channels and blogs
Searching through IRC channels and blogs can be -for specific targets- quite useful. However the ratio noise/signal is quite bad on these channels, and therefore IRC-searching and blog-searching is -in many cases- a waste of time if compared to more effective searching techniques.
After all, and behind the hype, blogs are just messageboards where only the Author can start a thread, and IRC channels need, in order to be useful, a lot of social engineering.
I'll just direct you to some blogs search engines and to some IRC search engines like this one. Nuff said.

Various "uncommon" search engines
At times simply switching to less known (but quite interesting) search engines can cut mustard.
Here's a related search with kartoo
and here's another search using gigablast.
Finally, since we are speaking of a programming language, we may also have a look at the recent google codesearch:
return lang:python gives 283000 scripts, enough for some serious studying. Samo with MSNsearch macros.

So we found our targets again and again using a palette of different searching colours. These are all paths that lead into the forest, and you'll be able to find many more on your own. Now let's go back to the theory.

Search engines' spamming (& "popularity")

top

Google alone and you're never done

Google's results, after many years of legendary quality, are now being spammed more and more.
Using a plethora of methods (cloaking, doorway pages, hidden text, blog-farms, you name it) the SEO beasts ("search engines optimizators" they have the cheek to call themselves) deny everybody the possibility to gather real knowledge pushing up their crap commercial sites into the first positions of the SERPs.

Of course you can apply some countermeasures: users -rather than spammers- should be able to influence the ranking of search results and some search engines (MSNsearch's until a few months ago had beautiful sliders -now MSNlive has macros though- and Yahoo still has philtron) provide users with the opportunity to influence -at least rudimentary- the ranking algorithms.
A notorious case is the infamous 'popularity' ranking criterion. This you should by all means avoid or slide to insignificancy, since it is eo ipso tantamount to crappiness.
Contrary to what search engines' algos designer still seem to believe, TURNING POPULARITY DOWN if given half a chance, is very important, since for humans with brains what -say- some idiots in Idaho are massively looking for has no relevance whatsoever. "Sites popular for zombies" are exactly those sites that you will never need. Remember the beautiful old trick of adding a site:edu or site:org specification to all your searchstrings per default: all the "com" sites will disappear: good riddance.

For search engines that do not allow any algo fine-tuning, a possible defensive approach is the "yo-yo" approach: jumping from the start onto lower SERPs and then going slowly back up.

Such methods can soon prove even more crucial for Internet searching purposes: while google may not be yet a sinking boat, anyone can see how much water is already leaking through its many spammed holes.

So we have to refine our seeking techniques.
Instead of just using google again and again, every time we begin a search, we should carefully consider how and where we start our searches, delve a little more inside our own specific requirements, and avoide wasting too much time on irrelevant side paths...

Conclusions

top

Quaeras ut possis, quando non quis ut velis

There are various strategies you can use when searching the web. Some are more relevant for LONG TERM searching, some on the contrary, for SHORT TERM searching.
But even the various simple techniques we used today (searching for mp3s and books) can and should be used together with the main search engines. On the ever moving web-quicksands it does not make much sense to give a list of links to places where you can "search alternatives". It is better, I believe, to (try to) show directly how to "search alternatively".

Of course there are various important non commercial databases, like Infomine (http://infomine.ucr.edu), Librarians Internet Index (http://lii.org), The Internet Public Library (http://www.ipl.org/) Resource Discovery Network (http://rdn.ac.uk), Academic Info (http://www.academicinfo.net/), The Front (for journals: http://www.arxiv.org/multi?group=math&%2Ffind=Search) and finally the best one of all: The Open Directory Project (http://dmoz.org).
These are all possible alternatives to a single approach limited to the main search engines.

Yet lists of links are and remain just that: lists of links. Bound to decay into obsolescence.
We have seen some alternative approaches. Practice them on your own subjects and interests. Once you learn how to seek, the world is yours. Cosmic power for free.

Assignment

top

Nil perpetuum, pauca diuturna sunt

An easy assignment for this evening: (just in order to practice the various techniques explained today, lest you forget everything): find the new 1120 pages - September 2006 - second edition of our target Core Python Programming by Wesley J. Chun, (Prentice Hall, ISBN: 0132269937)
This search should take you at most 10 minutes if done now (and just a few seconds in a couple of months, when the book will have percolated the Web).

And now I'm finished.
Thank-you for your patience. Any questions?

FORMS

Top

SEARCHING THE PAST (DISAPPEARED SITES)

http://webdev.archive.org/ ~ The 'Wayback' machine at Alexa: explore the Net as it was!

Visit The 'Wayback' machine at Alexa, or try your luck with the form below.

Alternatively, learn how to navigate through [Google's cache]!

Alternatively a new "preservation" project from Webcapture: the International Internet Preservation Consortium is coming along.

aap

All the various main search engines.

top

A quick tour of the main search engines...

Quick forms

√	T	Always 100 rez, safe off
√	T	No more sliders, but macros
√	T	Da biggest index?
√	T	You can personalize it
√	T	ex-teoma
√	T	Quite good
√	T	You can set your preferences
√	T	With its wondrous NEAR operator!
√	T	fastsearching for:
√		Da übercompound
√		Exalead!
√		inktomi type
√		the ALPHA & OMEGA of searching
√ = advanced; T = tools; visit the main search engines page. Learning the advanced specs and operators of a few major search engines (not only google's), knowing their common commands and their specific features, will improve your chances to fetch relevant information when seeking the web.

Almost forgot...

top

Uhhh.. almost forgot, a small book-searching present for those that solved the assignment (all others shouldn't look):
finding Ubuntu books

back to portal back to top