portal.htmmines.htm → Luxembourg_2006.htm   Version 0.33, Updated 21/OCT/2006
 
 
How to find *anything*
underneath the commercial web


"Powersearching without google"



Fravia's talk at the Hack.lu (Luxembourg, Luxembourg - 19-21 October 2006)
This file dwells @ http://www.searchlores.org/Luxembourg_2006.htm


Introduction
Recall and Precision
Let's analyze some simple music queries
Today's targets (python & proximity galore)
Finding the three targets
Potpourri searches
Look ma, no google!
Let's search elsewhere
Spamming (& "popularity")
Conclusions
Assignment
Forms


Introduction
top

Well, I didn't know *anything* was already in the public domain!

Please excuse my English, and please excuse my excessive lack of political correctness. That -and my disdain for all commercial aspects- is probably also one of the reasons I'm the only one here, together with my good old friend: The Grugq, who prefers to use a pseudonym :-)

Thank-you for inviting me. I kinda like Luxembourg: klein aber fein.
I also like reversers, and crackers and hackers. Always did. Sons of the light: deserve some cosmic searching power.

We'll examine some 'alternative' searching techniques: On today's menu: music and books à la carte.
As a proof of concept we will search some mp3s and some books (today about the python programming language).

Searching such simple targets (music and books) we will, maybe, demonstrate
  1. that everything is on the web;
  2. that you can search rather effectively without using google;
  3. that some "alternative searching paths" can be quite useful for seekers.
Of course our searches are not limited to just music and books.
Seekers can always find anything: whatever. Any image, any journal, any film, any software you fancy is somewhere on the nethervoid at your disposal.
In fact anything that may have been digitized is indeed somewhere out there (one of the 'optimistic' laws of searching).

Even more important is the fact that not only "tangible" and "digitized" targets are available to anyone, but also all kind of solutions are there, at your disposal. And I don't mean just messageboard solutions on -say- how to port a proprietary driver to GNU/Linux.
I mean real concrete solutions:
  • You want to counter a specific politician?
    Find out and document all his nasty deeds during the last 10 years. A little stalking and a moderate amount of searching skills are probably all you need to send him in Jail.
  • You want to counter and diminish those annoying ubiquitous advertisement panels?
    Someone, somewhere has posted experiences, best practices and ideas.
  • You want to free once for all a nice square of your town from those stinking private cars?
    Someone, somewhere has already done it. Find the stuff.
Synergical cosmic power: Ideas, Methods, Techniques, Lines of attack, Tactics, Approaches, Strategies...red

A propos mp3 and books: a small caveat: Somebody told me long ago -politely- that we should not download songs or books that are patented, or have not yet been released in the public domain. OK. I can understand that, as strange as it sounds (because, silly me, I always understood copyright as the right to copy).
Anyway we will just have an innocent look at our targets on line. If your local political clowns forbid it, don't download forbidden fruits, please, (or if you do it, you should practice some basic anonymity precautions...
red (Torpark, Ubuntu on USB ---> Casper Cow, wardriving and downloading...).

Your browser (should be Opera btw) is of tantamount importance for searching purposes, and can do wonders, if correctly trimmed...
red (HOST file, Proxomitron...).

And now let's begin...


Recall and Precision
top


Recall  
Really comprehensive retrieval...
 
...or just a few but very relevant data?
  Precision
Almost everything but also a lot of junk...
 
...or just a few mostly relevant results?


The real problem -when searching- is not even anonymity and it is not even speed: it is the relevance, coherence and reliability of our searching results.

In order to obtain such relevance, coherence and reliability you need some sound evaluation techniques, but a more methodological approach to searching can help as well.
Two important concepts you must be aware of, when searching, are those of RECALL and PRECISION ("precision" is the accepted terminology, but "relevance" would probably be a better term).

Recall is the ratio (%) of the number of relevant records retrieved to the total number of relevant records that might exist (hard to calculate how many might really exist, of course, but we are just using indicative parameters).
Hence recall is relevant retrieved documents / relevant documents.

Precision (or "Relevance") is the ratio (%) of the number of relevant records retrieved to the total number of all (irrelevant and relevant) records retrieved (this might be calculated, anyway note that in reality there is not such a clean "objective" distinction between relevant and irrelevant, a lot depends on your taste).
Hence Precision is retrieved documents / relevant documents.

Here two graphic attempts to explain this: in the void of the irrelevant blue results, there are some relevant red results.
Different kind of queries, here imagined as a transparent layer, (but you can see some real life examples -for mp3 searching- below) will give either recall or precision.

Broad (Recall) Query
     
Narrow (Precision) Query
     

Add synonyms to your query (or use our synecdochical searching method) and your precision will suffer (but your recall will increase). Use a proximity operator (like the NEAR operator in altavista) and your precision (may/will) increase (but also, of course, your recall will/might suffer).
There are differences between the main search engines as well: for instance google has a (slightly) higher precision than yahoo, yahoo has a (slightly) higher recall than google.


Also always consider that the main search engines DO NOT overlap too much and yet that they cover together (at best) only 1/4 of the web, this may be quite significant when deciding your search strategies.
Many clueless zombies consider "searching the web" tantamount to digit one term inside google and then clicking enter. Such a simplistic approach is wrong, and not only for the "one-termness" of it. The real problem is that google covers only a small part of the web.
In order to access a bigger part of it you will need to use techniques that go from stalking to social engineering, through trolling and passwords breaking.
     
coverage



Back to Recall and Precision: as a seeker, you must decide beforehand the aim of your search: do you want to find everything on a given topic, or do you want just the most relevant texts? Or do you instead want almost everything? Or just some texts?
This will determine what searching strategies you'll have to use. Many search techniques may be used to gain either more recall or more precision (in fact there are even techniques that will allow you to increase BOTH, despite the fact that recall and precision are -usually- inversely related).

This said, as the two graphics above point out, there's absolutely no way you can cover ALL relevant results, due to the fact that even using the best searching techniques (remember that the main search engines cover just about a third of the indexed web) you will never be able to cover the whole web.

Our aim as seekers is to have BOTH precision and recall. We want (ideally) to retrieve everything that is relevant, and everything that has been retrieved should (ideally) be relevant. We'll now use some tricks to (try to) do this.


Let's analyze some simple music queries
top


Pulling some MP3-webbits out of the web
(a "Webbit" is a "Querystring Rabbit" out of a magician's hat")

A quick list of the most important operators:
Yahoo operators:
site: hostname: link: linkdomain: (links that points to one domain) url: intitle: inurl: (a specific keyword as part of indexed urls, example: inurl:searching)
intitle & inurl are VERY important parameters... nomen est omen: redimages giotto5.jpg...

Google's operators:
site: allintitle: (all of the query words in the title) intitle: (that word in the title) allinURL: (all of the query words in the URL) inURL: (that word in the URL) cache: link: related: (pages that are "similar" to a specified web page) info: (google's info)

Altavista's most important operator:
NEAR (more on this later)

MSN Live's operators:
contains: Restricts results to sites that have links to the file type(s) you specify. For example, to search for websites that contain links to mp3 files, type music contains:mp3. filetype: Returns only web pages created in the file format you specify. Live Search recognizes html, txt, and pdf extensions. Live Search also recognizes the extensions for primary Office document types. For example, to find reports created in PDF format, type your subject, followed by filetype:pdf. For example, type information filetype:pdf. inanchor:, inbody:, intitle:, inurl: Returns pages that contain the specified term in the anchor, body, title, or web address of the site, respectively. Specify only one term per keyword. You can string multiple keyword entries as needed. For example, to find pages that contain google in the anchor, and the terms black and blue in the body, type inanchor:google inbody:black inbody:blue. ip: Finds sites that are hosted by a specific IP address. The IP address must be a dotted quad address. Type the IP: keyword, followed by the IP address of the website. For example, type IP:80.83.47.151. language: Returns web pages for a specific language. Specify the language code directly after the language: keyword. link: Finds sites that have links to the specified website or domain. This is useful for determining who links to whom. Do not add a space between link: and the web address. For example, to find pages that contain the word games and that link to searchlores.org, type games link:searchlores.org   linkdomain: Finds sites that link to any page within the specified domain. Use this keyword to determine how many links are being made to a specific page, as well as how those links are made. For example, to see pages that link to searchlores, type linkdomain:searchlores.org. linkfromdomain: Finds sites that are linked from the specified domain. Use this keyword to determine how many links are being made from a specific page, as well as how those links are made. For example, to see pages that are linked from my site, type linkfromdomain:fravia.com   loc:, location: Returns web pages from a specific country or region. Specify the country or region code directly after the loc: keyword. To focus on two or more languages, use a logical OR and group the languages. For example, "core python" (loc:RU OR loc:CN)   prefer: Adds emphasis on either a word or another operator. For example, type searching prefer:internet   site: Returns web pages that belong to the specified site. To focus on two or more domains, use a logical OR and group the domains. Do not add a space after the colon (:). You can use site search for web domains, top level domains, and directories that are not more than two levels deep. For example, to see web pages about media reporting from the BBC or CNN websites, type "media reporting" (site:bbc.co.uk OR site:cnn.com). You can also search for web pages that contain a specific search word on a site. For example, to find the library pages on searchlores, type site:www.searchlores.org/library feed: Finds RSS or Atom feeds on a website. For example, to find RSS or Atom feeds about web searching, type feed:"web searching"   hasfeed: Finds web pages that contain an RSS or Atom feed on a website. You can add search words to narrow your search. For example, to find web pages on the Guardian website that contain RSS or Atom feeds about google, type site:www.guardian.co.uk hasfeed:google   url: Checks whether the listed domain or web address is in the Live Search index. Do not add a space between url: and the domain or web address. For example, to verify that searchlores is in the index, type url:searchlores.org  
Most important MSNLive operator:
linkfromdomain: (an outbound links operator)



We'll now use as an example the intitle: operator.
The structure of the following old -and already "blunt" red- mp3s webbit, has various interesting characteristics, that may be used to exemplify general webbits' structures and purposes.

Click to try
1
2
3
4
5
6
Try s.e. swap!
High precision
beatles
imagine
mp3 OR ma4 OR ogg
intitle:"Index of"
-metallica
+"4.2M"
On google
High recall
lavigne
 
mp3 OR ma4 OR ogg
intitle:Index.of
-beatles
+"4.4M"
On Yahoo
 
group
title
format variants
index of in title
spamkiller
variable parameter
(guarantees length)
 
  1. The "group" (or singer name) is mandatory.
  2. Simply specifying a "title" adds precision and loses recall (precision and recall are -most of the time- inversely proportional). (This means that if you add your target's title you diminish excessive noise but may miss some target sites).
  3. The "format variants" will guarantee a broader spectrum. If the search engine you are using is heavily censored (as it happens more and more often) just eliminate the mp3 parameter. Chances are that some (yet) uncensored ma4 or some ogg file will be present inside our "real target" (mp3 censored music lists), and that these "ogg oddballs" will allow their retrieval. When they will censor ma4s we'll invent something else :-)
  4. The intitle:"index of" (or intitle:index.of, which is the same but avoids two key-presses) is mandatory, and -spammers notwithstanding- still allow fairly decent results. Of course the intitle: operator is to be used with google and yahoo, check the different operators for the other search engines, or just use a more simple (and spammed) "index of" string snippet.
  5. The -metallica (or -beatles, or whatnots) serves as a spamkiller, because many clowns still try to fish zombies out of the knowledge web uploading huge lists of groups' names. If you'r going for high recall, then re-launch the same query with a different singer acting as spamkiller.
  6. Finally we come to the LENGTH parameter, which not only guarantees the presence of at least some Megabyte heavy mp3, thus cutting away all the irrelevant noise of those bogus "index of" spammer sites with "small snippets" of music, but also can be varied ad libitum and will thus guarantee you hours of fishing pleasure. You may try for instance all variations in the range 1.6M - 6.4M. I suggest starting in the range 3.5-4.5 (which is a good signal ratio for mp3s) and moving upward if you are an optimist and downward if you are a pessimist.


A more recent webbit:

-inurl:htm -inurl:html -inurl:jsp -inurl:php -inurl:pdf -inurl:asp -inurl:txt -inurl:shtml -inurl:phtml -inurl:cgi -intitle:free -intitle:download -intitle:archive +intitle:index+of/ +parent-directory +name +"last modified" +size +description (oasis OR shakira) (mp3 OR wma OR m4a) -download

Note the useful spamkiller -download and the fact that we search only for (mp3 OR wma OR m4a). You may add ogg files as well. In fact there's no point in searching those infamous Ipod's m4p. However, should you happen to download some of those Ipod's "fairplay infected" AAC files, you may use a program like JHymn to play them wherever and whenever despite their ridiculous patents, and/or converting them on the fly to a more useful (and unprotected) m4a format.

Here a very simple, extremely short and yet quite useful query. You can launch this (or a very similar one) with any good search engine... even with google...
shakira "4.6m * snd"
"Grown up" humans will probably substitute "shakira" with -say- "bach" :-)
Note -again- the added "variable" parameter +"4.6m". which is quite important in order to reduce noise and spam (again: you can modulate as much as you fancy: 4.5, 4.6, 6.2 etc.)


Today's targets (python & proximity galore)
top


How to get inside libraries at night

Ah, Python! Maybe the most powerful programming language for the web.
The following books about python will today serve as "example queries".

Note that finding books on the web is extremely easy.

In fact there's even a direct relation between the "celebrity" of a target and how easy it will be to find it on the web.
For example: if you want hic et nunc the "Lord of the ring", you just search for a passage of it: At last the voice of Faramir ordered them to be uncovered. See? Even the heavy censored MSNSearch gives us some results. This is true for all books: "But few of any sort and none of name" and you immediately have Shakespeare's "Much ado about nothing" (and, I may add, you discover en passant a lot of book repositories as well).

Hence finding books, patented or not, is extremely easy. Yet today's "examples" will give us -I hope- an opportunity to investigate some alternative searching methods... some different ways of cutting the noise and getting the signal without just using google (or anyone of the many main search engines).

The following three books will be our 'virtual targets':
  1. Core Python Programming by Wesley J. Chun
    First Edition, Prentice Hall, December 14, 2000 ISBN: 0-13-026036-3, 816 pages.
    Note that the new 1120 pages - September 2006 - second edition (Prentice Hall, ISBN: 0132269937) is too new to have already "percolated" the web, it's already out there all the same, of course, and you'll have to find it yourself (see the assignment section), but keep in mind that for a book it takes usually a couple of months after its publication to "sink in" the web.
    A good introduction to python: explains the language, and does put it in a wider context. (Incidentally: reversers should note the usual tricks with the font sizes: here so big that the book could be condensed to half it's size -or less- if the type was reduced to a normal level :-)
  2. Python Cookbook by Alex Martelli and David Ascher (O'Reilly, ISBN: 0596001673), 2002
    Advanced: a collection of problems, solutions, and practical examples for Python programmers, written by Python programmers. In principle this book is for those that already know Python. In practice it's quite useful for anyone.
  3. Python Essential Reference by David Beazley, (Sams, ISBN: 0672328623), 2006
    Syntax, functions, operators, classes, and libraries. This is first and foremost a reference, yet quite useful if you have some experience with other programming languages.
Now, of course we all know that 'printed' books on such matters are often both obsolete as soon as printed and quite superficial vis-a-vis what you can find for free on the Web, because we also know that the best knowledge (and material) is to be found instead in some 'gray areas' of the web, or visiting ad hoc messageboards or perusing Usenet emails.
Moreover there is such a wealth of free useful texts on the web that this fact alone (not the fact that you can easily find all these patented books for free) makes one wonder whether nowadays it makes any sense at all to "buy" a book.

Again a caveat: the following queries are just a proof of concept, showing how we could search for books... the specific "quarries" (in this case our three python books) don't matter that much: you'll be able to adapt the following approaches to OTHER, different, targets of yours... replace for instance "python" with -say- "assembly" or "digital photography" and you'll obtain a different complete library instead.
The approaches and the techniques we are examining together are important, the targets themselves are irrelevant.

Proximity galore

Another interesting side effect of a correct web-seeking approach is that often, when searching for something, you will find on the same servers many other targets related to your topic, targets that you did not even know existed. While this is true for many different targets and only for books, we call this the "being inside the library" effect. Imagine you are not filling out a request form at the counter, imagine you are physically retrieving a book inside a library, with shelves and shelves of books around you and within your immediate reach. Thus you can scan with your eyes all other books in the proximity: books, more or less related to the topic you are searching for, that are, imagine, located on the same shelf, next to your target book and that you can also pick up at leisure. Gee... the amazing power of knowing how to search!

Enough theory! A good webbit for our python books? Here a "regional" one (say: China): python site:.cn "index+of"

Now let's leave theory and enter practice...

Finding the three targets (the simple way)
top


"A posse ad esse" :-)


Yep: searching for books on the web is -most of the time- extremely simple, especially if you have the exact title, the name of the author, the ISBN number and/or some snippets of text.

So let's now find the three targets listed above.
  1. Core Python Programming by Wesley J. Chun

    Please note that -in general- you find most books just inputting a title "tel quel" in any search engine: for instance a simple query on the good (and still commercially not infected) compound search engine seekz "core_python_programming" (note the underscores between the words) will almost immediately give us aplenty locations with the complete version of Chun's nice book.
    On this same repository we register the "inside the library" phenomenon we discussed before: 19 further books on python, some of them crap, some of them good.

    Alas! This "Core" target is the old "2000" edition. The new 1120 pages - September 2006 - second edition (Prentice Hall, ISBN: 0132269937) is more difficult to find, because too new to have already percolated the web: it takes (at most) a couple of months: you'll be anyway easily find it easily around December... but you may want to find it right now in order to solve today's assignement.


  2. Python Cookbook by Alex Martelli and David Ascher Let's try good ole Altavista with
    "Python Cookbook" NEAR rapidshare and choose results only in Russian, Persian, Chinese, Bulgarian, Arabic and other interesting non-copyright obsessed languages :-)

    We'll reap good results aplenty, and the first one (an Iranian messageboard) looks promising: in fact it gives us another Iranian URL, where we could download our target BOTH in chm and pdf format. As an added bonus, due -again- to that powerful "inside the library" effect, we could fetch the following books:
    In fact -its title notwithstanding- this last book is not *that* bad either.
    (Caveat: again, don't download pirated stuff: as a seeker you'll always be able to find on the fly these books somewhere).
    Please note that even if we DID apparently search for a book inside a online porn-depot à la rapidshare (more on this later), we found our target -in fact- inside an OPEN directory, and not on rapidshare itself... simply because the word 'rapidshare' was somewhere else on that first target page. Aso NOTE that this specific URL was not -until now- indexed by the main search engines, yet we found it nevetheless combing the web.


  3. Python Essential Reference by David Beazley
    We start with a simple yahoo query: "Python Essential Reference" rapidshare and we almost immediately get (inter many alia) a Russian promising URL, http://doci.nnm.ru/it_ebooks_digest, that gives us on a silver plate our third target (and many other as well, btw) .
    Bingo. Done. Finished. We could stop here.
    As many among you will know, rapidshare is (an important one) among HUNDREDS of file repositories services which make money hosting porn crap and patented material anyone can upload (they will 'remove it due to complaints' if needed) fooling idiots into buying "special accounts" in order to download this pirated and patented stuff/porn at leisure.
    However, the logic is that in order to fool the zombies they HAVE TO be known and HAVE TO lure people into using them.
    Therefore they HAVE TO provide a (often penalized and slower) free download opportunity for everyone and his leeching dog. Apart from the many simple possibilities to bypass these simple penalizing scripts (mostly written in php), the fact that there's now a huge industry thriving on the selling of patented software/porn/film/music material is very bad news for the poor & suffering patent holders. Not that anyone would give a dead rat, he :-)


Potpourri searches
top


Alltogether now!



We can also search all our titles together with our nice "potpourri" approach.
At times simply guessing that interesting places MUST have all your targets on the same page can be useful... in order to find these interesting places and also your targets :-)
Here a "potpourri" example: on yahoo Core Python Programming" "Python Cookbook" "Python Essential Reference" (note the &vst=.org&vs=.org&n=100 snippet in the search string)
and on google: "Core Python Programming" "Python Cookbook" "Python Essential Reference" (note the &as_sitesearch=.org&num=100 snippet in the search string)

When you search the web, the biggest problem is noise. Your target, your signal, will be often half-drowned underneath it.
And today's web has a lot of commercial noise.
If you search for an image for instance, say a picture of a famous painter, you will immediately find a gazillion spammers who want to sell you those very images you could easily find for free.
So you'll find thousand of low resolution images, or images defaced with an ugly watermark, put on line by "snake oil" sellers. So "cutting the noise" is crucial.
This holds true for everything, and for books as well. Hence we must "clean up" our queries a little, and, as the title of this conference: "searching underneath the commercial web" implies, a simple apprach is for instance to limit our searches to "edu" and "org" sites.

The first query above will give us this link that brings us in this subdirectory, and the second query will give us this "blocked link", yet through google's cached copy we will still be able to find this nice Russian site.
(Note in this example the importance of cached copies. In fact most search engines offer them: Ask, MSNSearch, Yahoo, Google, Alexa, Baidu, Gigablast...).

So, that's it: as you have seen, any and every book is at seekers' disposal.

(Caveat: Download your targets only if you are positively sure they have been released on the public domain. Real seekers do not need to waste harddisk space downloading doubtful stuff. This could even be constructed as 'illegal' by the patent holders and their political lackeys. Downloading is not necessary! Seekers will always find again and again on the fly -and consult on line- whatever they fancy).

Ok, ok. It was too easy. Much too easy. In order to continue, let's imagine for a moment that it would NOT have been so incredibly easy to find these books through such simple searches. Imagine we didn't find them. Imagine we will not find them again on those URLs.
After all, the web is a quicksand, and the specific locations where we found our targets after today's talk probably will disappear.


Look Ma: no google!
Go for the format, go for the name, do it like the lamers or search elsewhere
top


Sprinkles of cosmic searching power

Let's find again the same three targets WITHOUT using the same simple querystrings