History of the Internet, Chapter Four

Chapter Four: Search Engines
Richard T. Griffiths (Leiden University)

Much of the information for this paper was taken from the superb Search Engine Watch, which is the best place for further information. Much of the rest came from the companies themselves. All the data was current in September 2001.

"There is nothing worthwhile on the web, and you will never find it anyway".

This complaint has been voiced since the beginning of the web... and usually by people who had never even tried. From the start of the internet, there have been directories and lists on various topics maintained by enthusiasts (who were often experts in their fields). My own favourite used to be a list maintained by Scott Yanoff. He started a small list for his personal use in 1991 and found himself being snowballed by e-mail suggestions from grateful users until he became almost an institution, and author of a couple of internet guides. He seems to have given up in 1995. This is what it looked like then.

http://sunsite.iisc.ernet.in/virlib/html/spled/f3-5.gif

By the late 1980s, however, the amount of data was getting too large to rely on helpful hints from other users. From its start in 1983, the internet had grown to 1000 hosts in 1984, to 10,000 in 1987, to 100,000 in 1990 and to 1,000,000 in 1992. Information retrieval was becoming a bottleneck and a clustering of innovations took place to resolve the problem.

In the beginning

1990 Archie, developed at McGill University (Montreal) first search engine for finding and retrieving computer files. At the time these large institutional computers placed their data and program files into two categories: open and closed. When you 'logged-in' to another computer, you could access the 'open' files by identifying yourself as "anonymous" and using your e-mail address as the password. Then you could browse through their archive and download any files you wanted.

What Archie did was automatically at night (when the traffic was less) to visit all the archives they knew about and to copy the list into a searchable database (this piece of software was known as a spider). When you logged into an Archie site (by telnet) it would tell you where any file was and you could view and e-mail the results to yourself..... and you could go through the entire 'log-in and retrieve' procedures for each computer yourself. It was a striking comment on the state-of-the-art at the time that McGill soon discovered that half of the whole US-Canada traffic was running through its Archie server, so it shut down public access. By then, however, there were many alternative sites hosting the service.

1991 The Gopher system represented an improvement on ftp retrieval developed at the University of Minnesota (whose mascot was a golden gopher).

http://cit.evitech.fi/internet/images/gopher.jpg

The host computers (servers) put their files in a 'menu' form and the menus of the different servers were merged. Now you logged into any gopher server and you could query it for information by typing in keywords and, again like Archie, you would get a list of items. But now, instead of sending yourself the list and individually looking up the items, you scrolled down the list, pressed 'enter' and you were transferred directly to the relevant 'gopher' address, where you could read the contents. Then, if you wanted, you sent the file to yourself via e-mail. Since 'gopher' was a useful way for storing data, the system caught-on very rapidly. And within 'gopherspace' search-engine called Veronica (supposedly Very Easy Rodent-Oriented Network Index to Computerised Archives), developed at the University of Nevada operated on the same principle as Archie but it also allowed you to distinguish between a search for 'directories' and an undifferentiated search combining directories and files (the latter was much larger and time-consuming). Again, having located something, you e-mailed it to yourself.

http://www.jsu.edu/depart/psychology/sebac/fac-sch/internet/gopher-h.shtml

1991 also saw the birth of WAIS (Wide Area Information Server) developed by Thinking Machines Corp.

http://www.artemis.jussieu.fr/wwwos2/html/dess/memoire/promo94/zaoui/chap15.htm

Wais was also logged into separately. Wais searched through information on the basis of the contents. So, if using Archie and Veronica was like searching through a card index of book titles, WAIS was like using an book index. WAIS's data base was smaller than the other two but, even so, searching through the lot was daunting and time-consuming. So WAIS broke down its databases into separate subject indices and the researcher could then restrict the word search within the relevant category. At its peak WAIS linked up 600 databases around the World. WAIS ordered the results in rank order of the frequency they appeared and since it was gopher-based, you could click to the document and read its contents (and e-mail it to yourself if so desired).

Thus,

these early search-engines had a spider
built up databases either of directories or web-pages or
built-up directories (specifically limited in ambition and range, but supposedly limited to better sites) and
they could also rank by terms within a document

None of the first generation search-engines mentioned above have survived... but, like dinosaurs, they live on in better adapted versions. These principles they developed, refined and made more powerful, underpin the design of almost all subsequent search-engines.

This was situation in the early 1990s. Then in 1991 the WWW was developed and two years later the Mosaic graphics browser. These contributed to an enormous expansion of the net, but they also offered the development of a new generation of user-friendly search-engines. If, in 1992 the number of hosts had reached 1,000,000, by 1996 the number had surpassed 10,000,000. Moreover, the number of web-sites was beginning to increase exponentially. Two years later there were 36 million hosts and 4 million web-sites.

There is more information of the web than ever before, but in may ways it is easier to locate the information we want... if we work systematically and intelligently, and have a little patience.

How do we find information on the net?

Directories and Search Engines

With the exception of the WWW Virtual Library, which is really a set of clickable categories, all the following directories and search engines have a SEARCH function
If you type in a single word there are no problems, other than the fact that single words, such as 'history' will probably produce an unmanageable number of hits. So try to be more specific.
NOTE many search engines are sensitive to CAPITAL LETTERS. If you use lower case letters, all the search engines below will also match it to capitals as well. If you use capitals, Alta Vista and Infoseek will not bother looking for lower case matches (this could be useful, for example if you are looking for WHO as in World Health Organisation). As a rule, except for names, use lower case only (and if something that should be there doesn't show up, try a capital letter).
All the search engines mentioned here recognize the following 'search engine maths'. NOTE do not leave spaces

use +history+medieval for documents mentioning both

use +history+medieval+women for documents mentioning all three

use + history+medieval-women for documents mentioning the first two but excluding the third

use +history-women for documents including the first but excluding the second

use "medieval history" for documents containing the exact phrase, or words in that order

Some of the search engines (in advanced search mode) use 'Boolean' logic. NOTE not yahoo!, Infoseek and Google, and in Alta Vista only in 'advanced mode'. The principle is the same as above, but it is a little more powerful. For example:

use history AND medieval for documents mentioning both

use history NOT women for documents mentioning the first but excluding the second

use history OR geschiedenis for documents mentioning either word

use history NEAR medieval to stipulate that the words present need to be close to each other

use history AND (medieval OR renaissance) for building up more complex searches.

NOTE: you must use CAPITALS for these instructions. Check the 'detailed instructions' option for the detail in each case.

Some search engines allow you to search within a category. Use this facility. If you get down to 'history' through categories as education/research/academic etc you will obviously escape the life story of the family's pet hamster. And then follow the steps above.

Directories

Directories are lists of sites, chosen by human-beings (they still exist).

WWW Virtual Library was set-up by Tim Berners-Lee who was the founder of the WWW. It is non-commercial and is run by a federation of volunteer institutions which follow certain rules and which try to ensure that the links are relevant and up-to-date. It might be worth looking at the home-page if you want a theme which might equally fall under economics or law etc., but it does have a history library, most of which is hosted by the University of Kansas. This can be accessed two ways;

alphabetically
thematically and geographically [http://www.ukans.edu/history/VL/]

I recommend the latter and you can see which bits are done by Kansas and which bits by other organisations (eg 'labour and business history' is coordinated by the IISG, Amsterdam). The central index has grown enormously from

July 1998: 2500 sites

July 1999: 4000 sites (plus about 1000 more in sub-sites)

January 2000: 5800 sites

The library is good... in parts. Peter Doorn, writing the notes for this course four years ago was scathing: links were not kept up-to-date, the classification was often misleading, and some of the collections were pathetic. The situation has now improved, but there are still areas where you wonder why you bothered. On the other hand, some of the sections are really excellenet

Yahoo! (supposedly an anachronym for 'Yet Another Hierarchical Officious Oracle' but this is now denied by its creators) is a commercial directory established in late 1994 by two PhD students at Stanford University (David Filo and Jerry Yang, who also developed the software). In 1995 Marc Andreesen, invited them to use the more powerful computers at Netscape, but it maintains a separate commercial identity. The principle is one of user registration, but it also uses an advanced 'spider' engine.

It is the largest of the directories, employing 80 editors and manipulating a database of over 1 million links. It also has a history section which has grown from:

July 1998: 10,500 sites

July 1999: 17,800 sites (directories) linked into almost 1000 categories

January 2000: 20,000 sites

July 2001: 27,000 sites

It also has a 'search within category' function.

Search Engines

Directories are rather like a library book catalogue, telling you the titles available. At the end of each book, there is often an index, telling you exactly where to look within the book to find a mention of a particular name or topic. Imagine if, instead of looking through each book, someone had torn out all the indexes from all the books and rearranged them so that the names and topics were put together. And then imagine that it can sort through all those pages�. a stack one hundred miles high� and give you the results (ranked in an approximate order of potential relevance) and give you the answer in under one second. You�ve got the picture. This is the phenomenal power behind today�s modern search engines. This is the power of Goggle. One suggestion: although there are several meta-search-engines available (which simultaneously submit your research term to different search engines) you are better off looking through the first ten pages on one top-quality search engine than looking though the first page on ten separate engines.

KEY: GG=Google, FAST=FAST, AV=AltaVista, INK=Inktomi, WT=WebTop.com,
NL=Northern Light, EX=Excite. Also use this key for charts below.

http://searchenginewatch.com/reports/sizes.html

There are thousands of search-engines, but we have selected three of the largest and most flexible. Search-engines operate by selecting individual web-pages or documents. Although some give you the option of selecting sites, their coverage is far smaller than the main directories. Many search engines use the same 'spiders' to compile their indices, so the difference lies in the way they interpret the data and how they allow you to manipulate the results. Keep in mind that when a search-engine gives you a couple of hundred of hits, on twenty pages, many people do not bother beyond the first three pages. It is worth pausing to reflect what these 'spiders' are looking for. Web-page makers offer the spiders four information sources:

Page title or, by default, the first words (like 'welcome to my page')
Description (written in meta-text, which means you can't see it on the screen) which tries to pack in as many keywords as possible
Keywords (again in meta-text)
The text of the document itself

Moreover, most engines allow page-makers to submit their locations for inclusion. Of these, the title is the most important determinant of content. Some search engines (including Alta Vista and Info seek) incorporate the meta-text when making rankings (looking at the frequency a word appears or how near the top) but others use the first paragraphs of the document text itself. So, even if they use the same spider, they will not necessarily give you the same documents in the same order. Moreover, they way they categories their information and the search functions the provide also influences their usefulness.

Google I tipped this search-engine when it first started as one to watch. Google is a deliberate misspelling of googol - 10 to the power of 100.... but really chosen because the name sounded 'cool.� It was formed by two Stanford graduates in April 1998. The principle behind it was that it monitors other indices to see who links to what (and to rank these) in order to locate the real 'authorities' on a topic and it uses this to rank the results. It had superb, clean looks and it now also has the largest coverage of any search engine (1.6 thousand million web-pages) and it still delivers plenty of useable references within the first hundred or so results (it depends� but you can easily see when the quality tails off). If you use the �cached� pages (kept on its own computers) you can see you search terms highlighted in the document. It has recently introduced an �advanced search� category and an image search. In total, a fabulous resource.

AltaVista (meaning 'view from above') opened in December 1995 as an offshoot of Digital Computers. It has an index to 550 million web-pages, divided into 24,000 categories. It was also the first site to include a translation service (the same one behind some of our history web-sites) and a search facility for images and sound files. I used to rely a great deal on the image search function, but I fear that it has now been surpassed by Goggle, which gives more returns and which also allows you easily to see the images in their text context.

Northern Light (named after a record breaking 19th century American schooner) is a private company, established in 1995 employing 40 researchers. It has a reported 350 million web-pages in its index. It is particularly proud of its 'special collection' documents which mostly comprise journal articles that usually evade search-engine spiders (and some can be ordered for a small charge).

Databases

We saw that Northern Light was very proud of the fact that it integrated journal contents into its indices, but it is far from complete. Yet, Northern Light has highlighted a long-standing problem. We can hope to find books in library catalogues, but where we find recent journal articles? There was always one solution, and that was to locate journals that published indices (and abstracts) of scientific journals and (after locating the relevant sub-categories) to plough through them by hand. The most popular among historians was Historical Abstracts (available on diskette in the Leiden University�s library) whilst economic historians could rely on Journal of Economic Literature. Some of these are now available on-line, and all with search functions, but most are only available within Local Area Networks (and you should check your own library�s resources). For staff and students of Leiden University, you should go here and search further in ERL (Silver Platter).

Another way to access journal articles electronically is through on-line journal guides. Here are some suggestions:

Many of these have links through to contents of past issues.

A final route through to journal publications is through on-line bibliographies. The best way to access these is to enter the name of your field of interest into a search engine and add the term �bibliography� to your search. It might be intellectually satisfying to rediscover the wheel, but most of us have better things to do with our time.

Library Catalogues

Finally we come to good, old-fashioned library catalogues. These are still the best ways for finding printed information which is classified and catalogued by the data on its cover. In other words, we can use library catalogues to locate books, journals and working papers.... but not to locate chapters or articles. Most major libraries began computerising their catalogues in the 1970s and it was only a small step to making them available to the on-line community. However, this is not always possible through the WWW... many sites are reached through telnetting (not that you need to do anything) and they have an old-fashioned feel about them. But, especially for historians, there is still far more information available behind these catalogues than there is on the internet (and there will be for some considerable time to come). We assume that you are already familiar with our own university catalogue but most national (copyright) libraries have now put their catalogues on-line. You can find the list of addresses via Yahoo here.

R.T. Griffiths
Last update: 11 October 2002