The National Academies Press: Home The National Academies: Home
Read more than 4,000 books online FREE! More than 1900 PDFs now available for sale
HOME ABOUT NAP CONTACT NAP HELP NEW RELEASES ORDERING INFO Questions? Call 888-624-8373 cart icon Items in cart [0]
Browse by topic
View special offersEmail this pageSign up for email updates
BOX 2.2 | Youth, Pornography, and the Internet | Dick Thornburgh and Herbert S. Lin, Editors | Committee to Study Tools and Strategies for Protecting Kids from Pornography and Their Applicability to Other Inappropriate Internet Content | Computer Science and Telecommunications Board | National Research Council


Box 2.2
How Search Engines Work


Search engines help users find information on the Internet stored in Web pages. Typically, a user will type some words (the "search query") into a search engine, and the search engine will return a number of "links" on its results page. To reach any of these results, the user clicks on the link, which transfers the user away from the search engine and into the uniform resource locator (URL) corresponding to that link.

A search engine works by matching the user's query against an index of Web pages (documents) on the Internet that it has stored in a database. An index is necessary because with over 2 billion pages on the Internet, a real-time search of all of them when someone makes an information request would be prohibitively expensive and time-consuming. An index allows a search to be completed in a much smaller amount of time (seconds rather than days or weeks), though at the cost of some incompleteness and inaccuracy (because pages may have changed or been added since the index was created).

No search engine indexes (or even could index) all of the pages on the Web, and each search engine indexes a different set of pages. For this reason, and because of the dynamic nature of the Web, all search engines are inherently "incomplete," and the contents of their indexes (and thus search results) differ from one another.

A search engine builds its index of Web pages by sending out a "spider" to retrieve the pages from Web sites. Spiders retrieve only static pages, not pages that are hiding as databases or are dynamically generated. Most spiders also obey the robot.txt file on a Web site; if the file says, "Do not index this site," they do not index that site. They can store millions of words and hundreds of thousands of sites.

A paper published in Nature in 1999 estimated the types of material indexed, excluding commercial sites. "Scientific and educational" sites were the largest population. Health sites, personal sites, and the sites for societies (scholarly or other) are all larger than the percentage estimated for pornography; i.e., a few percent of Web pages contained material that could reasonably be characterized as adult-oriented, sexually explicit material.1

For a more detailed description of how search engines index Web pages, interpret queries, and search their databases, see Appendix C.

1Steve Lawrence and Lee Giles. 1999. "Accessibility of Information on the Web," Nature 400: 107-109.




Copyright 2002 by the National Academy of Sciences  



">