Search Engines

A search engine is a computer program that allows the user to enter a series of keywords, usually called a “query,” and that responds with a list of results from a database that match the query. Major search engines, such as Google, Yahoo Search, and Microsoft Live Search, provide the most widely used method of finding information on the world wide web. Search engine websites are the most visited in the world, with estimates showing that in 2006 the Google and Yahoo websites reached approximately 80 percent of worldwide Internet users, or approximately 500 million people each month. As more and more information and entertainment becomes digital and is stored in databases, search engines are becoming increasingly important. The ability to effectively use a search engine is becoming widely recognized as a key skill of digital or information literacy. On the Internet, there is often a strong incentive for companies to place highly in search engine results, and search engine companies have become very powerful economic actors in online media.

Research on search engines is not an integrated field and there are at least four different perspectives: first, information retrieval, where the search engine is studied as a complex programming problem; second, information literacy, where the interactions between search engines and user skills are of paramount concern; third, online marketing, where the search engine’s effectiveness as a marketing tool is of central concern; and finally, media law and policy, where regulations relating to search engines are being debated.

The Technology Of Information Retrieval

A search engine has three core technological elements: the index, the crawler, and the search algorithm. The index is a database that functions much as an index does in a book: it contains references and pointers to the information on the web, much as the word and page numbers in a printed index refer to text in a book. In order to obtain the references for the index, search engines on the web use another computer program, called a crawler or a spider, to automatically browse pages by traversing hyperlinks between and within websites. The search algorithm has the complex task of matching the terms the user types into the search box (the query) with the references in the index and displaying them in ranked order.

The classical problems in information retrieval, the branch of computer science in which search engines are studied, revolve around the search algorithm and how to insure that its results have sufficient precision (defined as the percentage of results that are relevant to the query) and recall (defined as the fraction of the results potentially relevant to the query that are in fact retrieved) (Singhal 2001). With the world wide web, new challenges have been introduced. First, the web demands search engines on a scale never seen before – web search indices contain references to billions of documents, many times the size of any other database. Second, the web demands speed – hundreds of thousands of queries must be processed per second. Third, the web introduces the problem of stability – it is constantly changing and it has been estimated that a complete index would be out of date within a year without constant crawling and refreshing of the index. Fourth, a construct of authority must be developed for web search engines – unlike a closed database, where the contents have been vetted for inclusion, the web includes many documents of all types, some of which may be fraudulent or criminal. Finally, the web contains many types of non-textual multimedia files, such as audio files, videos, and pictures, and much effort is being directed toward creating effective search engines for these items. In general, search engine companies are trying to increase both the type and the amount of items that can be searched.

Search Engines And Information Literacy

Search engines derive much of their importance as an object of study from their centrality in the everyday practice of Internet users. Research on the usage patterns of individuals shows that most people spend most of their time at a very small number of sites, typically including their email provider and news provider and very often a search engine. The search engine is the primary mechanism by which the user reaches unfamiliar websites. Search engines are therefore central not only to users but to operators of other websites, since they are the primary method of attracting new visitors. It is estimated that 80–90 percent of people who are online have used a search engine, and they are used in perhaps 20 percent of online sessions, or discrete periods of online activity. Usage of search engines appears to be associated with the intensity and diversity of web usage, leading to the conclusion that the ability to effectively use search engines for navigation and to rapidly comprehend and evaluate the results they display are key skills in information literacy.

Marketing And Advertising On Search Engines

Because of their central position in directing users to other websites, search engines are valuable sources of customers for online businesses. Search engines fund themselves primarily through advertising. Advertising does not appear typically in the main search engine listings, but instead the search engine operates a separate index of advertisements that are returned along with the main results when a user types in a query. These paid-for results are indicated as “sponsored links” or “recommendations” on the main search engine results page and are often set off in a separate area, for example in a column on the right-hand side of the page or in a box at the top. However, some smaller search engines do mix advertiser results with results from the main index, and some deliver only advertising-based results. Therefore it can sometimes be difficult for ordinary users to distinguish between advertising and non-advertising results. Search engines have also developed a sophisticated system of advertising syndication. In syndicated search advertising, another website displays the search engine’s ads for a share of the price charged to the advertisers. In 2005, search engines accounted for perhaps 70 percent of all online advertising, and some 30–40 percent of that was revenue coming from syndicated advertisements (PriceWaterhouseCoopers 2006). From a marketing perspective, search engine ads are particularly attractive compared to other advertising vehicles because they are charged on a cost-per-click (CPC) basis, meaning that the advertiser only pays when someone clicks the ad and visits their website, rather than on a cost-per-impression (CPM) basis, where the advertiser pays whenever someone sees the ad. CPC advertising, therefore, is seen as a more efficient means of advertising.

Another element of marketing on search engines is search-engine optimization (SEO). In SEO, the marketer tries to achieve a good placement in the main index (rather than purchasing results in the advertising index) by “optimizing” their web pages so that they match the criteria used by the search engine’s ranking algorithm. These criteria typically include both elements of the web page itself, including text, title, descriptive meta-data, and the number of hyperlinks on the page, and a consideration of the position of the web page in relation to the rest of the web, often determined by the number of hyperlinks from other websites to the page. In addition to legitimate SEO, the value of search engine traffic means that some marketers try to boost their traffic by artificially inflating some indicators; for example, by creating a series of sites whose only purpose is to link to each other (often called a link farm) (Perkins 2001). While not illegal, these techniques (known collectively as spam) are frowned on by the search engine operators, and even a suspicion of spamming may result in delisting from the search engine results.

Legal And Policy Debates

Search engine companies are also increasingly involved in a series of legal controversies (Gasser 2006). A major focus of concern is a debate over censorship and free speech. The large search engines are all based in the United States and fall primarily under United States law. Under the current provisions, search engines are entitled to the protections of freedom of the press, with the normal exceptions made for illegal content, such as child pornography. However, search engine owners have been criticized for censoring results in other countries, notably China, in accordance with government wishes. Restrictions also apply in a variety of other countries; for example, Germany restricts websites that promote Nazi views or deny the Holocaust. Whether this represents censorship or legitimate public control of illegal content is still being debated.

A second issue is user privacy. Search engines routinely collect data on their users, including their queries and usage patterns (and in some cases much more), although many users are unaware of this. Governments, notably those of the United States and China, have formally requested this data. In China, dissidents were jailed partially through evidence supplied by Yahoo. In the United States, Google dissented from the government’s request but was forced to comply by the courts. Even data that has been made anonymous by stripping out users’ names can be controversial, as AOL discovered when it released a collection of search query data for academic study, only to find that newspapers had been able to locate and interview specific individuals. Search engine logs therefore potentially represent a significant risk to personal privacy.

A third legal controversy concerns intellectual property rights online. As search engines have expanded to include new forms of content, such as video, pictures, music, news reports, and even the full text of books held in libraries, concern has been growing on the part of copyright holders around the world about how to protect their works from illegal use, and search engines have been the target of a number of suits of copyright infringement.

Finally, a small number of scholars have been concerned about the ethics of search engines (Introna & Nissenbaum 2000; Machill et al. 2003; Van Couvering 2004). These scholars argue that the industry structure, which is comprised of a few powerful companies – in other words, decision-makers constituting an oligarchy – combined with the issues of access to the web and concerns about censorship, privacy, and reports of the overrepresentation of commercial companies in search results (probably due in part to SEO), does not favor the public interest in easy, safe access to the whole of the web. A consensus on what remedies might be appropriate has not been reached. One potential option might be along the lines of the voluntary code of conduct adopted in Germany (FSM 2004), which commits signatories to clarifying how the search engine’s crawling and ranking algorithms operate, stating what actions may result in a website’s removal from the search engine listings, clearly designating advertising, acting to protect minors from harmful content, removing references to undesirable content as specified in German law, and protecting the data it gathers on users.

Methodological Considerations

In academic work, search engines are sometimes used to locate or sample resources on the web. Research on the quality of search engine results suggests that scholars should be cautious about relying on these alone to provide representative samples or complete coverage of a particular online object of study (Cothey 2004). Search engine results are likely to be unstable over even very short periods of time, resulting in different samples at different times (Bar-Ilan & Peritz 1999).

Search engines seem to overrepresent websites based in the United States, websites that are popular (that is, that have many links to them), and websites that are older (Mowshowitz & Kawaguchi 2005). Different regulatory regimes and differing algorithms between countries may result in a different set of results for the same query in different geographic regions. Finally, some terms are extremely popular among marketers and this may skew the results toward commercial organizations undertaking an organized SEO program. In many cases it would be appropriate to supplement search engine results with other methods of sampling online content. In all cases it is appropriate to reflect on the bias that search engine sampling may introduce.

References:

Bar-Ilan, J., & Peritz, B. C. (1999). The lifespan of a specific topic on the web – the case of “informetrics”: A quantitative analysis. Scientometrics, 46(3), 371–382.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30, 107–117.
Cothey, V. (2004). Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14), 1228–1238.
FSM (Freiwillige Selbstkontrolle Multimedia-Diensteanbieter) (2004). Subcode of conduct for search engine providers. At www.fsm.de/en/SubCoC_Search_Engines, accessed January 11, 2007.
Gasser, U. (2006). Regulating search engines: Taking stock and looking ahead. Yale Journal of Law and Technology, 9, 124–157.
Hargittai, E. (2002). Beyond logs and surveys: In-depth measures of people’s web use skills. Journal of the American Society for Information Science and Technology, 53(14), 1239–1244.
Introna, L. D., & Nissenbaum, H. (2000). Shaping the web: Why the politics of search engines matters. Information Society, 16(1), 169–185.
Machill, M., & Bieler, M. (eds.) (2007). The power of search engines/Die Macht der Suchmaschinen. Cologne: Herbert von Halem.
Machill, M., Neuberger, C., & Schindler, F. (2003). Transparency on the net: Functions and deficiencies of Internet search engines. Info: The Journal of Policy, Regulation and Strategy for Telecommunications, 5(1), 52–74.
Mowshowitz, A., & Kawaguchi, A. (2005). Measuring search engine bias. Information Processing and Management, 41(5), 1193–1205.
Perkins, A. (2001). The classification of search engine spam. At www.silverdisc.co.uk/articles/spamclassification, accessed February 17, 2007.
PriceWaterhouseCoopers (2006). IAB Internet advertising revenue report: 2005 full-year results. At www.iab.net/resources/adrevenue/pdf/IAB_PwC_2005.pdf, accessed February 17, 2007.
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4), 35–43.
Van Couvering, E. (2004). New media? A political economy of search engines. Paper presented at the International Association of Media and Communications Researchers, Porto Alegre, Brazil, July 24–30.