Lesson 3	Information retrieval services
Objective	Describe the different categories of information retrieval services.

Information Retrieval Services

The information available online today is incredibly diverse, varying not only in subject matter, but also in format. Accordingly, the search services available reflect this diversity. Some are automated, and some rely on you to work your way through categories. The first step, then, is to learn what kinds of information retrieval services are available to you. Let us take a quick tour of representative sites in the main categories of information retrieval services before discussing how to search them:

Directories or Web Catalogs:Directories are also known as Web Catalogs and are compiled and maintained by human editors and researchers. Another characteristic of a directory is a list of categories in hierarchical order (most broad to most specific). You can "search" by clicking on a category and making selections from categories that are increasingly specific. Some examples of directories are: Directories carry weight and can hurt your rankings. This section will mainly focus on general directories, but the end of this section does include a few tips on how to find niche directories (or even other general directories) and how to determine if they are worth getting a listing on.
Yahoo Directory: The Yahoo Directory was started in 1994 under the name "Jerry and David's Guide to the World Wide Web" but in 1996 became Yahoo. At the time Yahoo was primarily a directory with search functionality and (interestingly) neither SEO nor Internet Marketing were even categories at the time. Through the late 1990s Yahoo pushed to become a web portal and in 2000 even signed a deal with Google that would see Google power Yahoo's search functionality. Their focus at the time was to acquire users through acquisitions such as GeoCities, bringing more people into their portal and keeping them there. Unfortunately Yahoo did not have the same user loyalty that Apple does and the walled-garden approached failed as users Googled their way out of the Yahoo network of sites.

Directories are discussed in more detail in the next module. For now, take a quick look at the main page (or home page) of each directory and note the number and type of categories that each offers.
Questions:

Do any seem more inviting to use?
Is one directory organized in a manner that makes more sense to you?
Or do they seem visually identical to each other?
Clicking on any of these links will open the Web site in a separate browser window, so you can switch between the lesson and the website.

System-Computed Relevance and Ranking

An information retrieval system which will rank and order the records or their surrogates in a retrieved set needs a mechanism for calculating the closeness of a match between a user query and a document. The result of this calculation can be used to determine the order of presentation of members of the set to the searcher. That is to say, this calculation provides the system’s estimate of the relevance of the document, and the goal is that this estimate should be strongly correlated to the user’s judgment of the relevance of the document. The result of this calculation, the value given to the closeness of the match between the query and the document, has been called the retrieval status value, or rsv.
In a strict Boolean query system, one that specifies attribute values that must be present if a record is to be selected, each term present in the query or document could only have a weight of 0 or 1 and the resulting rsv^[1] of a document could only have a value of 1 (accept) or 0 (reject) resulting in the traditional unranked, but assumed relevant, subset of the database. If weighted terms are used, a document’s rsv, computed from their values, can range anywhere from 0 to 1 and is therefore potentially much more useful.

Ranking

Since the purpose of the (rsv) "retrieval status value" is to provide a mechanism for evaluating the match between a document and a query, it allows the system to rank documents in descending order on the basis of their rsv. This means that the system can go down the ranked list and present the user with a complete, ordered list of all documents that have a positive value of rsv or the top-ranking n documents of the list, where n can be set by the user. These would be those the system judges most likely to be deemed relevant by the user. This is what is called mathematically a weak ordering^[2], meaning that ties are allowed. If the rsv is binary there is no choice but to present all documents that meet the formal requirements of the query, an option often frustrating to users. Increasingly, IR systems are providing relevance ranking options, and on the Web where precise queries may not be possible and document attributes not explicit, all search engines utilize such rankings.

Challenge with Ranking
A difficulty with ranking is that users are not usually told what the system's base for the calculation is. Where users have been polled for their reactions, they seem to like it. Would it make any significant difference if they were told the basis or given an opportunity to make a contribution to the method, perhaps to emphasize words occurring in the text, name of the author, or source? There is no research on this question to date although systems exist that give the user the opportunity to supply terms to be used for ranking separate from those used in the search, e.g., the AltaVista "Sort By" box. Asking users to make such choices calls for more involvement on their part, necessitating more knowledge of the system, something not all users want to invest in. But, it could lead to better retrieval outcomes.

Search Engine Marketing

Automated Information Retrieval (IR)

Automated information retrieval (IR) systems were originally developed to help manage the huge scientific literature that has developed since the 1940s. Many university and public libraries now use IR systems to provide access to books, journals, and other documents. Commercial IR systems offer databases containing millions of documents in myriad subject areas. Dictionary and encyclopedia databases are now widely available for PCs. Information Retrieval has been found useful in such disparate areas as office automation and software engineering. Indeed, any discipline that relies on documents to do its work could potentially use and benefit from IR. An IR system matches user queries to documents stored in a database. A document is a data object, usually textual, though it may also contain other types of data such as photographs, graphs. Often, the documents themselves are not stored directly in the IR system, but are represented in the system by document surrogates. This web page is a document and could be stored in its entirety in an IR database. One might instead, however, choose to create a document surrogate for it consisting of the title, author, and abstract. This is typically done for efficiency to reduce the size of the database and searching time. An IR system must support certain basic operations. There must be a way to enter documents into a database, change the documents, and delete them. There must also be some way to search for documents, and present them to a user. IR systems vary greatly in the ways they accomplish these tasks.

Tour Search Engines
Tour Metasearch Engines
Tour Subject pages and Link pages

Meta Search Engines

A Metasearch engine provides one solution for this problem: a unified interface to multiple search services. Some provide a single search form that, once you have composed your query, will submit it to several different search engines. Others simply provide a list of different search engines and provide text fields with which to initiate a search for any specific engine.

Meta Search Engines: If you search for an identical term on various spider-based search engines, chances are you will get different search engine results. The basic premise of meta search engines is to aggregate these search results from many different crawler-based search engines, thereby improving the quality of the search results. The other benefit is that web users need to visit only one meta search engine instead of multiple spider-based search engines. Meta search engines will save you time in getting to the search engine results you need. As shown in Figure 2-3, a meta-search engine compiles its results from several sources, including Google, Bing, and Ask.com. One thing to note about meta search engines is that aside from caching frequently used queries for performance purposes, they usually do not hold an index database of their own.

Meta search engine components — Figure 2-3. Component parts of a Meta search engine

Some examples of metasearch engines are:

Search Engine Targeting Strategy: A search engine targeting strategy can mean several things. First, what search engines will you be targeting? This includes targeting regional as well as major search engines. There are search engines besides Google, Yahoo!, and Bing. If you are concerned about your presence overseas, there are many other search engines you need to worry about. Big search engines also operate on several different search engine verticals^[3]. Do not confuse search engine verticals with vertical search engines (which specialize in specific areas or data). The reference is to the Blended Search results shown on Google, Yahoo!, and Bing. These are additional avenues that you may want to explore.

SEM Strategy

Using PPC in parallel with SEO can be helpful. The benefits are multifold, especially if the site in question is brand new. PPC can provide accurate forecasts for targeted keywords. For example, within the Google AdWords platform you can target the same keywords in your ads that you are currently targeting on specific pages. You can then accurately forecast how your pages will convert for the same keywords once you start getting the equivalent SEO traffic.
You will have the opportunity to perform searches with metasearch engines in the next module. For now, take a quick look at the main page (or "home page") of startpage listed above. Note which, if any, allow you to choose the group of search engines and/or directories that will be searched before you begin the search. Also note other options that you can set to control your search. Clicking on any of these links will open the Web site in a separate browser window, so you can switch between this lesson and the Web site.

Meta Search: Meta search engines are search engines that aggregate results from multiple search engines and present them to the user. The best-known meta-search engine is Dogpile.com. However, its search volume is quite small, and these do not factor into SEO strategies.
More specialized Vertical Search Engines: Vertical search can also come from third parties. Here are some examples:
1. Comparison shopping engines, such as PriceGrabber, Shopzilla, and NexTag
2. Travel search engines, such as Expedia, Travelocity, Kayak, and Utake
3. Real estate search engines, such as Trulia and Zillow
4. People search engines, such as Spock and Wink
5. Job search engines, such as Indeed, CareerBuilder, and SimplyHired
6. Music search engines, such as iTunes Music Store
7. B2B search engines, such as Business.com, KnowledgeStorm, Kellysearch, and ThomasNet
Importance of Subject Pages and Link Pages in SEO
This course focuses on Web directories and search engines, but subject pages and Link pages can help you find information and should be considered as part of your searching resources.
- Subject Pages
  A subject page is a topical collection of information, references, and links to other Web sites. Subject pages are also known as
  1. collection pages,
  2. compendium pages, and
  3. index pages.
  Typically, these pages are maintained by individuals with a great passion and commitment for their specific subject matter; some are maintained by organizations. Because they focus on a single topic, they generally do not have the detail or hierarchical organization that you would find in a directory. However, subject pages put a significant amount of information in a single spot. For example, if you are looking for information on JavaScript, but you are uncertain of what keywords to use in your query, a subject page like JavaScript is a great place to start.
  This is an example of a Web site that might turn up in a search; you can bookmark it (or add it to your Favorites) and go back to it over and over again for information. Take a moment to view this site and note that it can provide information on a variety of related topics.
- Link Pages
  The best example I know of what I call a Link page is RefDesk.
  This site consists of nothing but links to a great number and variety of other sites, grouped into categories. There are dictionaries, encyclopedias, conversion calculators, online editions of newspapers and magazines, and hundreds of different Search Engines (over 200 at last count). Take a moment to scroll down the main page of this Web site. If you do not want to bookmark the site itself and scroll around to the link you want every time, you can point to the particular links you are most interested in and bookmark them (or add them to Favorites) individually. Pages of links. Having links leading to and away from your site is an essential way to ensure that crawlers find you. However, having pages of links seems suspicious to a search crawler, and it may classify your site as a spam site. Instead of having pages that are all links, break links up with descriptions and text. If that is not possible, block the link pages from being indexed by crawlers.
Registering with Directories

Register your website with the major directories and second-tier general directories. Try to register with about 6-12 of the better general directories if you are targeting Google. If you are targeting the other engines first and can wait on Google, you may want to register with about twenty to fifty general directories.
Register with at least a couple local or niche-specific directories. Niche-specific directories are findable via search engines and some are listed at http://www.isedb.com, but you should check to ensure they provide static links before spending money registering your sites, although directories that rank well may deliver quality traffic even if they do not provide direct links. Search for things like

<my keywords> + <add URL>
to find other niche directories.
Often times I do not mind spending hundreds of dollars getting links from different sites (or directories) across many different IP ranges. Many of the second-tier directories charge a one-time fee for listing, and some of them allow you to add your websites free if you become an editor. In my directory of directories, I have 50-100 general directories listed in the general directory categories. Most top ranking sites in mildly competitive fields do not have text links from fifty different sites pointing to them, so if you can afford it, doing this offers a huge advantage to you for your Yahoo! and MSN rankings, but you need to choose directories carefully when considering how TrustRank may effect Google. If you are in more competitive fields and rent some powerful links, these listings in various directories can help stabilize your rankings when search engine algorithms shift. Some directories I highly recommend are Yahoo!, DMOZ, Business.com, JoeAnt, Best of the Web, and Gimpsy.
The terms search engine and engine to refer to any service that allows you to compose your own search query. Any service that provides a compiled directory or allows you to perform searches is called a search service or information retrieval service. As you now are aware, very often the question you will ask before beginning a search is not "Where do I find a search site?" but, rather, "Which one of all these services do I start with?" This is not a trivial question. A directory, with its smaller number of hand-selected sites, may be more immediately useful than a search engine if you are searching for beginning-level information on a popular topic. A search engine, by its continual automated Web-roaming, may be more useful if you are looking for very specific information or an obscure topic.

Information Retrieval Models

Modeling in Information Retrieval is a complex process aimed at producing a ranking function A ranking function is a function that assigns scores to documents with regard to a given query. This process consists of two main tasks:

The conception of a logical framework for representing documents and queries
The definition of a ranking function that allows quantifying the similarities among documents and queries

Modeling and Ranking: Information Retrieval systems usually adopt index terms to index and retrieve documents
1. In a restricted sense: it is a keyword that has some meaning on its own; usually plays the role of a noun
2. In a more general form: it is any word that appears in a document
1. Retrieval based on index terms can be implemented efficiently
2. Also, index terms are simple to refer to in a query
3. Simplicity is important because it reduces the effort of query formulation

Retrieval Service Categories

Click the link below to review the categories of information retrieval services just viewed.
Retrieval Service Categories

[1](rsv) Retrieval Status Value: The retrieved documents are ranked according to their retrieval status values if these are montonically increasing with the probability of relevance of documents.

[2] weak ordering: A weak ordering is a mathematical formalization of the intuitive notion of a ranking of a set, some of whose members may be tied with each other. Weak orders are a generalization of totally ordered sets (rankings without ties) and are in turn generalized by (poset) partially ordered sets and preorders.

[3]Search engine verticals: Examples of different verticals include local search, image search, video search, product search and realtime search for news.