Lesson 3 | Information retrieval services |
Objective | Describe the different categories of information retrieval services. |
Information Retrieval Services
System-Computed Relevance and Ranking
An information retrieval system which will rank and order the records or their surrogates in a retrieved set needs a mechanism for calculating the closeness of a match between a user query and a document. The result of this calculation can be used to determine the order of presentation of members of the set to the searcher. That is to say, this calculation provides the system’s estimate of the relevance of the document, and the goal is that this estimate should be strongly correlated to the user’s judgment of the relevance of the document. The result of this calculation, the value given to the closeness of the match between the query and the document, has been called the retrieval status value, or rsv.
In a strict Boolean query system, one that specifies attribute values that must be present if a record is to be selected, each term present in the query or document could only have a weight of 0 or 1 and the resulting
rsv[1] of a document could only have a value of 1 (accept) or 0 (reject) resulting in the traditional unranked, but assumed relevant, subset of the database. If weighted terms are used, a document’s rsv, computed from their values, can range anywhere from 0 to 1 and is therefore potentially much more useful.
Ranking
Since the purpose of the (rsv) "retrieval status value" is to provide a mechanism for evaluating the match between a document and a query, it allows the system to rank documents in descending order on the basis of their rsv. This means that the system can go down the ranked list and present the user with a complete, ordered list of all documents that have a positive value of rsv or the top-ranking n documents of the list, where n can be set by the user. These would be those the system judges most likely to be deemed relevant by the user. This is what is called mathematically a
weak ordering[2], meaning that ties are allowed. If the rsv is binary there is no choice but to present all documents that meet the formal requirements of the query, an option often frustrating to users. Increasingly, IR systems are providing relevance ranking options, and on the Web where precise queries may not be possible and document attributes not explicit, all search engines utilize such rankings.
- Challenge with Ranking
A difficulty with ranking is that users are not usually told what the system's base for the calculation is.
Where users have been polled for their reactions, they seem to like it. Would it make any significant difference if they were told the basis or given an opportunity to make a contribution to the method, perhaps to emphasize words occurring in the text, name of the author, or source? There is no research on this question to date although systems exist that give the user the opportunity to supply terms to be used for ranking separate from those used in the search, e.g., the AltaVista "Sort By" box. Asking users to make such choices calls for more involvement on their part, necessitating more knowledge of the system, something not all users want to invest in. But, it could lead to better retrieval outcomes.
Search Engine Marketing
Automated Information Retrieval (IR)
Automated information retrieval (IR) systems were originally developed to help manage the huge scientific literature that has developed since the 1940s. Many university and public libraries now use IR systems to provide access to books, journals, and other documents. Commercial IR systems offer databases containing millions of documents in myriad subject areas. Dictionary and encyclopedia databases are now widely available for PCs. Information Retrieval has been found useful in such disparate areas as office automation and software engineering. Indeed, any discipline that relies on documents to do its work could potentially use and benefit from IR. An IR system matches user queries to documents stored in a database. A document is a data object, usually textual, though it may also contain other types of data such as photographs, graphs. Often, the documents themselves are not stored directly in the IR system, but are represented in the system by document surrogates. This web page is a document and could be stored in its entirety in an IR database. One might instead, however, choose to create a document surrogate for it consisting of the title, author, and abstract. This is typically done for efficiency to reduce the size of the database and searching time. An IR system must support certain basic operations. There must be a way to enter documents into a database, change the documents, and delete them. There must also be some way to search for documents, and present them to
a user. IR systems vary greatly in the ways they accomplish these tasks.
- Tour Search Engines
- Tour Metasearch Engines
- Tour Subject pages and Link pages
Meta Search Engines
SEM Strategy
Using PPC in parallel with SEO can be helpful. The benefits are multifold, especially if the site in question is brand new.
PPC can provide accurate forecasts for targeted keywords. For example, within the Google AdWords platform you can target the same
keywords in your ads that you are currently targeting on specific pages. You can then accurately forecast how your pages will convert for the same keywords once you start getting the equivalent SEO traffic.
You will have the opportunity to perform searches with metasearch engines in the next module. For now, take a quick look at the main page (or "home page") of startpage listed above. Note which, if any, allow you to choose the group of search engines and/or directories that will be searched before you begin the search. Also note other options that you can set to control your search. Clicking on any of these links will open the Web site in a separate browser window, so you can switch between this lesson and the Web site.
- Meta Search: Meta search engines are search engines that aggregate results from multiple search engines and present them to the user. The best-known meta-search engine is Dogpile.com. However, its search volume is quite small, and these do not factor into SEO strategies.
- More specialized Vertical Search Engines: Vertical search can also come from third parties. Here are some examples:
- Comparison shopping engines, such as PriceGrabber, Shopzilla, and NexTag
- Travel search engines, such as Expedia, Travelocity, Kayak, and Utake
- Real estate search engines, such as Trulia and Zillow
- People search engines, such as Spock and Wink
- Job search engines, such as Indeed, CareerBuilder, and SimplyHired
- Music search engines, such as iTunes Music Store
- B2B search engines, such as Business.com, KnowledgeStorm, Kellysearch, and ThomasNet
Importance of Subject Pages and Link Pages in SEO
This course focuses on Web directories and search engines, but subject pages and Link pages can help you find information and should be considered as part of your searching resources.
- Subject Pages
A subject page is a topical collection of information, references, and links to other Web sites. Subject pages are also known as
- collection pages,
- compendium pages, and
- index pages.
Typically, these pages are maintained by individuals with a great passion and commitment for their specific subject matter; some are maintained by organizations. Because they focus on a single topic, they generally do not have the detail or hierarchical organization that you would find in a directory. However, subject pages put a significant amount of information in a single spot. For example, if you are looking for information on JavaScript, but you are uncertain of what keywords to use in your query, a subject page like JavaScript is a great place to start.
This is an example of a Web site that might turn up in a search; you can bookmark it (or add it to your Favorites) and go back to it over and over again for information. Take a moment to view this site and note that it can provide information on a variety of related topics.
- Link Pages
The best example I know of what I call a Link page is RefDesk.
This site consists of nothing but links to a great number and variety of other sites, grouped into categories.
There are dictionaries, encyclopedias, conversion calculators, online editions of newspapers and magazines, and hundreds of different Search Engines (over 200 at last count). Take a moment to scroll down the main page of this Web site. If you do not want to bookmark the site itself and scroll around to the link you want every time, you can point to the particular links you are most interested in and bookmark them (or add them to Favorites) individually.
Pages of links. Having links leading to and away from your site is an essential way to ensure that crawlers find you. However, having pages of links seems suspicious to a search crawler, and it may classify your site as a spam site. Instead of having pages that are all links, break links up with descriptions and text. If that is not possible, block the link pages from being indexed by crawlers.
Registering with Directories
Register your website with the major directories and second-tier general directories.
Try to register with about 6-12 of the better general directories if you are targeting Google. If you are targeting the other engines first and can wait on Google, you may want to register with about twenty to fifty general directories.
Register with at least a couple local or niche-specific directories. Niche-specific directories are findable via search engines and some are listed at http://www.isedb.com, but you should check to ensure they provide static links before spending money registering your sites, although directories that rank well may deliver quality traffic even if they do not provide direct links. Search for things like
<my keywords> + <add URL>
to find other niche directories.
Often times I do not mind spending hundreds of dollars getting links from different sites (or directories) across many different IP ranges. Many of the second-tier directories charge a one-time fee for listing, and some of them allow you to add your websites free if you become an editor. In my directory of directories, I have 50-100 general directories listed in the general directory categories. Most top ranking sites in mildly competitive fields do not have text links from fifty different sites pointing to them, so if you can afford it, doing this offers a huge advantage to you for your Yahoo! and MSN rankings, but you need to choose directories carefully when considering how TrustRank may effect Google. If you are in more competitive fields and rent some powerful links, these listings in various directories can help stabilize your rankings when search engine algorithms shift. Some directories I highly recommend are Yahoo!, DMOZ, Business.com, JoeAnt, Best of the Web, and Gimpsy.
The terms search engine and engine to refer to any service that allows you to compose your own search query. Any service that provides a compiled directory or allows you to perform searches is called a search service or information retrieval service. As you now are aware, very often the question you will ask before beginning a search is not "Where do I find a search site?" but, rather, "Which one of all these services do I start with?" This is not a trivial question. A directory, with its smaller number of hand-selected sites, may be more immediately useful than a search engine if you are searching for beginning-level information on a popular topic. A search engine, by its continual automated Web-roaming, may be more useful if you are looking for very specific information or an obscure topic.
Information Retrieval Models
Modeling in Information Retrieval is a complex process aimed at producing a ranking function A
ranking function is a function that assigns scores to documents with regard to a given query. This process consists of two main tasks:
- The conception of a logical framework for representing documents and queries
- The definition of a ranking function that allows quantifying the similarities among documents and queries
- Modeling and Ranking: Information Retrieval systems usually adopt index terms to index and retrieve documents
- In a restricted sense: it is a keyword that has some meaning on its own; usually plays the role of a noun
- In a more general form: it is any word that appears in a document
- Retrieval based on index terms can be implemented efficiently
- Also, index terms are simple to refer to in a query
- Simplicity is important because it reduces the effort of query formulation
Retrieval Service Categories
[1](rsv) Retrieval Status Value: The retrieved documents are ranked according to their retrieval status values if these are montonically increasing with the probability of relevance of documents.
[2]
weak ordering: A weak ordering is a mathematical formalization of the intuitive notion of a ranking of a set, some of whose members may be tied with each other. Weak orders are a generalization of totally ordered sets (rankings without ties) and are in turn generalized by (poset) partially ordered sets and preorders.
[3]Search engine verticals: Examples of different verticals include local search, image search, video search, product search and realtime search for news.
