Lesson 1
How Search Engines and Directories work
Search engine directories operate on a fundamentally different premise compared to search engines that utilize web crawlers. The theory underpinning search engine directories is predicated on human oversight and hierarchical categorization rather than algorithmic web crawling and indexing. Below is an exploration of this theory:
Human-Curated Categorization:
At the core of search engine directories is the principle of human-mediated categorization. Websites are not automatically crawled but are instead submitted by site owners or identified by human editors. These editors review submissions and categorize them based on content and subject matter. This process aims to ensure a high level of quality and relevance because human judgment is used to screen for authority, accuracy, and value.
Hierarchical Organization:
Search engine directories employ a hierarchical structure to organize information, akin to a digital library's system of classification. Websites are arranged into categories and subcategories. This taxonomy facilitates a more intuitive search process for users who can navigate through layers of categories to find the type of sites they are interested in, from general to specific topics.
Quality Control:
Unlike algorithm-driven search engines, directories maintain quality by being selective about the websites they include. Human editors can assess the credibility and relevance of a site, excluding those of poor quality or those that do not meet the directory's guidelines. This vetting process is designed to provide a directory of websites that are trustworthy and substantive.
Search Methodology:
When a user queries a search engine directory, the system does not dynamically crawl the web to find new content. Instead, it searches its pre-defined categories to find matches within its curated list of sites. This means the results are limited to what has been reviewed and included, emphasizing quality over quantity.
Directory-Based Ranking:
In search engine directories, the concept of 'ranking' differs from search engines that use complex algorithms. The placement of a website within a directory is more static, based on the category it has been assigned to rather than a continuously updated ranking score. Some directories may prioritize sites within categories by additional criteria, such as user ratings or editorial preference, but these are typically less fluid than algorithmic rankings.
Evolution and Integration:
It is important to note that the strict division between search engine directories and crawling search engines has evolved over time. Many traditional directories have integrated algorithmic search capabilities to enhance the breadth of their search services, while algorithmic search engines have adopted aspects of human curation for certain functions, such as featured snippets or verified listings.
Search engine directories are grounded in a theory of structured, human-mediated content curation and organization. While their prominence has declined with the rise of powerful algorithm-based search engines, the principles of a directory approach—human curation, hierarchical classification, and emphasis on quality, remain relevant, particularly in niche or specialized search applications where the trustworthiness and quality of content are paramount.
Traditional Web Searching
Up to this point in the course, you have not done any searching (unless you have tried a search with some of the sites you visited in the last module). Now that you have been introduced to different search services and to some challenges of searching, you can begin to practice some searches.
The searching exercises in this module let you compare different categories of information retrieval services and different services in the same category. This module will also discuss in more detail the concepts and functions of directories and search engines,
including their advantages and disadvantages compared to other information retrieval service categories, how they can complement each other,
and why one may be more appropriate than the other in a particular search.
An understanding of how each type of search service functions will help you to create more effective search strategies.
After completing this module, you will be able to:
- Describe how directories are created and organized, their advantages and limitations
- Describe how a search engine creates and maintains its database of sites
- Ask a search engine to find information with a search query
- Explain how a search engine's database affects your results
Search Engine Functions
Search engines fundamentally do three things:
- ingest content,
- return content matching incoming queries, and
- sort the returned content based upon some measure of how well it matches the query.
Relevance is the term used to describe this notion of "how well the content matches the query".
Most of the time the matched content is documents, and the returned and ranked content is those matched documents along with some corresponding metadata describing the documents.
In most search engines, the default relevance sorting is based upon a score indicating how well
each keyword in a query matches the same keyword in each document, with the best matches
yielding the highest relevance score and returned at the top of the search results. The relevance
calculation is highly configurable, however, and can be easily adjusted on a per-query-basis in
order to enable very sophisticated ranking behavior.
In this module, we will provide an overview of how relevance is calculated, how the relevance
function can be easily controlled and adjusted through function queries, and how to implement
popular domain-specific and user-specific relevance ranking features. We’ll start by looking at how
ranking actually works.
Click the link below to consider what makes using a search engine or directory an easy or a difficult experience.
How search engines work
Scoring query and document vectors with cosine similarity
Previously, we demonstrated the idea of measuring the similarity of two vectors by calculating the cosine between them.
We created vectors (lists of numbers), where each number represents the strength of some feature. For example, representing different food items, and we then calculated the cosine, the size of the angle between the vectors, in order to determine their similarity.
We will expand upon that technique in this section, discussing how text queries and documents can map into vectors for
ranking purposes. We’ll further get into some popular text-based feature weighting techniques and how they can be integrated to create an improved relevance ranking formula
Mapping text to vectors
In a typical search problem, we start with a collection of documents and we then try to rank documents based upon how well they match some user’s query. In this section, we’ll walk through the process of mapping the text of queries and documents into vectors.
In the last chapter, we used the example of a search for food and beverage items, like apple juice, so let’s reuse that example here.
How search engines and directories work
If you have experience in searching, you may recall searches that were successful very quickly and returned high-quality results that were just what you wanted. You may also recall search experiences that went on for a long time and many tries, but still did not find useful, specific, or current enough results.
If you are new to searching, it may seem premature to ask you about your experiences, but you are aware by now of the scale of the Web and some of the challenges in finding what you want.
Spend a minute or two reflecting on or writing about how the "perfect" search engine or directory would help you in your initial search effort and further help you if your first search was unsuccessful. Some characteristics of this ideal search engine might be:
- Giving you a "natural language" method of describing exactly what you want to find
- Offering easy-to-select menus or check boxes to add advanced searching syntax to narrow or expand your search keywords
- Refining your search by performing a second search through the initial results
- Pointing to a particular result and indicating that you want more like that result
As you perform the searching exercises in this course, and in your own explorations, refer to (or recall) your list as you look around the search site's main page (you may have to go to the Advanced Search page)
to see if any of your ideal search engine characteristics are present. You will likely find at least one search site in each category that has some of what you desire.
Search Engine Optimization (SEO) is the activity of optimizing web pages or whole sites in order to make them search engine friendly, thus getting higher positions in search results. This tutorial explains simple SEO techniques to improve the visibility of your web pages for different search engines, especially for Google, Yahoo, and Bing.
How does a Search Engine Work?
Search engines perform several activities in order to deliver search results.
- Crawling: Process of fetching all the web pages linked to a website. This task is performed by a software called a crawler or a spider (or Googlebot, in case of Google).
- Indexing: Process of creating index for all the fetched web pages and keeping them into a giant database from where it can later be retrieved.
Essentially, the process of indexing is identifying the words and expressions that best describe the page and assigning the page to particular keywords.
- Processing: When a search request comes, the search engine processes it, i.e., it compares the search string in the search request with the indexed pages in the database.
- Calculating Relevancy: It is likely that more than one page contains the search string, so the search engine starts calculating the relevancy of each of the pages in its index to the search string.
- Retrieving Results: The last step in search engine activities is retrieving the best matched results. Basically, it is nothing more than simply displaying them in the browser.
Search engines such as Google often update their search algorithms several times per month.
When you see changes in your rankings, it is due to a new algorithm being implemented.