SharePoint PitStop: Sharepoint 2007 Search Terminology and Architecture

Enterprise Search enables collection, indexing, and querying of content. SQL is used to run full-text searches. Enterprise Search also enables extensive crawling and indexing of content and provides a keyword syntax that supports keyword searches. Keywords are words or phrases that are identified as important by the administrators.The underlying search service in the Enterprise Search is the same as the search service in Windows SharePoint Service.

Some parts of the Enterprise Search architecture include :

Query engine The query engine executes queries in a keyword and SQL syntax format against content index and search configuration data.

Index engine The index engine processes blocks of text and properties that are filtered from content sources and also stores these blocks in the property store and content index.

Content index The content index stores information that is related to words and their locations in a content item.

Content source A content source is a collection of start addresses, which represents the content that needs to be crawled by the search index component. A content source also defines the schedule when the content will be crawled as well as the behavior of the crawl.
The Enterprise Search enables various content sources, such as SharePoint, web, file share, exchange folder, and business data content.

Protocol handlers Protocol handlers enable the opening of content sources in their native protocols. These handlers also display the documents and other items that are to be filtered.

IFilters IFilters are used to filter documents and other content source items into blocks of text and properties, after opening them in their original formats.

Other parts of the Enterprise Search architecture include :

Search Configuration Data Search Configuration Data stores information that is used by the Search service. This includes crawl configuration, property schema, and scopes.

Crawl log The status of the crawled content is stored in the crawl log, which also contains the current status of every item in the content index.

Search scopes The search scope provides a method for grouping several search items together. This grouping is based on an element that is common among all the items in the
Search scope. Scopes enhance the search process and enable users to concentrate their search on a smaller group of content in the index, instead of searching through the entire index.
After a search scope has been created, you can add scope rules to it, which define the content that needs to be included in the search scope. Scope rules can be based on the address, property query, or content source.

Property store The property store stores a table that consists of properties and its associated values.

Wordbreakers The query and index engines use wordbreakers to break complex words or phrases into a single word or token.

How Search works in Sharepoint 2007

The index engine requests the Filter Daemon to start filtering the content source. To make this request, the index engine employs a pipe of shared memory.Content crawling that is enabled by Enterprise Search requires the content source to have an associated protocol handler that can read the protocol. The start address of the content source is provided by the index engine, based on which the appropriate protocol handler is invoked by the Filter Daemon. Individual items in the content source are extracted and filtered by protocol handlers and IFilters. The extracted and filtered data is passed back in a text format by the Filter Daemon, through the pipe to the index engine.The properties of the document are saved by the index engine in the property store from which they can be retrieved and sorted. The property store consists of tables, and each row in the table corresponds to an individual document in the full-text index.The actual text of a content item is stored in the content index. Therefore, it can be used for content queries.Document-level security is gathered when a document is crawled and is maintained and enforced by the property store. During the crawl process, the index engine uses wordbreakers and stemmers to process the text and the properties further. The stemming component is used to generate inflected forms of a given word.The index engine also creates an inverted index for full-text searching and removes noise words.The query that needs to be executed is passed by the query engine through a language-specific wordbreaker.If the query is in a language for which no specific wordbreaker is available, a neutral wordbreaker is employed. The neutral wordbreaker performs whitespace-style wordbreaking, in which there are whitespaces in the words and phrases.The words from the wordbreaking process are passed through the stemmer, which generates language-specific inflected forms of a given word. The use of wordbreakers and stemmers enhance the search process, because they generate relevant alternatives to a user's query phrasing. To execute a property-value query, the query engine first checks the index to obtain a list of all possible matches.The properties of the matching documents that are found are loaded from the property store. The properties that have been retrieved are rechecked, to ensure a match against the search query.The query result consists of a list of all the possible matches that are displayed in their order of relevance to the query words. If a user does not have the permission to access a particular search result, it is filtered out of the list before the list is displayed.