Monday, November 2, 2009

Working with Lucene Search Index in Sitecore 6. Part II - How it works

Here is second part of the Lucene search index overview for Sitecore 6. In this part we'll take a look at configuration settings and talk about how it works.

Sitecore 5 has Lucene engine as well. Let's step one Sitecore version back and see how Lucene works there. In web.config file there is a section /sitecore/indexes that contains Lucene index configuration. When index is configured, it should be added to /sitecore/databases/database/indexes section.
The web database does not have a search index by default. Even if you add it to aforementioned section, it won't work. Why? Because index configuration relies on HistoryEngine functionality. By default the web database does not have it. It's easy to add it though. Just add the HistoryEngine configuration section to the database.
You can find more configuration details from this article on SDN.
This index has the same configuration in Sitecore 6.
In addition to it, Sitecore 6 has a new Lucene index functionality. Which is more reliable and has Sitecore level API on top of Lucene one. In some cases you will still have to use Lucene API. For instance to create range queries.
Configuration settings for new search index located under /sitecore/search section.
The analyzer section defines a Lucene analyzer that is used to analyze and index content.
The categories section is used to categories search results. It's used for content tree search introduced in Sitecore 6. The search box is located right above the content tree in content editor.
The configuration section has indexes definitions with their configurations. An index definition should be created under /sitecore/search/configuration/indexes node.
First two parameters describe the index name and folder name where it should be stored:
<param desc="name">$(id)</param>
<param desc="folder">my_index_folderName</param>
Next setting is the analyzer that should be used for the index:
<Analyzer ref="search/analyzer" />
Lucene StandardAnalyzer covers most of the case scenarios. But it's possible to use any other analyzer if it's needed.
Following setting defines locations for the index:
<locations hint="list:AddCrawler">
It's possible to have multiple locations for one index. Moreover it's even possible to have content from different databases in the same index. Every child of the locations node has its own configuration for a particular part of the content. A name of location node is not predefined. You're welcome to name it the way you want. For example:
<locations hint="list:AddCrawler">
<sdn-site type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
...
</sdn-site>
</locations>
Every location has a database section. It defines indexing database for the location.
Then root section. The database crawler will index content beneath this path.
Next sibling node is the include section. Here it's possible to add templates items of which should be included to the index or excluded from it.
Example:
<include hint="list:IncludeTemplate">
<sampleItem>{76036F5E-CBCE-46D1-AF0A-4143F9B557AA}</sampleItem>
</include>
<include hint="list:ExcludeTemplate">
<layout>{3A45A723-64EE-4919-9D41-02FD40FD1466}</layout>
</include>
It does not make sense to use both of these settings for the one location. Use only one of them.
Next location setting is tags section. Here you can tag indexed content and use it during the search procedure.
Last setting is boost section. Here you have an ability to boost indexed content among other content that belongs to other locations.
And last but not the least, this search index uses the same HistoryEngine mechanism as old one. So, don't forget to copy configuration section from master database to a database where you want to add search index facilities to.

How it all works?
When an action performed on the item, database crawler updates entries in search index for the item. So that information in index is in sync with the one in database. How does it happen if "item:saved", "item:deleted", "item:renamed", "item:copied", "item:moved" do not have event handlers that trigger search index update? Thank to HistoryEngine that was mentioned several times already.
It is HistoryEngine that tracks any changes made to the item and fires appropriate event handler to process it.
IndexingManager is responsible for all operations to the search index. It subscribes to AddEntry event of HistoryEngine and as soon as an entry added to the History table, it triggers a job that updates the search index(es).
In web.config file there are a few settings that belong to indexing functionality.
  • Indexing.UpdateInterval - sets the interval between the IndexingManager checking its queue for pending actions. Default value is 5 minutes.
    What does it mean? If for whatever reason pending job was not executed, the IndexingManager will re-run it if it finds it in pending state after 5 minutes pass.
  • Indexing.UpdateJobThrottle - sets the minimum time to wait between individual index update jobs. Default value 1 second.
    When some operation is performed on the item, you can see this entry in Sitecore log file:
    INFO Starting update of index for the database 'databaseName' ( pending).
    This setting sets the interval between jobs like this. So that it does not overwhelm all CPU time if you're doing massive change to the items.
  • Indexing.ServerSpecificProperties - Indicates if server specific keys should be used for property values (such as 'last updated'). It's off by default.
    This setting is designed for content delivery environments in web farms. As web database is shared, there could be a situation when one server has updated its search indexes and changed History table in the database. Other servers won't update their indexes because HistoryEngine wouldn't indicate there was a change. This setting prevents situations like this.
Well... this is it for now. In next part we will take a look at Sitecore Lucene API and create some search queries with it.
Enjoy!

17 comments:

Seeni said...

Nice article...waiting for the next one.

Seenivasan said...

is the part 3 coming up any time soon?

Ivan said...

It's almost finished. I will publish it sometime this week.

Anonymous said...

Great article, very helpful as I was confused about the index vs search section.

Am trying to use the new search index to index the web database. Added something like this search/indexes:
[index id="webIndex" type="Sitecore.Search.Index, Sitecore.Kernel"
[param desc="name"]$(id)[/param]
[param desc="folder"]__web[/param]
[Analyzer ref="search/analyzer" /]
[locations hint="list:AddCrawler"]
[web type="Sitecore.Search.Crawlers.DatabaseCrawler,Sitecore.Kernel"]
[Database]web[/Database]
[Root]/sitecore/content[/Root]
[Tags]content[/Tags]
[/web]
[/locations]
[locations hint="list:AddCrawler"]
[web type="Sitecore.Search.Crawlers.DatabaseCrawler,Sitecore.Kernel"]
[Database]web[/Database]
[Root]/sitecore/media library[/Root]
[Tags]media[/Tags]
[/web]
[/locations]
[/index]
I dont have to add anything to the indexes/index section as its not needed right?

Also was wondering if like in the previous version can I add specific fields to the index? Like adding a "tag" field which is a multilist so I can search content tagged by a certain term.

Thanks!

Ivan said...

>>I dont have to add anything to the indexes/index section as its not needed right?

Nope. But don't forget to add HistoryEngine section to the web database definition. A new search does not work without it.


>>Also was wondering if like in the previous version can I add specific fields to the index?

By default this search adds all the fields to the index. So if you have multilist (or treelist, lookup etc), those data will be in the index and you can use them to find references to them.

mT said...

Hi Ivan,

I want to make specific fields in search index as we used to do it in older versions. Can you plz give me some hint on this. Currently I am using older indexes in my code but it indexes std. value items of a template as well which I don't want. Now I want to use newer indexes but it doesn't supports custom field names by default. Putting more constraint - I don't want to write custom indexer for indexes because I have around 10 indexes in my app. :(

thanks,
mT

Anonymous said...

Ivan,
Having some trouble with excluding items belonging to certain templates from the search.

Here is what the config looks like:
[index id="webIndex" type="Sitecore.Search.Index, Sitecore.Kernel"]
[param desc="name"]$(id)[/param]
[param desc="folder"]__web[/param]
[Analyzer ref="search/analyzer" /]
[locations hint="list:AddCrawler"]
[web type="MyWeb.DatabaseCrawler,MyWeb"]
[Database]web[/Database]
[Root]/sitecore/content/Home[/Root]
[include hint="list:ExcludeTemplate"]
[layout]{2352AF64-B6ED-4715-8C59-94B6F6A12741}[/layout]
[layout]{C68F04E2-D93D-4D1B-9C81-9AF26E9DB84B}[/layout]
[/include]
[Tags]webcontent[/Tags]
[/web]
[/locations]
[/index]

The templates included in the ExcludeTemplate tag are also getting indexed. Appreciate any help.
Thanks.

Ivan said...

You use the same tag name for both templates that you're trying to exclude from the index. Try to use template name as a tag name, so that duplicate name does not appear.
For instance:
<layout1>...</layout1>
<layout2>...</layout2>

mT said...

Hi Ivan,

I saw the sitecore.kernel.dll in reflector and found that there is no code in Sitecore.search.index for reading specific fields only. Though tt exists there in Sitecore.data.Indexing.Index. Its quite surprising because newer implementation doesn't covers the older one completely. Is it possible to do it any other way? Did I miss something while analyzing the dis-assembled code?

Thanks,
mT

Ivan said...

mT,

I just published a new article that covers custom fields question: http://sitecoregadgets.blogspot.com/2010/08/adding-custom-fields-to-index.html

--Ivan

Anonymous said...

Hi Ivan,
We have a web farm environment with an authoring server behind the firewall publishing to a web db outside the firewall and 2 public webservers out side the firewall.

In this case do we need to do anything different on the public servers:
Right now we just have the HistoryEngine configured for the web db and Indexing.ServerSpecificProperties set to true.

Is this enough? Since we don't have the authoring interface here whats the best practice in terms of forcing a reindex?
<Engines.HistoryEngine.Storage>
<obj type="Sitecore.Data.$(database).$(database)HistoryStorage, Sitecore.Kernel">
<param connectionStringName="$(id)" />
<EntryLifeTime>30.00:00:00</EntryLifeTime>
</obj>
</Engines.HistoryEngine.Storage>
<Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack>

Ivan said...

That's all you need to do for public servers to setup indexing for them.
In terms of rebuilding, you need to use custom code to do that. Create an aspx page with the code that rebuilds the index.
Code example:
Sitecore.Search.Index index = Sitecore.Search.SearchManager.GetIndex("index_name");
index.Rebuild();

PaulG said...

You can also invoke a rebuild from within the shell using the Index Viewer module - http://trac.sitecore.net/IndexViewer

Unknown said...

IVAN!!!

this whole new Index Configuration makes me very confuse...

Can someone just post a simple configuration that includes 2 templates and perhaps 3 fields.

and what's with [layout] tag?

Please help

Azzy said...

Hi Ivan,

How can i index PDF files from media library?

Ivan said...

Azzy,

Check out Sitecore Search and Indexing document on SDN: http://sdn.sitecore.net/Reference/Sitecore%206/Sitecore%20Search%20and%20Indexing.aspx
It has an example of file crawler. That will give you an idea how to index PDFs.
You can index any documents from media library by getting their content through Sitecore API and passing the textual value of the content into Lucene API for indexing. In the past I was playing with ABCPdf for .NET to index PDF files. You can try to find a similar open source software if you don't have budget for licensed one.

Azzy said...

Thanks Ivan,

That was of really great help.