Friday, October 2, 2009

Working with Lucene Search Index in Sitecore 6. Part I - Why/When to use

How often did you stress yourself thinking on the question "What's the most efficient way to retrieve Sitecore items?"?
Here are possible ways to do it:
  • Sitecore API
  • Sitecore Query
  • SQL custom stored procedures
  • Lucene index
Let's take a look at each option briefly.
Sitecore API - the most popular way to get an item. All you need is an item ID and simple call of GetItem() method of database instance will get you the latest item version in current language. If you want an item in specific language, not a problem: GetItem(, ). Want specify version, not a problem either: GetItem(, , ). There are many ways to get an item through Sitecore API. My point is that it's the most easiest way to do it.
Sitecore Query - easy way to get a bunch of items filtered by some criteria. Build a string query, which kinda look like XPath query, and run it on a database instance. Read more about Sitecore Query.
SQL custom stored procedures - we all know how to create a SQL stored procedure. Connections strings already exist in Sitecore solution. All you need is to use some SQL management studio to create a stored procedure for the database. The question is why would you do it if API already exists? It makes sense only if you run complex query with lots of filtering conditions so that SQL will return you only items that you're searching for.
Lucene index - why would I use separate data storage/data index to get an item from Sitecore database? It does not make sense. Agree, for an item it does not but for a collection of items it has a perfect sense to do it. Why? Maybe because performance is much better when you are working with big amount of data. Moreover that's a search index. Which means that data are perfectly organized to be searched.

Let's talk about performance characteristics for each of these options and compare them.
Sitecore API - uses Sitecore data provider to pull out information from the database. If item that we're looking for is cached the call to SQL database is avoided and item gets retrieved from one of Sitecore caches. Using Sitecore caches gives you huge performance benefit. That's why we always worried about caches configuration in Sitecore solution. What if you need to get a number of items by some criteria, let's say template ID and belonging to some category (from here on I'm gonna use this condition for all item retrieval options). You run foreach or for or any other logic that goes through the content tree and checks for a TemplateID property and category field of every single item. If item resides in cache, it will be quick enough but still the code will have to request every single item from the content tree.
Sitecore Query - the picture with Sitecore Query is very similar to Sitecore API but it get's more slower when number of items is growing. Kim has a good article that explains Sitecore Query performance.
NOTE: I'm not talking about fast Sitecore Query introduced in Sitecore 6.
SQL stored procedures - it seems to be a good approach to go with if you're searching for items through the content tree. The thing is there will be only one stored procedure that executes query by specified criteria. Also you will have to consider caching option for retrieved items if you have very deep content tree and items of some specific time are not gathered under one branch. My point is that it can cost you more then the benefits that you will get. I would go with this approach only if I cannot do it with Lucene resources.
Lucene index - again it's search index. It perfectly fits searching options. When you search for items with specified criteria it neither requests an item instance nor runs query over database tables. It goes through the fields that you have in your index and compares their values to your search conditions. The only thing if search query uses custom fields to filter data by, the fields must be added to your index. Otherwise you will get zero results. It's very easy to do though. I will describe it in the next part of the article.

Conclusion: Let's answer those questions in the topic of this part.
Why to use lucene index?
The approach is one of the most (maybe event the best) efficient ones. Search is quick enough that in most cases you don't even need to implement custom caching for the retrieved data.
When to use lucene index?
When you need to run search queries with specific criteria on huge number of data.

Don't forget the rule - the more complex search query gets the slower it works ;).

In next part we will look into Lucene search index introduced in Sitecore 6.
Information about "old" Lucene index (it was introduced in Sitecore 5) you can find here.


Anonymous said...

How reliable do you feel the search index is? Are the master and web indexes up to date as soon as content is added or deleted?

Ivan said...

The index introduced in Sitecore 6 is quite reliable.
Sitecore creates a job that updates Lucene indexes every time when an item gets updated/deleted/created. Since it's a job it may take a couple of seconds to update indexes. It depends on how hard your server loaded with other tasks. Usually index is updated by the end of publish action.

Unknown said...

I have a code for Searching through stored procedure

Anonymous said...

Can we see the stored proc example, please?