Wednesday, October 20, 2010

Sitecore Lucene index does not remove old data

Looks like interest to Sitecore implementation of Lucene index has raised since Dream Core event and developers have run into an issue with old data being kept in the index repository. In this article I want to show you how to go around this issue.
First of all let’s see why it’s happening. I ran into this issue when I started playing with new implementation of Lucene index in Sitecore 6. When I created an output of the results I saw duplicates of my data in there. I stated debugging my code and found that Lucene somehow recognizes raw GUID’s which breaks search criteria that Sitecore uses to find items during update/delete procedure.
To solve this issue I had to create additional field for Lucene index (_shorttemplateid) and store there short GUID for an item (item.ID.ToShortID()). Then override AddMatchCriteria method and dependent properties to use short template GUID for matching criteria. Below is the code example.

Code Snippet
  1. namespace LuceneExamples
  2. {
  3.    public class DatabaseCrawler : Sitecore.Search.Crawlers.DatabaseCrawler
  4.    {
  5.       #region Fields
  6.  
  7.       private bool _hasIncludes;
  8.       private bool _hasExcludes;
  9.       private Dictionary<string, bool> _templateFilter;
  10.       private ArrayList _customFields;
  11.  
  12.       #endregion Fields
  13.  
  14.       #region ctor
  15.  
  16.       public DatabaseCrawler()
  17.       {
  18.          _templateFilter = new Dictionary<string, bool>();
  19.          _customFields = new ArrayList();
  20.       }
  21.  
  22.       #endregion ctor
  23.  
  24.       #region Base class methods
  25.  
  26.       // Should be overriden to add date fields in "yyyyMMddHHmmss" format. Otherwise it's not possible to create range queries for date values.
  27.       // Also adds _shorttemplateid field which has a template id in ShortID format.
  28.       protected override void AddAllFields(Document document, Item item, bool versionSpecific)
  29.       {
  30.          Assert.ArgumentNotNull(document, "document");
  31.          Assert.ArgumentNotNull(item, "item");
  32.          Sitecore.Collections.FieldCollection fields = item.Fields;
  33.          fields.ReadAll();
  34.          foreach (Sitecore.Data.Fields.Field field in fields)
  35.          {
  36.             if (!string.IsNullOrEmpty(field.Key) && (field.Shared != versionSpecific))
  37.             {
  38.                bool tokenize = base.IsTextField(field);
  39.                if (IndexAllFields)
  40.                {
  41.                   if (field.TypeKey == "date" || field.TypeKey == "datetime")
  42.                   {
  43.                      IndexDateFields(document, field.Key, field.Value);
  44.                   }
  45.                   else
  46.                   {
  47.                      document.Add(CreateField(field.Key, field.Value, tokenize, 1f));
  48.                   }
  49.                }
  50.                if (tokenize)
  51.                {
  52.                   document.Add(CreateField(BuiltinFields.Content, field.Value, true, 1f));
  53.                }
  54.             }
  55.          }
  56.          AddShortTemplateId(document, item);
  57.          AddCustomFields(document, item);
  58.       }
  59.  
  60.       /// <summary>
  61.       /// Loops through the collection of custom fields and adds them to fields collection of each indexed item.
  62.       /// </summary>
  63.       /// <param name="document">Lucene document</param>
  64.       /// <param name="item">Sitecore data item</param>
  65.       private void AddCustomFields(Document document, Item item)
  66.       {
  67.          foreach(CustomField field in _customFields)
  68.          {
  69.             document.Add(CreateField(field.LuceneFieldName, field.GetFieldValue(item), field.StorageType, field.IndexType, Boost));
  70.          }
  71.       }
  72.  
  73.       /// <summary>
  74.       /// Creates a Lucene field.
  75.       /// </summary>
  76.       /// <param name="fieldKey">Field name</param>
  77.       /// <param name="fieldValue">Field value</param>
  78.       /// <param name="storeType">Storage option</param>
  79.       /// <param name="indexType">Index type</param>
  80.       /// <param name="boost">Boosting parameter</param>
  81.       /// <returns></returns>
  82.       private Fieldable CreateField(string fieldKey, string fieldValue, Field.Store storeType, Field.Index indexType, float boost)
  83.       {
  84.          Field field = new Field(fieldKey, fieldValue, storeType, indexType);
  85.          field.SetBoost(boost);
  86.          return field;
  87.       }
  88.  
  89.       /// <summary>
  90.       /// Parses a configuration entry for a custom field and adds it to a collection of custom fields.
  91.       /// </summary>
  92.       /// <param name="node">Configuration entry</param>
  93.       public void AddCustomField(XmlNode node)
  94.       {
  95.          CustomField field = CustomField.ParseConfigNode(node);
  96.          if (field == null)
  97.          {
  98.             throw new InvalidOperationException("Could not parse custom field entry: " + node.OuterXml);
  99.          }
  100.          _customFields.Add(field);
  101.       }
  102.  
  103.       // Method should use _shorttemplateid to allow one create combined/boolean search queries with template id reference.
  104.       // Also used to create a matching criteria for update/delete actions.
  105.       protected override void AddMatchCriteria(BooleanQuery query)
  106.       {
  107.          query.Add(new TermQuery(new Term(BuiltinFields.Database, Database)), BooleanClause.Occur.MUST);
  108.          query.Add(new TermQuery(new Term(BuiltinFields.Path, Sitecore.Data.ShortID.Encode(Root).ToLowerInvariant())), BooleanClause.Occur.MUST);
  109.          if (HasIncludes || HasExcludes)
  110.          {
  111.             foreach (KeyValuePair<string, bool> pair in TemplateFilter)
  112.             {
  113.                query.Add(new TermQuery(new Term(Constants.ShortTemplate, Sitecore.Data.ShortID.Encode(pair.Key).ToLowerInvariant())), pair.Value ? BooleanClause.Occur.SHOULD : BooleanClause.Occur.MUST_NOT);
  114.             }
  115.          }
  116.       }
  117.  
  118.       // Method should be overriden because _hasIncludes and _hasExcludes variables were introduced.
  119.       protected override bool IsMatch(Item item)
  120.       {
  121.           bool flag;
  122.           Assert.ArgumentNotNull(item, "item");
  123.           if (!RootItem.Axes.IsAncestorOf(item))
  124.           {
  125.               return false;
  126.           }
  127.           if (!HasIncludes && !HasExcludes)
  128.           {
  129.               return true;
  130.           }
  131.           if (!TemplateFilter.TryGetValue(item.TemplateID.ToString(), out flag))
  132.           {
  133.               return !HasIncludes;
  134.           }
  135.           return flag;
  136.       }
  137.  
  138.       // Method required to override AddMatchCriteria one.
  139.       new public void IncludeTemplate(string templateId)
  140.       {
  141.          Assert.ArgumentNotNullOrEmpty(templateId, "templateId");
  142.          _hasIncludes = true;
  143.          _templateFilter[templateId] = true;
  144.       }
  145.  
  146.       // Method required to override AddMatchCriteria one.
  147.       new public void ExcludeTemplate(string templateId)
  148.       {
  149.          Assert.ArgumentNotNullOrEmpty(templateId, "templateId");
  150.          _hasExcludes = true;
  151.          _templateFilter[templateId] = false;
  152.       }
  153.  
  154.       #endregion Base class methods
  155.  
  156.       /// <summary>
  157.       /// Converts Sitecore date and datetime fields to the recognizable format for Lucene API.
  158.       /// </summary>
  159.       /// <param name="doc">Lucene document object</param>
  160.       /// <param name="fieldKey">Field name</param>
  161.       /// <param name="fieldValue">Field value</param>
  162.       private void IndexDateFields(Document doc, string fieldKey, string fieldValue)
  163.       {
  164.          DateTime dateTime = Sitecore.DateUtil.IsoDateToDateTime(fieldValue);
  165.          string luceneDate = "";
  166.          if (dateTime != DateTime.MinValue)
  167.          {
  168.             luceneDate = dateTime.ToString(Constants.DateTimeFormat);
  169.          }
  170.          doc.Add(CreateField(fieldKey, luceneDate, false, 1f));
  171.       }
  172.  
  173.       /// <summary>
  174.       /// Adds template id in ShortID format
  175.       /// </summary>
  176.       /// <param name="doc">Lucene document object</param>
  177.       /// <param name="item">Sitecore item</param>
  178.       private void AddShortTemplateId(Document doc, Item item)
  179.       {
  180.          doc.Add(CreateField(Constants.ShortTemplate, Sitecore.Data.ShortID.Encode(item.TemplateID).ToLowerInvariant(), false, 1f));
  181.       }
  182.  
  183.       #region Properties
  184.  
  185.       protected bool HasIncludes
  186.       {
  187.          get
  188.          {
  189.             return _hasIncludes;
  190.          }
  191.          set
  192.          {
  193.             _hasIncludes = value;
  194.          }
  195.       }
  196.  
  197.       protected bool HasExcludes
  198.       {
  199.          get
  200.          {
  201.             return _hasExcludes;
  202.          }
  203.          set
  204.          {
  205.             _hasExcludes = value;
  206.          }
  207.       }
  208.  
  209.       protected Dictionary<string, bool> TemplateFilter
  210.       {
  211.          get
  212.          {
  213.             return _templateFilter;
  214.          }
  215.       }
  216.  
  217.       protected Item RootItem
  218.       {
  219.          get
  220.          {
  221.             return Sitecore.Data.Managers.ItemManager.GetItem(Root, Sitecore.Globalization.Language.Invariant,
  222.                                                               Sitecore.Data.Version.Latest,
  223.                                                               Sitecore.Data.Database.GetDatabase(Database),
  224.                                                               Sitecore.SecurityModel.SecurityCheck.Disable);
  225.          }
  226.       }
  227.  
  228.       #endregion Properties
  229.  
  230.    }
  231. }

This should solve this issue as well as add Lucene recognizable format for Sitecore date and datetime field types. Also it will allow to build Combined and Boolean search queries.

Update. Code for the Constants class:

   1: namespace LuceneExamples
   2: {
   3:    public class Constants
   4:    {
   5:       // special field for template id in ShortID format
   6:       public const string ShortTemplate = "_shorttemplateid";
   7:  
   8:       // searchable date-time format. All datetime field
   9:       public const string DateTimeFormat = "yyyyMMddHHmmss";
  10:  
  11:       // Path to lucene setting items: /sitecore/system/Settings/Lucene
  12:       public const string LuceneSettingsPath = "{89783047-026C-45B5-AB5B-338E4A22446C}";
  13:    }
  14: }


Hope it saves someone a minute or two.

16 comments:

Anonymous said...

Hi Ivan,

We've just run into this problem and the code you posted did the trick! Thank you so much!

One question -- you say the root cause of the issue is that "Lucene somehow recognizes raw GUID's". Could you explain a bit more what you mean by that? How were the old GUIDs breaking Lucene?

Thanks!
rusty

Gabriel Boys said...

Thanks a lot Ivan, this solved our duplicate indexing issue.

Ivan said...

Rusty,

DatabaseCrawler uses item.TemplateID.ToString() code to add a template GUID. This API converts GUID string into upper case value. Lucene StandardAlanyzer (which is used by default) parses upper case value in a specific way. In other words, the outcome is not a GUID anymore. That's why it breaks the search.
If you convert GUID into lower case string, than it should work fine.

--Ivan

Anonymous said...

hi,

Great Article. I tried to enhance the crawler by padding the numeric values with "00" on left so that I can successfuly execute range queries on numeric fields. So to do the same, I used the same code that you have shown for DateFields, but it don't work. While debugging the code I can see field value is padded. But when I see IndexViewer the value goes unpadded. I tested this more by adding several characters at the end of the field and they also don't appear in the indexViewer. Please help me, I am stuck :(

mT said...

Hi Ivan,

Great Article, helped a lot otherwise we could have dumped Lucene approach for our work.

Posting my question again.
I tried to enhance the crawler by padding the numeric values with "00" on left so that I can successfuly execute range queries on numeric fields. So to do the same, I used the same code that you have shown for DateFields, but it don't work. While debugging the code I can see field value is padded. But when I see IndexViewer the value goes unpadded. I tested this more by adding several characters at the end of the field and they also don't appear in the indexViewer. Please help me, I am stuck :(

Thanks,
mT

Ivan said...

Try to use Luke to see if the value for the field is padded. When you add a numeric field to the index make sure that you use UN_TOKENIZED option.
Here is an article that explains how to treat numeric values: http://wiki.apache.org/lucene-java/SearchNumericalFields

--Ivan

mT said...

Thanks Ivan for suggesting Luke. Seems like things are working fine, I downloaded Luke and updated IndexViewer and yes things work.

I went through your article on creating Lucene indexes (Part-2) and then came to know about two kind of indexes of sitecore. Again thanks for clearing the confusion.

Thanks,
mT

Per said...

Is this issue corrected with the Sitecore CMS 6.2.0 rev.100831 (Update-4) release?

Ivan said...

Yes, it was one of the fixes of 6.2.0 Update-4.

Anonymous said...

Hi Ivan,
The moment I add the IncludeTemplate and ExcludeTemplate methods to my DatabaseCrawler it stops excluding items whose template ID's are included in the include hint="list:ExcludeTemplate" section in the web.config for the index.

Any reason that might be happening?

Thanks,
Asif

Ivan said...

Asif,

There was one method missing from the provided code. The IsMatch method should use customized _hasIncludes, _hasExcludes and templateFilter variables.
I've added missing method to the code. Try it out.

joshjs said...

Hey, Ivan. I'm trying to put this code into my project and I'm seeing errors for some of the constants:

'Sitecore.Constants' does not contain a definition for 'ShortTemplate'

'Sitecore.Constants' does not contain a definition for 'DateTimeFormat'

Any idea what I might be missing?

Thanks in advance.

joshjs

Ivan said...

Hi Josh,
I updated the article with missing code sample.
I'd recommend you to take a look at Advanced Database Crawler shared source component (http://trac.sitecore.net/AdvancedDatabaseCrawler). It's a complete solution that is quite extensible and is based on most stable version of Sitecore Lucene API.

joshjs said...

Many thanks. :)

Sqiar BI said...

data analysis reporting services

SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading Business Intelligence company.Sqiar Consultants Provide Tableau Software Consultancy To small and Medium size of organization.

Sqiar BI said...

data analysis reporting services

SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading Business Intelligence company.Sqiar Consultants Provide Tableau Software Consultancy To small and Medium size of organization.