Tuesday, August 31, 2010

Adding custom fields to the index

In this post I want to show how to address a missing feature that was a part of “old” lucene index implementation. This article will provide an example how one can customize Lucene search configuration so that it’s possible to add custom fields to the index.

First off, let’s create a configuration that would allow us to add additional fields to the indexed data.

<index id="News" type="Sitecore.Search.Index, Sitecore.Kernel">
<
param desc="name">$(id)</param>
<
param desc="folder">_news</param>
<
Analyzer ref="search/analyzer" />
<
locations hint="list:AddCrawler">
<
examples-news type="LuceneExamples.DatabaseCrawler,LuceneExamples">
<
Database>web</Database>
<
Root>/sitecore/content</Root>
<
IndexAllFields>true</IndexAllFields>
<
include hint="list:IncludeTemplate">
<
news>{788EF1BE-B71E-4D59-9276-50519BD4F641}</news>
<
tag>{4DD970FB-2695-4E50-96F3-A766F7D6CAF1}</tag>
</
include>
<
fields hint="raw:AddCustomField">
<
field luceneName="author" storageType="no" indexType="tokenized">__updated by</field>
<
field luceneName="changed" storageType="yes" indexType="untokenized">__updated</field>
</
fields>
</
examples-news>
</
locations>
</
index>


There is a new configuration section in this example. It’s <fields> section that introduces two fields “author” and “changed”. These fields will be added to a fields collection of each indexed item. Basically, there is AddCustomField method that gets called for every <field> configuration entry to identify a custom field that is going to be added to the fields collection.


Description of configuration attributes:


  • luceneName  is a field name that appears in lucene index.
  • storageType  is a storage type for lucene field. It can have the following values:
    • no
    • yes
    • compress
  • indexType  is an index type for lucene field. It can have the following values:
    • no
    • tokenized
    • untokenized
    • nonorms

Refere to Lucene documentation to find out what each of these options mean: store and index.


Now all you need to do is to loop through the collection of custom fields in the overridden AddAllFields method and add them to the indexed data.


I created a custom class called CustomField that helps to manage custom field entries. Below is the example of this class as well as additional methods for extended DatabaseCrawler. Since code for the DatabaseCrawler was already published in this blog post, I’m not going to duplicate it here.


Here is a code for CustomField class.


using System.Xml;
using Sitecore.Data;
using Sitecore.Data.Items;
using Sitecore.Xml;
using Lucene.Net.Documents;

namespace LuceneExamples
{
public class CustomField
{
public CustomField()
{
FieldID = ID.Null;
FieldName = "";
LuceneFieldName = "";
}

public ID FieldID
{
get;
private set;
}

public string FieldName { get; private set; }

public Field.Store StorageType { get; set; }

public Field.Index IndexType { get; set; }

public string LuceneFieldName { get; private set; }

public static CustomField ParseConfigNode(XmlNode node)
{
CustomField field = new CustomField();
string fieldName = XmlUtil.GetValue(node);
if (ID.IsID(fieldName))
{
field.FieldID = ID.Parse(fieldName);
}
else
{
field.FieldName = fieldName;
}
field.LuceneFieldName = XmlUtil.GetAttribute("luceneName", node);
field.StorageType = GetStorageType(node);
field.IndexType = GetIndexType(node);

if (!IsValidField(field))
{
return null;
}

return field;
}

public string GetFieldValue(Item item)
{
if (!ID.IsNullOrEmpty(FieldID))
{
return item[ID.Parse(FieldID)];
}
if(!string.IsNullOrEmpty(FieldName))
{
return item[FieldName];
}
return string.Empty;
}

private static bool IsValidField(CustomField field)
{
if ((!string.IsNullOrEmpty(field.FieldName) || !ID.IsNullOrEmpty(field.FieldID)) && !string.IsNullOrEmpty(field.LuceneFieldName))
{
return true;
}
return false;
}

private static Field.Index GetIndexType(XmlNode node)
{
string indexType = XmlUtil.GetAttribute("indexType", node);
if (!string.IsNullOrEmpty(indexType))
{
switch (indexType.ToLowerInvariant())
{
case "no":
return Field.Index.NO;
case "tokenized":
return Field.Index.TOKENIZED;
case "untokenized":
return Field.Index.UN_TOKENIZED;
case "nonorms":
return Field.Index.NO_NORMS;
}
}
return Field.Index.TOKENIZED;
}

private static Field.Store GetStorageType(XmlNode node)
{
string storage = XmlUtil.GetAttribute("storageType", node);
if (!string.IsNullOrEmpty(storage))
{
switch (storage.ToLowerInvariant())
{
case "no":
return Field.Store.NO;
case "yes":
return Field.Store.YES;
case "compress":
return Field.Store.COMPRESS;
}
}
return Field.Store.NO;
}
}
}


And the code for additional methods for DatabaseCrawler.


/// <summary>
///
Loops through the collection of custom fields and adds them to fields collection of each indexed item.
/// </summary>
/// <param name="document">
Lucene document</param>
/// <param name="item">
Sitecore data item</param>
private void AddCustomFields(Document document, Item item)
{
foreach(CustomField field in _customFields)
{
document.Add(CreateField(field.LuceneFieldName, field.GetFieldValue(item), field.StorageType, field.IndexType, Boost));
}
}

/// <summary>
///
Creates a Lucene field.
/// </summary>
/// <param name="fieldKey">
Field name</param>
/// <param name="fieldValue">
Field value</param>
/// <param name="storeType">
Storage option</param>
/// <param name="indexType">
Index type</param>
/// <param name="boost">
Boosting parameter</param>
/// <returns></returns>
private Fieldable CreateField(string fieldKey, string fieldValue, Field.Store storeType, Field.Index indexType, float boost)
{
Field field = new Field(fieldKey, fieldValue, storeType, indexType);
field.SetBoost(boost);
return field;
}

/// <summary>
///
Parses a configuration entry for a custom field and adds it to a collection of custom fields.
/// </summary>
/// <param name="node">
Configuration entry</param>
public void AddCustomField(XmlNode node)
{
CustomField field = CustomField.ParseConfigNode(node);
if (field == null)
{
throw new InvalidOperationException("Could not parse custom field entry: " + node.OuterXml);
}
_customFields.Add(field);
}


Last thing that is left to do is to call AddCustomFields method from AddAllFields one.


protected override void AddAllFields (Documentdocument, Itemitem, bool versionSpecific)
{
    ………………………………………
    AddCustomFields(document, item);
}


You can take it even further and add support for some field interpreter for each field configuration entry.


Hope you'll find it useful.

24 comments:

adeneys said...

Nice post Ivan. Have you found an easy way to rebuild your custom index? I've found the new style search indexes (which you've used) don't get built when you rebuild database indexes through the control panel. They only rebuild the old style indexes.
I'm currently using the new style indexes in a project and have had to handle the rebuilding of the index myself. I'd be interested to know how you deal with this.

Ivan said...

Hi Alistair,

Somehow "Rebuild the Search index" tool does not pick up custom lucene indexes from new implementation.
Good thing it's simple enough to address this issue by write a couple of lines of code.
You can create an aspx page or even extend standard "Rebuild the Search index" app with the following code:
Sitecore.Search.Index index = SearchManager.GetIndex("index_name");
if (index != null)
{
index.Rebuild();
}

mT said...

Ivan,
There is almost no documentation of new search indexes on sdn (if you know, plz post the link). The only hint is found in your blogs. I am using Lucene as a base for my app and I feel new search indexes in sitecore are still not matured. I am afraid these new indexes won't work on Web-farm environment. My project will be deployed on a web-farm env. and seems like I had to follow the old one only. Old is gold? How reliable are the new indexes, in your opinion?

mT

Ivan said...

Manish,

Not too long ago we created a shared source component that shows how new index works. It has pretty good documentation on it. You can find it here: http://trac.sitecore.net/BidirectionalLuceneSearch

From my experience new lucene search implementation more consistent and stable. It's easier to work with and customize something in case you need it.

It works with web-farm setup as well. Every node in the farm will have it's own copy of indexes. It should be even possible to maintain index on one node and share it with others in read-only mode.

mT said...

Ivan,
From customization perspective new indexes definitely looks promising but I still can't find a good documentation on the complete setup. BidirectionalLuceneSearch module has a good documentation and can be used for reference while coding new indexes. My biggest fear still lies in the web-farm environment. Please point me to a link where I can read about web-farm configuration for new indexes.

Ivan said...

Manish,

We don't have any official article on this topic yet. I'll try to put things together and publish some information on my blog.
In short it works exactly the same as old index. Re-indexing is done on CD server thorough HistoryEngine functionality.

Anonymous said...

What version of Sitecore is this for? I have version 6.1. When I use the config entry in this article, it hangs the website. when I comment out the crawler section, it throws an error "Could not find add method: AddIndex (type: Sitecore.Data.Database)

Ivan said...

It was developed and tested on Sitecore 6.2 but it should work on any 6 version higher than 6.0.x. Make sure that you added index configuration under /sitecore/search/configuration/indexes section.

Eugene Novikov said...

Ivan,
I am looking for advice related to crawler/indexing in SiteCore 6.0.
We have multi-language web site in SiteCore 6.0 and our existing crawler (based on Lucene API) written by another company and this crawler run once per day as scheduled application, but it takes more then 1 hour and a lot of resources.
Question is - can we use in our SiteCore 6.0 internal Lucene indexing/crawling mechanism for our multi-language web site or can you suggest any other solutions?

Eugene

Ivan said...

Hi Eugene,

I would question why do you need your app to re-index all data completely once a day. If all information that gets indexed comes from Sitecore items, then there is no need to rebuild index entirely (at least for Sitecore 6.2 Update-5, 6.3.1, 6.4.1 or higher). All prior versions of Sitecore had some indexing issues that kept information in the index after it got removed from within Sitecore.
I would recommend you to estimate upgrade process to lift your Sitecore solution up to at least 6.2 Update-5 version.

Another option that you can use with your current implementation is to configure crawler to index only required data. If you run it on the whole content tree and there are items that are not required in the index, you can omit them by tweaking index configuration. This post will give you an idea how to configure it: http://sitecoregadgets.blogspot.com/2009/11/working-with-lucene-search-index-in.html.
Note that I'm referring to "new" Lucene index integration introduced in Sitecore 6.
There is an example with proper documentation that explains how to work with this integration: http://trac.sitecore.net/BidirectionalLuceneSearch
Also Alex made a great project that extends standard database crawler: http://trac.sitecore.net/AdvancedDatabaseCrawler

Eugene Novikov said...

Ivan, can you clarify please, I am missing something.
Lets say content manager updated some page with new text, on publishing all this text will be stored in SiteCore DB and incremental indexing "immediately" will update some indexes, so external user who searching our web site will see new text in his search result.
If it is correct I do not understand why RoundCube build external crawler/indexer and this crawler request every our page, parse content and indexing it (Lucene IndexWriter), so it takes long time.
We have code of this crawler but can not explain why it was done like that. Unfortunately people who did it long time gone. One of our theory is because SiteCore 6.0 does not have this capabilities for our multi-language environment?

Ivan said...

One of the reasons why it could be done that way is that there might be some external content showing up on the page that does not come from Sitecore but has to be a part of the indexed data. Another thing that I mentioned before is that Sitecore 6.0 has some issues with keeping old data in the index and the only way to get rid of those is to rebuild index entirely. Still if the first case does not apply, I don't know why would they want to request a page externally and index it.
There should be a reason for that if it was done that way.
If there is no external data on the pages, try to setup a test environment and configure indexing through default Sitecore functionality. Test it out and see how it performs and whether all data are in there. Just use the "new" index that is located at /sitecore/search/configuration section in web.config file.

Eugene Novikov said...

Thank you Ivan, I will try internal indexing.

Liz at Verndale said...

Hi Ivan, we are using the Advanced DB Crawler for a site we are building for a client. It has been working great so far.

A need came up to be able to search the contents of a pdf as well as the the sitecore fields, which is something we had done in the past with the older way of Lucene indexing. With the Advanced DB Crawler, we knew we would have to implement it in a different way, which is how we ended up here.

We tried implementing pretty much exactly what you have here, with the contents of the custom field coming from contents of the pdf attached to the sitecore item. All of the code appears to run fine when we rebuild the index. In the Sitecore.SharedSource.Search.config file, we simply added this line:


pdfContents

And when I open up the .cfs file, I can find the keyword that is in the pdf.

But when I try to run a Full Text Query for the keyword using DemoPage1, I get no results. And if I try to search the custom field specifically, using DemoPage3, entering pdfContents and the keyword into the Field 1 search parameters, I also get no results.

Do you have any suggestions for troubleshooting this or pointers as to what I might be missing?

Thanks in advance!
Liz

Liz at Verndale said...

<fields hint="raw:AddCustomField"><field luceneName="pdfContents" storageType="YES" indexType="TOKENIZED">pdfContents</field></fields>

Sorry, trying again to upload the line from the web.config.

Ivan said...

Hi Liz,

As far as I understood you used a custom field to store PDF content in it. If so, in this case FullTextQuery cannot be used as it runs search against the "_content" field.
If you store PDF content in "pdfContents" field, you have to use FieldQuery or a Lucene query (e.g. TermQuery). You can also test your queries in Luke tool which you can get from here: http://code.google.com/p/luke/.
One more thing, when you search "pdfContents" field try to spell it in lower-case. DatabaseCrawler uses field.Key property when it adds the field to the search index document. I don't expect it to cause an issue but I never tried it before. So, it could be.

Liz at Verndale said...

Ivan, thanks so much for your response!

Finding out that FullTextQuery searches only the _content field is good to know and the Luke tool is really helpful. I was able to use the tool to see that the pdfContents field was definitely being populated correctly.

That allowed me to focus on figuring out why querying on the pdfContents field directly with FieldQuery via the DemoPage3 wasn't working. Instead of just having to change 'pdfContents' to lowercase when searching, I also had to change the name I give it in the config file. This is because the Searcher.cs file ApplyRefinements method does this to the search parameter:

var fieldName = refinement.Key.ToLowerInvariant();

so no matter what I searched on, it changed it to lower, eg: 'pdfcontents', and therefore wasn't finding it in the index, as the custom field was set to 'pdfContents' in the config file. Once I changed the custom field to be all lowercase in the config file and rebuilt the index, the field query search worked fine.

Two quick questions about all of this:
1. I found that I could also add the pdf contents to '_content' and that it would then find it using the FullTextQuery. It seems this simply appends the pdf contents onto what is already in there, thereby allowing me to continue to search on the regular content values as well. Do you foresee any problems by doing this?

2. Since pdfs typically have a lot of data in them, do you think the storagetype should be changed to 'compress'? I find that if I do this, it still seems to search successfully, however the Luke tool has some problems opening up the indexes and searching.

Thanks, Liz

Ivan said...

Good to hear that it helped you solving the problem.
1. I can't see any issues with adding PDF content to the _content field. If you are OK with having PDF data merged with other text fields from the media item, then go for it.
2. Compress option will minimize space required for the indexed data. Keep in mind that searching compressed data requires the data to be uncompressed first. Run tests to see if searching compressed fields does not compromise performance.
I noticed that Luke util doesn't show compressed fields properly. I think it was compiled with older DLL that had different compression API or did not have it at all. It's still possible to see hits in Luke when you search compressed data.

Michael said...

Hi Liz

can you describe how you added the contents of the pdf file? In my case I'd like to add it to the _content field.. but I just like to know in general how you got the contents out of the pdf and into the index...

cheers
Michael

Ivan said...

Michael,

If you store your PDF files in Sitecore database, as items of the media library, then you can get a stream of the media item that contains media content.
Example:
Stream content = mediaItem.GetMediaStream();

Then you can extract the textual information from the media stream using one of PDF readers. I used ABCpdf .NET library in the past. I'm sure there is a bunch of alternative libraries including open source ones.

--Ivan

Michael said...

thanks but I might not have been clear enough.. I asked myself what the right changes to the config file would be, so the field gets added to the _content.

I found a solution and just document it here - so others can reuse it...

1. set root above media library (/sitecore)
2. add a fieldcrawler for the fieldType Attribute
3. add a fieldtype with name=attachment storageType=NO indexType=TOKENIZED vectorType=NO
4. in the fieldcrawler calss override GetValue and use any kind of pdf parser to return the contents as string

or am I missing something? seems to work ;)

thanks anyway for the tipp with the parsers!

cheers
Michael

GW said...

Hi

I'd like to develop a geocoded/proximity 'find my nearest' type search. I see that this is possible with Lucene.net directly so I am assuming that it's possible vis Sitecore also.

Can you provide any hints and tips on how to achieve this functionality?

Thanks

Anonymous said...

Any chance you can update your style sheet to avoid the dark blue on black - almost impossible to read.

Good content though - thanks!

Egemen Bagislar said...

Hi,
nice post!
I have a question. Is it possible to exclude the html tags of the rich text field from the index. For example, a search for "background-color" should not retrieve a result because of the html attribute in a rich text field.

Thanx