tag:blogger.com,1999:blog-5251195438749660493.post1865361081035253937..comments2024-03-27T18:06:31.934-07:00Comments on Sitecore Gadgets: Adding custom fields to the indexIvanhttp://www.blogger.com/profile/09998430037866709466noreply@blogger.comBlogger24125tag:blogger.com,1999:blog-5251195438749660493.post-58899523385718723722013-08-19T05:27:02.346-07:002013-08-19T05:27:02.346-07:00Hi,
nice post!
I have a question. Is it possible t...Hi,<br />nice post!<br />I have a question. Is it possible to exclude the html tags of the rich text field from the index. For example, a search for "background-color" should not retrieve a result because of the html attribute in a rich text field.<br /><br />ThanxEgemen Bagislarnoreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-30692603262719033172012-07-11T03:49:51.077-07:002012-07-11T03:49:51.077-07:00Any chance you can update your style sheet to avoi...Any chance you can update your style sheet to avoid the dark blue on black - almost impossible to read.<br /><br />Good content though - thanks!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-39325111217003979872011-09-22T07:42:11.406-07:002011-09-22T07:42:11.406-07:00Hi
I'd like to develop a geocoded/proximity &...Hi<br /><br />I'd like to develop a geocoded/proximity 'find my nearest' type search. I see that this is possible with Lucene.net directly so I am assuming that it's possible vis Sitecore also.<br /><br />Can you provide any hints and tips on how to achieve this functionality?<br /><br />ThanksGWhttps://www.blogger.com/profile/10835333234363663555noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-69000634300971439842011-09-07T00:32:03.389-07:002011-09-07T00:32:03.389-07:00thanks but I might not have been clear enough.. I ...thanks but I might not have been clear enough.. I asked myself what the right changes to the config file would be, so the field gets added to the _content.<br /><br />I found a solution and just document it here - so others can reuse it...<br /><br />1. set root above media library (/sitecore)<br />2. add a fieldcrawler for the fieldType Attribute<br />3. add a fieldtype with name=attachment storageType=NO indexType=TOKENIZED vectorType=NO<br />4. in the fieldcrawler calss override GetValue and use any kind of pdf parser to return the contents as string<br /><br />or am I missing something? seems to work ;)<br /><br />thanks anyway for the tipp with the parsers!<br /><br />cheers<br />MichaelMichaelnoreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-78238588503494725652011-09-06T14:50:17.737-07:002011-09-06T14:50:17.737-07:00Michael,
If you store your PDF files in Sitecore ...Michael,<br /><br />If you store your PDF files in Sitecore database, as items of the media library, then you can get a stream of the media item that contains media content.<br />Example: <br /> Stream content = mediaItem.GetMediaStream();<br /><br />Then you can extract the textual information from the media stream using one of PDF readers. I used ABCpdf .NET library in the past. I'm sure there is a bunch of alternative libraries including open source ones.<br /><br />--IvanIvanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-62425994432436501732011-09-06T07:59:41.907-07:002011-09-06T07:59:41.907-07:00Hi Liz
can you describe how you added the content...Hi Liz<br /><br />can you describe how you added the contents of the pdf file? In my case I'd like to add it to the _content field.. but I just like to know in general how you got the contents out of the pdf and into the index...<br /><br />cheers<br />MichaelMichaelnoreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-34865158760332080592011-08-04T14:08:24.658-07:002011-08-04T14:08:24.658-07:00Good to hear that it helped you solving the proble...Good to hear that it helped you solving the problem.<br />1. I can't see any issues with adding PDF content to the _content field. If you are OK with having PDF data merged with other text fields from the media item, then go for it.<br />2. Compress option will minimize space required for the indexed data. Keep in mind that searching compressed data requires the data to be uncompressed first. Run tests to see if searching compressed fields does not compromise performance.<br />I noticed that Luke util doesn't show compressed fields properly. I think it was compiled with older DLL that had different compression API or did not have it at all. It's still possible to see hits in Luke when you search compressed data.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-23274317626202994152011-08-04T08:24:45.934-07:002011-08-04T08:24:45.934-07:00Ivan, thanks so much for your response!
Finding...Ivan, thanks so much for your response! <br /><br />Finding out that FullTextQuery searches only the _content field is good to know and the Luke tool is really helpful. I was able to use the tool to see that the pdfContents field was definitely being populated correctly.<br /><br />That allowed me to focus on figuring out why querying on the pdfContents field directly with FieldQuery via the DemoPage3 wasn't working. Instead of just having to change 'pdfContents' to lowercase when searching, I also had to change the name I give it in the config file. This is because the Searcher.cs file ApplyRefinements method does this to the search parameter: <br /><br />var fieldName = refinement.Key.ToLowerInvariant();<br /><br />so no matter what I searched on, it changed it to lower, eg: 'pdfcontents', and therefore wasn't finding it in the index, as the custom field was set to 'pdfContents' in the config file. Once I changed the custom field to be all lowercase in the config file and rebuilt the index, the field query search worked fine.<br /><br />Two quick questions about all of this:<br />1. I found that I could also add the pdf contents to '_content' and that it would then find it using the FullTextQuery. It seems this simply appends the pdf contents onto what is already in there, thereby allowing me to continue to search on the regular content values as well. Do you foresee any problems by doing this?<br /><br />2. Since pdfs typically have a lot of data in them, do you think the storagetype should be changed to 'compress'? I find that if I do this, it still seems to search successfully, however the Luke tool has some problems opening up the indexes and searching.<br /><br />Thanks, LizLiz at Verndalehttps://www.blogger.com/profile/08770333270153879649noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-34549034251402369562011-08-02T19:33:32.419-07:002011-08-02T19:33:32.419-07:00Hi Liz,
As far as I understood you used a custom ...Hi Liz,<br /><br />As far as I understood you used a custom field to store PDF content in it. If so, in this case FullTextQuery cannot be used as it runs search against the "_content" field.<br />If you store PDF content in "pdfContents" field, you have to use FieldQuery or a Lucene query (e.g. TermQuery). You can also test your queries in Luke tool which you can get from here: http://code.google.com/p/luke/.<br />One more thing, when you search "pdfContents" field try to spell it in lower-case. DatabaseCrawler uses field.Key property when it adds the field to the search index document. I don't expect it to cause an issue but I never tried it before. So, it could be.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-21574906434811533172011-08-02T07:52:12.377-07:002011-08-02T07:52:12.377-07:00<fields hint="raw:AddCustomField">...<fields hint="raw:AddCustomField"><field luceneName="pdfContents" storageType="YES" indexType="TOKENIZED">pdfContents</field></fields><br /><br />Sorry, trying again to upload the line from the web.config.Liz at Verndalehttps://www.blogger.com/profile/08770333270153879649noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-80125235003021522512011-08-02T07:49:48.954-07:002011-08-02T07:49:48.954-07:00Hi Ivan, we are using the Advanced DB Crawler for ...Hi Ivan, we are using the Advanced DB Crawler for a site we are building for a client. It has been working great so far.<br /><br />A need came up to be able to search the contents of a pdf as well as the the sitecore fields, which is something we had done in the past with the older way of Lucene indexing. With the Advanced DB Crawler, we knew we would have to implement it in a different way, which is how we ended up here.<br /><br />We tried implementing pretty much exactly what you have here, with the contents of the custom field coming from contents of the pdf attached to the sitecore item. All of the code appears to run fine when we rebuild the index. In the Sitecore.SharedSource.Search.config file, we simply added this line: <br /><br /><br />pdfContents<br /><br />And when I open up the .cfs file, I can find the keyword that is in the pdf.<br /><br />But when I try to run a Full Text Query for the keyword using DemoPage1, I get no results. And if I try to search the custom field specifically, using DemoPage3, entering pdfContents and the keyword into the Field 1 search parameters, I also get no results. <br /><br />Do you have any suggestions for troubleshooting this or pointers as to what I might be missing?<br /><br />Thanks in advance! <br />LizLiz at Verndalehttps://www.blogger.com/profile/08770333270153879649noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-85365375575155775692011-02-10T18:26:42.970-08:002011-02-10T18:26:42.970-08:00Thank you Ivan, I will try internal indexing.Thank you Ivan, I will try internal indexing.Eugene Novikovhttps://www.blogger.com/profile/15828059070995574051noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-63300659881999854752011-02-10T16:56:47.144-08:002011-02-10T16:56:47.144-08:00One of the reasons why it could be done that way i...One of the reasons why it could be done that way is that there might be some external content showing up on the page that does not come from Sitecore but has to be a part of the indexed data. Another thing that I mentioned before is that Sitecore 6.0 has some issues with keeping old data in the index and the only way to get rid of those is to rebuild index entirely. Still if the first case does not apply, I don't know why would they want to request a page externally and index it.<br />There should be a reason for that if it was done that way.<br />If there is no external data on the pages, try to setup a test environment and configure indexing through default Sitecore functionality. Test it out and see how it performs and whether all data are in there. Just use the "new" index that is located at /sitecore/search/configuration section in web.config file.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-33986399032556112542011-02-10T08:06:27.022-08:002011-02-10T08:06:27.022-08:00Ivan, can you clarify please, I am missing somethi...Ivan, can you clarify please, I am missing something. <br />Lets say content manager updated some page with new text, on publishing all this text will be stored in SiteCore DB and incremental indexing "immediately" will update some indexes, so external user who searching our web site will see new text in his search result.<br />If it is correct I do not understand why RoundCube build external crawler/indexer and this crawler request every our page, parse content and indexing it (Lucene IndexWriter), so it takes long time.<br />We have code of this crawler but can not explain why it was done like that. Unfortunately people who did it long time gone. One of our theory is because SiteCore 6.0 does not have this capabilities for our multi-language environment?Eugene Novikovhttps://www.blogger.com/profile/15828059070995574051noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-9670571881808830802011-02-09T22:18:07.771-08:002011-02-09T22:18:07.771-08:00Hi Eugene,
I would question why do you need your ...Hi Eugene,<br /><br />I would question why do you need your app to re-index all data completely once a day. If all information that gets indexed comes from Sitecore items, then there is no need to rebuild index entirely (at least for Sitecore 6.2 Update-5, 6.3.1, 6.4.1 or higher). All prior versions of Sitecore had some indexing issues that kept information in the index after it got removed from within Sitecore.<br />I would recommend you to estimate upgrade process to lift your Sitecore solution up to at least 6.2 Update-5 version.<br /><br />Another option that you can use with your current implementation is to configure crawler to index only required data. If you run it on the whole content tree and there are items that are not required in the index, you can omit them by tweaking index configuration. This post will give you an idea how to configure it: http://sitecoregadgets.blogspot.com/2009/11/working-with-lucene-search-index-in.html.<br />Note that I'm referring to "new" Lucene index integration introduced in Sitecore 6.<br />There is an example with proper documentation that explains how to work with this integration: http://trac.sitecore.net/BidirectionalLuceneSearch<br />Also Alex made a great project that extends standard database crawler: http://trac.sitecore.net/AdvancedDatabaseCrawlerIvanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-87210147237545085672011-02-08T21:40:37.718-08:002011-02-08T21:40:37.718-08:00Ivan,
I am looking for advice related to crawler/i...Ivan,<br />I am looking for advice related to crawler/indexing in SiteCore 6.0.<br />We have multi-language web site in SiteCore 6.0 and our existing crawler (based on Lucene API) written by another company and this crawler run once per day as scheduled application, but it takes more then 1 hour and a lot of resources.<br />Question is - can we use in our SiteCore 6.0 internal Lucene indexing/crawling mechanism for our multi-language web site or can you suggest any other solutions?<br /><br />EugeneEugene Novikovhttps://www.blogger.com/profile/15828059070995574051noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-25954737154646978452010-11-02T18:03:57.371-07:002010-11-02T18:03:57.371-07:00It was developed and tested on Sitecore 6.2 but it...It was developed and tested on Sitecore 6.2 but it should work on any 6 version higher than 6.0.x. Make sure that you added index configuration under /sitecore/search/configuration/indexes section.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-28236069763872876712010-11-02T09:38:31.173-07:002010-11-02T09:38:31.173-07:00What version of Sitecore is this for? I have versi...What version of Sitecore is this for? I have version 6.1. When I use the config entry in this article, it hangs the website. when I comment out the crawler section, it throws an error "Could not find add method: AddIndex (type: Sitecore.Data.Database)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-19614732239661021112010-09-15T14:38:24.793-07:002010-09-15T14:38:24.793-07:00Manish,
We don't have any official article on...Manish,<br /><br />We don't have any official article on this topic yet. I'll try to put things together and publish some information on my blog.<br />In short it works exactly the same as old index. Re-indexing is done on CD server thorough HistoryEngine functionality.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-46598916452168002222010-09-02T20:49:24.470-07:002010-09-02T20:49:24.470-07:00Ivan,
From customization perspective new indexes d...Ivan,<br />From customization perspective new indexes definitely looks promising but I still can't find a good documentation on the complete setup. BidirectionalLuceneSearch module has a good documentation and can be used for reference while coding new indexes. My biggest fear still lies in the web-farm environment. Please point me to a link where I can read about web-farm configuration for new indexes.mThttps://www.blogger.com/profile/14815021415216729114noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-67166220182209217202010-09-02T09:21:14.489-07:002010-09-02T09:21:14.489-07:00Manish,
Not too long ago we created a shared sour...Manish,<br /><br />Not too long ago we created a shared source component that shows how new index works. It has pretty good documentation on it. You can find it here: http://trac.sitecore.net/BidirectionalLuceneSearch<br /><br />From my experience new lucene search implementation more consistent and stable. It's easier to work with and customize something in case you need it. <br /><br />It works with web-farm setup as well. Every node in the farm will have it's own copy of indexes. It should be even possible to maintain index on one node and share it with others in read-only mode.Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-80692410633516336812010-09-02T02:48:45.195-07:002010-09-02T02:48:45.195-07:00Ivan,
There is almost no documentation of new sear...Ivan,<br />There is almost no documentation of new search indexes on sdn (if you know, plz post the link). The only hint is found in your blogs. I am using Lucene as a base for my app and I feel new search indexes in sitecore are still not matured. I am afraid these new indexes won't work on Web-farm environment. My project will be deployed on a web-farm env. and seems like I had to follow the old one only. Old is gold? How reliable are the new indexes, in your opinion?<br /><br />mTmThttps://www.blogger.com/profile/14815021415216729114noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-64341063114318650262010-09-01T15:59:20.765-07:002010-09-01T15:59:20.765-07:00Hi Alistair,
Somehow "Rebuild the Search ind...Hi Alistair,<br /><br />Somehow "Rebuild the Search index" tool does not pick up custom lucene indexes from new implementation.<br />Good thing it's simple enough to address this issue by write a couple of lines of code.<br />You can create an aspx page or even extend standard "Rebuild the Search index" app with the following code:<br />Sitecore.Search.Index index = SearchManager.GetIndex("index_name");<br />if (index != null)<br />{<br />index.Rebuild();<br />}Ivanhttps://www.blogger.com/profile/09998430037866709466noreply@blogger.comtag:blogger.com,1999:blog-5251195438749660493.post-67133801994206341132010-09-01T15:08:24.409-07:002010-09-01T15:08:24.409-07:00Nice post Ivan. Have you found an easy way to rebu...Nice post Ivan. Have you found an easy way to rebuild your custom index? I've found the new style search indexes (which you've used) don't get built when you rebuild database indexes through the control panel. They only rebuild the old style indexes.<br />I'm currently using the new style indexes in a project and have had to handle the rebuilding of the index myself. I'd be interested to know how you deal with this.Anonymousnoreply@blogger.com