The Typed Index
Besides issues of scaling your search, there are very important aspects concerning search-quality that should not be neglected. Search-quality is mainly controlled by analysis and query parsing. I will talk about frequent problems concerning Lucene analysis and I will show on several examples that usually one analyzer (normal form) is not sufficient, even in a mono-lingual environment but especially in multilingual environments. I will show how our typed index approach allows to solve many of these problems and how we plan to properly treat even mixed-language documents based on this approach.
If you want to search in a multilingual environment with high-quality language-specific word-normalization you soon realize, that you need different types of terms. For example you cannot use morphologically normalized terms or stemmed terms for every kind of search. For Wildcard- and Fuzzy-Search you need terms that have not been normalized at all or that have been normalized only slightly. Phonetic search adds another kind of normal form. Sometimes search should be case-sensitive (e.g. if you want to distinguish between search for the company "MAN" and the word "man") sometimes it shouldn´t. Semantic search (search for persons, organizations and places) might add another type of terms. Usually one uses different fields to distinguish between different types of terms. I will show that putting different types of terms into the same field and representing their type by e.g. a prefix, has an important advantages. It allows you to use the important information about their relative positions. I even challenge the standard approach of having different fields for different languages because it requires so much configuration effort and because it prevents a reasonable treatment of mixed-language documents. At IntraFind we decided to implement a language chunker based on Ted Dunnings paper "Statistical Identification of Language" and to include it into our next-generation linguistic analyzer for Lucene/Solr/Elasticsearch. It identifies chunks of the same language within text and delegates the analysis to a specified language-specific analyzer, which might be a high-quality IntraFind morphological analyzer, a third-party analyzer or one of the high-quality Lucene Open-Source analyzers e.g. for Chinese and Japanese. Terms of different languages are distinguished by different prefixes (types). In this way we can provide an easy-to-use and very powerful linguistic analyzer that requires almost no configuration effort. You simply index your content and don’t have to care about the language. If there is enough time I would like to add a deep dive into Lucene Analysis with position increments and position length and why this is important e.g. for analyzers that decompose terms such as Lucenes WorldDelimiter Analyzer and IntraFinds German Analyzer which also does word decomposition.