Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize “dogs” and “dog” to “dog”, resulting in identical search results for both keywords.
Manticore has four built-in morphology preprocessors:
morphology = morphology1[, morphology2, ...]
The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.
Manticore comes with built-in morphological preprocessors for:
Lemmatizers require dictionary .pak
files that can be downloaded from the Manticore website. The dictionaries need to be put in the directory specified by lemmatizer_base. Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.
The Chinese language segmentation can be performed using ICU. It provides more precise segmentation compared to n-grams but is slightly slower. The charset_table must include all Chinese characters, which can be done by using the “cjk” alias. When “morphology=icu_chinese” is specified, the documents are first pre-processed by ICU. Then, the result is processed by the tokenizer according to the charset_table, and finally, other morphology processors specified in the “morphology” option are applied. Only those parts of texts that contain Chinese are passed to ICU for segmentation, while others can be modified by different means such as different morphologies or charset_table.
Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results
Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.
To use the morphology
option, specify one or multiple of the built-in options, including: * none: do not perform any morphology processing * lemmatize_ru - apply Russian lemmatizer and pick a single root form * lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table
since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'
. Here is an interactive course about how to install and use the urkainian lemmatizer. * lemmatize_en - apply English lemmatizer and pick a single root form * lemmatize_de - apply German lemmatizer and pick a single root form * lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms * lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table
. * lemmatize_en_all - apply English lemmatizer and index all possible root forms * lemmatize_de_all - apply German lemmatizer and index all possible root forms * stem_en - apply Porter’s English stemmer * stem_ru - apply Porter’s Russian stemmer * stem_enru - apply Porter’s English and Russian stemmers * stem_cz - apply Czech stemmer * stem_ar - apply Arabic stemmer * soundex - replace keywords with their SOUNDEX code * metaphone - replace keywords with their METAPHONE code * icu_chinese - apply Chinese text segmentation using ICU * libstemmer_* . Refer to the list of supported languages for details
Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'
POST /cli -d "CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'stem_en, libstemmer_sv'
]);
'CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'') utilsApi.sql(
sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\''); res = await utilsApi.
sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'"); utilsApi.
table products {
morphology = stem_en, libstemmer_sv
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
morphology_skip_fields = field1[, field2, ...]
A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology_skip_fields' => 'name',
'morphology' => 'stem_en'
]);
'CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'') utilsApi.sql(
sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\''); res = await utilsApi.
sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"); utilsApi.
table products {
morphology_skip_fields = name
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_field = name
rt_attr_uint = price
}
min_stemming_len = length
Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).
Stemmers are not perfect, and might sometimes produce undesired results. For instance, running “gps” keyword through Porter stemmer for English results in “gp”, which is not really the intent. min_stemming_len
feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_stemming_len' => '4',
'morphology' => 'stem_en'
]);
'CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'') utilsApi.sql(
sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\''); res = await utilsApi.
sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"); utilsApi.
table products {
min_stemming_len = 4
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_exact_words = {0|1}
This option enables indexing of the original keywords along with the stemmed or remapped versions. The default value is 0, which means this feature is disabled.
When index_exact_words
is enabled, the raw keywords are added to the full-text index along with the stemmed or remapped versions. This allows the use of the exact form operator in the query language. Keep in mind that enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_exact_words' => '1',
'morphology' => 'stem_en'
]);
'CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'') utilsApi.sql(
sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\''); res = await utilsApi.
sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"); utilsApi.
table products {
index_exact_words = 1
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}