Miscellaneous tools

indextool

indextool is a utility tool that helps to dump various information about a physical table (excluding template or distributedtables). The general syntax for using indextool is:

indextool <command> [options]

Options effective for all commands:

The commands are as follows:

spelldump

spelldump is used to extract the contents of a dictionary file that uses the ispell or MySpell format, which can be useful in building word lists for wordforms - all of the possible forms are pre-built for you.

The general syntax is:

spelldump [options] <dictionary> <affix> [result] [locale-name]

The two main parameters are the dictionary’s main file and its affix file; these are usually named [language-prefix].dict and [language-prefix].aff and can be found in most common Linux distributions and various online sources.

[result] is where the extracted dictionary data will be output, and [locale-name] specifies the locale details you wish to use.

There is also an optional -c [file] option, which specifies a file for case conversion details.

Examples of usage are:

spelldump en.dict en.aff
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
spelldump ru.dict ru.aff ru.txt .1251

The result file will contain a list of all the words in the dictionary, sorted alphabetically, in the format of a wordforms file. This can be used to tailor it to your specific needs. An example of what the result file could look like:

zone > zone
zoned > zoned
zoning > zoning

wordbreaker

wordbreaker is used to split compound words, such as those commonly found in URLs, into their component words. For example, this tool can split “lordoftherings” into its four component words, or http://manofsteel.warnerbros.com into “man of steel warner bros”. This helps in searching, as it eliminates the need for prefixes or infixes. For example, searching for “sphinx” would not match “sphinxsearch”, but if you break the compound word and index the separate components, you would get a match without the increased file sizes that come with using prefixes and infixes in full-text indexing.

Examples of usage include:

echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel

The input stream will be separated into words using the -dict dictionary file. If no dictionary is specified, wordbreaker looks in the working folder for a wordbreaker-dict.txt file. (The dictionary should match the language of the compound word.) The split command breaks words from the standard input and outputs the result to the standard output. There are also test and bench commands that allow you to test the splitting quality and benchmark the splitting functionality.

Wordbreaker requires a dictionary to recognize individual substrings within a string. To differentiate between different guesses, it uses the relative frequency of each word in the dictionary, with higher frequency meaning a higher split probability. You can generate such a file using the indexer tool:

indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf

which will write the 100,000 most frequent words along with their counts from myindex into dict.txt. The output file is a text file, so it can be edited by hand if necessary to add or remove words.