Sorting and ranking

Full-text queries return matches sorted by default. If nothing is specified, they are sorted by relevance, which is equivalent to ORDER BY weight() DESC in SQL format.

Non-full-text queries do not perform any sorting by default.

Extended mode

ORDER BY weight() DESC, price ASC, id DESC

Extended mode is automatically enabled when you explicitly provide sorting rules by adding the ORDER BY clause in SQL format or using the sort option via HTTP JSON.

In the sort clause, you can use any combination of up to 5 columns, each followed by asc or desc. Functions and expressions are not allowed as arguments for the sort clause, except for the weight() and random() functions (the latter can only be used via SQL in the form of ORDER BY random()). However, you can use any expression in the SELECT list and sort by its alias.

select *, a + b alias from test order by alias desc;
+------+------+------+----------+-------+
| id   | a    | b    | f        | alias |
+------+------+------+----------+-------+
|    1 |    2 |    3 | document |     5 |
+------+------+------+----------+-------+

HTTP

Sorting by attributes

Query results can be sorted by one or more attributes. Example:

{
  "index":"test",
  "query":
  {
    "match": { "title": "what was" }
  },
  "sort": [ "_score", "id" ]
}

"sort" specifies an array of attributes and/or additional properties. Each element of the array can be an attribute name or "_score" if you want to sort by match weights. In that case, the sort order defaults to ascending for attributes and descending for _score.

You can also specify the sort order explicitly. Example:

"sort":
[
  { "price":"asc" },
  "id"
]

You can also use another syntax and specify the sort order via the order property:

"sort":
[
  { "gid": { "order":"desc" } }
]

Sorting by MVA attributes is also supported in JSON queries. Sorting mode can be set via the mode property. The following modes are supported:

Example:

"sort":
[
  { "attr_mva": { "order":"desc", "mode":"max" } }
]

When sorting on an attribute, match weight (score) calculation is disabled by default (no ranker is used). You can enable weight calculation by setting the track_scores property to true:

{
  "index":"test",
  "track_scores":true,
  "query": { "match": { "title": "what was" } },
  "sort": [ { "gid": { "order":"desc" } } ]
}

Ranking overview

Ranking (also known as weighting) of search results can be defined as a process of computing a so-called relevance (weight) for every given matched document regards to a given query that matched it. So relevance is, in the end, just a number attached to every document that estimates how relevant the document is to the query. Search results can then be sorted based on this number and/or some additional parameters, so that the most sought-after results would appear higher on the results page.

There is no single standard one-size-fits-all way to rank any document in any scenario. Moreover, there can never be such a way, because relevance is subjective. As in, what seems relevant to you might not seem relevant to me. Hence, in general cases, it’s not just hard to compute; it’s theoretically impossible.

So ranking in Manticore is configurable. It has a notion of a so-called ranker. A ranker can formally be defined as a function that takes a document and a query as its input and produces a relevance value as output. In layman’s terms, a ranker controls exactly how (using which specific algorithm) Manticore will assign weights to the documents.

Available built-in rankers

Manticore ships with several built-in rankers suited for different purposes. Many of them use two factors: phrase proximity (also known as LCS) and BM25. Phrase proximity works on keyword positions, while BM25 works on keyword frequencies. Essentially, the better the degree of phrase match between the document body and the query, the higher the phrase proximity (it maxes out when the document contains the entire query as a verbatim quote). And BM25 is higher when the document contains more rare words. We’ll save the detailed discussion for later.

The currently implemented rankers are:

The ranker name is case-insensitive. Example:

SELECT ... OPTION ranker=sph04;

Quick summary of the ranking factors

Name Level Type Summary
max_lcs query int maximum possible LCS value for the current query
bm25 document int quick estimate of BM25(1.2, 0)
bm25a(k1, b) document int precise BM25() value with configurable K1, B constants and syntax support
bm25f(k1, b, {field=weight, …}) document int precise BM25F() value with extra configurable field weights
field_mask document int bit mask of matched fields
query_word_count document int number of unique inclusive keywords in a query
doc_word_count document int number of unique keywords matched in the document
lcs field int Longest Common Subsequence between query and document, in words
user_weight field int user field weight
hit_count field int total number of keyword occurrences
word_count field int number of unique matched keywords
tf_idf field float sum(tf*idf) over matched keywords == sum(idf) over occurrences
min_hit_pos field int first matched occurrence position, in words, 1-based
min_best_span_pos field int first maximum LCS span position, in words, 1-based
exact_hit field bool whether query == field
min_idf field float min(idf) over matched keywords
max_idf field float max(idf) over matched keywords
sum_idf field float sum(idf) over matched keywords
exact_order field bool whether all query keywords were a) matched and b) in query order
min_gaps field int minimum number of gaps between the matched keywords over the matching spans
lccs field int Longest Common Contiguous Subsequence between query and document, in words
wlccs field float Weighted Longest Common Contiguous Subsequence, sum(idf) over contiguous keyword spans
atc field float Aggregate Term Closeness, log(1+sum(idf1idf2pow(distance, -1.75)) over the best pairs of keywords

Document-level Ranking Factors

A document-level factor is a numeric value computed by the ranking engine for every matched document with regards to the current query. So it differs from a plain document attribute in that the attribute does not depend on the full text query, while factors might. These factors can be used anywhere in the ranking expression. Currently implemented document-level factors are:

Field-level Ranking Factors

A field-level factor is a numeric value computed by the ranking engine for every matched in-document text field regards to the current query. As more than one field can be matched by a query, but the final weight needs to be a single integer value, these values need to be folded into a single one. To achieve that, field-level factors can only be used within a field aggregation function, they can not be used anywhere in the expression. For example, you cannot use (lcs+bm25) as your ranking expression, as lcs takes multiple values (one in every matched field). You should use (sum(lcs)+bm25) instead, that expression sums lcs over all matching fields, and then adds bm25 to that per-field sum. Currently implemented field-level factors are:

In other words, we process the best (closest) matched keyword pairs in the document, and compute pairwise “closenesses” as the product of their IDFs scaled by the distance coefficient:

pair_tc = idf(pair_word1) * idf(pair_word2) * pow(pair_distance, -1.75)

We then sum such closenesses, and compute the final, log-dampened ATC value:

atc = log(1+sum(pair_tc))

Note that this final dampening logarithm is precisely the reason you should use OPTION idf=plain because, without it, the expression inside the log() could be negative.

Having closer keyword occurrences contributes much more to ATC than having more frequent keywords. Indeed, when the keywords are right next to each other, distance=1 and k=1; when there’s just one word in between them, distance=2 and k=0.297, with two words between, distance=3 and k=0.146, and so on. At the same time, IDF attenuates somewhat slower. For example, in a 1 million document collection, the IDF values for keywords that match in 10, 100, and 1000 documents would be respectively 0.833, 0.667, and 0.500. So a keyword pair with two rather rare keywords that occur in just 10 documents each but with 2 other words in between would yield pair_tc = 0.101 and thus barely outweigh a pair with a 100-doc and a 1000-doc keyword with 1 other word between them and pair_tc = 0.099. Moreover, a pair of two unique, 1-doc keywords with 3 words between them would get a pair_tc = 0.088 and lose to a pair of two 1000-doc keywords located right next to each other and yielding a pair_tc = 0.25. So, basically, while ATC does combine both keyword frequency and proximity, it still somewhat favors proximity.

Ranking factor aggregation functions

A field aggregation function is a single-argument function that accepts an expression with field-level factors, iterates over all matched fields, and computes the final results. The currently implemented field aggregation functions include:

Formula expressions for all the built-in rankers

Most other rankers can actually be emulated using the expression-based ranker. You just need to provide an appropriate expression. While this emulation will likely be slower than using the built-in, compiled ranker, it may still be interesting if you want to fine-tune your ranking formula starting with one of the existing ones. Additionally, the formulas describe the ranker details in a clear, readable manner.

Configuration of IDF formula

The historically default IDF (Inverse Document Frequency) in Manticore is equivalent to OPTION idf='normalized,tfidf_normalized', and those normalizations may cause several undesired effects.

First, idf=normalized causes keyword penalization. For instance, if you search for the | something and the occurs in more than 50% of the documents, then documents with both keywords the and[something will get less weight than documents with just one keyword something. Using OPTION idf=plain avoids this.

Plain IDF varies in [0, log(N)] range, and keywords are never penalized; while the normalized IDF varies in [-log(N), log(N)] range, and too frequent keywords are penalized.

Second, idf=tfidf_normalized causes IDF drift over queries. Historically, we additionally divided IDF by query keyword count, so that the entire sum(tf*idf) over all keywords would still fit into [0,1] range. However, that means that queries word1 and word1 | nonmatchingword2 would assign different weights to the exactly same result set, because the IDFs for both word1 and nonmatchingword2 would be divided by 2. OPTION idf='tfidf_unnormalized' fixes that. Note that BM25, BM25A, BM25F() ranking factors will be scale accordingly once you disable this normalization.

IDF flags can be mixed; plain and normalized are mutually exclusive;tfidf_unnormalized and tfidf_normalized are mutually exclusive; and unspecified flags in such a mutually exclusive group take their defaults. That means that OPTION idf=plain is equivalent to a complete OPTION idf='plain,tfidf_normalized' specification.