Grouping search results is often helpful for obtaining per-group match counts or other aggregations. For example, it’s useful for creating a graph illustrating the number of matching blog posts per month or grouping web search results by site or forum posts by author, etc.
Manticore supports the grouping of search results by single or multiple columns and computed expressions. The results can:
The general syntax is:
General syntax
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
HAVING where_condition]
[GROUP ORDER BY field_name {ASC | DESC} [, ...]]
[WITHIN ...
...) }
SELECT_expr: { field_name | function_name(COUNT(*)} where_condition: {aggregation expression alias |
JSON query format currently supports a basic grouping that can retrieve aggregate values and their count(*).
{
"index": "<index_name>",
"limit": 0,
"aggs": {
"<aggr_name>": {
"terms": {
"field": "<attribute>",
"size": <int value>
}
}
}
}
The standard query output returns the result set without grouping, which can be hidden using limit
(or size
). The aggregation requires setting a size
for the group’s result set size.
Grouping is quite simple - just add “GROUP BY smth” to the end of your SELECT
query. The something can be:
SELECT
list, you can GROUP BY it tooYou can omit any aggregation functions in the SELECT
list and it will still work:
SELECT release_year FROM films GROUP BY release_year LIMIT 5;
+--------------+
| release_year |+--------------+
2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
| +--------------+
In most cases, however, you’ll want to obtain some aggregated data for each group, such as:
COUNT(*)
to simply get the number of elements in each groupAVG(field)
to calculate the average value of the field within the groupSELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;
+--------------+----------+
count(*) |
| release_year | +--------------+----------+
2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
| +--------------+----------+
SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5;
+--------------+------------------+
avg(rental_rate) |
| release_year | +--------------+------------------+
2004 | 2.78629661 |
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2005 | 2.90397978 |
| 2000 | 3.17556739 |
| +--------------+------------------+
POST /search -d '
{
"index" : "films",
"limit": 0,
"aggs" :
{
"release_year" :
{
"terms" :
{
"field":"release_year",
"size":100
}
}
}
}
'
{
"took": 2,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
]
},
"release_year": {
"group_brand_id": {
"buckets": [
{
"key": 2004,
"doc_count": 108
},
{
"key": 2002,
"doc_count": 108
},
{
"key": 2000,
"doc_count": 97
},
{
"key": 2005,
"doc_count": 93
},
{
"key": 2001,
"doc_count": 91
}
]
}
}
}
$index->setName('films');
$search = $index->search('');
$search->limit(0);
$search->facet('release_year','release_year',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[release_year] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 2009
[doc_count] => 99
)
[1] => Array
(
[key] => 2008
[doc_count] => 102
)
[2] => Array
(
[key] => 2007
[doc_count] => 93
)
[3] => Array
(
[key] => 2006
[doc_count] => 103
)
[4] => Array
(
[key] => 2005
[doc_count] => 93
)
[5] => Array
(
[key] => 2004
[doc_count] => 108
)
[6] => Array
(
[key] => 2003
[doc_count] => 106
)
[7] => Array
(
[key] => 2002
[doc_count] => 108
)
[8] => Array
(
[key] => 2001
[doc_count] => 91
)
[9] => Array
(
[key] => 2000
[doc_count] => 97
)
)
)
)
=searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}}) res
'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
{u'key': 2009},
u'doc_count': 102,
{u'key': 2008},
u'doc_count': 93,
{u'key': 2007},
u'doc_count': 103,
{u'key': 2006},
u'doc_count': 93,
{u'key': 2005},
u'doc_count': 108,
{u'key': 2004},
u'doc_count': 106,
{u'key': 2003},
u'doc_count': 108,
{u'key': 2002},
u'doc_count': 91,
{u'key': 2001},
u'doc_count': 97,
{u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}
= await searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}}); res
"took":0,"timed_out":false,"aggregations":{"release_year":{"buckets":[{"key":2009,"doc_count":99},{"key":2008,"doc_count":102},{"key":2007,"doc_count":93},{"key":2006,"doc_count":103},{"key":2005,"doc_count":93},{"key":2004,"doc_count":108},{"key":2003,"doc_count":106},{"key":2002,"doc_count":108},{"key":2001,"doc_count":91},{"key":2000,"doc_count":97}]}},"hits":{"total":1000,"hits":[]}} {
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
new SearchRequest();
searchRequest = setIndex("films");
searchRequest.setLimit(0);
searchRequest.new HashMap<String,Object>();
query = put("match_all",null);
query.setQuery(query);
searchRequest.setAggs(aggs);
searchRequest.search(searchRequest); searchResponse = searchApi.
class SearchResponse {
0
took: false
timedOut: 2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
aggregations: {release_year={buckets=[{key=class SearchResponseHits {
hits: null
maxScore: 1000
total:
hits: []
}null
profile: }
By default, groups are not sorted, and the next thing you typically want to do is order them by something, like the field you’re grouping by:
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;
+--------------+----------+
count(*) |
| release_year | +--------------+----------+
2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
| +--------------+----------+
Alternatively, you can sort by the aggregation:
count(*)
to display groups with the most elements firstavg(rental_rate)
to show the highest-rated movies first. Note that in the example, it’s done via an alias: avg(rental_rate)
is first mapped to avg
in the SELECT
list, and then we simply do ORDER BY avg
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;
+--------------+----------+
count(*) |
| release_year | +--------------+----------+
2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
| +--------------+----------+
SELECT release_year, AVG(rental_rate) avg FROM films GROUP BY release_year ORDER BY avg desc LIMIT 5;
+--------------+------------+
avg |
| release_year | +--------------+------------+
2006 | 3.26184368 |
| 2000 | 3.17556739 |
| 2001 | 3.09989142 |
| 2002 | 3.08259249 |
| 2008 | 2.99000049 |
| +--------------+------------+
In some cases, you might want to group not just by a single field, but by multiple fields at once, such as a movie’s category and year:
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;
+-------------+--------------+----------+
count(*) |
| category_id | release_year | +-------------+--------------+----------+
1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
| +-------------+--------------+----------+
Sometimes it’s useful to see not just a single element per group, but multiple. This can be easily achieved with the help of GROUP N BY
. For example, in the following case, we get two movies for each year rather than just one, which a simple GROUP BY release_year
would have returned.
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;
+--------------+-----------------------------+
| release_year | title |+--------------+-----------------------------+
2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
| +--------------+-----------------------------+
Another crucial analytics requirement is to sort elements within a group. To achieve this, use the WITHIN GROUP ORDER BY ... {ASC|DESC}
clause. For example, let’s get the highest-rated film for each year. Note that it works in parallel with just ORDER BY
:
WITHIN GROUP ORDER BY
sorts results inside a groupGROUP BY
sorts the groups themselvesThese two work entirely independently.
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;
+--------------+------------------+-------------+
| release_year | title | rental_rate |+--------------+------------------+-------------+
2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
| +--------------+------------------+-------------+
HAVING expression
is a helpful clause for filtering groups. While WHERE
is applied before grouping, HAVING
works with the groups. For example, let’s keep only those years when the average rental rate of the films for that year was higher than 3. We get only four years:
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;
+--------------+------------+
avg |
| release_year | +--------------+------------+
2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
| +--------------+------------+
There is a function GROUPBY()
which returns the key of the current group. It’s useful in many cases, especially when you GROUP BY an MVA or a JSON value.
It can also be used in HAVING
, for example, to keep only years 2000 and 2002.
Note that GROUPBY()
is not recommended for use when you GROUP BY multiple fields at once. It will still work, but since the group key in this case is a compound of field values, it may not appear the way you expect.
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);
+--------------+----------+
count(*) |
| release_year | +--------------+----------+
2002 | 108 |
| 2000 | 97 |
| +--------------+----------+
Manticore supports grouping by MVA. To demonstrate how it works, let’s create a table “shoes” with MVA “sizes” and insert a few documents into it:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
so we have:
SELECT * FROM shoes;
+---------------------+----------+---------+
id | sizes | title |
| +---------------------+----------+---------+
1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
| +---------------------+----------+---------+
If we now GROUP BY “sizes”, it will process all our multi-value attributes and return an aggregation for each, in this case just the count:
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;
+------+----------+
count(*) |
| gb | +------+----------+
40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
| +------+----------+
POST /search -d '
{
"index" : "shoes",
"limit": 0,
"aggs" :
{
"sizes" :
{
"terms" :
{
"field":"sizes",
"size":100
}
}
}
}
'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"sizes": {
"buckets": [
{
"key": 43,
"doc_count": 2
},
{
"key": 42,
"doc_count": 2
},
{
"key": 41,
"doc_count": 2
},
{
"key": 40,
"doc_count": 1
}
]
}
}
}
$index->setName('shoes');
$search = $index->search('');
$search->limit(0);
$search->facet('sizes','sizes',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[sizes] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 43
[doc_count] => 2
)
[1] => Array
(
[key] => 42
[doc_count] => 2
)
[2] => Array
(
[key] => 41
[doc_count] => 2
)
[3] => Array
(
[key] => 40
[doc_count] => 1
)
)
)
)
=searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}}) res
'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
{'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}
= await searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}}); res
"took":0,"timed_out":false,"aggregations":{"sizes":{"buckets":[{"key":43,"doc_count":2},{"key":42,"doc_count":2},{"key":41,"doc_count":2},{"key":40,"doc_count":1}]}},"hits":{"total":3,"hits":[]}} {
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
new SearchRequest();
searchRequest = setIndex("films");
searchRequest.setLimit(0);
searchRequest.new HashMap<String,Object>();
query = put("match_all",null);
query.setQuery(query);
searchRequest.setAggs(aggs);
searchRequest.search(searchRequest); searchResponse = searchApi.
class SearchResponse {
0
took: false
timedOut: 43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
aggregations: {release_year={buckets=[{key=class SearchResponseHits {
hits: null
maxScore: 3
total:
hits: []
}null
profile: }
If you have a field of type JSON, you can GROUP BY any node from it. To demonstrate this, let’s create a table “products” with a few documents, each having a color in the “meta” JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
id | meta | title |
| +---------------------+-------------------+--------+
1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
| +---------------------+-------------------+--------+
To group the products by color, we can simply use GROUP BY meta.color
, and to display the corresponding group key in the SELECT
list, we can use GROUPBY()
:
SELECT groupby() color, count(*) from products GROUP BY meta.color;
+-------+----------+
count(*) |
| color | +-------+----------+
2 |
| red | 1 |
| green | +-------+----------+
POST /search -d '
{
"index" : "products",
"limit": 0,
"aggs" :
{
"color" :
{
"terms" :
{
"field":"meta.color",
"size":100
}
}
}
}
'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"color": {
"buckets": [
{
"key": "green",
"doc_count": 1
},
{
"key": "red",
"doc_count": 2
}
]
}
}
}
$index->setName('products');
$search = $index->search('');
$search->limit(0);
$search->facet('meta.color','color',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[color] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => green
[doc_count] => 1
)
[1] => Array
(
[key] => red
[doc_count] => 2
)
)
)
)
=searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}}) res
'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
{u'key': u'green'},
u'doc_count': 2, u'key': u'red'}]}},
{'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}
= await searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}}); res
"took":0,"timed_out":false,"aggregations":{"color":{"buckets":[{"key":"green","doc_count":1},{"key":"red","doc_count":2}]}},"hits":{"total":3,"hits":[]}} {
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("color", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","meta.color");
put("size",100);
}});
}});
}};
new SearchRequest();
searchRequest = setIndex("products");
searchRequest.setLimit(0);
searchRequest.new HashMap<String,Object>();
query = put("match_all",null);
query.setQuery(query);
searchRequest.setAggs(aggs);
searchRequest.search(searchRequest); searchResponse = searchApi.
class SearchResponse {
0
took: false
timedOut: 1}, {key=red, doc_count=2}]}}
aggregations: {color={buckets=[{key=green, doc_count=class SearchResponseHits {
hits: null
maxScore: 3
total:
hits: []
}null
profile: }
Besides COUNT(*)
, which returns the number of elements in each group, you can use various other aggregation functions: ##### COUNT(DISTINCT field) While COUNT(*)
returns the number of all elements in the group, COUNT(DISTINCT field)
returns the number of unique values of the field in the group, which may be completely different from the total count. For instance, you can have 100 elements in the group, but all with the same value for a certain field. COUNT(DISTINCT field)
helps to determine that. To demonstrate this, let’s create a table “students” with the student’s name, age, and major:
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
so we have:
none)]> SELECT * from students;
MySQL [(+---------------------+------+----------+---------+
id | age | major | name |
| +---------------------+------+----------+---------+
1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
| +---------------------+------+----------+---------+
In the example, you can see that if we GROUP BY major and display both COUNT(*)
and COUNT(DISTINCT age)
, it becomes clear that there are two students who chose the major “cs” with two unique ages, but for the major “arts”, there are also two students, yet only one unique age.
There can be at most one COUNT(DISTINCT)
per query.
COUNT(DISTINCT)
against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema (identical set/order of fields, but may have different tokenization settings).
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;
+----------+----------+---------------------+
count(*) | count(distinct age) |
| major | +----------+----------+---------------------+
2 | 1 |
| arts | 1 | 1 |
| business | 2 | 2 |
| cs | +----------+----------+---------------------+
Often, you want to better understand the contents of each group. You can use GROUP N BY for that, but it would return additional rows you might not want in the output. GROUP_CONCAT()
enriches your grouping by concatenating values of a specific field in the group. Let’s take the previous example and improve it by displaying all the ages in each group.
GROUP_CONCAT(field)
returns the list as comma-separated values.
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
+----------+----------+---------------------+-------------------+
count(*) | count(distinct age) | group_concat(age) |
| major | +----------+----------+---------------------+-------------------+
2 | 1 | 21,21 |
| arts | 1 | 1 | 22 |
| business | 2 | 2 | 21,22 |
| cs | +----------+----------+---------------------+-------------------+
Of course, you can also obtain the sum, average, minimum, and maximum values within a group.
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;
+------+------------+----------+----------+------------+
year | sum | min | max | avg |
| +------+------------+----------+----------+------------+
2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
| +------+------------+----------+----------+------------+
Grouping is done in fixed memory, which depends on the max_matches setting. If max_matches
allows for storage of all found groups, the results will be 100% accurate. However, if the value of max_matches
is lower, the results will be less accurate.
When parallel processing is involved, it can become more complicated. When pseudo_sharding
is enabled and/or when using an RT index with several disk chunks, each chunk or pseudo shard gets a result set that is no larger than max_matches
. This can lead to inaccuracies in aggregates and group counts when the result sets from different threads are merged. To fix this, either a larger max_matches
value or disabling parallel processing can be used.
Manticore will try to increase max_matches
up to max_matches_increase_threshold if it detects that groupby may return inaccurate results. Detection is based on the number of unique values of the groupby attribute, which is retrieved from secondary indexes (if present).
To ensure accurate aggregates and/or group counts when using RT indexes or pseudo_sharding
, accurate_aggregation
can be enabled. This will try to increase max_matches
up to the threshold, and if the threshold is not high enough, Manticore will disable parallel processing for the query.
none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
MySQL [(+------+----------+
year | count(*) |
| +------+----------+
2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
| +------+----------+
none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
MySQL [(+------+----------+
year | count(*) |
| +------+----------+
2004 | 76 |
| +------+----------+
none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
MySQL [(+------+----------+
year | count(*) |
| +------+----------+
2004 | 76 |
| 2002 | 74 |
| +------+----------+
none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
MySQL [(+------+----------+
year | count(*) |
| +------+----------+
2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| +------+----------+