Aggregating


Aggregating


Bluge supports a powerful framework for computing aggregated values over the set of DocumentMatches.

Terminology

In Bluge, the aggregation framework relies heavily on two concepts Buckets and Metrics.

A Bucket is simply a set of documents matching some criteria.

A Metric is some value (or set of values) computed over a Bucket.

There is one implicit bucket defined, which is the entire result set of your search.

Some aggregations (which we refer to as bucketing aggregations) define new sub-buckets inside this top-level bucket. These sub-buckets could either be staticly defined at search time, or dynamically defined based on the data.

Other aggregations (which we refer to as metric aggregations) compute values on buckets.

Bucketing Aggregations

Terms Aggregation

The terms aggregation typically operates on field data. Each term seen becomes it’s own bucket, and by default the count metric is applied to each bucket. Finally, at the conclusion of the search, these buckets are sorted by their counts descending, and the top N buckets are returned as part of the result.

For example, consider a set documents describing products. Each product has a keyword field named category, indexed with the sortable option. When a user searches the products, we can compute a terms aggregation on the category field, and display to the user the top 5 categories within their search results, and a count of how many products were in each category. This is often used as a way for users to drill deeper into the results, by refining their search filter interactively.

Numeric Range Aggregation

The numeric range aggregation also typically operates on field data. A query time a set of buckets is statically defined, which describe interesting numeric ranges. The aggregation by default includes the count metric, keeping track of how many documents had a numeric field value within the range.

Date Range Aggregation

The date range aggregation also typically operates on field data. A query time a set of buckets is statically defined, which describe interesting date ranges. The aggregation by default includes the count metric, keeping track of how many documents had a date time field value within the range.

Metric Aggregations

Basic

The following basic single-value metrics are supported:

  • sum
  • min
  • max
  • avg
  • weighted avg

Special

A few special case aggregations are supported:

  • count (sum of 1 per document)
  • duration (time.Duration computed since the start of the search)

Cardinality Estimation

The cardinality estimation metric can be used to count the number of distinct values seen, in a memory efficient way.

Quantile Approximation

The quantil approximation metric can be used to approximate quantiles in a memory efficient way.

Nesting

Buckets and Metrics can be nested in arbitrary and powerful ways.

For example, imagine we have a set of documents describing beers. Each beer has a field named style describing the style (lager, ale, lambic, etc). Each beer also has a numeric field named abv describing the beer’s alcohol by volume. One could run a MatchAll query across the beers, compute a Terms Aggregation on the style field, and then nest the Quantile Approximation metric inside each of those buckets. The result would be that we could report the median (50th) and 99th percentile ABV for each different style of beer.

Custom Sources

All the aggregations discussed thus far operate on extendable interfaces, not directly on field values.

This allows aggregations to work on custom values computed by your application, which can themselves use field value as inputs.

It also allows for filtering out undesirable values, or replacing missing values with alternates.

Extending the Framework

The core Aggregation and Calculator types used to define all of this functionality are exposed as interfaces, allowing your application the full power to define their own behavior.