User-Defined Aggregate Functions (UDAFs) and non-aggregate (scalar) functions (UDFs) for BigQuery SQL engine.
Please visit the main Apache DataSketches website for more information about DataSketches library.
If you are interested in making contributions to this project please see our Community page for how to contact us.
- Requires Emscripten (emcc compiler)
- Requires a link to datasketches-cpp in this repository
- Requires make utility
- Requires Google Cloud CLI
- Requires npm and @dataform/cli package
npm install -g @dataform/cli
- Requires setting environment variables
- JS_BUCKET: to hold compiled artifacts (must include gs://)
- BQ_PROJECT: location of stored SQL functions (routines)
- BQ_DATASET: location of stored SQL functions (routines)
- BQ_LOCATION: location of BQ_DATASET
gcloud auth application-default login # for authentication
make # performs compilation
make install # upload to $JS_BUCKET & create functions in $BQ_PROJECT.$BQ_DATASET
make test # runs predefined tests in BQ
The above steps can be executed in the root directory to install everything, or can be run from an individual sketch directory to install only that particular sketch.
DataSketches are probabilistic data structures that can process massive amounts of data and return very accurate results with a small memory footprint. Because of this, DataSketches are particularly useful for "big data" use cases such as streaming analytics and data warehousing.
This package includes BigQuery UD(A)Fs for the following Sketch types:
Sketch Type | Description |
---|---|
Frequent Items Sketch | Estimates the frequency of items in a stream to find the most frequent ones (e.g., top-selling products, most active users). |
CPC Sketch | A very compact sketch for estimating the number of unique items, especially in distributed environments. |
HLL Sketch | Memory-efficient sketch for estimating the number of unique items, optimized for high accuracy. |
KLL Sketch | Estimates the distribution of values, allowing you to find quantiles (like median, percentiles) without storing all the data. |
Theta Sketch | Estimates unique items and supports set operations (union, intersection, difference) on those items. |
Tuple Sketch | Similar to Theta Sketch but allows associating values with each unique item, enabling operations like sum, min, max on those values. |
TDigest | Another algorithm and very compact data structure for estimating quantiles and ranks of numeric values. |
Description: CPC sketches are a compact and efficient way to estimate the cardinality of a dataset, especially in distributed environments. They provide accurate estimates with low memory usage and are particularly useful for applications like counting unique users, analyzing website traffic, or tracking distinct events.
For more information: CPC Sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
cpc_sketch_agg_union | AGGREGATE | (sketch BYTES) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES. |
cpc_sketch_agg_string | AGGREGATE | (str STRING) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES |
cpc_sketch_agg_int64 | AGGREGATE | (value INT64) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES |
cpc_sketch_agg_string_lgk_seed | AGGREGATE | (str STRING, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: the seed to be used by the underlying hash function. Returns: a Compact, Compressed CPC Sketch, as BYTES |
cpc_sketch_agg_union_lgk_seed | AGGREGATE | (sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed CPC Sketch, as BYTES. |
cpc_sketch_agg_int64_lgk_seed | AGGREGATE | (value INT64, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: the seed to be used by the underlying hash function. Returns: a Compact, Compressed CPC Sketch, as BYTES |
cpc_sketch_get_estimate | SCALAR | (sketch BYTES) -> FLOAT64 | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: a FLOAT64 value as the cardinality estimate. |
cpc_sketch_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch the given sketch as BYTES. Defaults: seed = 9001. Returns: a STRING that represents the state of the given sketch. |
cpc_sketch_get_estimate_seed | SCALAR | (sketch BYTES, seed INT64) -> FLOAT64 | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a FLOAT64 value as the cardinality estimate. |
cpc_sketch_to_string_seed | SCALAR | (sketch BYTES, seed INT64) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch the given sketch as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRING that represents the state of the given sketch. |
cpc_sketch_union | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a CPC Sketch, as BYTES. |
cpc_sketch_get_estimate_and_bounds | SCALAR | (sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as bytes. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}. |
cpc_sketch_union_lgk_seed | SCALAR | (sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64) -> BYTES | Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a CPC Sketch, as BYTES. |
cpc_sketch_get_estimate_and_bounds_seed | SCALAR | (sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as bytes. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRUCT with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}. |
Examples:
# using defaults
# expected 5
select `$BQ_DATASET`.cpc_sketch_get_estimate(
`$BQ_DATASET`.cpc_sketch_union(
(select `$BQ_DATASET`.cpc_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.cpc_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
)
);
# using full signatures
# expected 5
select `$BQ_DATASET`.cpc_sketch_get_estimate_seed(
`$BQ_DATASET`.cpc_sketch_union_lgk_seed(
(select `$BQ_DATASET`.cpc_sketch_agg_string_lgk_seed(str, struct<byteint, int64>(10, 111)) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.cpc_sketch_agg_string_lgk_seed(str, struct<byteint, int64>(10, 111)) from unnest(["c", "d", "e"]) as str),
10,
111
),
111
);
# using defaults
create or replace table `$BQ_DATASET`.cpc_sketch(sketch bytes);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
select `$BQ_DATASET`.cpc_sketch_to_string(sketch) from `$BQ_DATASET`.cpc_sketch;
# expected about 20000
select `$BQ_DATASET`.cpc_sketch_get_estimate(
`$BQ_DATASET`.cpc_sketch_agg_union(sketch)
) from `$BQ_DATASET`.cpc_sketch;
select `$BQ_DATASET`.cpc_sketch_get_estimate_and_bounds(
`$BQ_DATASET`.cpc_sketch_agg_union(sketch),
3
) from `$BQ_DATASET`.cpc_sketch;
drop table `$BQ_DATASET`.cpc_sketch;
# using full signatures
create or replace table `$BQ_DATASET`.cpc_sketch(sketch bytes);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64_lgk_seed(value, struct<byteint, int64>(10, 111)) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64_lgk_seed(value, struct<byteint, int64>(10, 111)) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
select `$BQ_DATASET`.cpc_sketch_to_string_seed(sketch, 111) from `$BQ_DATASET`.cpc_sketch;
# expected about 20000
select `$BQ_DATASET`.cpc_sketch_get_estimate_seed(
`$BQ_DATASET`.cpc_sketch_agg_union_lgk_seed(sketch, struct<byteint, int64>(10, 111)),
111
) from `$BQ_DATASET`.cpc_sketch;
select `$BQ_DATASET`.cpc_sketch_get_estimate_and_bounds_seed(
`$BQ_DATASET`.cpc_sketch_agg_union_lgk_seed(sketch, struct<byteint, int64>(10, 111)),
3,
111
) from `$BQ_DATASET`.cpc_sketch;
drop table `$BQ_DATASET`.cpc_sketch;
Description: Frequent Items (FI) sketches are used to estimate the frequencies of items in a dataset. They are effective for identifying the most frequent items, such as the top products purchased or the most popular search queries.
For more information: Frequency Sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
frequent_strings_sketch_merge | AGGREGATE | (sketch BYTES, lg_max_map_size BYTEINT NOT AGGREGATE) -> BYTES | Merges sketches from the given column. Param sketch: the column of values. Param lg_max_map_size: the sketch accuracy/size parameter as an integer not less than 3. Returns: a serialized Frequent Strings sketch as BYTES. |
frequent_strings_sketch_build | AGGREGATE | (item STRING, weight INT64, lg_max_map_size BYTEINT NOT AGGREGATE) -> BYTES | Creates a sketch that represents frequencies of the given column. Param item: the column of STRING values. Param weight: the amount by which the weight of the item should be increased. Param lg_max_map_size: the sketch accuracy/size parameter as a BYTEINT not less than 3. Returns: a Frequent Strings Sketch, as bytes. |
frequent_strings_sketch_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch. |
frequent_strings_sketch_get_result | SCALAR | (sketch BYTES, error_type STRING, threshold INT64) -> ARRAY<STRUCT<item STRING, estimate INT64, lower_bound INT64, upper_bound INT64>> | Returns an array of rows that include frequent items, estimates, lower and upper bounds given an error_type and a threshold. Param sketch: the given sketch as sketch encoded bytes. Param error_type: determines whether no false positives or no false negatives are desired. Param threshold: a threshold to include items in the result list. If NULL, the maximum error of the sketch is used as a threshold. Returns: an array of frequent items with frequency estimates, lower and upper bounds. |
Examples:
select `$BQ_DATASET`.frequent_strings_sketch_to_string(`$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5)) from unnest(["a", "b", "c"]) as str;
create or replace table `$BQ_DATASET`.fs_sketch(sketch bytes);
insert into `$BQ_DATASET`.fs_sketch
(select `$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5) from unnest(["a", "b", "c", "d"]) as str);
insert into `$BQ_DATASET`.fs_sketch
(select `$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5) from unnest(["a", "a", "c"]) as str);
select `$BQ_DATASET`.frequent_strings_sketch_get_result(`$BQ_DATASET`.frequent_strings_sketch_merge(sketch, 5), "NO_FALSE_NEGATIVES", null) from `$BQ_DATASET`.fs_sketch;
drop table `$BQ_DATASET`.fs_sketch;
Description: HyperLogLog (HLL) sketches are another type of cardinality estimation sketch. They are known for their high accuracy and low memory consumption, making them suitable for large datasets and real-time analytics.
For more information: HLL Sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
hll_sketch_agg_string | AGGREGATE | (str STRING) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES. |
hll_sketch_agg_union | AGGREGATE | (sketch BYTES) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES. |
hll_sketch_agg_int64 | AGGREGATE | (value INT64) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES. |
hll_sketch_agg_string_lgk_type | AGGREGATE | (str STRING, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES. |
hll_sketch_agg_union_lgk_type | AGGREGATE | (sketch BYTES, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES. |
hll_sketch_agg_int64_lgk_type | AGGREGATE | (value INT64, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES. |
hll_sketch_get_estimate | SCALAR | (sketch BYTES) -> FLOAT64 | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Returns: the cardinality estimate as FLOAT64 value. |
hll_sketch_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Returns: a STRING that represents the state of the given sketch. |
hll_sketch_union | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the union of the two given sketches. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES. |
hll_sketch_union_lgk_type | SCALAR | (sketchA BYTES, sketchB BYTES, lg_k BYTEINT, tgt_type STRING) -> BYTES | Computes a sketch that represents the union of the two given sketches. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES. |
hll_sketch_get_estimate_and_bounds | SCALAR | (sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Returns: a struct with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}. |
Examples:
# expected 3
select `$BQ_DATASET`.hll_sketch_get_estimate(`$BQ_DATASET`.hll_sketch_agg_string(s)) from unnest(["a", "b", "c"]) as s;
select `$BQ_DATASET`.hll_sketch_to_string(`$BQ_DATASET`.hll_sketch_agg_string(s)) from unnest(["a", "b", "c"]) as s;
# expected 5
select `$BQ_DATASET`.hll_sketch_get_estimate_and_bounds(
`$BQ_DATASET`.hll_sketch_union_lgk_type(
(select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["c", "d", "e"]) as str),
10,
"HLL_8"
),
2
);
select `$BQ_DATASET`.hll_sketch_to_string(
`$BQ_DATASET`.hll_sketch_union_lgk_type(
(select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["c", "d", "e"]) as str),
10,
"HLL_8"
)
);
create or replace table `$BQ_DATASET`.hll_sketch(sketch bytes);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected estimate about 20000
select `$BQ_DATASET`.hll_sketch_to_string(
`$BQ_DATASET`.hll_sketch_agg_union(sketch)
) from `$BQ_DATASET`.hll_sketch;
select `$BQ_DATASET`.hll_sketch_to_string(
`$BQ_DATASET`.hll_sketch_agg_union_lgk_type(sketch, struct<byteint, string>(10, "HLL_6"))
) from `$BQ_DATASET`.hll_sketch;
drop table `$BQ_DATASET`.hll_sketch;
create or replace table `$BQ_DATASET`.hll_sketch(sketch bytes);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64_lgk_type(value, struct<int, string>(8, "HLL_6")) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64_lgk_type(value, struct<int, string>(8, "HLL_6")) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected estimate about 20000
select `$BQ_DATASET`.hll_sketch_to_string(
`$BQ_DATASET`.hll_sketch_agg_union_lgk_type(sketch, struct<byteint, string>(8, "HLL_6"))
) from `$BQ_DATASET`.hll_sketch;
drop table `$BQ_DATASET`.hll_sketch;
Description: KLL sketches are quantile sketches that provide approximate quantiles for a dataset. They are useful for understanding the distribution of data and calculating percentiles, such as the median or 95th percentile.
For more information: KLL Sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
kll_sketch_float_build | AGGREGATE | (value FLOAT64) -> BYTES | Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Defaults: k = 200. Returns: a KLL Sketch, as bytes. |
kll_sketch_float_merge | AGGREGATE | (sketch BYTES) -> BYTES | Merges sketches from the given column. Param sketch: the column of values. Defaluts: k = 200. Returns: a serialized KLL sketch as BYTES. |
kll_sketch_float_merge_k | AGGREGATE | (sketch BYTES, k INT NOT AGGREGATE) -> BYTES | Merges sketches from the given column. Param sketch: the column of values. Param k: the sketch accuracy/size parameter as an integer in the range [8, 65535]. Returns: a serialized KLL sketch as BYTES. |
kll_sketch_float_build_k | AGGREGATE | (value FLOAT64, k INT NOT AGGREGATE) -> BYTES | Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Param k: the sketch accuracy/size parameter as an INT in the range [8, 65535]. Returns: a KLL Sketch, as bytes. |
kll_sketch_float_get_n | SCALAR | (sketch BYTES) -> INT64 | Returns the length of the input stream. Param sketch: the given sketch as BYTES. Returns: stream length as INT64 |
kll_sketch_float_get_min_value | SCALAR | (sketch BYTES) -> FLOAT64 | Returns the minimum value of the input stream. Param sketch: the given sketch as BYTES. Returns: min value as FLOAT64 |
kll_sketch_float_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch. |
kll_sketch_float_get_num_retained | SCALAR | (sketch BYTES) -> INT64 | Returns the number of retained items (samples) in the sketch. Param sketch: the given sketch as BYTES. Returns: number of retained items as INT64 |
kll_sketch_float_get_max_value | SCALAR | (sketch BYTES) -> FLOAT64 | Returns the maximum value of the input stream. Param sketch: the given sketch as BYTES. Returns: max value as FLOAT64 |
kll_sketch_float_get_normalized_rank_error | SCALAR | (sketch BYTES, pmf BOOL) -> FLOAT64 | Returns the approximate rank error of the given sketch normalized as a fraction between zero and one. Param sketch: the given sketch as BYTES. Param pmf: if true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries. Returns: normalized rank error as FLOAT64 |
kll_sketch_float_get_rank | SCALAR | (sketch BYTES, value FLOAT64, inclusive BOOL) -> FLOAT64 | Returns an approximation to the normalized rank, on the interval [0.0, 1.0], of the given value. Param sketch: the given sketch in serialized form. Param value: value to be ranked. Param inclusive: if true the weight of the given value is included into the rank. Returns: an approximate rank of the given value. |
kll_sketch_float_get_pmf | SCALAR | (sketch BYTES, split_points ARRAY, inclusive BOOL) -> ARRAY | Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses defined by the given split_points. Param sketch: the given sketch as BYTES. Param split_points: an array of M unique, monotonically increasing values (of the same type as the input values) that divide the input value domain into M+1 non-overlapping intervals. Each interval except for the end intervals starts with a split-point and ends with the next split-point in sequence. The first interval starts below the minimum value of the stream (corresponding to a zero rank or zero probability), and ends with the first split-point The last (m+1)th interval starts with the last split-point and ends above the maximum value of the stream (corresponding to a rank or probability of 1.0). Param inclusive: if true and the upper boundary of an interval equals a value retained by the sketch, the interval will include that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If false and the upper boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will include that value. Returns: the PMF as a FLOAT64 array of M+1 probability masses on the interval [0.0, 1.0]. The sum of the probability masses of all (m+1) intervals is 1.0. |
kll_sketch_float_kolmogorov_smirnov | SCALAR | (sketchA BYTES, sketchB BYTES, pvalue FLOAT64) -> BOOL | Performs the Kolmogorov-Smirnov Test between two KLL sketches of type FLOAT64. If the given sketches have insufficient data or if the sketch sizes are too small, this will return false. Param sketchA: sketch A in serialized form. Param sketchB: sketch B in serialized form. Param pvalue: Target p-value. Typically 0.001 to 0.1, e.g. 0.05. Returns: boolean indicating whether we can reject the null hypothesis (that the sketches reflect the same underlying distribution) using the provided p-value. |
kll_sketch_float_get_cdf | SCALAR | (sketch BYTES, split_points ARRAY, inclusive BOOL) -> ARRAY | Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as an array of cumulative probabilities defined by the given split_points. Param sketch: the given sketch as BYTES. Param split_points: an array of M unique, monotonically increasing values (of the same type as the input values to the sketch) that divide the input value domain into M+1 overlapping intervals. The start of each interval is below the lowest input value retained by the sketch (corresponding to a zero rank or zero probability). The end of each interval is the associated split-point except for the top interval where the end is the maximum input value of the stream. Param inclusive: if true and the upper boundary of an interval equals a value retained by the sketch, the interval will include that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If false and the upper boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will include that value. Returns: the CDF as a monotonically increasing FLOAT64 array of M+1 cumulative probablities on the interval [0.0, 1.0]. The top-most probability of the returned array is always 1.0. |
kll_sketch_float_get_quantile | SCALAR | (sketch BYTES, rank FLOAT64, inclusive BOOL) -> FLOAT64 | Returns a value from the sketch that is the best approximation to a value from the original stream with the given rank. Param sketch: the given sketch in serialized form. Param rank: rank of a value in the hypothetical sorted stream. Param inclusive: if true, the given rank is considered inclusive (includes weight of a value) Returns: an approximate quantile associated with the given rank. |
Examples:
create or replace table `$BQ_DATASET`.kll_sketch(sketch bytes);
# using default
insert into `$BQ_DATASET`.kll_sketch
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value);
# using full signature
insert into `$BQ_DATASET`.kll_sketch
(select `$BQ_DATASET`.kll_sketch_float_build_k(value, 100) from unnest([11,12,13,14,15,16,17,18,19,20]) as value);
select `$BQ_DATASET`.kll_sketch_float_to_string(sketch) from `$BQ_DATASET`.kll_sketch;
# using default
select `$BQ_DATASET`.kll_sketch_float_to_string(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;
# using full signature
select `$BQ_DATASET`.kll_sketch_float_to_string(`$BQ_DATASET`.kll_sketch_float_merge_k(sketch, 100)) from `$BQ_DATASET`.kll_sketch;
# expected 0.5
select `$BQ_DATASET`.kll_sketch_float_get_rank(`$BQ_DATASET`.kll_sketch_float_merge(sketch), 10, true) from `$BQ_DATASET`.kll_sketch;
# expected 10
select `$BQ_DATASET`.kll_sketch_float_get_quantile(`$BQ_DATASET`.kll_sketch_float_merge(sketch), 0.5, true) from `$BQ_DATASET`.kll_sketch;
# expected 20
select `$BQ_DATASET`.kll_sketch_float_get_n(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;
# expected 0.5, 0.5
select `$BQ_DATASET`.kll_sketch_float_get_pmf(`$BQ_DATASET`.kll_sketch_float_merge(sketch), [10.0], true) from `$BQ_DATASET`.kll_sketch;
# expected 0.5, 1
select `$BQ_DATASET`.kll_sketch_float_get_cdf(`$BQ_DATASET`.kll_sketch_float_merge(sketch), [10.0], true) from `$BQ_DATASET`.kll_sketch;
# expected 1
select `$BQ_DATASET`.kll_sketch_float_get_min_value(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;
# expected 20
select `$BQ_DATASET`.kll_sketch_float_get_max_value(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;
drop table `$BQ_DATASET`.kll_sketch;
# expected about 1.3%
select `$BQ_DATASET`.kll_sketch_float_get_normalized_rank_error(`$BQ_DATASET`.kll_sketch_float_build(value), false) from unnest(generate_array(1, 10000)) as value;
select `$BQ_DATASET`.kll_sketch_float_get_num_retained(`$BQ_DATASET`.kll_sketch_float_build(value)) from unnest(generate_array(1, 10000)) as value;
# expected false
select `$BQ_DATASET`.kll_sketch_float_kolmogorov_smirnov(
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
0.05
);
# expected true
select `$BQ_DATASET`.kll_sketch_float_kolmogorov_smirnov(
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([11,12,13,14,15,16,17,18,19,20]) as value),
0.05
);
Description: Theta sketches are used for set operations like union, intersection, and difference. They are efficient for estimating the size of these operations on large datasets, enabling applications like analyzing user overlap or comparing different groups.
For more information: Theta sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
theta_sketch_agg_int64 | AGGREGATE | (value INT64) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, seed = 9001, p = 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_agg_union | AGGREGATE | (sketch BYTES) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_agg_string | AGGREGATE | (str STRING) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, seed = 9001, p = 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_agg_union_lgk_seed | AGGREGATE | (sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_agg_int64_lgk_seed_p | AGGREGATE | (value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. A NULL specifies the default of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_agg_string_lgk_seed_p | AGGREGATE | (str STRING, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE) -> BYTES | Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. A NULL specifies the default of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_get_estimate | SCALAR | (sketch BYTES) -> FLOAT64 | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: a FLOAT64 value as the cardinality estimate. |
theta_sketch_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Defaults: seed = 9001. Returns: a STRING that represents the state of the given sketch. |
theta_sketch_get_num_retained | SCALAR | (sketch BYTES) -> INT | Returns the number of retained entries in the given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: number of retained entries as INT. |
theta_sketch_get_theta | SCALAR | (sketch BYTES) -> FLOAT64 | Returns theta (effective sampling rate) as a fraction from 0 to 1. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: theta as FLOAT64. |
theta_sketch_get_num_retained_seed | SCALAR | (sketch BYTES, seed INT64) -> INT | Returns the number of retained entries in the given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: number of retained entries as INT. |
theta_sketch_get_estimate_seed | SCALAR | (sketch BYTES, seed INT64) -> FLOAT64 | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a FLOAT64 value as the cardinality estimate. |
theta_sketch_to_string_seed | SCALAR | (sketch BYTES, seed INT64) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRING that represents the state of the given sketch. |
theta_sketch_get_theta_seed | SCALAR | (sketch BYTES, seed INT64) -> FLOAT64 | Returns theta (effective sampling rate) as a fraction from 0 to 1. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: theta as FLOAT64. |
theta_sketch_intersection | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the scalar intersection of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_union | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_a_not_b | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the scalar set difference: sketchA and not sketchB. Param sketchA: the first sketch "A" as bytes. Param sketchB: the second sketch "B" as bytes. Defaults: seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_intersection_seed | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES | Computes a sketch that represents the scalar intersection of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_a_not_b_seed | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES | Computes a sketch that represents the scalar set difference: sketchA and not sketchB. Param sketchA: the first sketch "A" as bytes. Param sketchB: the second sketch "B" as bytes. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_union_lgk_seed | SCALAR | (sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64) -> BYTES | Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES. |
theta_sketch_get_estimate_and_bounds | SCALAR | (sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}. |
theta_sketch_jaccard_similarity | SCALAR | (sketchA BYTES, sketchB BYTES) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64> | Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index. |
theta_sketch_get_estimate_and_bounds_seed | SCALAR | (sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}. |
theta_sketch_jaccard_similarity_seed | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64> | Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index. |
Examples:
# using defaults
create or replace table `$BQ_DATASET`.theta_sketch(sketch bytes);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected about 20000
select `$BQ_DATASET`.theta_sketch_get_estimate_and_bounds(
`$BQ_DATASET`.theta_sketch_agg_union(sketch),
2
) from `$BQ_DATASET`.theta_sketch;
# expected estimate about 20000
select `$BQ_DATASET`.theta_sketch_to_string(
`$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;
select `$BQ_DATASET`.theta_sketch_get_theta(
`$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;
select `$BQ_DATASET`.theta_sketch_get_num_retained(
`$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;
drop table `$BQ_DATASET`.theta_sketch;
# using full signatures
create or replace table `$BQ_DATASET`.theta_sketch(sketch bytes);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64_lgk_seed_p(value, struct<int, int, float64>(14, 111, 0.9)) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64_lgk_seed_p(value, struct<int, int, float64>(14, 111, 0.9)) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected about 20000
select `$BQ_DATASET`.theta_sketch_get_estimate_and_bounds_seed(
`$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
2,
111
) from `$BQ_DATASET`.theta_sketch;
# expected estimate about 20000
select `$BQ_DATASET`.theta_sketch_to_string_seed(
`$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
111
) from `$BQ_DATASET`.theta_sketch;
select `$BQ_DATASET`.theta_sketch_get_theta_seed(
`$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
111
) from `$BQ_DATASET`.theta_sketch;
select `$BQ_DATASET`.theta_sketch_get_num_retained_seed(
`$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
111
) from `$BQ_DATASET`.theta_sketch;
drop table `$BQ_DATASET`.theta_sketch;
# using defaults
# expected 5
select `$BQ_DATASET`.theta_sketch_get_estimate(
`$BQ_DATASET`.theta_sketch_union(
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
)
);
# full signatures
# expected 5
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
`$BQ_DATASET`.theta_sketch_union_lgk_seed(
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
10,
111
),
111
);
# using defaults
# expected 1
select `$BQ_DATASET`.theta_sketch_get_estimate(
`$BQ_DATASET`.theta_sketch_intersection(
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
)
);
# full signatures
# expected 1
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
`$BQ_DATASET`.theta_sketch_intersection_seed(
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
111
),
111
);
# using defaults
# expected 2
select `$BQ_DATASET`.theta_sketch_get_estimate(
`$BQ_DATASET`.theta_sketch_a_not_b(
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
)
);
# full signatures
# expected 2
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
`$BQ_DATASET`.theta_sketch_a_not_b_seed(
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
111
),
111
);
# using defaults
# expected 0.2
select `$BQ_DATASET`.theta_sketch_jaccard_similarity(
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
);
# using full signatures
# expected 0.2
select `$BQ_DATASET`.theta_sketch_jaccard_similarity_seed(
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
111
);
Description: Tuple sketches extend the functionality of Theta sketches by allowing you to associate a summary value with each item in the set. This enables calculations like the sum, minimum, or maximum of values associated with the distinct items.
For more information: Tuple sketches
Function Name | Function Type | Signature | Description |
---|---|---|---|
tuple_sketch_int64_agg_union | AGGREGATE | (sketch BYTES) -> BYTES | Builds a Tuple Sketch that represents the UNION of the given column of Tuple Sketches. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given column of Tuple Sketches with an INT64 summary column. This may not be NULL. Defaults: lg_k = 12, seed = 9001, mode = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_agg_string | AGGREGATE | (key STRING, value INT64) -> BYTES | Builds a Tuple Sketch from an STRING Key column and an INT64 value column. Multiple values for the same key are aggregated using the default mode. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an STRING Key column and an INT64 summary column. Param key: the STRING column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Defaults: lg_k = 12, seed = 9001, p = 1.0, mode = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_agg_int64 | AGGREGATE | (key INT64, value INT64) -> BYTES | Builds a Tuple Sketch from an INT64 Key column and an INT64 value column. Multiple values for the same key are aggregated using the default mode. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 Key column and an INT64 summary column. Param key: the INT64 key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Defaults: lg_k = 12, seed = 9001, p = 1.0, mode = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_agg_union_lgk_seed_mode | AGGREGATE | (sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64, mode STRING> NOT AGGREGATE) -> BYTES | Builds a Tuple Sketch that represents the UNION of the given column of Tuple Sketches. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given column of Tuple Sketches with an INT64 summary column. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_agg_int64_lgk_seed_p_mode | AGGREGATE | (key INT64, value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64, mode STRING> NOT AGGREGATE) -> BYTES | Builds a Tuple Sketch from an INT64 Key column and an INT64 value column. Multiple values for the same key are aggregated using one of the selectable operations: { SUM, MIN, MAX, ONE (constant 1) }. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 Key column and an INT64 summary column. Param key: the INT64 key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_agg_string_lgk_seed_p_mode | AGGREGATE | (key STRING, value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64, mode STRING> NOT AGGREGATE) -> BYTES | Builds a Tuple Sketch from an STRING Key column and an INT64 value column. Multiple values for the same key are aggregated using one of the selectable operations: SUM, MIN, MAX, ONE. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an STRING Key column and an INT64 summary column. Param key: the STRING key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a human readable STRING that is a short summary of the state of this sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the sketch to be summarized. This may not be NULL. Defaults: seed = 9001. Returns: A human readable STRING that is a short summary of the state of this sketch. |
tuple_sketch_int64_get_estimate | SCALAR | (sketch BYTES) -> FLOAT64 | Returns the cardinality estimate of the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: the cardinality estimate of the given Tuple Sketch |
tuple_sketch_int64_get_theta | SCALAR | (sketch BYTES) -> FLOAT64 | Returns theta (effective sampling rate) as a fraction from 0 to 1. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: theta as FLOAT64. |
tuple_sketch_int64_get_num_retained | SCALAR | (sketch BYTES) -> INT | Returns the number of retained entries in the given sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: number of retained entries as INT. |
tuple_sketch_int64_get_theta_seed | SCALAR | (sketch BYTES, seed INT64) -> FLOAT64 | Returns theta (effective sampling rate) as a fraction from 0 to 1. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: theta as FLOAT64. |
tuple_sketch_int64_get_num_retained_seed | SCALAR | (sketch BYTES, seed INT64) -> INT | Returns the number of retained entries in the given sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: number of retained entries as INT. |
tuple_sketch_int64_to_string_seed | SCALAR | (sketch BYTES, seed INT64) -> STRING | Returns a human readable STRING that is a short summary of the state of this sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the sketch to be summarized. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: A human readable STRING that is a short summary of the state of this sketch. |
tuple_sketch_int64_a_not_b | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the set difference of sketchA and not sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_from_theta_sketch | SCALAR | (sketch BYTES, value INT64) -> BYTES | Converts the given Theta Sketch into a Tuple Sketch with a INT64 summary column set to the given INT64 value. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. Param sketch: the given Theta Sketch. This may not be NULL. Param value: the given INT64 value. This may not be NULL. Defaults: seed = 9001. Returns: a Tuple Sketch with an INT64 summary column as BYTES. |
tuple_sketch_int64_get_estimate_seed | SCALAR | (sketch BYTES, seed INT64) -> FLOAT64 | Returns the cardinality estimate of the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: the cardinality estimate of the given Tuple Sketch |
tuple_sketch_int64_intersection | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a sketch that represents the scalar intersection of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. Param sketchB: the second sketch "B" as BYTES. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_union | SCALAR | (sketchA BYTES, sketchB BYTES) -> BYTES | Computes a Tuple Sketch that represents the UNION of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_from_theta_sketch_seed | SCALAR | (sketch BYTES, value INT64, seed INT64) -> BYTES | Converts the given Theta Sketch into a Tuple Sketch with a INT64 summary column set to the given INT64 value. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. Param sketch: the given Theta Sketch. This may not be NULL. Param value: the given INT64 value. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Tuple Sketch with an INT64 summary column as BYTES. |
tuple_sketch_int64_a_not_b_seed | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES | Computes a sketch that represents the scalar set difference of sketchA and not sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_filter_low_high | SCALAR | (sketch BYTES, low INT64, high INT64) -> BYTES | Returns a Tuple Sketch computed from the given sketch filtered by the given low and high values. This returns a compact tuple sketch that contains the subset of rows of the give sketch where the summary column is greater-than or equal to the given low and less-than or equal to the given high. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param low: the given low INT64. This may not be NULL. Param high: the given high INT64. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_get_estimate_and_bounds | SCALAR | (sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Returns the cardinality estimate and bounds from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}. |
tuple_sketch_int64_filter_low_high_seed | SCALAR | (sketch BYTES, low INT64, high INT64, seed INT64) -> BYTES | Returns a Tuple Sketch computed from the given sketch filtered by the given low and high values. This returns a compact tuple sketch that contains the subset of rows of the give sketch where the summary column is greater-than or equal to the given low and less-than or equal to the given high. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param low: the given low INT64. This may not be NULL. Param high: the given high INT64. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_jaccard_similarity | SCALAR | (sketchA BYTES, sketchB BYTES) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64> | Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch as bytes. This may not be NULL. Param sketchB: the second sketch as bytes. This may not be NULL. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index. |
tuple_sketch_int64_get_sum_estimate_and_bounds | SCALAR | (sketch BYTES, num_std_devs BYTEINT) -> STRUCT<sum_estimate FLOAT64, sum_lower_bound FLOAT64, sum_upper_bound FLOAT64> | Returns the estimate and bounds for the sum of the INT64 summary column scaled to the original population from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {sum_estimate, sum_lower_bound, sum_upper_bound}. |
tuple_sketch_int64_intersection_seed_mode | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64, mode STRING) -> BYTES | Computes a sketch that represents the scalar intersection of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. Param sketchB: the second sketch "B" as BYTES. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_get_sum_estimate_and_bounds_seed | SCALAR | (sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<sum_estimate FLOAT64, sum_lower_bound FLOAT64, sum_upper_bound FLOAT64> | Returns the estimate and bounds for the sum of the INT64 summary column scaled to the original population from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values as {sum_estimate, sum_lower_bound, sum_upper_bound}. |
tuple_sketch_int64_union_lgk_seed_mode | SCALAR | (sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64, mode STRING) -> BYTES | Computes a Tuple Sketch that represents the UNION of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES. |
tuple_sketch_int64_get_estimate_and_bounds_seed | SCALAR | (sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64> | Returns the cardinality estimate and bounds from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}. |
tuple_sketch_int64_jaccard_similarity_seed | SCALAR | (sketchA BYTES, sketchB BYTES, seed INT64) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64> | Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch as bytes. This may not be NULL. Param sketchB: the second sketch as bytes. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index. |
Examples:
# using defaults
create or replace table `$BQ_DATASET`.tuple_sketch(sketch bytes);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch(`$BQ_DATASET`.theta_sketch_agg_string(cast(value as string)), 1) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch(`$BQ_DATASET`.theta_sketch_agg_string(cast(value as string)), 1) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected about 20000
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_and_bounds(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch),
2
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_sum_estimate_and_bounds(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch),
2
) from `$BQ_DATASET`.tuple_sketch;
# expected estimate about 20000
select `$BQ_DATASET`.tuple_sketch_int64_to_string(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_theta(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_num_retained(
`$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;
drop table `$BQ_DATASET`.tuple_sketch;
# using full signatures
create or replace table `$BQ_DATASET`.tuple_sketch(sketch bytes);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch_seed(
`$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(cast(value as string), STRUCT<BYTEINT, INT64, FLOAT64>(10, 111, 0.999)),
1,
111
) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch_seed(
`$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(cast(value as string), STRUCT<BYTEINT, INT64, FLOAT64>(10, 111, 0.999)),
1,
111
) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);
# expected about 20000
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
111
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_and_bounds_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
2,
111
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_sum_estimate_and_bounds_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
2,
111
) from `$BQ_DATASET`.tuple_sketch;
# expected estimate about 20000
select `$BQ_DATASET`.tuple_sketch_int64_to_string_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
111
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_theta_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
111
) from `$BQ_DATASET`.tuple_sketch;
select `$BQ_DATASET`.tuple_sketch_int64_get_num_retained_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
111
) from `$BQ_DATASET`.tuple_sketch;
drop table `$BQ_DATASET`.tuple_sketch;
# using defaults
# expected 5
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
`$BQ_DATASET`.tuple_sketch_int64_union(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_int64(key, 1) from unnest([1, 2, 3]) as key),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_int64(key, 1) from unnest([3, 4, 5]) as key)
)
);
# using full signatures
# expected 5
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
`$BQ_DATASET`.tuple_sketch_int64_union_lgk_seed_mode(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_int64_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest([1, 2, 3]) as key),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_int64_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest([3, 4, 5]) as key),
10,
111,
"MIN"
),
111
);
# using defaults
# expected 1
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
`$BQ_DATASET`.tuple_sketch_int64_intersection(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
)
);
# using full signatures
# expected 1
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
`$BQ_DATASET`.tuple_sketch_int64_intersection_seed_mode(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["c", "d", "e"]) as str),
111,
"MIN"
),
111
);
# using defaults
# expected 2
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
`$BQ_DATASET`.tuple_sketch_int64_a_not_b(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
)
);
# using full signatures
# expected 2
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
`$BQ_DATASET`.tuple_sketch_int64_a_not_b_seed(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["c", "d", "e"]) as str),
111
),
111
);
# using defaults
# expected 0.2
select `$BQ_DATASET`.tuple_sketch_int64_jaccard_similarity(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
);
# using full signatures
# expected 0.2
select `$BQ_DATASET`.tuple_sketch_int64_jaccard_similarity_seed(
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "NOP")) from unnest(["a", "b", "c"]) as str),
(select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "NOP")) from unnest(["c", "d", "e"]) as str),
111
);
# using defaults
# expected 1 entry
select `$BQ_DATASET`.tuple_sketch_int64_to_string(
`$BQ_DATASET`.tuple_sketch_int64_filter_low_high(
`$BQ_DATASET`.tuple_sketch_int64_agg_string(key, 1),
2,
2
)
) from unnest(["a", "b", "c", "c"]) as key;
# using full signatures
# expected 1 entry
select `$BQ_DATASET`.tuple_sketch_int64_to_string_seed(
`$BQ_DATASET`.tuple_sketch_int64_filter_low_high_seed(
`$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "SUM")),
2,
2,
111
),
111
) from unnest(["a", "b", "c", "c"]) as key;
Description: Similar to KLL sketch, estimates distributions of numeric values, provides approximate quantiles and ranks.
For more information: t-digest
Function Name | Function Type | Signature | Description |
---|---|---|---|
tdigest_double_build | AGGREGATE | (value FLOAT64) -> BYTES | Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Defaults: k = 200. Returns: a t-Digest, as bytes. |
tdigest_double_merge | AGGREGATE | (sketch BYTES) -> BYTES | Merges sketches from the given column. Param sketch: the column of values. Defaults: k = 200. Returns: a serialized t-Digest as BYTES. |
tdigest_double_merge_k | AGGREGATE | (sketch BYTES, k INT NOT AGGREGATE) -> BYTES | Merges sketches from the given column. Param sketch: the column of values. Param k: the sketch accuracy/size parameter as an integer in the range [10, 65535]. Returns: a serialized t-Digest as BYTES. |
tdigest_double_build_k | AGGREGATE | (value FLOAT64, k INT NOT AGGREGATE) -> BYTES | Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Param k: the sketch accuracy/size parameter as an INT in the range [10, 65535]. Returns: a t-Digest, as bytes. |
tdigest_double_get_max_value | SCALAR | (sketch BYTES) -> FLOAT64 | Returns the maximum value of the input stream. Param sketch: the given sketch as BYTES. Returns: max value as FLOAT64 |
tdigest_double_to_string | SCALAR | (sketch BYTES) -> STRING | Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch. |
tdigest_double_get_total_weight | SCALAR | (sketch BYTES) -> INT64 | Returns the total weight of the input stream. Param sketch: the given sketch as BYTES. Returns: total weight as INT64 |
tdigest_double_get_min_value | SCALAR | (sketch BYTES) -> FLOAT64 | Returns the minimum value of the input stream. Param sketch: the given sketch as BYTES. Returns: min value as FLOAT64 |
tdigest_double_get_rank | SCALAR | (sketch BYTES, value FLOAT64) -> FLOAT64 | Returns an approximation to the normalized rank, on the interval [0.0, 1.0], of the given value. Param sketch: the given sketch in serialized form. Param value: value to be ranked. Returns: an approximate rank of the given value. |
tdigest_double_get_quantile | SCALAR | (sketch BYTES, rank FLOAT64) -> FLOAT64 | Returns a value from the sketch that is the best approximation to a value from the original stream with the given rank. Param sketch: the given sketch in serialized form. Param rank: rank of a value in the hypothetical sorted stream. Returns: an approximate quantile associated with the given rank. |
Examples:
create or replace table `$BQ_DATASET`.tdigest_double(sketch bytes);
# using default
insert into `$BQ_DATASET`.tdigest_double
(select `$BQ_DATASET`.tdigest_double_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value);
# using full signature
insert into `$BQ_DATASET`.tdigest_double
(select `$BQ_DATASET`.tdigest_double_build_k(value, 100) from unnest([11,12,13,14,15,16,17,18,19,20]) as value);
select `$BQ_DATASET`.tdigest_double_to_string(sketch) from `$BQ_DATASET`.tdigest_double;
# using default
select `$BQ_DATASET`.tdigest_double_to_string(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;
# using full signature
select `$BQ_DATASET`.tdigest_double_to_string(`$BQ_DATASET`.tdigest_double_merge_k(sketch, 100)) from `$BQ_DATASET`.tdigest_double;
# expected 0.5
select `$BQ_DATASET`.tdigest_double_get_rank(`$BQ_DATASET`.tdigest_double_merge(sketch), 10) from `$BQ_DATASET`.tdigest_double;
# expected 10
select `$BQ_DATASET`.tdigest_double_get_quantile(`$BQ_DATASET`.tdigest_double_merge(sketch), 0.5) from `$BQ_DATASET`.tdigest_double;
# expected 20
select `$BQ_DATASET`.tdigest_double_get_total_weight(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;
# expected 1
select `$BQ_DATASET`.tdigest_double_get_min_value(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;
# expected 20
select `$BQ_DATASET`.tdigest_double_get_max_value(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;
drop table `$BQ_DATASET`.tdigest_double;