[FEATURE]PPL Support approximation count for improved performance #882

YANG-DB · 2024-11-08T18:19:52Z

Is your feature request related to a problem?
In large volume datasets and/or high cardinality datasets the counting is an expensive computation.
Spark has a list of build-in approximation function such as:

approx_count_distinct which returns the estimated number of distinct values in expr within the group.
- ```
  SELECT approx_count_distinct(column_name) FROM table_name;
```

approx_percentile which returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values

  SELECT 
     col_name,
         approx_percentile(col_name, 0.5) AS median_value,
         approx_percentile(col_name, array(0.25, 0.5, 0.75)) AS quartiles
     FROM 
         table_name
     GROUP BY 
         col_name

What solution would you like?
We would like to add these approximation capabilities to every PPL function that can offer counting:

Top & Rare commands

... | top_approx 5 values by country 
... | rare_approx age by gender

Stats

... | stats count_distinct_approx(c) by b | head 5`

Do you have any additional context?

https://spark.apache.org/docs/3.5.2/sql-ref-functions-builtin.html

The text was updated successfully, but these errors were encountered:

YANG-DB added enhancement New feature or request untriaged Lang:PPL Pipe Processing Language support 0.6 labels Nov 8, 2024

YANG-DB self-assigned this Nov 8, 2024

YANG-DB removed the untriaged label Nov 8, 2024

YANG-DB mentioned this issue Nov 9, 2024

Ppl count approximate support #884

Merged

5 tasks

YANG-DB closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]PPL Support approximation count for improved performance #882

[FEATURE]PPL Support approximation count for improved performance #882

YANG-DB commented Nov 8, 2024 •

edited

Loading

[FEATURE]PPL Support approximation count for improved performance #882

[FEATURE]PPL Support approximation count for improved performance #882

Comments

YANG-DB commented Nov 8, 2024 • edited Loading

YANG-DB commented Nov 8, 2024 •

edited

Loading