Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]PPL Support approximation count for improved performance #882

Closed
YANG-DB opened this issue Nov 8, 2024 · 0 comments
Closed

[FEATURE]PPL Support approximation count for improved performance #882

YANG-DB opened this issue Nov 8, 2024 · 0 comments
Assignees
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Nov 8, 2024

Is your feature request related to a problem?
In large volume datasets and/or high cardinality datasets the counting is an expensive computation.
Spark has a list of build-in approximation function such as:

  • approx_count_distinct which returns the estimated number of distinct values in expr within the group.

    •   SELECT approx_count_distinct(column_name) FROM table_name;
  • approx_percentile which returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values

    •   SELECT 
           col_name,
               approx_percentile(col_name, 0.5) AS median_value,
               approx_percentile(col_name, array(0.25, 0.5, 0.75)) AS quartiles
           FROM 
               table_name
           GROUP BY 
               col_name

What solution would you like?
We would like to add these approximation capabilities to every PPL function that can offer counting:

  • Top & Rare commands
... | top_approx 5 values by country 
... | rare_approx age by gender
  • Stats
... | stats count_distinct_approx(c) by b | head 5` 

Do you have any additional context?

@YANG-DB YANG-DB added enhancement New feature or request untriaged Lang:PPL Pipe Processing Language support 0.6 labels Nov 8, 2024
@YANG-DB YANG-DB self-assigned this Nov 8, 2024
@YANG-DB YANG-DB removed the untriaged label Nov 8, 2024
@YANG-DB YANG-DB closed this as completed Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
None yet
Development

No branches or pull requests

1 participant