You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?
In large volume datasets and/or high cardinality datasets the counting is an expensive computation.
Spark has a list of build-in approximation function such as:
approx_count_distinct which returns the estimated number of distinct values in expr within the group.
SELECT approx_count_distinct(column_name) FROM table_name;
approx_percentile which returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values
SELECT
col_name,
approx_percentile(col_name, 0.5) AS median_value,
approx_percentile(col_name, array(0.25, 0.5, 0.75)) AS quartiles
FROM
table_name
GROUP BY
col_name
What solution would you like?
We would like to add these approximation capabilities to every PPL function that can offer counting:
Top & Rare commands
... | top_approx 5values by country
... | rare_approx age by gender
Stats
... | stats count_distinct_approx(c) by b | head 5`
Is your feature request related to a problem?
In large volume datasets and/or high cardinality datasets the counting is an expensive computation.
Spark has a list of build-in approximation function such as:
approx_count_distinct which returns the estimated number of distinct values in expr within the group.
approx_percentile which returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values
What solution would you like?
We would like to add these approximation capabilities to every PPL function that can offer counting:
Top
&Rare
commandsStats
Do you have any additional context?
The text was updated successfully, but these errors were encountered: