Histogram error estimator #458

dvadym · 2023-06-20T14:08:50Z

This PR implements estimation of RMSE from DatasetHistogram for l0_bound and linf_bound

The algorithm is the following

From l0_bound and l0_contributions_histogram the ratio data_dropped_from_l0 contribution bounding is computed.
From linf_bound and linf_contributions_histogram the ratio_data_dropped_from_linf contribution bounding is computed.
The total 'ratio_data_dropped' for contribution bounding is estimated from data_dropped_from_l0 and ratio_data_dropped_from_linf.
Then under the assumption that contribution bounding drops data uniformly on all partitions, for a partition of the size n, it is assumed that n*ratio_data_dropped data points are dropped with contribution bounding. And RMSE for this partition is computed as sqrt((n*ratio_data_dropped)**2 + noise_std**2)
RMSEs are averaged across all partitions.

pipeline_dp/dataset_histograms/histogram_error_estimator.py

tests/dataset_histograms/histogram_error_estimator_test.py

dvadym

Thanks a lot for comments! I've addressed them. PTAL

pipeline_dp/dataset_histograms/histogram_error_estimator.py

tests/dataset_histograms/histogram_error_estimator_test.py

pipeline_dp/dataset_histograms/histogram_error_estimator.py

RamSaw · 2023-07-07T10:23:27Z

examples/restaurant_visits/run_without_frameworks_tuning.py

@@ -118,7 +118,7 @@ def tune_parameters():
    restaurant_visits_rows = load_data(FLAGS.input_file)
    # Create aggregate_params, data_extractors and public partitions.
    aggregate_params = get_aggregate_params()
-    public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None
+    public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None


just want to check that it is an intentional change

thanks reverted this change

RamSaw · 2023-07-07T10:31:11Z

pipeline_dp/dataset_histograms/histogram_error_estimator.py

            return ratios_dropped[index][1]

-        x1, y1 = ratios_dropped[index]
-        x2, y2 = ratios_dropped[index + 1]
+        x1, y1 = ratios_dropped[index - 1]


should we check that (index - 1) >= 0 and if not, then make y1 = 1?

thanks, good point. I've added a comment and made a small change to ensure that bound > 0.

No we don't need to check, since ratio_dropped start from 0, but here bound > 0

pipeline_dp/dataset_histograms/histograms.py

pipeline_dp/dataset_histograms/histogram_error_estimator.py

RamSaw · 2023-07-07T10:45:06Z

tests/dataset_histograms/histogram_error_estimator_test.py

@@ -92,9 +92,28 @@ def test_sum_not_supported(self):
                ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"):
            self._get_estimator(pipeline_dp.Metrics.SUM)

-    def test_get_ratio_dropped_l0(self):
+    @parameterized.parameters((0, 1), (1, 0.818181818181818),


how's this number? if l0=1, then since user 1 contributes to 10 partitions, we drop 9 of them, i.e. 9 data points, for user 2 everything we keep, 9 / (20 + 10) = 0.3, not 0.(81)
maybe add a comment or formula somewhere

also, in test for estimate_rmse it is hard to verify that the expected numbers are correct, is it possible to write a short formula? we have iteration over partitions there, 10 of them, so maybe it will be not very small formula...

Good point. Done

The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.

dvadym

Thanks a lot for review!

dvadym · 2023-07-07T13:26:32Z

examples/restaurant_visits/run_without_frameworks_tuning.py

@@ -118,7 +118,7 @@ def tune_parameters():
    restaurant_visits_rows = load_data(FLAGS.input_file)
    # Create aggregate_params, data_extractors and public partitions.
    aggregate_params = get_aggregate_params()
-    public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None
+    public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None


thanks reverted this change

dvadym · 2023-07-13T10:01:00Z

tests/dataset_histograms/histogram_error_estimator_test.py

@@ -92,9 +92,28 @@ def test_sum_not_supported(self):
                ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"):
            self._get_estimator(pipeline_dp.Metrics.SUM)

-    def test_get_ratio_dropped_l0(self):
+    @parameterized.parameters((0, 1), (1, 0.818181818181818),


Good point. Done

The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.

dvadym · 2023-07-13T10:09:15Z

pipeline_dp/dataset_histograms/histogram_error_estimator.py

            return ratios_dropped[index][1]

-        x1, y1 = ratios_dropped[index]
-        x2, y2 = ratios_dropped[index + 1]
+        x1, y1 = ratios_dropped[index - 1]


thanks, good point. I've added a comment and made a small change to ensure that bound > 0.

No we don't need to check, since ratio_dropped start from 0, but here bound > 0

dvadym added 7 commits June 12, 2023 15:43

wip

803593b

wip

994c759

Merge branch 'main' into error_estimator

e8aaad0

wip

2921030

Merge branch 'main' into error_estimator

02aa8b5

wip

e7c86fd

wip

8e0b10b

dvadym changed the title ~~(WIP) Histogram error estimator~~ Histogram error estimator Jun 22, 2023

dvadym requested a review from RamSaw June 22, 2023 12:14

RamSaw reviewed Jun 22, 2023

View reviewed changes

dvadym added 5 commits June 23, 2023 19:28

wip

70e9f51

Merge branch 'main' into error_estimator

9b538cc

wip

0b2ccfc

Merge branch 'main' into error_estimator

b98e1d8

addressed comments

7f72605

dvadym commented Jul 6, 2023

View reviewed changes

pipeline_dp/dataset_histograms/histogram_error_estimator.py Outdated Show resolved Hide resolved

tests/dataset_histograms/histogram_error_estimator_test.py Outdated Show resolved Hide resolved

pipeline_dp/dataset_histograms/histogram_error_estimator.py Show resolved Hide resolved

RamSaw reviewed Jul 7, 2023

View reviewed changes

RamSaw approved these changes Jul 7, 2023

View reviewed changes

addressed comments

0f0e148

dvadym commented Jul 13, 2023

View reviewed changes

dvadym merged commit 45c6acc into main Jul 13, 2023
10 of 11 checks passed

delete-merged-branch bot deleted the error_estimator branch July 13, 2023 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram error estimator #458

Histogram error estimator #458

dvadym commented Jun 20, 2023 •

edited

Loading

dvadym left a comment

RamSaw Jul 7, 2023

dvadym Jul 7, 2023

RamSaw Jul 7, 2023

dvadym Jul 13, 2023

RamSaw Jul 7, 2023

RamSaw Jul 7, 2023

dvadym Jul 13, 2023

dvadym left a comment

dvadym Jul 7, 2023

dvadym Jul 13, 2023

dvadym Jul 13, 2023

Histogram error estimator #458

Histogram error estimator #458

Conversation

dvadym commented Jun 20, 2023 • edited Loading

dvadym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvadym commented Jun 20, 2023 •

edited

Loading