Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram error estimator #458

Merged
merged 13 commits into from
Jul 13, 2023
Merged

Histogram error estimator #458

merged 13 commits into from
Jul 13, 2023

Conversation

dvadym
Copy link
Collaborator

@dvadym dvadym commented Jun 20, 2023

This PR implements estimation of RMSE from DatasetHistogram for l0_bound and linf_bound

The algorithm is the following

  1. From l0_bound and l0_contributions_histogram the ratio data_dropped_from_l0 contribution bounding is computed.
  2. From linf_bound and linf_contributions_histogram the ratio_data_dropped_from_linf contribution bounding is computed.
  3. The total 'ratio_data_dropped' for contribution bounding is estimated from data_dropped_from_l0 and ratio_data_dropped_from_linf.
  4. Then under the assumption that contribution bounding drops data uniformly on all partitions, for a partition of the size n, it is assumed that n*ratio_data_dropped data points are dropped with contribution bounding. And RMSE for this partition is computed as sqrt((n*ratio_data_dropped)**2 + noise_std**2)
  5. RMSEs are averaged across all partitions.

@dvadym dvadym changed the title (WIP) Histogram error estimator Histogram error estimator Jun 22, 2023
@dvadym dvadym requested a review from RamSaw June 22, 2023 12:14
Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for comments! I've addressed them. PTAL

@@ -118,7 +118,7 @@ def tune_parameters():
restaurant_visits_rows = load_data(FLAGS.input_file)
# Create aggregate_params, data_extractors and public partitions.
aggregate_params = get_aggregate_params()
public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None
public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to check that it is an intentional change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks reverted this change

return ratios_dropped[index][1]

x1, y1 = ratios_dropped[index]
x2, y2 = ratios_dropped[index + 1]
x1, y1 = ratios_dropped[index - 1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check that (index - 1) >= 0 and if not, then make y1 = 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good point. I've added a comment and made a small change to ensure that bound > 0.

No we don't need to check, since ratio_dropped start from 0, but here bound > 0

pipeline_dp/dataset_histograms/histograms.py Outdated Show resolved Hide resolved
@@ -92,9 +92,28 @@ def test_sum_not_supported(self):
ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"):
self._get_estimator(pipeline_dp.Metrics.SUM)

def test_get_ratio_dropped_l0(self):
@parameterized.parameters((0, 1), (1, 0.818181818181818),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's this number? if l0=1, then since user 1 contributes to 10 partitions, we drop 9 of them, i.e. 9 data points, for user 2 everything we keep, 9 / (20 + 10) = 0.3, not 0.(81)
maybe add a comment or formula somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, in test for estimate_rmse it is hard to verify that the expected numbers are correct, is it possible to write a short formula? we have iteration over partitions there, 10 of them, so maybe it will be not very small formula...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done

The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.

Copy link
Collaborator Author

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for review!

@@ -118,7 +118,7 @@ def tune_parameters():
restaurant_visits_rows = load_data(FLAGS.input_file)
# Create aggregate_params, data_extractors and public partitions.
aggregate_params = get_aggregate_params()
public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None
public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks reverted this change

@@ -92,9 +92,28 @@ def test_sum_not_supported(self):
ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"):
self._get_estimator(pipeline_dp.Metrics.SUM)

def test_get_ratio_dropped_l0(self):
@parameterized.parameters((0, 1), (1, 0.818181818181818),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done

The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.

return ratios_dropped[index][1]

x1, y1 = ratios_dropped[index]
x2, y2 = ratios_dropped[index + 1]
x1, y1 = ratios_dropped[index - 1]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good point. I've added a comment and made a small change to ensure that bound > 0.

No we don't need to check, since ratio_dropped start from 0, but here bound > 0

@dvadym dvadym merged commit 45c6acc into main Jul 13, 2023
10 of 11 checks passed
@delete-merged-branch delete-merged-branch bot deleted the error_estimator branch July 13, 2023 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants