-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram error estimator #458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for comments! I've addressed them. PTAL
@@ -118,7 +118,7 @@ def tune_parameters(): | |||
restaurant_visits_rows = load_data(FLAGS.input_file) | |||
# Create aggregate_params, data_extractors and public partitions. | |||
aggregate_params = get_aggregate_params() | |||
public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None | |||
public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to check that it is an intentional change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks reverted this change
return ratios_dropped[index][1] | ||
|
||
x1, y1 = ratios_dropped[index] | ||
x2, y2 = ratios_dropped[index + 1] | ||
x1, y1 = ratios_dropped[index - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check that (index - 1) >= 0 and if not, then make y1 = 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, good point. I've added a comment and made a small change to ensure that bound > 0
.
No we don't need to check, since ratio_dropped
start from 0
, but here bound > 0
@@ -92,9 +92,28 @@ def test_sum_not_supported(self): | |||
ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"): | |||
self._get_estimator(pipeline_dp.Metrics.SUM) | |||
|
|||
def test_get_ratio_dropped_l0(self): | |||
@parameterized.parameters((0, 1), (1, 0.818181818181818), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's this number? if l0=1, then since user 1 contributes to 10 partitions, we drop 9 of them, i.e. 9 data points, for user 2 everything we keep, 9 / (20 + 10) = 0.3, not 0.(81)
maybe add a comment or formula somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, in test for estimate_rmse it is hard to verify that the expected numbers are correct, is it possible to write a short formula? we have iteration over partitions there, 10 of them, so maybe it will be not very small formula...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done
The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for review!
@@ -118,7 +118,7 @@ def tune_parameters(): | |||
restaurant_visits_rows = load_data(FLAGS.input_file) | |||
# Create aggregate_params, data_extractors and public partitions. | |||
aggregate_params = get_aggregate_params() | |||
public_partitions = list(range(1, 8)) if FLAGS.public_partitions else None | |||
public_partitions = list(range(2, 9)) if FLAGS.public_partitions else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks reverted this change
@@ -92,9 +92,28 @@ def test_sum_not_supported(self): | |||
ValueError, "Only COUNT and PRIVACY_ID_COUNT are supported"): | |||
self._get_estimator(pipeline_dp.Metrics.SUM) | |||
|
|||
def test_get_ratio_dropped_l0(self): | |||
@parameterized.parameters((0, 1), (1, 0.818181818181818), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done
The numbers for L0 histogram is correct, since it's not about rows, but about (privacy_unit, partition) pairs.
return ratios_dropped[index][1] | ||
|
||
x1, y1 = ratios_dropped[index] | ||
x2, y2 = ratios_dropped[index + 1] | ||
x1, y1 = ratios_dropped[index - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, good point. I've added a comment and made a small change to ensure that bound > 0
.
No we don't need to check, since ratio_dropped
start from 0
, but here bound > 0
This PR implements estimation of RMSE from
DatasetHistogram
forl0_bound
andlinf_bound
The algorithm is the following
l0_bound
andl0_contributions_histogram
the ratiodata_dropped_from_l0
contribution bounding is computed.linf_bound
andlinf_contributions_histogram
theratio_data_dropped_from_linf
contribution bounding is computed.data_dropped_from_l0
andratio_data_dropped_from_linf
.n
, it is assumed thatn*ratio_data_dropped
data points are dropped with contribution bounding. And RMSE for this partition is computed assqrt((n*ratio_data_dropped)**2 + noise_std**2)