Skip to content

Datasets

bengetch edited this page Feb 5, 2021 · 3 revisions

Datasets

Congregation tracks ownership of data on both column and dataset
level, and the semantics for each differ slightly.

* Note that as of 1/14, hybrid/public protocols are not yet implemented, and
trust / plaintext annotations that would normally trigger certain performance
improvements do not currently have that effect.

Column level annotations

Columns may be manually annotated with both a trust_set and a
plaintext_set. For example, the following:

a = create_column("a", "INTEGER", trust_set={1,2}, plaintext_set={1})

defines a column with the following attributes:

  • Column name is a
  • Stores integer data
  • Held in plaintext by party with ID 1
  • May be safely revealed to party with ID 2

This information is used by congregation to determine which performance
improvements can be applied to the compiled workflow. If you choose not
to annotate columns as you define them, congregation will work out default
annotations based on the ownership definitions you provide at the dataset
level.

Dataset level annotations

A dataset can represent data stored in either plaintext or secret-shared
form. A plaintext dataset looks like the following:

a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")

rel_one = create("in1", [a, b, c], {1})

Optionally, a create() call can specify an input_path value, which tells
congregation exactly where to find the input dataset. If an input_path value
is not provided, congregation will expect the dataset to be located at
$data_path/$node_name.csv. Here, $data_path corresponds to the value from
your config input file with the same name, while $node_name corresponds to
the name specified in the create() call. The query above could be written
alternatively as:

rel_one = create("in1", [a,b,c], {1}, input_path=f"{config['general']['data_path']}/in1.csv")

Also note that if you specify an input path, there is no need for the name of the
dataset to match the name that you're giving the node.

If you want to define a dataset that is stored as secret shares that are
distributed across some number of parties (e.g. parties 1, 2, and 3), then
you would define it as follows:

a = create_column("a", "INTEGER")
b = create_column("b", "INTEGER")
c = create_column("c", "INTEGER")

rel_one = create("in1", [a, b, c], {1, 2, 3})

Note that the same rules regarding the optional input_path variable apply
in this case as well.

Clone this wiki locally