Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for execution engine allowing runtime languages (like c#, Java) #66

Open
fhoering opened this issue Jul 17, 2024 · 0 comments
Open

Comments

@fhoering
Copy link

fhoering commented Jul 17, 2024

Context

One key design concept of the key/value server is that it should be side effect free. It means any software running on it should not be able to do logging or any form of inbound our outbound RPC as this would allow to get out cross site user data. In particular this means No network, disk access, timers, or logging.

Google’s implementation of the key/value server relies on a sandboxed VM using Open Source V8 engine inside Sandbox2. that is able to reset the context at each request. This has the upside of allowing custom business logic for each adtech, but the effective mandatory backend seems to be Javascript (#43)

Having to use a Javascript backend is a major constraint for any ad tech that don’t currently run on a Node JS platform because it means

  • having to migrate a big part of the code base to Javascript
  • having to maintain the existing code base for other browsers that don’t support the Privacy Sandbox like Safari and Firefox.

At Criteo we are particularly interested in managed runtimes because we have our code base in c# and therefore use c# in the examples. But the same mechanisms would apply for every other language like Java or Python.

We consider that having an execution engine that supports many programming languages as a key feature as it would likely also reduce other ad techs migration efforts to the Bidding an Auction server. We also think that having a generic solution that can be reapplied to many programming languages seems beneficial compared to the development effort of having to patch and sandbox each language independently.

Process based sandboxing

We support the design principles mentioned above, but believe they can be attained with process based sandboxing. nsjail looks interesting because it leverages linux kernel based mechanisms to lock down disk and network access which means we could provide an image that runs any process as longs as this process is protected by those linux kernel mechanisms.

The process would be started and isolated on disk and network access as shown in the following Schema:
image

We can see this as a very similar mechanism to what Google already proposed with ML inference inference, instead of a pytorch process we propose ton run a custom process in a sandboxed environment (see protected-auction-services-discussion/resources/bidding-auction-inference-presentation-2024-04-24.pdf slide 16).

It also looks very similar to the proposed gVisor sandboxing mechanism but which currently is not supported for managed runtimes.

Threat model and potential solutions

One remaining attack on this system would be to use memory access to construct a timeline across requests and use those information for future bidding across interest groups or take the information out in some form via the bid value of the generateBid function. It should be noted that the key value server already handles mostly user IG data, very limited contextual data and the 1st party IG user data is already known to the ad tech when the IG has been created.

We are proposing here some potential mitigation techniques that would take into account the data that is actually processed. Those should be considered as first elements for discussion. We would be happy to consider other proposals as well.

Restart the process after each request

The most simple workaround for this would be to restart the server after each request to reinitialize memory state. But it takes time and resources to initialize a new worker process in particular for languages like c# or Java with runtimes. Given the real time bidding use case and the required QPS constraints and scale that this service needs to handle it will very likely be prohibitive (see benchmarks of #43)

Restart the process every n minutes

Besides IG user data the KV server already receives some contextual data today like the publisher hostname and optionally experimentgroupid /ad sizes.

It would allow to attach some auction data to a single first party interest group but also to know that the user has participated in some auctions.

To mitigate this we propose to apply the same trick that likely will be also applied to Pytorch ML inference, to regularly restart the process and clean up memory state.

Akshay: we sandbox the model the same way we sandbox UDFs. For PyTorch, we periodically reset the model, as we want it to be stateless. For privacy, we'll reset the model with some probability to ensure statelessness.

We estimate that such a process would need to run at least 1h to allow keeping infra cost requirements reasonable. It seems enough given that the input contextual data is very limited and this would allow to combine the IG activity for bidding only for 1h, then restarting from scratch and not allowing to get out this data out of the TEE for further usage.

Let each worker handle only one IG for one user

For performance reasons the Chrome browser combines all user interest groups and sends them in a batched way to the KV server. We propose to keep this mechanism as it is very important to not send out too many network calls. However, one could parse the keys server side and redirect each IG key in separate call to a TEE sandbox container (or handle them on the same node with multiple worker processes).

image

Sanity check for stateless behavior

We must prevent ad techs untrusted code from leaking state. One option could be to have the TEE trusted code execute regular sanity checks of the untrusted code.

For example in 10% of the cases execute the same call twice on two instances and compare the results. If the results are identical, we know for sure that no information has been leaked.

If the result are different, we can reset the state (by restarting the processes), and make the rogue DSP pay a penalty (e.g. by replying no-bid for a minute instead of calling the untrusted code).

With the right parameters, we can be confident that very few data can be leaked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant