-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Query Planner #1573
Comments
Nice write-up, @bradengroom! A query planner is an idea that we've thrown around a few times but has never made it into a concrete proposal. This is a great place to start jotting ideas. Other places we've talked about query planning being potentially valuable:
Ideally we could decompose query plans into small subproblems that would be shared between many types of queries / plans so that this is not as much of an issue.
Have you considered starting in the middle and going outward? 😄 |
If that's true then that's much more interesting. You could potentially change your course of action halfway through a query. "I started traversing from the subject, but my estimate was off an order of magnitude. Maybe I should start traversing from the resource and try to meet myself where I've made it so far" That also opens the door for request-time query hints (if we like the idea of query hints at all) |
Thanks for the feedback! As I think on this, I think that this might be a large change to hash out in a single thread. I'm seeing 3 distinct threads that this work can be split into:
There's also likely a thread to spin up around an optimization layer. How do we take a physical plan and transform it into something more efficient? That can likely come later though once the code base has the concept of a query plan in place at all. Since it sounds like the idea of a query planner already has buy-in from your internal discussions, I'd like to take a piece and start designing it with more concrete details if that's alright. Do y'all have any issues with me spinning up a new, concrete proposal for the stats module? I don't want to overwhelm folks with proposal docs if this isn't the usual flow. I'm just eager to contribute 😄 I'm still exploring the code, so this will likely take some time. |
Designs have been thrown around on an IR to do just that, but its still very much in the ideation phase
No issue with doing so, but it is not a priority right now either, so just be aware the response times can and likely will be lower on any such designs or PRs. Stats is probably the best place to start with a goal of getting good stats without impacting read or write performance, which likely means a background worker updating them async. |
@bradengroom Any updates on the stats side? Just curious :) |
I have not unfortunately. The demands of my day job have picked up as we've gone deeper into Q4 |
There are two types of operations that query planners can do:
So far we've mostly been discussing cost-based planning, but actually I think it makes sense to first consolidate the few places where we're performing heuristic/rules rewrites already into a clean interface that can be the foundation of the planner. |
@jzelinskie agreed that might be a good first step to refactor towards the larger change. Is there a good way of hunting these rewrites down? Or does someone just need to trace the code and compile a list of such rewrites? I assume it’d all be in the check dispatch path. |
This would be a big win for us. Nearly all of our evaluations, especially We loosely followed the https://authzed.com/blog/google-cloud-iam-modeling example when creating our model. In some cases we have hundreds of thousands or millions of |
Problem Statement
Please redirect me if this is not the best place to gather feedback for an idea that is not fully formed yet. I'm looking for feedback from deeper SpiceDB experts on this idea that I might be interested in exploring if there is interest.
The first question worth answering is "why would SpiceDB need a query planner at all?". At the end of the day, SpiceDB's queries are just trying to determine if a path exists between two points. Currently, SpiceDB performs all path finding starting from resource->subject even though there may be some queries that are more performant when traversing subject->resource. While SpiceDB tuples represent a directed graph, that does not preclude SpiceDB's query engine from traversing the graph backwards via database indexes.
Why would we ever want to traverse backwards? (subject->resource)
I think it comes down to cardinality of relationships and how much data must be loaded from the data store. Let's consider a simple scenario where there's a many-to-many relationship between entities A, B, and C:
We want to know if there's a path from
A:1
toC:1
throughB
. We have two options for fulfilling this query:Traverse A->B
Traverse B->C
Access granted if
b_to_c_tuples
is not empty.But what happens if
A:1
has manya_to_b_relation
s. What if there are a million? Then step 2 will chug along making many requests (with some concurrency factor) to sift through the many A->B tuples.What if the cardinality happens to be much better going the other way? Each
A
has manyB
s, but eachB
only has 1A
. It could potentially be much faster to:Traverse (backwards) C->B
Traverse (backwards) B->A
Access granted if a_to_b_tuples is not empty.
This could potentially be much faster if the relationship have lower cardinality when traversing backwards.
If a proper query planning layer existed, if could try to solve this problem and it could potentially be extended to perform other optimizations that we think of later.
Open Questions / Problems
Determining Relation Cardinality
One idea would be to use HLLs to simply count and estimate of how many unique entries we see on each side of a relation. So if two tuples are written:
A:1#a_to_b_relation@B:1
andA:1#a_to_b_relation@B:2
, we would end up with two HLLs:a_to_b_relation_a_hll.Len() => 1
a_to_b_relation_b_hll.Len() => 2
At scale, these numbers could provide insight on the cardinality between the two and hint at an optimal traversal direction. Using these cardinality values between relations, we can weigh our options. Some heuristic would need to be established if there are multiple paths from A->C.
You'd want to persist this information in the datastore somehow and have nodes only leverage the information once a substantial number of data points have been gathered. All nodes need to be on the same page.
Query Planning Latency
How would a query planning step be introduced without adding significant latency to the queries? We have to be able to pay off the time we spend planning!
Fortunately, SpiceDB is purpose-built and doesn't have an infinite number of queries to serve up. A given schema only captures so many paths and clients likely only use a subset of paths that a schema is capable of. The first time SpiceDB encounters a query for some
A:X#a_to_b_relation@B:Y
path, the plan may be cached and reused for all queries of its shape.Cache Hit Rate
This is likely the biggest concern. Once a query plan is chosen for a particular query type, we likely need to stick to that query plan (unless there's a REALLY good reason not to). Otherwise cache hit rate will likely be terrible. We can't be too dynamic in our execution paths. Even still, this would require some experimentation to evaluate how this idea behaves at all at scale.
Indexing & Data Size
I would need to investigate further, but my hope is that much of the indexing required may already exist since the SpiceDB already offers APIs for traversing both directions (
LookupResources
&LookupSubjects
). If not, data increase from additional indexes could be a concern.Query Hints?
An MVP of this could involve allowing customers to specify query hints in their schema some how (don't accept query hints at query time because of the cache hit rate problem. We don't want' people flipping back and forth.)
Other Crazy Ideas
Wouldn't it be crazy if we traversed from both sides at once and tried to meet in the middle? I have no idea how that might work right now, but the idea is an interesting future convo. 🙂 (out of scope here)
Other optimizations?...
The text was updated successfully, but these errors were encountered: