-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data inconsistencies in Octopoes: a proposal to retire celery and validate the model continuously #3585
Comments
If I understand it correctly the problem is that an affirmation and deletion can happen at the same time and the affirmation can overwrite the deletion. Deleting is probably only part of the problem, because as far as I can see this can also happen if the affirmation and an update conflict, because an affirmation saves the whole object and potentially overwrites the data of the update. This a pretty common problem with databases and concurrency and the usual solution is to use transactions to make sure the saved data is consistent. With XTDB we can do that using match in v1 or using ASSERT in v2. The match/assert should guard against an earlier/concurrent transaction doing conflicting changes. This should prevent saving an affirmation for an already deleted object. Other than that I disagree that what celery currently does can be easily done with a threadpool, because we also need to take into account race conditions, resilience against crashes and scalability. Maybe it can be done with a threadpool, but I don't think we should think about it as something that is easy to do. Also note that a "fast thread pool that can work parallel" does not exist with Python if what is meant is executing Python code in parallel because of the GIL. And it will still take a few years before there is a Python without GIL that we can use... |
Thanks @dekkers for you comment and concerns.
Same time could somewhat be misleading, the point is more that causality, as in the order of transactions is not preserved by the mix of various mechanisms launched by Octopoes. Indeed affirmations resaves the whole OOI.
I am aware of the various "atomic" methods one can apply to prevent data being parallel modified. I do not see, however, how this solves our problem. Note that in this case the OOI is retroactively deleted in the past (from the future -- if that makes sense). It is more a problem of logic within Octopoes rather than putting a simple lock on a transaction, because an object can be legitimately deleted and then reintroduced. Fundamentally this logic has to be assessed by Octopoes.
While I agree that it is a terrible idea to write anything of this sort in Python, as stated many many times before. Celery has been a source of frustration throughout the Octopoes project -- other than my own experience -- this something I also gathered from various developers in the team. Apart from that, I do not see how can reduce the overhead in calls and the long delays in execution, query the queue, and manage the queue execution priority (as alluded to above) by transaction type. Thanks. |
Making the following changes:
Still yields these problems:
Where the time lapse seems consistently smaller. |
As @Donnype proposed, for now a good solution for this particular problem will be to have affirmations not modify the |
The behavior is even funnier than we imagined...
Remedies:
∴ in this case, ideally, we want to have "to be affirmed OOI's" not to be created again from another source, so the correct mode of operation is then not to recreate OOI's that don't yield new information... (this does not solve expiration problems). Of course this only hold for KATFindingTypes... for a more generic approach we either affirm before object creation and assess whether the new information adds (or needs to be updated) -- or we could introduce something like shadow OOI that wait for affirmation and only the information yield of a merge is assessed after... Thoughts @underdarknl @Donnype @dekkers @noamblitz? Thanks! |
Example with 1s
|
To emphasize the issue with this solution:
Where XTDB reports that the OOI (at a
|
Given the following expectations for an OOI lifecycle: insert t1 (declaration) To make this work, we probably need to check (on t2.5) if there are any future affirmations that we need to 'undo', up to the point where we have a new declaration. From t4, any affirmation is valid again. Another scenario and its expectation: == on t3.1 does not exists. To make this work we probably need to assert on t2.5 that the object does in fact exists. which in this case it does not(ish?), or we should figure out how to make XTDB follow the T timeline instead of the transaction-log when consolidating these transactions. |
XTDB2 seems to do this differently than XTDB1 (if I understand Benny's first example correctly). In XTDB2 the delete will cause the row to not exist anymore after t2.5:
|
Yes, this is the whole point, that we need to find the delete event from the queue -- because it has not been processed yet. Hence, this: nl-kat-coordination/octopoes/octopoes/core/service.py Lines 180 to 191 in 8d40832
|
Ok... so with #3624 merged: Let study this scenario, and let us make it a multiple choice question: With this following script:
Does the node "ShaCheDeChungKe" have fries? The answer might look familiar:
|
This issue will be bulk addressed with the transition to XTDB2, the reformulation of the data-model, and the introduction of nibbles; etc. and is therefore currently blocked. |
Data inconsistencies in Octopoes: a proposal to retire celery and validate the model continuously.
Describe the bug
Since VisualOctopoesStudio several bugs regarding Octopoes' the data model have come to light.
A subset of these bugs (#3498, #3564, and #3577), have addressed the various mechanisms of dangling self-proving OOI's that can occur like:
causing all kinds of bugs, like #3205.
With fixes for these bugs merged we still sporadically obtain such a self-proving OOI on the current main:
With its history:
And the Origin's history (as there is only one transaction we show the Origin here implicitly):
(note that XTDB transaction can contain multiple entities.)
In the history of the OOI there is something odd, namely that OOI there 9 seconds lag between it's
validTime
and the thetxTime
. This is cause by several factors playing:What seems to be happening graphically is:
where the the timing of the deletion event and the affirmation are such that after deletion queuing (given the validTime), the OOI is affirmed (and by the affirmation implicitly recreated), only after which the deletion is executed (for that previously mentioned validTime).
Proposed resolution(s)
Retire celery
The event manager in Octopoes uses Celery a worker thread pool. Celery has been a source of issues within Octopoes, see for instance Slow clearence level aggregation #2171 where the upstream Celery/Billiard issue remains untouched Long hangs when
os.sysconf('SC_OPEN_MAX')
is large celery/billiard#399. While Celery has nice features, it seems overkill for our case and a source of delay, in this case accumulating up to 9s. In order to mitigate the behavior we would like to have a fast thread pool that can work parallel but does not change the order of creation and deletion events on a similar "inference-spacetimeline" as this violates causality. In addition, we would like to Octopoes to be able to query the event queue, so it can block or reject certain finding based on issued deletion events. As far as we know Celery has no trivial way to query the queue as such. This can all be easily done with a custom thread pool implementation managed by Octopoes, retiring Celery, and thus we propose to do so.Validate the model continuously
Similar to a filesystem, we ideally never have any errors but if errors occur we would like to have to tools to detect them, and possibly fix them. Currently we have neither in Octopoes. We propose to implement a thread that with low priority validates the current Octopoes state for (logical) inconsistencies, once found a user can opt to have them fixed automatically where possible or fix/mitigate the error. Such a tool within Octopoes will make OpenKAT both more reliable and transparent, additionally it is an excellent way for a OpenKAT system administrator to file well documented issues should such errors occur.
OpenKAT version
main
The text was updated successfully, but these errors were encountered: