Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient indices might lead to data corruption #373

Open
pangloss opened this issue Dec 12, 2020 · 13 comments
Open

Transient indices might lead to data corruption #373

pangloss opened this issue Dec 12, 2020 · 13 comments

Comments

@pangloss
Copy link

Hi, I'm encountering an issue where data is disappearing or reappearing in historical versions of my db when I do subsequent transactions against the current version. It only happens with entities which have a {:db/unique :db.unique/identity} property.

It seems to be a data structure bug in datascript.

I can run a few transactions, query a specific instance of the db history, then do additional transactions and when I query the same instance from the history, the data will have changed.

I'm running Clojure 1.10.1 on Java 11 and have confirmed it with both v1.0.1 and v1.0.2.

I'm hoping you'll have an idea what may be causing this!

Although I can reliably reproduce it in the context of my application (it happens every time), I so far cannot figure out exactly how to reproduce it in isolation, so can't provide you a repro right now, unfortunately.

Suggestions as to how to approach reproducing the bug would be welcome. Meanwhile I think a workaround will be to isolate indexed properties into independent entities that just point to the actual data...

@pangloss
Copy link
Author

pangloss commented Dec 13, 2020

Unfortunately my hoped-for workaround didn't help, so these records are being corrupted even when they do not have unique indices attached to them, and so far I do not see anything that distinguishes them. There are only 1000-2000 entities of this type created in the transaction. Other record types are creating more in some cases without trouble. The total is about 5000 entities in the one of the larger transactions that I'm looking at.

Any change I make to the application changes the set of specific impacted entities, but if I make no changes and re-run, the results set is identical, including errors.

Here is a "small" sample that I hope shows what's happening a bit better. I'll try to explain it:

This is the code I'm using to identify bad records. In my system, every entity should have a :type attribute, so I can pick out records missing it.

This iterates the program one step at a time, and after each step it runs the get-weird-ids fn. That function goes through each historical db in the results vector and checks for entities missing the :type attribute. (An excerpt of the schema is included at the end.)

  (defn get-weird-ids [results]
    (->> results
         (map (fn [[{:keys [db-after]}]]
                (mapv #(d/entity db-after %)
                      (distinct (map #(.e %) (d/datoms db-after :eavt))))))
         (map-indexed #(vector %1 (->> %2
                                       (filter (complement :type)) ;; every entity should have :type
                                       (map d/touch))))))

  (map second
       (reductions (fn [[results weird-ids] n]
                     ;; events/step steps the system forward and produces a single large transaction.
                     (let [results (conj results (events/step conn :tx-report))]
                       [results (get-weird-ids results)]))
                   [(create-sample) (get-weird-ids results)]
                   (range 6)))

This is the result of the above script. You can see that for the first 8 cycles there is no problem, but after that, suddenly the data in several databases is changed/corrupted.

(([0 ()] [1 ()] [2 ()] [3 ()] [4 ()])
 ([0 ()] [1 ()] [2 ()] [3 ()] [4 ()] [5 ()])
 ([0 ()] [1 ()] [2 ()] [3 ()] [4 ()] [5 ()] [6 ()])
 ([0 ()] [1 ()] [2 ()] [3 ()] [4 ()] [5 ()] [6 ()] [7 ()])
 ([0 ()] [1 ()] [2 ()] [3 ()] [4 ()] [5 ()] [6 ()] [7 ()] [8 ()])
 ([0 ()] [1 ()] [2 (#:db{:id 16270})] [3 (#:db{:id 16270})] [4 (#:db{:id 16270})] [5 (#:db{:id 16270})] [6 (#:db{:id 16270})] [7 (#:db{:id 16270})] [8 (#:db{:id 16270})] [9 ()])
 ([0 ()] [1 ()] [2 (#:db{:id 16270})] [3 (#:db{:id 16270})] [4 (#:db{:id 16270})] [5 (#:db{:id 16270})] [6 (#:db{:id 16270})] [7 (#:db{:id 16270})] [8 (#:db{:id 16270})] [9 ()] [10 ()])
 ([0 ()]
  [1 ()]
  [2 (#:db{:id 16270})]
  [3 (#:db{:id 16270})]
  [4 (#:db{:id 16270})]
  [5 (#:db{:id 16270})]
  [6 (#:db{:id 16270})]
  [7 (#:db{:id 16270})]
  [8 (#:db{:id 16270})]
  [9 ()]
  [10 ({:item/link #:db{:id 7227}, :item/new-limit 3, :item/container #:db{:id 5220}, :item/slot-num 59, :item/type :two, :db/id 15448})]
  [11 ()])

This final structure is data relevant to each eid found above. The structure is

{eid {:v #{versions impacted}, :t {version-number [changed datoms in the tx that created that version]}}

You can see that the entities are not modified in the transactions, and that the entities above should all have all of their properties.

(edit: I originally included the wrong data here)

{16270
 {:v #{7 4 6 3 2 5 8},
  :t
  {1
   [#datascript/Datom [16270 :item/type :two 536871621 true]
    #datascript/Datom [16270 :item/link 5133 536871621 true]
    #datascript/Datom [16270 :item/new-limit 3 536871621 true]
    #datascript/Datom [16270 :type :item 536871621 true]],
   8 [#datascript/Datom [16270 :item/container 5209 536871628 true] #datascript/Datom [16270 :item/slot-num 10 536871628 true]],
   9
   [#datascript/Datom [16270 :type :item 536871629 false]
    #datascript/Datom [16270 :item/link 5133 536871629 false]
    #datascript/Datom [16270 :item/new-limit 3 536871629 false]
    #datascript/Datom [16270 :item/container 5209 536871629 false]
    #datascript/Datom [16270 :item/slot-num 10 536871629 false]
    #datascript/Datom [16270 :item/type :two 536871629 false]]}},
 15448
 {:v #{10},
  :t
  {1
   [#datascript/Datom [15448 :item/type :two 536871621 true]
    #datascript/Datom [15448 :item/link 7227 536871621 true]
    #datascript/Datom [15448 :item/new-limit 3 536871621 true]
    #datascript/Datom [15448 :type :item 536871621 true]],
   3 [#datascript/Datom [15448 :item/container 5220 536871623 true] #datascript/Datom [15448 :item/slot-num 59 536871623 true]],
   11
   [#datascript/Datom [15448 :type :item 536871631 false]
    #datascript/Datom [15448 :item/link 7227 536871631 false]
    #datascript/Datom [15448 :item/new-limit 3 536871631 false]
    #datascript/Datom [15448 :item/container 5220 536871631 false]
    #datascript/Datom [15448 :item/slot-num 59 536871631 false]
    #datascript/Datom [15448 :item/type :two 536871631 false]]}}}

In the above example the affected records are deleted in the 11 th version, but that is not always the trigger. In the below, from a different run, none of the records are deleted...

{14978
 {:v #{7},
  :t
  {1
   [#datascript/Datom [14978 :item/type :two 536871621 true]
    #datascript/Datom [14978 :item/link 8666 536871621 true]
    #datascript/Datom [14978 :item/new-limit 3 536871621 true]
    #datascript/Datom [14978 :type :item 536871621 true]]}},
 16068
 {:v #{6 5},
  :t
  {1
   [#datascript/Datom [16068 :item/type :two 536871621 true]
    #datascript/Datom [16068 :item/link 8187 536871621 true]
    #datascript/Datom [16068 :item/new-limit 3 536871621 true]
    #datascript/Datom [16068 :type :item 536871621 true]],
   2 [#datascript/Datom [16068 :item/container 5220 536871622 true] #datascript/Datom [16068 :item/slot-num 5 536871622 true]]}},
 16396
 {:v #{7 6 5},
  :t
  {1
   [#datascript/Datom [16396 :item/type :two 536871621 true]
    #datascript/Datom [16396 :item/link 9201 536871621 true]
    #datascript/Datom [16396 :item/new-limit 3 536871621 true]
    #datascript/Datom [16396 :type :item 536871621 true]]}},
 16388
 {:v #{7 6 5},
  :t
  {1
   [#datascript/Datom [16388 :item/type :two 536871621 true]
    #datascript/Datom [16388 :item/link 7892 536871621 true]
    #datascript/Datom [16388 :item/new-limit 3 536871621 true]
    #datascript/Datom [16388 :type :item 536871621 true]]}},
 16288
 {:v #{7 6 5},
  :t
  {1
   [#datascript/Datom [16288 :item/type :two 536871621 true]
    #datascript/Datom [16288 :item/link 6751 536871621 true]
    #datascript/Datom [16288 :item/new-limit 3 536871621 true]
    #datascript/Datom [16288 :type :item 536871621 true]],
   3 [#datascript/Datom [16288 :item/container 5249 536871623 true] #datascript/Datom [16288 :item/slot-num 108 536871623 true]]}},
 16120
 {:v #{5},
  :t
  {1
   [#datascript/Datom [16120 :item/type :two 536871621 true]
    #datascript/Datom [16120 :item/link 9296 536871621 true]
    #datascript/Datom [16120 :item/new-limit 3 536871621 true]
    #datascript/Datom [16120 :type :item 536871621 true]]}},
 14404
 {:v #{7},
  :t
  {1
   [#datascript/Datom [14404 :item/type :two 536871621 true]
    #datascript/Datom [14404 :item/link 7906 536871621 true]
    #datascript/Datom [14404 :item/new-limit 3 536871621 true]
    #datascript/Datom [14404 :type :item 536871621 true]],
   3 [#datascript/Datom [14404 :item/container 5351 536871623 true] #datascript/Datom [14404 :item/slot-num 108 536871623 true]]}},
 14974
 {:v #{7},
  :t
  {1
   [#datascript/Datom [14974 :item/type :two 536871621 true]
    #datascript/Datom [14974 :item/link 4831 536871621 true]
    #datascript/Datom [14974 :item/new-limit 3 536871621 true]
    #datascript/Datom [14974 :type :item 536871621 true]],
   2 [#datascript/Datom [14974 :item/container 5218 536871622 true] #datascript/Datom [14974 :item/slot-num 15 536871622 true]]}},
 16106
 {:v #{6 5},
  :t
  {1
   [#datascript/Datom [16106 :item/type :two 536871621 true]
    #datascript/Datom [16106 :item/link 6952 536871621 true]
    #datascript/Datom [16106 :item/new-limit 3 536871621 true]
    #datascript/Datom [16106 :type :item 536871621 true]],
   2 [#datascript/Datom [16106 :item/container 5226 536871622 true] #datascript/Datom [16106 :item/slot-num 8 536871622 true]]}},
 16270
 {:v #{4 3 2},
  :t
  {1
   [#datascript/Datom [16270 :item/type :two 536871621 true]
    #datascript/Datom [16270 :item/link 5133 536871621 true]
    #datascript/Datom [16270 :item/new-limit 3 536871621 true]
    #datascript/Datom [16270 :type :item 536871621 true]]}},
 16394
 {:v #{7 6 5},
  :t
  {1
   [#datascript/Datom [16394 :item/type :two 536871621 true]
    #datascript/Datom [16394 :item/link 8001 536871621 true]
    #datascript/Datom [16394 :item/new-limit 3 536871621 true]
    #datascript/Datom [16394 :type :item 536871621 true]]}}}
  ;; These are the only relevant pieces of the schema for the affected records. The :type property is not in the schema.
  (d/create-conn {:item/container {:db/type :db.type/ref
                                   :db/cardinality :db.cardinality/one}
                  :item/link {:db/type :db.type/ref
                              :db/cardinality :db.cardinality/one})

@tonsky
Copy link
Owner

tonsky commented Dec 13, 2020

Can you try with one of this jar or this patch (both are inside archive), whatever is easier?

Archive.zip

@pangloss
Copy link
Author

I just tested it and the patch does fix the problem.

tonsky added a commit that referenced this issue Dec 15, 2020
@tonsky
Copy link
Owner

tonsky commented Dec 15, 2020

Ok, thank you. I pushed 1.0.3 for now that disables transient indices. Later I’ll take a more detailed look into it. Please leave this issue open

@tonsky tonsky changed the title Data corruption in DB values from previous transactions Transient indices might lead to data corruption Dec 15, 2020
@pangloss
Copy link
Author

Thank you very much!

@darkleaf
Copy link
Contributor

@pangloss can you provide a reproducible example?
For example, you can serialize your data by datascript/transit and attach it as an archive.

@pangloss
Copy link
Author

Hi @darkleaf,

My previous quick attempt to make a very simple standalone repro to share didn't work. There's no way I could share any part of the actual database that I encountered the problems in, unfortunately, but I'll try to describe the conditions it appears in. It'll be a while before I have a chance to work on this since I've got a pretty big backlog at the moment, unfortunately.

In the application schema I have a few properties marked {:db/unique :db.unique/identity}, all but one of them are statically created in the first couple of transactions and permanent.

The other is an index to future events :future-timestamp. On most transactions one or more future-timestamp entries are created, but sometimes none are created. Regardless, exactly one future-timestamp record is deleted on every transaction.

The future-timestamp also has a bunch of entities related to it, most of those get deleted in the same transaction the future-timestamp is deleted except the ones that were created in the beginning.

In addition to the future-timestamps and related entities, every transaction has a handful of regular entities changed, added or removed. Most transactions have 50-100 datoms in the resulting tx-data, but sometimes it can be 1000-2000. The data is highly connected, so every change includes changes to attributes marked {:db/type :db.type/ref :db/cardinality :db.cardinality/many} or ...one}.

When I run it, I save all of the resulting transaction reports and eventually run analysis over them in sequential order. Most runs have 100-500 tx-results.

Hopefully that helps...

Cheers,
Darrick

@darkleaf
Copy link
Contributor

Thanks, but it's still hard to understand for me.

I have a guess. You have an instance of DB, you change it in several ways several times, and then you run a query against the initial instance and it has its state changed, but it's wrong because DB is an immutable value. Am I right?

@pangloss
Copy link
Author

Yes, data changes and becomes invalid in past versions of the database even though the latest version always stays correct.

@tonsky
Copy link
Owner

tonsky commented Jan 21, 2021

@darkleaf issue probably was in transient indices + restarting transaction after tempid resolve (db/retry-with-tempid).

Note: I disabled transient indices in latest version.

@pangloss
Copy link
Author

Possibly related to @tonsky's comment, I do have some tx functions that are explicitly scheduled at the end of the transaction so that they can query the partially committed db to resolve possible conflicts.

@filipesilva
Copy link
Contributor

@tonsky would this issue also affect the ClojureScript Datascript implementation? Over at RoamResearch we were looking into it and didn't seem to find many of these blocks that had lost attributes.

We also dug a bit into asTransient and it looks like it's not doing anything in CLJS impl, but is doing something in the CLJ impl. Didn't want to assume though, so thought we'd ask.

@tonsky
Copy link
Owner

tonsky commented Feb 8, 2021

@filipesilva They are not implemented in CLJS, no need to worry. JVM version is now fine too, I turned them off, will fix later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants