Avoiding lossy type conversions during import #132

pavish · 2021-05-18T16:04:15Z

pavish
May 18, 2021
Maintainer

Branching off from our discussions from Matrix.

Came across a particular scenario today, which is relevant to our discussions on type inference.

There is a column called upc, which holds barcode information. This is a string column.
It has several values that start with 0, such as 012839291012, 012875939012 etc.,

When we infer types on this column, we would most probably detect it as numeric, which will change the values to 12839291012 and 12875939012 (Removes the initial 0)
Now, even if the user changes the type back to string, the initial 0 is lost, creating invalid data.

Are we handling such scenarios currently? If not, this is something we need to take into account.

We need not have to worry about it if it's a user action (undo would handle it). But during import, we need to ensure there is no lossy conversion.

kgodey · 2021-05-18T22:05:47Z

kgodey
May 18, 2021
Maintainer

@mathemancer i think the easiest solution here is to not suggest a number type if there are values that start with 0 in a column, just to be conservative. What do you think?

0 replies

mathemancer · 2021-05-19T08:54:01Z

mathemancer
May 19, 2021
Maintainer

I'm going to give a quick definition to make things (hopefully) clearer and more precise. By "lossy" type conversion we mean any type conversion where the map f that takes a value from one type to a new value of a different type is not injective. This means we can lose information, since more than one value from the original type can be mapped to the same value of the target type. An example is both t and true being mapped to the boolean value TRUE during type conversion from string to boolean, or both of '00123' and '123' being mapped to 123 during type conversion from string to numeric.

The problem with solving these piecemeal is that we'll keep adding these pieces and the overall system will be difficult to maintain. I'd rather have a type conversion system that guarantees that we can't lose data on import, and that the user can freely experiment with changing types without worrying that they'll somehow lose data during the conversion process. Another lossy type conversion is string to time interval. We'll map multiple strings to the same interval: both of the strings '14 days' and '2 weeks' map to the same time interval. It's possible the user is fine with this, but it may be that they had those strings written differently for a reason, and we can't recover the original strings from the resulting interval. We've also discussed lossy conversions from string to TIMESTAMP types.

Finally, I think the proper solution to this will be usable both at import and for undo functionality. I.e., the same function can let a user undo a type conversion they chose, and also recover the lost data if we convert types during import. The more I think about this, I think that figuring out what we want to do about that is a prerequisite to safe type inference.

Readily apparent solutions:

Store the original import data from a CSV as strings in a table that's read-only. 'undo' could be implemented by storing any type conversions in a log, and re-running them up to the previous step.
- Pros:
  - conceptually simple implementation
  - adds safety for all conversions (not just type). The user can always start fresh.
  - Very fast "undo" for the initial import type conversions
- Cons:
  - requires double (roughly) the storage.
  - undo for the 74th type conversion will take some time.
Keep the original CSV on disk, and re-import if needed to undo any conversions, using a similar log to (1).
- Pros
  - also conceptually simple
  - Disk storage may be cheaper than DB storage in some situations
- Cons
  - Basically the same as (1) with extra steps
  - We'll need two import versions: the initial import where we don't know what types to avoid, and a future import with certain things specified.
  - Slow after a number of steps for the same reason as (1).
Keep a log of each manipulation of each value in a type conversion. So, the log could be something like
```
                           text_to_numeric
+-----------------------------------------------------------------------+
| conversion_id | schema  | table   | item_pk | orig_text | new_numeric |
+---------------+---------+---------+---------+-----------+-------------+
           1    | myapp   | numbers |    3    |   '0012'  |     12      |
           2    | myapp   | numbers |    7    |   '01234' |   1234      |
```
- Pros
  - speed when we want to undo, since we know each value's trail
  - Similar to the audit table concept we've discussed previously, good for granular tracking of all data conversions.
- Cons
  - conceptually difficult
  - Massive storage needs: any ALTER COLUMN TYPE statement would generate the same number of rows in the log as in the data table.
  - We need a table like the above for each type conversion...type. That is, we need a table like the above for both text to numeric and another table for text to interval. Alternatively, the table could have many columns, and null values in a row for columns in uninvolved types for a given conversion. This is because we don't want to have to convert the value in question to the type of some column in the log table (that would defeat the purpose)
Keep a log of each manipulation of each value in a type conversion, but only if the conversion is lossy for that value. Same-ish tables as for (3), but fewer of them (and fewer entries)
- Pros
  - speed when we want to undo, since we know each value's trail
  - Similar to the audit table concept we've discussed previously, good for granular tracking of all data conversions.
  - Smaller data usage than (3)
- Cons
  - conceptually difficult
  - Implementation will be very complicated, since we'll need to make sure we handle any possible lossy conversion, and only store logs when it's actually being lost by the conversion.
  - Still large storage needs in many cases.
  - We still need most of the tables that we do for (3), and we need to know when we need them.

Of these options, the least bad one (to my eye) is (1). I'm not really a big fan of it, though.

0 replies

mathemancer · 2021-05-19T08:55:06Z

mathemancer
May 19, 2021
Maintainer

As far as avoiding lossy conversions, I can't think of a way to do that which isn't piecemeal, and which doesn't require noticing when any given conversion has the potential to lose information.

0 replies

mathemancer · 2021-05-19T09:05:56Z

mathemancer
May 19, 2021
Maintainer

Option 5:
Whenever an ALTER COLUMN TYPE command is issued, instead:

create a new column of the desired type.
copy all data to the new column using appropriate conversions
Hide the old column from view somehow (though it would still be visible in actual DB queries)

This would mean more data storage than (1) over time, but be even faster than (3) or (4). Very conceptually simple, quick undo. The main irritations would be around reflecting the DB, since we'd need to somehow mark the columns as "do not show"

0 replies

mathemancer · 2021-05-19T09:26:10Z

mathemancer
May 19, 2021
Maintainer

All of these ideas so far fail if data is added to the data table with the new type. Given that there isn't a proper reverse map for a given type conversion (e.g., string to numeric), what should be done if the user wants to change back to string? Should zeroes be prepended?

1 reply

mathemancer May 19, 2021
Maintainer

I suppose that isn't too much of a problem for these methods if they're employed only for import, and before any other data is entered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding lossy type conversions during import #132

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Avoiding lossy type conversions during import #132

pavish May 18, 2021 Maintainer

Replies: 5 comments · 1 reply

kgodey May 18, 2021 Maintainer

mathemancer May 19, 2021 Maintainer

mathemancer May 19, 2021 Maintainer

mathemancer May 19, 2021 Maintainer

mathemancer May 19, 2021 Maintainer

mathemancer May 19, 2021 Maintainer

pavish
May 18, 2021
Maintainer

Replies: 5 comments 1 reply

kgodey
May 18, 2021
Maintainer

mathemancer
May 19, 2021
Maintainer

mathemancer
May 19, 2021
Maintainer

mathemancer
May 19, 2021
Maintainer

mathemancer
May 19, 2021
Maintainer

mathemancer May 19, 2021
Maintainer