Replies: 26 comments 150 replies
-
Lua typesThe Lua type system is a major part of Tarantool’s data type system. It’s worth noting that it’s important to remember that Lua is a dynamically typed language with its own type coercion rules and data values limitations. Furthermore LuaJIT adds its own ffi integer types which have extended value ranges. Both these ranges should be taken into account during data processing in box and SQL. As it’s written in the Lua 5.1 Reference Manual - https://www.lua.org/manual/5.1/manual.html#2.2:
LuaJIT extends the list of built-in datatypes with the cdata type, with which one can implement an arbitrary custom data type via FFI (e.g. unsigned long 64-bit integer Here is the list of basic data types we have in the Lua/LuaJIT virtual machines. Highlighted are additional types introduced by LuaJIT for its own purposes.
NB! There is no values of Remember, in a longer run we want to have the least destructive conversion rules while transferring data from Lua world to box storage world to SQL world. LuaJIT typesLuaJIT internally uses a little bit different set of types than those of Lua. That was done mostly for performance reasons, as a dirty hack to squeeze everything to local variables of 8 byte size.
From the user perspective these types are not extending the list of supported types presented in the Lua types table above, and not modifying (extending or contracting) our mental type schema anyway. So, we put it here mostly for illustrative reasons. Lua vs LuaJIT typesThere is strict correspondence between internal LuaJIT types and original Lua types
But from user perspective those differences are invisible and inaccessible, thus for the purposes of current specification, while Issues with
|
Beta Was this translation helpful? Give feedback.
-
Box format field typesHistorically, in Tarantool, as a NoSQL database, programmer could work with space data even without schema/format provided. In this case, data to be serialized was stored with MessagePack format, which acts on received data values, and saves them in the most compact form possible. Space schemas, known as formats, have been introduced to Tarantool after version 1.6(?) and have been used for validation of data being inserted to the data store. SQL could not work unless a format is provided, thus we assume such a precondition from now on – each space should have format assigned to it. There is a space:format() method which might be used to define a field constraints which we enforce on data when storing to the database. https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_space/format/ At the moment of writing, we have this list of supported field types:
An additional “is_nullable” attribute in the format or index definition allows saving nil values in data tuples. If we compare the list of Lua types to the list of format field types we could easily recognize differences and omissions on both sides. Number in Lua could easily end up as These incompatibilities are not that much of a problem because we rely on programmers themselves to decide when it’s necessary select which Lua values to save to which field. But there are some subtle current limitations which we want to fix in the longer run, but more on them after the messagepack introduction. Messagepack types aka mp-typesMp-types used in the Tarantool core are a set of standard MessagePack types plus a set of extended types introduced for Tarantool purposes via the MP_EXT extension mechanism.
Only two of the four extended types are user visible, and have any relation whatsoever to the scope of this this paper (MP_DECIMAL, MP_UUID), so we leave out the other ones as out-of-scope. Messagepack types supported by format field typesAcceptance table extracted from code (click to expand)Acceptance table for messagepack data saved into filed type columns as defined in conversion tables named
i.e. for example, we see that the FIELD_TYPE_NUMBER column could accept as valid values data which is represented as MP_UINT (unsigned positive values or 0), MP_INT (negative values), MP_FLOAT/MP_DOUBLE (floating values) and MP_DECIMAL (extended MP type introduced in Tarantool for decimal values support). |
Beta Was this translation helpful? Give feedback.
-
MessagePack serialization issuesSerialization of floating valuesThe table above looks reasonable at first sight, but due to the way we serialize tables to MessagePack without consideration of space format specifications, and only using incoming Lua values, we might get to the case when Lua box behaves differently from SQL:
Here is the problem, as it seems today:
*Solution A* (nothing to see here - do not expand)As a temporary work-around for our current implementation we may extend the list of accepted fields for
Solution BWe believe, that space, which does have schema attached, should not proceed with a generic MessagePack serialization routine using the value-based approach, but rather ought to use a type-based algorithm according to the space’s format. This format type would be a hint for selection of the MP_* type encoding schema necessary for the field. Serialization of
|
Beta Was this translation helpful? Give feedback.
-
SQL typesThe following type names are specified by ISO SQL: bigint, binary, binary varying, boolean, char, character varying, character, varchar, date, double precision, integer, interval, numeric, decimal, real, smallint, time (with or without time zone), timestamp (with or without time zone), xml. FYI: Despite the fact that SQL:2016 has added JSON support, it was not done in a way that was similar to XML support, and there is no currently added json data type for that. Speaking of SQL types in Tarantool we may distinguish their original, syntactic form in the SQL dialect used in Tarantool, and their internal representation form also known as “mem-types”. In the ideal world mem-types would be equivalent to a Tarantool engine’s field types, but that’s not the case at the moment, so some conversions should be applied. In the longer term we have to take several steps to commonize those types, reducing any possible conversion overhead. Let us compare space field types, SQL types, and MEM-types: [To the left we see box type systems, in the middle are standard ISO SQL types, and to the right we see SQL MEM-types] If we compare the list of supported box types with standard SQL types (with their underlying mem-types) then we could see multiple omissions on all sides involved. For example, now there is no DECIMAL type in Tarantool SQL, but there is decimal support in box. The same problem exists for UUID, MAP, or ARRAY types. (But less so for map and array, because they are supported partially, to some degree). (SQLite) Mem-typesAt the moment of writing Tarantool SQL engine uses a different type system inside of its VDBE routines, those names could be recognized by their MEM_ prefix. This type system resembles very much the type system used by box, but is rather encoded differently.
MEM_St MEM_Blob could have various modifiers, allocator related and garbage collection related (e.g. MEM_Dyn, MEM_Static, or MEM_Ephem), but we will leave them out-of-scope for the current document
At the moment there is an effort #5818 to centralize mem-types code, and extract all getters/setters to a common place. The intention is – have the ability eventually to switch to the same type system as box, avoiding any extra conversion overhead. SQL NULL and Lua nil(s)Fortunately, SQL NULL is unambiguously equal to nil and box.NULL if used for binding arguments:
One could say that there is another NULL discrepancy in box vs SQL, e.g.
But this works as expected – there is no way in SQL to compare (using ‘=’) NULL against anything, including NULL itself. If you really need to find null values in SQL then you have to use “IS [NOT] NULL” predicate, not an equality operator ‘=’. '00000000-0000-0000-0000-000000000000' and SQL NULLThere is yet another discrepancy which we would rather not fix between Lua and SQL worlds -- treatment as nil/NULL of the special UUID value '00000000-0000-0000-0000-000000000000'. In Lua we would consider it as nil (for some odd reason):
While in SQL the similar expression should be false for the `CAST(‘00000000-0000-0000-0000-000000000000’ AS UUID) IS NULL` predicate. We see not much point in extending the IS NULL predicate with predefined “nil UUIDs” semantics. SQL NULL is SQL NULL. If we need to check for “nil UUIDs” we could introduce a new built-in like ‘uuid_is_null(x)`. SQL BLOBs issues‘\0’ problemAt the moment, both blobs and strings might add a terminating \0 byte to their allocated chunks inside of the Tarantool SQL engine. We believe, that this approach for blob is unnecessary and counter intuitive. We should make sure that there is no post-processing for raw blobs, and they are passed as-is, without any extra conversion or Blobs in LuaInserting blobs via Lua is not a very easy task. There are some known dirty voodoo tricks which you need to proceed to insert binary values into tuple fields from the Lua side – see #1629 (comment) , but this recipe is a mouthful and troublesome. For consistency’s sake, the easy task in SQL should be similarly easy as in box. We need to introduce an api or some extra convention to pass unmodified binary data (cdata?) to blob fields. The current hypothesis is – this problem might be fixed and code greatly simplified once we start to use space format for MessagePack serialization (see MessagePack serialization issues) |
Beta Was this translation helpful? Give feedback.
-
Missing SQL typesThere are several data types, which are already available in box, but have still not been exported via SQL, e.g.:
We will cover inconsistency here using different approaches:
At the end of current efforts, we are supposed to provide a consistent picture for Lua and SQL, having them as much compatible as possible. |
Beta Was this translation helpful? Give feedback.
-
DECIMAL SQL typeDECIMAL is not yet supported in Tarantool SQL, while it is already supported in box. To make this consistent we would like to follow this plan:
DECIMAL is an exact numeric type according to SQL standard definitions. Depending on the SQL vendor implementation, integer literals may actually have subtypes of DECIMAL type. Comparison with inexact numeric type expressions (float or double) should implicitly convert to wider type, e.g. DECIMAL. Example: MS SQL-Server DECIMAL rules (click to expand)For example, in MS SQL-Server DECIMAL and NUMERIC are equivalent:
Example decimal constants in MS SQL-Server (click to expand)
Conversions between types
DECIMALs in Tarantool SQL – plan of actions
|
Beta Was this translation helpful? Give feedback.
-
Date, time and timestampsThere is as yet no DATE/TIME/TIMESTAMP support in Tarantool SQL, nor is there built-in date/time support in box. Past experience showed that we need date/time/timestamp support for (at least) running some industry wide benchmarks like TPC-H. To run TPC-H we have developed a set of ugly hacks, which implemented SQLite functions, thus we could run SQLite version of TPC-H scripts unmodified, and investigate performance issues (if any). The problem is – those SQLite built-ins are not SQL standard compliant at all. And they do not provide rich timezone support which might be used if timezone is included in literals or functions. Current practice shows (see tables below) that using non-standard built-ins for date/time support are not a big deal, and everybody does something like that, but World timezones support might be added though using the ICU4C library, which is already used partially in Tarantool, for the locations support. The Lua part of date/time support is already available via the NB! It is worth noting that at the moment of writing the set of older SQLite-derived date/time patches went nowhere, but they are rebased daily for running TPC-H against the current master for bench.tarantool.org purposes. Landing of those patches in the master has been blocked because of possible storage schema changes/upgrades which we would need to perform eventually, if we would start today from using date/time with strings as an underlying type (which is used in the patch). The older patch, essentially…
Thus, modified TPC-H queries would look like: -- using 1433771997 as a seed to the RNG
select
c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
from
customer, orders, lineitem, nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date('1994-01-01')
and o_orderdate < date('1994-01-01', '+ 3 months')
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
order by
revenue desc
limit 20; Instead of the original TPC-H syntax: -- using 1433771997 as a seed to the RNG
select
c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
from
customer, orders, lineitem, nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1994-01-01'
and o_orderdate < date '1994-01-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
order by
revenue desc
limit 20; See the subtle syntax difference, where instead of supporting standard data and interval expressions we have seen functions in the SQLite-based version. If we looked around to see what standard date/time/timestamp syntaxes are supported by various vendors (PostgreSQL, Oracle, MySQL, SQLite) then we would see this table. It’s easy to recognize that SQLite does not support standard SQL time-related types and corresponding syntax here, and that PostgreSQL is most advanced here – it’s the sole vendor which supports the OVERLAPS expression. But, if we compare their functions, then the pictures becomes very messy and complicated: We believe, that from practical perspective we need to be as close to PostgreSQL as possible. |
Beta Was this translation helpful? Give feedback.
-
PostgreSQL example of storage and ranges of datetime types (click to expand)As described in PostgreSQL 13 | Date/Time Types | 8.5. Date/Time Types datetime types have the following storage sizes and supported ranges:
SQL-Server example of datetime storage and ranges supported (click to expand)As described in Microsoft SQL-Server docs | Transact-SQL (T-SQL) | Reference Functions | Date &
Future Tarantool support for datetime (click to expand)So, we suggest implementing the following functionality for Tarantool NB! The list of supported functions is debatable. Tarantool DATETIME – plan of actions
|
Beta Was this translation helpful? Give feedback.
-
UUIDUUID type is just a binary type of given length (128-bit). Storage, collation and comparison should be defined in the box implementation. From the SQL perspective we introduce a new primary type UUID, which At the moment there is no plan to introduce any special handling of string literals with the form String literals and string expressions may be implicitly converted to UUID if they have proper format. Converting UUID to STRING is only possible via explicit cast. ISO SQL does not require any UUID support from SQL vendors, thus all known vendors implement it slightly differently. The worst uuid implementation, out of all those mentioned above in the table, is Oracle Both MySQL and SQLite use similar approaches – providing a set of functions for conversion of string to blob, and the reverse. The most complete implementation seemingly is in PostgreSQL, which allows various relaxed formats for string literals which may be accepted as UUID – the plan is to be as close as possible to PostgreSQL here. Tarantool UUID – plan of actionsThe suggestion is:
|
Beta Was this translation helpful? Give feedback.
-
ArrayNikita has opened an issue for array support in syntax – “sql: introduce type <ARRAY>” This issue was requesting only a set of built-in functions, and was not asking much of syntax extensions.
This proposal looks very easy to implement, but is not very SQL-ish. The There is in standard SQL:2016 a simple [] syntax for accessing elements of ARRAYs, and DB2, PostgreSQL and Oracle support it. All of them have different sets of array-supporting built-ins (and PostgreSQL has an especially long list of functions in its extension). But from a practical perspective there is not much value in supporting anything beyond ISO SQL functions. Tarantool ARRAYs – plan of actionsSo, the plan is:
Such fields will be stored as MP_ARRAY encoded tables in box. NB! An open question is – should we export TRIM_ARRAY and ARRAY_AGG to their Lua counterparts? |
Beta Was this translation helpful? Give feedback.
-
MapSimilar to Nikita’s proposal about arrays, maps support might be implemented via a set of builtin functions. Please see proposal from Nikita in “sql: introduce type <MAP> #4763” - #4763
There is no (and hardly ever could be) pure ISO SQL aggregate type similar to box map. However, it is a natural concept for NoSQL databases, and is supported by Cassandra CQL, as an example. Here is excerpt from The Cassandra Query Language (CQL) | Data Types | Collections | Maps:
MAPs in Tarantool SQL – plan of actionsJust do as Cassandra does:
|
Beta Was this translation helpful? Give feedback.
-
BITANSI SQL:99 types Despite the fact that XMLXML types support is in SQL standard since SQL:2003. Nobody actually cares about that, so we should rather not bother with XML at the moment. XML in Tarantool SQL – plan of actions
JSON[Strictly speaking this might be out-of-topic for the current paper, because we are not speaking about an SQL type per se, but for completeness…] ISO SQL:2016 adds json support, but does it differently than for XML. CREATE TABLE … ( jcol CLOB CHECK (jcol IS JSON) );
[{id: 1, name: "Marvin"},
{id: 2, name: "Arthur"}
]
SELECT jt.*
FROM t,
JSON_TABLE
( jcol,
‘$[*]’
COLUMNS
(id NUMERIC PATH '$.id',
name VARCHAR(255) PATH '$.name'
)
) jt There is no special JSON type added, but rather a lot of functions to operate with string or binary data. For implementation we could rely here on already available support of json-path in Tarantool box. But let JSON in Tarantool SQL – plan of actions
|
Beta Was this translation helpful? Give feedback.
-
ISO SQL conversion rulesISO SQL:2016 has a convenient table for explicit type conversion which might be done for different standard types. There is no drawn implicit conversion table though. [At least we didn't find it] Let us put aside user-defined types (which we not support at the moment for SQL mode), references, rows, and collections (array/map), and see the rest of the table entries. We see that string data could be conditionally converted to any other (we name them “scalar”) types. We see that integer and float types might be converted to strings and to each other. There is more complex correspondence between date/time types, but they are behaving reasonably and according to intuition. Example, Oracle conversion table (click to expand)Among other SQL vendors only Oracle and Microsoft provide human-readable conversions tables in their documentation. Please see the Oracle version above, as described in SQL Language Reference | Data Type Comparison Rules | Implicit and Explicit Data Conversion For the sake of completeness, we show an explicit conversion table from the same version of Oracle 21 from SQL Language Reference | Data Type Comparison Rules | Explicit Data Conversion Example, Microsoft SQL-Server conversion table (click to expand)SQL-Server documentation in their article Transact-SQL (T-SQL) | Reference | Data types | Conversion | Implicit and explicit conversion shows an absolutely outstanding conversion table, which compresses in single place information about whether this combination of input and NB! There are a lot of extended types which are not SQL standard (like CLR UDT, money or hierarchyid). We do not care about them at the moment. Example: integer and floating types in MS SQL-ServerExample, MS SQL-Server about integer types:
Example, Microsoft SQL-Server about real / float types
Example, MySQL conversion table (click to expand)Unfortunately, there is no drawn clear and concise conversion table for the MySQL SQL flavor. Rules are described verbally MySQL 8.0 Reference Manual / Functions and Operators / Type Conversion in Expression NB! Please let us know if you have some corrections for the above table, we still want to have correct and up-to-date information represented in our document. Example, Type conversion rules in MySQL https://dev.mysql.com/doc/refman/8.0/en/type-conversion.html
Example, PostgreSQL conversion table (click to expand)We experience similar problems when we try to visualize implicit and explicit casts in PostgreSQL, because there is no such table in the documentation. PostgreSQL 13.2 Documentation | Part II. The SQL Language | 10. Type Conversion
Ok, we’ve given up the idea to build a conversion table! |
Beta Was this translation helpful? Give feedback.
-
Implicit casts and comparisonsHere is the older picture of implicit conversions that we do in vdbe code: “Integer” (actually negative integer) and “unsigned” make up the generic “integer” type. Number is a superset of generic integer and double. (There is no smaller 4-byte float data type, as in other SQL vendors’ Here are a few comments about possible combinations in the above table:
Implicit types conversion refactoring (click to expand)There is refactoring going on for a stricter casts table, which suggests to implement this (slightly stricter) conversion table: Pay attention to the smaller number of allowed entries in string column. i.e. we do not allow to implicitly convert between brother number types or strings. That is a sudden and unexpected result, and the majority of SQL vendors do support such conversions (see tables above). Surprisingly, not much change needs to be introduced to the implicit conversation table above, for all the new SQL types that we speak about in this paper. Array and map are already there and in proper places, and omitted there – new scalar uuid type, and several datetime types. Proposed implicit cast table with newer typesWith the addition of newer SQL types (uuid, datetime, etc.) we would have the following implicit conversion table:
|
Beta Was this translation helpful? Give feedback.
-
Explicit castsIn SQL we could use several ways to explicitly cast from one argument type to another, either via `CAST(x AS T)` or via a special purpose builtin function. Example, ClickHouse explicit casts (click to expand)A good example is how ClickHouse uses both functions and CAST expressions -
Current explicit conversion tableWith the exception of Oracle and Microsoft SQL-Server we did not show explicit casts tables (for Oracle there was a separate function, and for SQL-Server it was integrated to the common table), but the principle is Here is the currently implemented table of explicit conversions for VDBE: Proposed table for explicit conversions with newer types
|
Beta Was this translation helpful? Give feedback.
-
SQL operatorsSQL - operator precedenceImportant part of an algorithm of implicit and explicit cast application is the order of operators evaluation (please see PostgreSQL algorithm referenced above). We are not yet so much flexible as PostgreSQL, and would rather prefer to have static rules in parser. It's worth to mention that once we get to the business with user-defined types and functions overload we should be ready to return to the question of operator precedence control once again. But probably it won't happen any time soon. |
Beta Was this translation helpful? Give feedback.
-
As far as I know, currently we do not have any implicit cast when we execute box operations from Lua. However, I believe that in near future we plan to use format during serialization. Does it mean that implicit cast rules defined here will be applied to operations from Lua? If we know format during serialization, then we have no reason no to do this. So, will |
Beta Was this translation helpful? Give feedback.
-
Currently we have a problem with comparison in SCALAR. Example:
I see two ways to solve the problem:
I believe the second way is better. One more reason to remove implicit cast. |
Beta Was this translation helpful? Give feedback.
-
Suggested implicit cast table:
Here For numeric types conditions are: cast from numeric type A to numeric type B is possible only if |
Beta Was this translation helpful? Give feedback.
-
Suggested explicit cast table:
Here Cast from one numeric type to another can lead to loss of precision. For example Any NUMBER value can be stored as INTEGER, DOUBLE or DECIMAL. Any SCALAR value can be stored as BOOLEAN, INTEGER, DOUBLE, DECIMAL, STRING, VARBINARY or UUID. In the following rules, NUMBER/SCALAR values use the type that is used to store the value instead of NUMBER/SCALAR. For example, SCALAR value stored as BOOLEAN will follow rules determined for BOOLEAN values. Rules for explicit casts: Explicit cast to BOOLEAN:
Explicit cast to UNSIGNED:
Explicit cast to INTEGER:
Explicit cast to DOUBLE:
Explicit cast to DECIMAL:
Explicit cast to NUMBER:
Explicit cast to UUID:
|
Beta Was this translation helpful? Give feedback.
-
Rules of arithmetic operations. Rules for
Rules for
|
Beta Was this translation helpful? Give feedback.
-
Rules of bit-wise operations.
|
Beta Was this translation helpful? Give feedback.
-
According to the description, QUOTE() returns numeric value if its argument is numeric, but currently, it returns a string in case of float argument. If we fix this problem, |
Beta Was this translation helpful? Give feedback.
-
Hi all! Below I will describe my thought about SQL built-in functions. List of types in this document: BOOLEAN, INTEGER, DOUBLE, DECIMAL, NUMBER, STRING, VARBINARY, UUID, SCALAR. I will not mention UNSIGNED type since it is essentially INTEGER type with some restrictions.
|
Beta Was this translation helpful? Give feedback.
-
I think that it'is a shame. PRINTF is a very consistent way to format strings. |
Beta Was this translation helpful? Give feedback.
-
Re ZEROBLOB(n): I read an explanation https://www.oreilly.com/library/view/using-sqlite/9781449394592/re159.html |
Beta Was this translation helpful? Give feedback.
-
NB! Result of this discussion has already been aggregated and committed as RFC to the master
Introduction
The purpose of this paper is to try to create a clear and non-contradicting description of a typing model, which covers both worlds of types in NoSQL Lua (Tarantool box), and in SQL. Wherever possible we will try to commonize types used, and, if that’s not possible, we will try to present least harmful conversion rules.
If a conversion rule is not straightforward and should employ any logic, this should be the least surprising logic.
The problem is (as usual) in details. Despite the fact that all parties involved have very similar type system - be it Lua, Tarantool storage engines (with format schema applied to spaces), MessagePack used as internal serialization format, or SQL engine (heavily influenced by SQLite design). In all those cases we could find subtle, but important differences
here and there.
The goal of the current specification is to define direction to either minimize those differences or to get rid of them altogether.
As an example, if we look into Tarantool storage then today Tarantool box uses two different categories of types:
Field types as described by format (if it’s applied to the space);
and serialization types used by messagepack also known as
mp-types.
In the longer term, we should get to the situation when SQL type system described in the terms of Tarantool field-types. At the moment there are conversion rules but we want to eliminate their use so there is no information loss.
Beta Was this translation helpful? Give feedback.
All reactions