Add string schema properties #587

bsless · 2021-12-05T09:18:25Z

Currently only implemented for JVM
What do you think?

Charset, pattern, blank

ikitommi

Looks Great! few comments attached. Also, to be complete, we need also:

JSON Schema mappings
Changes to generator

src/malli/core.cljc

ikitommi · 2021-12-06T09:59:13Z

src/malli/core.cljc

+            false))))))
+
+#?(:clj
+   (defn find-blank-method


we could lift the minimum java to 11 and remove this?

That's a huge breaking change for users. Sadly, there's still plenty of Java 8 in the world and we must accommodate

@ikitommi Definitely ~half of our work projects are still on Java 8

ikitommi · 2021-12-06T10:01:08Z

src/malli/core.cljc

+           (let [p (-charset-predicate charset)]
+             (string-char-predicate p)))
+         non-blank (when non-blank #(not (blank? %)))]
+     (-> non-blank


is the :non-blank needed? much shorter to use :min:

[:string {:non-blank true}] [:string {:min 1}]

valid min 1 string: " ", but it's invalid for non-blank

This could be solved by transformers, but users can certainly want to specify they want a string with minimum length which is not blank. Blank in this case is a superset of empty, which I almost missed in the beginning

Deraen · 2021-12-06T14:39:42Z

Charset name here is quite strange? It doesn't refer to charset really.
Looks like after additional changes this works fully on Cljs now? I would be really careful of adding JVM only features, especially if there is no schema creation time error about using non-supported properties on some env.

bsless · 2021-12-06T15:01:01Z

Charset name here is quite strange? It doesn't refer to charset really.

It's a set of characters, but I'm not a fan of the name either, if you have a better idea, I don't mind changing it

Looks like after additional changes this works fully on Cljs now? I would be really careful of adding JVM only features, especially if there is no schema creation time error about using non-supported properties on some env.

The only JVM only feature currently is regex generation, but this is not a regression

Deraen · 2021-12-07T09:00:58Z

Perhaps just :characters (or :chars)?

Charset (or even character set) means completely different thing in the same context, so it is not an option to use that: https://en.wikipedia.org/wiki/Character_encoding

Quick test to check this is faster than just using equivalent patterns:

(bench/quick-bench (validate [:string {:charset :digit}] "123"))
;; 470ns

(bench/quick-bench (validate [:string {:pattern "^\d*$"}] "123"))
;; 765ns

Deraen · 2021-12-07T09:03:58Z

src/malli/core.cljc

+(defn -charset-predicate
+  [o]
+  (case o
+    :digit #?(:clj #(Character/isDigit ^char %) :cljs -numeric-char?)


Btw. there is both char and int versions of the predicates in Java:

https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(char)
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int)

The int version supports unicode characters.

Okay, this is probably fine. Char supports 0-65535 so it is enough to support unicode range 0x0000 - 0xFFFF.

Clojure characters don't support supplementary char ranges either. Though strings support:

0x2F81A:

\冬
=> Unsupported character: \冬

(.codePointAt "冬" 0)
=> 194586

(char (.codePointAt "冬" 0))
=> Value out of range for char: 194586

(Character/isLetter (int 0x2F81A))
=> true

(int (.charAt "冬" 0))
=> 55422 in this case .charAt only returns first two bytes.

JS charCodeAt works the same as JVM charAt.

Problem is I'm using charAt which returns char, not int

0x2F81A (which is \⾁) is supported under the 12161 unicode: (link)

(int \⾁) => 12161

Same with all other unicode characters:

(char 8809) => \≩ (char 508) => \Ǽ (int \Ǽ) => 508 (int \≩) => 8809 (int \Θ) => 920 (char 33071) => \脯

Deraen · 2021-12-07T09:33:19Z

src/malli/json_schema.cljc

-  (merge {:type "string"} (-> schema m/properties (select-keys [:min :max]) (set/rename-keys {:min :minLength, :max :maxLength}))))
+  (let [props (-> schema m/properties)
+        pattern (case (:charset props)
+                  :digit "^[0-9]*$"


These patterns don't match the Character predicates, the predicates allow other ranges, e.g.:

'\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')
'\u0660' through '\u0669', Arabic-Indic digits
'\u06F0' through '\u06F9', Extended Arabic-Indic digits
'\u0966' through '\u096F', Devanagari digits
'\uFF10' through '\uFF19', Fullwidth digits

There are some classes in at least JVM Pattern which might match the predicates, but not sure if there are equivelents to all, and what does JS support. Listing all the ranges might work for some cases.

I can use unicode ranges, this works:

(re-find #"[\u0061-\u00F6]" "a")

Deraen · 2021-12-07T11:02:44Z

It is possible to rewrite the string loop and char checks to work with unicode values, using codePointAt:

diff --git a/src/malli/core.cljc b/src/malli/core.cljc
index e8ab01b..2878c12 100644
--- a/src/malli/core.cljc
+++ b/src/malli/core.cljc
@@ -583,13 +583,14 @@
 (defn -charset-predicate
   [o]
   (case o
-    :digit #?(:clj #(Character/isDigit ^char %) :cljs -numeric-char?)
-    :letter #?(:clj #(Character/isLetter ^char %) :cljs -letter?)
-    (:alphanumeric :letter-or-digit) #?(:clj #(Character/isLetterOrDigit ^char %) :cljs -alphanumeric?)
+    :digit #?(:clj #(Character/isDigit ^int %) :cljs -numeric-char?)
+    :letter #?(:clj #(Character/isLetter ^int %)
+               :cljs -letter?)
+    (:alphanumeric :letter-or-digit) #?(:clj #(Character/isLetterOrDigit ^int %) :cljs -alphanumeric?)
     :alphabetic #?(:clj #(Character/isAlphabetic (int %)) :cljs -letter?)
     (cond
       (set? o) (miu/-some-pred (mapv -charset-predicate o))
-      (char? o) #?(:clj #(= ^char o %) :cljs (let [i (.charCodeAt o 0)] #(= i %)))
+      (char? o) #?(:clj #(= ^int o %) :cljs (let [i (.charCodeAt o 0)] #(= i %)))
       :else (eval o))))

 (defn string-char-predicate
@@ -599,10 +600,13 @@
       (loop [i 0]
         (if (= i n)
           true
-          (if (p #?(:clj  (.charAt s (unchecked-int i))
-                    :cljs (.charCodeAt s (unchecked-int i))))
-            (recur (unchecked-inc i))
-            false))))))
+          (let [cur-char #?(:clj  (.codePointAt s (unchecked-int i))
+                            :cljs (.codePointAt s (unchecked-int i)))]
+            (if (p cur-char)
+              (recur (unchecked-add i #?(:clj (Character/charCount cur-char)
+                                         ;; TODO: Is there better way?
+                                         :cljs (.length (String/fromCodePoint cur-char)))))
+              false)))))))

It doesn't seem to hurt the performance either:

;; 470ns
(bench/quick-bench (validate [:string {:charset :letter}] "冬"))
;; 840ns
(bench/quick-bench (validate [:string {:pattern "^\p{L}*$"}] "冬"))

Deraen · 2021-12-07T11:03:11Z

src/malli/core.cljc

+;; string schema helpers
+;;
+
+#?(:cljs (defn -numeric-char? [c] (and (< 47 c) (< c 58))))


These checks work differently than the JVM versions.

bsless · 2021-12-07T11:09:23Z

Perhaps just :characters (or :chars)?

I think characters is good, I'll make the appropriate change

Deraen · 2021-12-07T11:13:44Z

What is the use case for this feature?

The performance is a bit better (~2x) than Patterns but I don't think that is a huge difference usually. Patterns already support unicode quite well, and both JVM and JS.

This is surprisingly complex feature due to character encodings, and what letter are included in the character classes.

[A-Za-z] and just checking for 64 < c < 91 and 96 < c < 123 break e.g. with Finnish and Scandinavian languages. I don't think providing features that work only in specific cases is that useful. And now the JVM predicates work considerably differently than Cljs predicates. At least we should be very careful to ensure these work similarly in both platforms.

bsless · 2021-12-07T11:57:18Z

Not sure why mix the constructors here:

(require '[criterium.core :as cc])
(require '[malli.core :as m])

(def v0 (m/validator [:string {:charset :digit}]))
(def v1 (m/validator [:string {:pattern #"^\d*$"}]))

(cc/quick-bench (v0 "123")) ;; 16.756693 ns
(cc/quick-bench (v0 "123456789")) ;; 48.106319 ns
(cc/quick-bench (v0 "123456789a")) ;; 53 ns

(cc/quick-bench (v1 "123")) ;; 71.737728 ns
(cc/quick-bench (v1 "123456789")) ;; 109.460351 ns
(cc/quick-bench (v1 "123456789a")) ;; 128 ns

bsless · 2021-12-07T17:29:30Z

The use case for this feature is twofold:

regular expressions are slower
expressive power - being able to say "a string is alphanumeric" is more readable than ^[A-Za-z0-9]*$
Especially when mixed with other constraints, like "it should have at least two elements"
Composition / working with data: specify a set of options instead of a regex. One of the options can be a characters set, too, or an ad-hoc validation function

I've seen too many examples at work for such use cases and I don't really want a regex on my hot path

bsless · 2021-12-13T19:00:44Z

Doing some more thinking about this, how about two properties, include-chars and exclude-chars?

bsless added 2 commits December 5, 2021 11:17

Add string schema properties

09c6258

Charset, pattern, blank

Add cljs implementation

dc8eeda

ikitommi reviewed Dec 6, 2021

View reviewed changes

Use fail instead of throw

82a3c4e

bsless added 3 commits December 6, 2021 16:55

Simplify charset default case

402034f

Collapse identical case

3df68a1

Add string charset, pattern, blank support

7973683

Add pattern and charset to string json schema

b4c04a7

bsless requested a review from ikitommi December 6, 2021 21:08

Deraen reviewed Dec 7, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add string schema properties #587

Add string schema properties #587

bsless commented Dec 5, 2021

ikitommi left a comment

ikitommi Dec 6, 2021

bsless Dec 6, 2021

Deraen Dec 6, 2021 •

edited

Loading

ikitommi Dec 6, 2021

bsless Dec 6, 2021

bsless Dec 6, 2021

Deraen commented Dec 6, 2021

bsless commented Dec 6, 2021

Deraen commented Dec 7, 2021

Deraen Dec 7, 2021

Deraen Dec 7, 2021

Deraen Dec 7, 2021

bsless Dec 7, 2021

evg-tso Dec 15, 2021

Deraen Dec 7, 2021

bsless Dec 7, 2021

Deraen commented Dec 7, 2021

Deraen Dec 7, 2021

bsless commented Dec 7, 2021

Deraen commented Dec 7, 2021

bsless commented Dec 7, 2021

bsless commented Dec 7, 2021

bsless commented Dec 13, 2021

Add string schema properties #587

Are you sure you want to change the base?

Add string schema properties #587

Conversation

bsless commented Dec 5, 2021

ikitommi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Deraen Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Deraen commented Dec 6, 2021

bsless commented Dec 6, 2021

Deraen commented Dec 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Deraen commented Dec 7, 2021

Choose a reason for hiding this comment

bsless commented Dec 7, 2021

Deraen commented Dec 7, 2021

bsless commented Dec 7, 2021

bsless commented Dec 7, 2021

bsless commented Dec 13, 2021

Deraen Dec 6, 2021 •

edited

Loading