-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add string schema properties #587
base: master
Are you sure you want to change the base?
Conversation
Charset, pattern, blank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Great! few comments attached. Also, to be complete, we need also:
- JSON Schema mappings
- Changes to generator
false)))))) | ||
|
||
#?(:clj | ||
(defn find-blank-method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could lift the minimum java to 11 and remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a huge breaking change for users. Sadly, there's still plenty of Java 8 in the world and we must accommodate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikitommi Definitely ~half of our work projects are still on Java 8
(let [p (-charset-predicate charset)] | ||
(string-char-predicate p))) | ||
non-blank (when non-blank #(not (blank? %)))] | ||
(-> non-blank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the :non-blank
needed? much shorter to use :min
:
[:string {:non-blank true}]
[:string {:min 1}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid min 1 string: " "
, but it's invalid for non-blank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be solved by transformers, but users can certainly want to specify they want a string with minimum length which is not blank. Blank in this case is a superset of empty, which I almost missed in the beginning
|
It's a set of characters, but I'm not a fan of the name either, if you have a better idea, I don't mind changing it
The only JVM only feature currently is regex generation, but this is not a regression |
Perhaps just Charset (or even character set) means completely different thing in the same context, so it is not an option to use that: https://en.wikipedia.org/wiki/Character_encoding Quick test to check this is faster than just using equivalent patterns: (bench/quick-bench (validate [:string {:charset :digit}] "123")) (bench/quick-bench (validate [:string {:pattern "^\d*$"}] "123")) |
(defn -charset-predicate | ||
[o] | ||
(case o | ||
:digit #?(:clj #(Character/isDigit ^char %) :cljs -numeric-char?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. there is both char and int versions of the predicates in Java:
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(char)
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int)
The int version supports unicode characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is probably fine. Char supports 0-65535 so it is enough to support unicode range 0x0000 - 0xFFFF.
Clojure characters don't support supplementary char ranges either. Though strings support:
0x2F81A:
\冬
=> Unsupported character: \冬
(.codePointAt "冬" 0)
=> 194586
(char (.codePointAt "冬" 0))
=> Value out of range for char: 194586
(Character/isLetter (int 0x2F81A))
=> true
(int (.charAt "冬" 0))
=> 55422 in this case .charAt only returns first two bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS charCodeAt works the same as JVM charAt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is I'm using charAt
which returns char, not int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x2F81A (which is \⾁) is supported under the 12161 unicode: (link)
(int \⾁)
=> 12161
Same with all other unicode characters:
(char 8809)
=> \≩
(char 508)
=> \Ǽ
(int \Ǽ)
=> 508
(int \≩)
=> 8809
(int \Θ)
=> 920
(char 33071)
=> \脯
(merge {:type "string"} (-> schema m/properties (select-keys [:min :max]) (set/rename-keys {:min :minLength, :max :maxLength})))) | ||
(let [props (-> schema m/properties) | ||
pattern (case (:charset props) | ||
:digit "^[0-9]*$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These patterns don't match the Character predicates, the predicates allow other ranges, e.g.:
'\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')
'\u0660' through '\u0669', Arabic-Indic digits
'\u06F0' through '\u06F9', Extended Arabic-Indic digits
'\u0966' through '\u096F', Devanagari digits
'\uFF10' through '\uFF19', Fullwidth digits
There are some classes in at least JVM Pattern which might match the predicates, but not sure if there are equivelents to all, and what does JS support. Listing all the ranges might work for some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use unicode ranges, this works:
(re-find #"[\u0061-\u00F6]" "a")
It is possible to rewrite the string loop and char checks to work with unicode values, using codePointAt:
It doesn't seem to hurt the performance either: ;; 470ns |
;; string schema helpers | ||
;; | ||
|
||
#?(:cljs (defn -numeric-char? [c] (and (< 47 c) (< c 58)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks work differently than the JVM versions.
I think characters is good, I'll make the appropriate change |
What is the use case for this feature? The performance is a bit better (~2x) than Patterns but I don't think that is a huge difference usually. Patterns already support unicode quite well, and both JVM and JS. This is surprisingly complex feature due to character encodings, and what letter are included in the character classes.
|
Not sure why mix the constructors here:
|
The use case for this feature is twofold:
I've seen too many examples at work for such use cases and I don't really want a regex on my hot path |
Doing some more thinking about this, how about two properties, |
Currently only implemented for JVM
What do you think?