Kakoune’s regex syntax is inspired by ECMAScript, as defined by the ECMA-262 standard (see :doc regex compatibility).
Kakoune’s regex always runs on Unicode codepoint sequences, not on bytes.
Every character except the syntax characters \^$.*+?[]{}|().
match
themselves. Syntax characters can be escaped with a backslash so that
\$
will match a literal $
, and \\
will match a literal \
.
Some literals are available as escape sequences:
-
\f
matches the form feed character. -
\n
matches the newline character. -
\r
matches the carriage return character. -
\t
matches the tabulation character. -
\v
matches the vertical tabulation character. -
\0
matches the null character. -
\cX
matches the control-X
character (X
can be in[A-Za-z]
). -
\xXX
matches the character whose codepoint isXX
(in hexadecimal). -
\uXXXXXX
matches the character whose codepoint isXXXXXX
(in hexadecimal).
The [
character introduces a character class, matching one character
from a set of characters.
A character class contains a list of literals, character ranges,
and character class escapes surrounded by [
and ]
.
If the first character inside a character class is ^
, then the character
class is negated, meaning that it matches every character not specified
in the character class.
Literals match themselves, including syntax characters, so ^
does not need to be escaped in a character class. [\*]` matches both
the `\*` character and the `
character. Literal escape sequences are
supported, so [\n\r]
matches both the newline and carriage return
characters.
The ]
character needs to be escaped for it to match a literal ]
instead of closing the character class.
Character ranges are written as <start character>-<end character>
, so
[A-Z]
matches all uppercase basic letters. [A-Z0-9]
will match all
uppercase basic letters and all basic digits.
The -
characters in a character class that are not specifying a
range are treated as literal -
, so [A-Z-]` matches all upper case
characters, the `-` character, and the `
character.
Supported character class escapes are:
-
\d
which matches digits 0-9. -
\w
which matches word characters A-Z, a-z, 0-9 and underscore (ignoring theextra_word_chars
option). -
\s
which matches all Unicode whitespace characters. -
\h
which matches whitespace except Vertical Tab and line-breaks.
Using an upper case letter instead of a lower case one will negate
the character class. For example, \D
will match every non-digit
character.
Character class escapes can be used outside of a character class, \d
is equivalent to [\d]
.
.
matches any character, including newlines, by default.
(see :doc regex modifiers on how to change it)
Regex atoms can be grouped using (
and )
or (?:
and )
. If (
is
used, the group will be a capturing group, which means the positions from
the subject strings that matched between (
and )
will be recorded.
Capture groups are numbered starting at 1. They are numbered in the
order of appearance of their (
in the regex. A special capture group
0 is for the whole sequence that matched.
-
(?:
introduces a non capturing group, which will not record the matching positions. -
(?<name>
introduces a named capturing group, which, in addition to being referred by number, can be, in certain contexts, referred by the given name.
The |
character introduces an alternation, which will either match
its left-hand side, or its right-hand side (preferring the left-hand side)
For example, foo|bar
matches either foo
or bar
, foo(bar|baz|qux)
matches foo
followed by either bar
, baz
or qux
.
Literals, character classes, any characters, and groups can be followed by a quantifier, which specifies the number of times they can match.
-
?
matches zero, or one time. -
*
matches zero or more times. -
+
matches one or more times. -
{n}
matches exactlyn
times. -
{n,}
matchesn
or more times. -
{n,m}
matchesn
tom
times. -
{,m}
matches zero tom
times.
By default, quantifiers are greedy, which means they will prefer to
match more characters if possible. Suffixing a quantifier with ?
will
make it non-greedy, meaning it will prefer to match as few characters
as possible.
Assertions do not consume any character, but they will prevent the regex from matching if not fulfilled.
-
^
matches at the start of a line; that is, just after a newline character, or at the subject’s beginning (unless it is specified that the subject’s beginning is not a start of line). -
$
matches at the end of a line; that is, just before a newline, or at the subject end (unless it is specified that the subject’s end is not an end of line). -
\b
matches at a word boundary; which is to say that between the previous character and the current character, one is a word character, and the other is not. -
\B
matches at a non-word boundary; meaning, when both the previous character and the current character are word characters, or both are not. -
\A
matches at the subject string’s beginning. -
\z
matches at the subject string’s end. -
\K
matches anything, and resets the start position of capture group 0 to the current position.
More complex assertions can be expressed with lookarounds:
-
(?=…)
is a lookahead; it will match if its content matches the text following the current position. -
(?!…)
is a negative lookahead; it will match if its content does not match the text following the current position. -
(?⇐…)
is a lookbehind; it will match if its content matches the text preceding the current position. -
(?<!…)
is a negative lookbehind; it will match if its content does not match the text preceding the current position.
For performance reasons, lookaround contents must be a sequence of
literals, character classes, or any character (.
); quantifiers are not
supported.
For example, (?<!bar)(?=foo).
will match any character which is not
preceded by bar
and where foo
matches from the current position
(which means the character has to be an f
).
Some modifiers can control the matching behavior of the atoms following them:
-
(?i)
starts case-insensitive matching. -
(?I)
starts case-sensitive matching (default). -
(?s)
allows.
to match newlines (default). -
(?S)
prevents.
from matching newlines.
\Q
will start a quoted sequence, where every character is treated as
a literal. That quoted sequence will continue until either the end of
the regex, or the appearance of \E
.
For example, .\Q.^$\E$
will match any character followed by the
literal string .^$
, followed by an end of line.
Kakoune’s syntax tries to follow the ECMAScript regex syntax, as defined by https://www.ecma-international.org/ecma-262/8.0/; some divergence exists for ease of use, or performance reasons:
-
Lookarounds are not arbitrary, but lookbehind is supported.
-
\K
,\Q..\E
,\A
,\h
and\z
are added. -
Stricter handling of escaping, as we introduce additional escapes; identity escapes like
\X
withX
being a non-special character are not accepted, to avoid confusions between\h
meaning literalh
in ECMAScript, and horizontal blank in Kakoune. -
\uXXXXXX
uses 6 digits to cover all of Unicode, instead of relying on ECMAScript UTF-16 surrogate pairs with 4 digits.