JS: RegExp.prototype.unicode (flag u)

By Xah Lee. Date: . Last updated: .

RegExp.prototype.unicode (u)

u

(new in ECMAScript 2015)

Interpret the text to be matched as byte sequence of characters of their unicode code points.

  • This is useful if you want to know if a byte sequence occur in the text.
  • With flag u off, the text is byte sequence of UTF-16 encoding.
  • With flag u on, the text is byte sequence of the character's code points.
  • For this flag to be useful, the regex must contain a byte sequence you seek, typically specified by \uxxxx where the xxxx is 4 hexadecimal digits.
  • If the regex uses literal characters such as abc or literal unicode "🦋", they are interpreted as byte sequence of UTF-16 encoding.
console.log(/\uD83E/.test("🦋") === true);
// true
console.log(/\uD83E/u.test("🦋") === false);
// true

/*
D83E is first 2 bytes of the butterfly character in utf 16.
D83E is part of surrogate code point, but not a standalone unicode character.

The hexadecimal for the butterfly character is 1F98B .
*/

/*
🦋
Name: BUTTERFLY
ID 129419
HEXD 1F98B
UTF8  F0 9F A6 8B
UTF16 D83E DD8B
*/

🛑 WARNING: when you use the unicode escape sequence form \u{hexadecimal} in literal regex expression e.g. "🦋".match(/\u{1F98B}/), it is not interpreted as the code point's byte sequence, unless you have the u flag. But if you use it in a string, as arg to regex constructor, e.g. "🦋".match( RegExp("\u{1F98B}")); it works (interpreted as byte sequence 1F98B).

/* digit 0 has codepoint 48.
It is 30 in hexadecimal.
the unicode escape form
\u{30}
stand for the char 0 in string.
*/
console.log("\u{30}" === "0");

/*
however, in regex literal expression,
/\u{30}/
means the char repeated 30 times.
you need the unicode flag u to interpreted it as digit 0.
/\u{30}/u
*/

// the regex here is interpreted as u repeated 30 times
// deno-fmt-ignore
console.log(/\u{30}/.test("0") === false);

// the regex here is interpreted as the character 0, because the flag u
// deno-fmt-ignore
console.log(/\u{30}/u.test("0") === true);

// if you use the regex constructor , you don't have this problem.
console.log(RegExp("\u{30}", "").test("0") === true);
/* replace butterfly char by x */

// you don't need the regex flag u to match unicode characters
console.log("🦋".replace(/🦋/, "x") === "x");

// but having it on doesn't hurt.
console.log("🦋".replace(/🦋/u, "x") === "x");

JavaScript Unicode topics