From a50d2742544241ed49d8d95cafe86f0361516ccf Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Mon, 27 Sep 2021 18:34:54 -0700 Subject: [PATCH 1/3] Editorial: Introduce `IdentifierStartChar` + `IdentifierPartChar` (#2392) Extract `IdentifierStartChar` from `IdentifierStart` and `RegExpIdentifierStart`. Extract `IdentifierPartChar` from `IdentifierPart` and `RegExpIdentifierPart`. This has 3 benefits: - We eliminate some repetition between the productions for Identifiers and RegExpIdentifiers. - We can simplify 4 Early Error rules involving escape sequences, because the constraint can now be expressed in terms of a single nonterminal, rather than a nonterminal plus some terminals. - We can eliminate the Early Error rule for `RegularExpressionFlags` by instead expressing its constraint in the grammar: in the production for `RegularExpressionFlags`, replace `IdentifierPart` with `IdentifierPartChar`. (As a consequence of the last point, this commit undefines the following id: sec-literals-regular-expression-literals-static-semantics-early-errors There didn't seem to be a sensible place to relocate it as an oldid.) --- spec.html | 43 ++++++++++++++++++------------------------- 1 file changed, 18 insertions(+), 25 deletions(-) diff --git a/spec.html b/spec.html index 43f39fe00a..19a030601d 100644 --- a/spec.html +++ b/spec.html @@ -16175,15 +16175,21 @@

Syntax

IdentifierName IdentifierPart IdentifierStart :: + IdentifierStartChar + `\` UnicodeEscapeSequence + + IdentifierPart :: + IdentifierPartChar + `\` UnicodeEscapeSequence + + IdentifierStartChar :: UnicodeIDStart `$` `_` - `\` UnicodeEscapeSequence - IdentifierPart :: + IdentifierPartChar :: UnicodeIDContinue `$` - `\` UnicodeEscapeSequence <ZWNJ> <ZWJ> @@ -16209,13 +16215,13 @@

Static Semantics: Early Errors

IdentifierStart :: `\` UnicodeEscapeSequence IdentifierPart :: `\` UnicodeEscapeSequence @@ -17057,22 +17063,12 @@

Syntax

RegularExpressionFlags :: [empty] - RegularExpressionFlags IdentifierPart + RegularExpressionFlags IdentifierPartChar

Regular expression literals may not be empty; instead of representing an empty regular expression literal, the code unit sequence `//` starts a single-line comment. To specify an empty regular expression, use: `/(?:)/`.

- -

Static Semantics: Early Errors

- RegularExpressionFlags :: RegularExpressionFlags IdentifierPart - -
-

Static Semantics: BodyText

@@ -34244,19 +34240,14 @@

Syntax

RegExpIdentifierName[?UnicodeMode] RegExpIdentifierPart[?UnicodeMode] RegExpIdentifierStart[UnicodeMode] :: - UnicodeIDStart - `$` - `_` + IdentifierStartChar `\` RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate RegExpIdentifierPart[UnicodeMode] :: - UnicodeIDContinue - `$` + IdentifierPartChar `\` RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate - <ZWNJ> - <ZWJ> RegExpUnicodeEscapeSequence[UnicodeMode] :: [+UnicodeMode] `u` HexLeadSurrogate `\u` HexTrailSurrogate @@ -34418,7 +34409,7 @@

Static Semantics: Early Errors

RegExpIdentifierStart :: `\` RegExpUnicodeEscapeSequence
  • - It is a Syntax Error if the CharacterValue of |RegExpUnicodeEscapeSequence| is not the code point value of *"$"*, *"_"*, or some code point matched by the |UnicodeIDStart| lexical grammar production. + It is a Syntax Error if the CharacterValue of |RegExpUnicodeEscapeSequence| is not the code point value of some code point matched by the |IdentifierStartChar| lexical grammar production.
RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate @@ -34430,7 +34421,7 @@

Static Semantics: Early Errors

RegExpIdentifierPart :: `\` RegExpUnicodeEscapeSequence
  • - It is a Syntax Error if the CharacterValue of |RegExpUnicodeEscapeSequence| is not the code point value of *"$"*, *"_"*, <ZWNJ>, <ZWJ>, or some code point matched by the |UnicodeIDContinue| lexical grammar production. + It is a Syntax Error if the CharacterValue of |RegExpUnicodeEscapeSequence| is not the code point value of some code point matched by the |IdentifierPartChar| lexical grammar production.
RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate @@ -46020,6 +46011,8 @@

Lexical Grammar

+ + From 6d2ba3cdf5b28686fa68d9404df351eed8f8566b Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Mon, 27 Sep 2021 18:35:01 -0700 Subject: [PATCH 2/3] Editorial: Introduce [RegExp]IdentifierCodePoint[s] SDOs (#2392) This commit introduces SDOs `IdentifierCodePoints` and `IdentifierCodePoint`. - This allows `StringValue` of _IdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving _UnicodeEscapeSequence_), since they can now be expressed as constraints on a code point, rather than having to be translated into the space of String values. ---- Similarly, this commit introduces SDOs `RegExpIdentifierCodePoints` and `RegExpIdentifierCodePoint`. - This allows `CapturingGroupName` of _RegExpIdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving surrogate pairs). (Note that the current algorithm for `CapturingGroupName` only 'normalizes' escape sequences, whereas this PR's algorithm also normalizes surrogate pairs. However, since the normalized text is immediately passed to `CodePointsToString`, the result should be the same. Given the Early Error rules for surrogate pairs, normalizing them made sense to me.) --- spec.html | 102 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 94 insertions(+), 8 deletions(-) diff --git a/spec.html b/spec.html index 19a030601d..ce117bf605 100644 --- a/spec.html +++ b/spec.html @@ -16215,16 +16215,55 @@

Static Semantics: Early Errors

IdentifierStart :: `\` UnicodeEscapeSequence
  • - It is a Syntax Error if the SV of |UnicodeEscapeSequence| is not ! UTF16EncodeCodePoint(_cp_) for some Unicode code point _cp_ matched by the |IdentifierStartChar| lexical grammar production. + It is a Syntax Error if IdentifierCodePoint of |UnicodeEscapeSequence| is not some Unicode code point matched by the |IdentifierStartChar| lexical grammar production.
IdentifierPart :: `\` UnicodeEscapeSequence
  • - It is a Syntax Error if the SV of |UnicodeEscapeSequence| is not ! UTF16EncodeCodePoint(_cp_) for some Unicode code point _cp_ matched by the |IdentifierPartChar| lexical grammar production. + It is a Syntax Error if IdentifierCodePoint of |UnicodeEscapeSequence| is not some Unicode code point matched by the |IdentifierPartChar| lexical grammar production.
+ + +

Static Semantics: IdentifierCodePoints

+
+
+ IdentifierName :: IdentifierStart + + 1. Let _cp_ be IdentifierCodePoint of |IdentifierStart|. + 1. Return « _cp_ ». + + IdentifierName :: IdentifierName IdentifierPart + + 1. Let _cps_ be IdentifierCodePoints of the derived |IdentifierName|. + 1. Let _cp_ be IdentifierCodePoint of |IdentifierPart|. + 1. Return the list-concatenation of _cps_ and « _cp_ ». + +
+ + +

Static Semantics: IdentifierCodePoint

+
+
+ IdentifierStart :: IdentifierStartChar + + 1. Return the code point matched by |IdentifierStartChar|. + + IdentifierPart :: IdentifierPartChar + + 1. Return the code point matched by |IdentifierPartChar|. + + UnicodeEscapeSequence :: `u` Hex4Digits + + 1. Return the code point whose numeric value is the MV of |Hex4Digits|. + + UnicodeEscapeSequence :: `u{` CodePoint `}` + + 1. Return the code point whose numeric value is the MV of |CodePoint|. + +
@@ -17672,8 +17711,7 @@

Static Semantics: StringValue

IdentifierName IdentifierPart - 1. Let _idText_ be the source text matched by |IdentifierName|. - 1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_ with the code point represented by the |UnicodeEscapeSequence|. + 1. Let _idTextUnescaped_ be IdentifierCodePoints of |IdentifierName|. 1. Return ! CodePointsToString(_idTextUnescaped_). @@ -34415,7 +34453,7 @@

Static Semantics: Early Errors

RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
  • - It is a Syntax Error if the result of performing UTF16SurrogatePairToCodePoint on the two code points matched by |UnicodeLeadSurrogate| and |UnicodeTrailSurrogate| respectively is not matched by the |UnicodeIDStart| lexical grammar production. + It is a Syntax Error if RegExpIdentifierCodePoint of |RegExpIdentifierStart| is not matched by the |UnicodeIDStart| lexical grammar production.
RegExpIdentifierPart :: `\` RegExpUnicodeEscapeSequence @@ -34427,7 +34465,7 @@

Static Semantics: Early Errors

RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
  • - It is a Syntax Error if the result of performing UTF16SurrogatePairToCodePoint on the two code points matched by |UnicodeLeadSurrogate| and |UnicodeTrailSurrogate| respectively is not matched by the |UnicodeIDContinue| lexical grammar production. + It is a Syntax Error if RegExpIdentifierCodePoint of |RegExpIdentifierPart| is not matched by the |UnicodeIDContinue| lexical grammar production.
UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue @@ -34710,11 +34748,59 @@

Static Semantics: CapturingGroupName

RegExpIdentifierName RegExpIdentifierPart
- 1. Let _idText_ be the source text matched by |RegExpIdentifierName|. - 1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |RegExpUnicodeEscapeSequence| in _idText_ with the code point represented by the |RegExpUnicodeEscapeSequence|. + 1. Let _idTextUnescaped_ be RegExpIdentifierCodePoints of |RegExpIdentifierName|. 1. Return ! CodePointsToString(_idTextUnescaped_).
+ + +

Static Semantics: RegExpIdentifierCodePoints

+
+
+ RegExpIdentifierName :: RegExpIdentifierStart + + 1. Let _cp_ be RegExpIdentifierCodePoint of |RegExpIdentifierStart|. + 1. Return « _cp_ ». + + RegExpIdentifierName :: RegExpIdentifierName RegExpIdentifierPart + + 1. Let _cps_ be RegExpIdentifierCodePoints of the derived |RegExpIdentifierName|. + 1. Let _cp_ be RegExpIdentifierCodePoint of |RegExpIdentifierPart|. + 1. Return the list-concatenation of _cps_ and « _cp_ ». + +
+ + +

Static Semantics: RegExpIdentifierCodePoint

+
+
+ RegExpIdentifierStart :: IdentifierStartChar + + 1. Return the code point matched by |IdentifierStartChar|. + + RegExpIdentifierPart :: IdentifierPartChar + + 1. Return the code point matched by |IdentifierPartChar|. + + + RegExpIdentifierStart :: `\` RegExpUnicodeEscapeSequence + + RegExpIdentifierPart :: `\` RegExpUnicodeEscapeSequence + + + 1. Return the code point whose numeric value is the CharacterValue of |RegExpUnicodeEscapeSequence|. + + + RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate + + RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate + + + 1. Let _lead_ be the code unit whose numeric value is that of the code point matched by |UnicodeLeadSurrogate|. + 1. Let _trail_ be the code unit whose numeric value is that of the code point matched by |UnicodeTrailSurrogate|. + 1. Return UTF16SurrogatePairToCodePoint(_lead_, _trail_). + +
From 1901514ff9aaf2041fc89f9007848910db5bac9e Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Mon, 27 Sep 2021 18:35:05 -0700 Subject: [PATCH 3/3] Editorial: Move 2 paragraphs down one level (#2392) ... from 12.6 Names and Keywords down to 12.6.1 Identifier Names. I think this makes it clearer that this prose is mostly saying the same thing as the associated Early Error rules and SDOs. --- spec.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/spec.html b/spec.html index ce117bf605..c02bcd844f 100644 --- a/spec.html +++ b/spec.html @@ -16163,8 +16163,6 @@

Names and Keywords

This standard specifies specific code point additions: U+0024 (DOLLAR SIGN) and U+005F (LOW LINE) are permitted anywhere in an |IdentifierName|, and the code points U+200C (ZERO WIDTH NON-JOINER) and U+200D (ZERO WIDTH JOINER) are permitted anywhere after the first code point of an |IdentifierName|.

-

Unicode escape sequences are permitted in an |IdentifierName|, where they contribute a single Unicode code point to the |IdentifierName|. The code point is expressed by the |CodePoint| of the |UnicodeEscapeSequence| (see ). The `\\` preceding the |UnicodeEscapeSequence| and the `u` and `{ }` code units, if they appear, do not contribute code points to the |IdentifierName|. A |UnicodeEscapeSequence| cannot be used to put a code point into an |IdentifierName| that would otherwise be illegal. In other words, if a `\\` |UnicodeEscapeSequence| sequence were replaced by the |SourceCharacter| it contributes, the result must still be a valid |IdentifierName| that has the exact same sequence of |SourceCharacter| elements as the original |IdentifierName|. All interpretations of |IdentifierName| within this specification are based upon their actual code points regardless of whether or not an escape sequence was used to contribute any particular code point.

-

Two |IdentifierName|s that are canonically equivalent according to the Unicode standard are not equal unless, after replacement of each |UnicodeEscapeSequence|, they are represented by the exact same sequence of code points.

Syntax

PrivateIdentifier :: @@ -16209,6 +16207,8 @@

Syntax

Identifier Names

+

Unicode escape sequences are permitted in an |IdentifierName|, where they contribute a single Unicode code point to the |IdentifierName|. The code point is expressed by the |CodePoint| of the |UnicodeEscapeSequence| (see ). The `\\` preceding the |UnicodeEscapeSequence| and the `u` and `{ }` code units, if they appear, do not contribute code points to the |IdentifierName|. A |UnicodeEscapeSequence| cannot be used to put a code point into an |IdentifierName| that would otherwise be illegal. In other words, if a `\\` |UnicodeEscapeSequence| sequence were replaced by the |SourceCharacter| it contributes, the result must still be a valid |IdentifierName| that has the exact same sequence of |SourceCharacter| elements as the original |IdentifierName|. All interpretations of |IdentifierName| within this specification are based upon their actual code points regardless of whether or not an escape sequence was used to contribute any particular code point.

+

Two |IdentifierName|s that are canonically equivalent according to the Unicode standard are not equal unless, after replacement of each |UnicodeEscapeSequence|, they are represented by the exact same sequence of code points.

Static Semantics: Early Errors