Formal grammar

This is the canonical formal specification of Carve. It is layered, and each layer is stated in a declared, machine-interpretable formalism - no rule lives in prose alone:

Layer	Content	Formalism
PART 0	Line layout: indentation, container prefixes, lazy continuation	Deterministic line automaton
PARTS 1-8	Block and inline productions	EBNF + declared guard notation (lookahead, lookbehind classes, `where` counting guards)
PART 9	Semantic constraints a context-free production cannot carry (emphasis resolution, paragraph interruption, table span walk, attribute floating, tabs)	Operational semantics: labeled rules over declared state
PART 9R	Whole-document resolution (references, footnotes, crossrefs, numbering)	Two-pass rules over declared symbol tables
PART 10	HTML serialization	Tree-transform conventions
PART 11	Canonical source writer (`carve fmt`): round-trip invariants and the escaping rule	Invariants over `parse`/`fmt` + a decision procedure
PART 12	AST serialization: the JSON shape a parsed document exchanges as	Reference-implementation field names + a round-trip invariant

Why layers instead of one grammar?

Light markup languages are provably not context-free: fence-length matching is a counting constraint, indentation is 2D, and reference resolution needs a whole-document symbol table. No single EBNF can express Carve (or Djot, or CommonMark). What CAN be done - and what this file does - is state every rule in some exact formalism, so nothing normative rests on English prose.

Carve Core: the executable spec

Carve Core is the subset of Carve whose specification is directly executable - each spec layer exists as a machine-interpretable artifact:

Spec layer	Executable artifact
PART 0 layout automaton + list/quote structure	`scripts/spec/layout.mjs`
PART 3 inline grammar	`resources/carve-core.ohm` (Ohm/PEG)
PART 9R resolution + PART 10 serialization	`scripts/spec/html.mjs`

The executable spec covers the full conformant core: block structure (headings incl. multi-line folding and section wrapping, lists with every ordered dialect, quotes, tables with the span walk, fenced code and colon fences, definition lists, comments, frontmatter, block-attribute lines), the complete inline layer (emphasis with word-boundary guards, links, images, spans, attributes with the security hardening rules, autolinks, math, extensions, mentions/tags, editorial markup, smart typography, footnotes incl. inline notes, crossrefs, raw passthrough), and the resolution passes (references, footnote numbering and endnotes placement, numbered captions, abbreviations).

bash

npm run core:check

The gate demands byte-identical HTML for every pair in the conformance corpus. Rules a pure PEG cannot state are executed as declared predicates in the layout automaton (fence-length counting, the where guards) or as a pre-scan (the emphasis close-first delimiter-stack rule), so the pipeline never silently diverges from the delimiter-stack semantics.

Implementations should match this grammar. The case study explains the design rationale, the reference page covers parsing edge cases, and the examples show the expected HTML output for each construct.

The full grammar lives at resources/grammar.ebnf in the repository.

ebnf

(* ============================================================================
   Carve Markup Language - EBNF Grammar
   Version: 0.1

   This grammar defines the syntax of Carve, a post-Markdown lightweight markup
   language with visual mnemonics and human-centered design.

   Notation:
   - (*...*) = comment
   - 'x'     = terminal character
   - "x"     = terminal string
   - x | y   = alternation
   - x y     = concatenation
   - [x]     = optional (0 or 1)
   - {x}     = repetition (0 or more)
   - x+      = repetition (1 or more)
   - {x}+    = repetition (1 or more) -- same as x+, used where the repeated
               unit is a group
   - x - y   = exception (x but not y)
   - (x)     = grouping

   GUARD NOTATION (formal predicates -- machine-interpretable, PEG-class).
   These are the ONLY devices beyond pure context-free EBNF that the
   productions use. Each has exact, implementable semantics; none is prose.
   - &(p)      = positive lookahead: the input AHEAD must match p; consumes
                 nothing (PEG `&`). p may be a full production, so unbounded
                 lookahead such as "a matching closer exists ahead" is legal.
   - !(p)      = negative lookahead: the input ahead must NOT match p;
                 consumes nothing (PEG `!`).
   - <&(c) / <!(c) = single-character look-BEHIND: the character immediately
                 BEFORE the current position must (must not) match the
                 character class c; consumes nothing. Left context is always
                 exactly one already-scanned character or START, so this is a
                 regular condition, mechanically checkable.
   - x:name    = capture: bind the text matched by x to `name`.
   - where C   = semantic guard: C is a boolean expression over captures
                 using only len(name), char(name) (the run's character),
                 numeric comparison, and equality. A production carrying
                 `where` matches only when C holds. This is the single
                 counting device (fence-length matching); everything else
                 above is regular lookaround.
   - prod(d)   = production TEMPLATE: a production parameterized by a
                 terminal d, instantiated once per listed terminal (an
                 indexed-grammar macro; expansion is purely mechanical).
   - START, EOL = virtual zero-width terminals (start of inline run /
                 end of line).
   Character classes for guards: ws = space | tab | newline; alnum =
   letter | digit; punct = ascii_punctuation; non_ws = character - ws.

   NORMATIVITY
   - This file (resources/grammar.ebnf) is the NORMATIVE specification of
     Carve. Where any document disagrees with it, this file wins. It is
     LAYERED, each layer in a declared formalism: PART 0 (layout automaton)
     -> PARTS 1-8 (EBNF productions + the GUARD NOTATION above) -> PART 9
     (operational semantics for the non-context-free rules) -> PART 9R
     (two-pass resolution semantics) -> PART 10 (serialization). PART 9 and
     PART 9R are the authority for behavior the productions cannot encode.
   - docs/case-study/syntax.md and docs/edge-cases.md are EXPLANATORY and
     DERIVED — prose for humans, non-normative.
   - docs/examples.md and the generated tests/corpus/* pairs are the
     CONFORMANCE CONTRACT: every example is checked against the reference
     implementation in CI. A spec change is not real until a corpus pair
     pins it.
   ============================================================================ *)

(* ============================================================================
   PART 0: LAYOUT LAYER (L0) -- LINE AUTOMATON (NORMATIVE)
   ============================================================================

   The block productions (PARTS 1-2) are context-free only OVER THE OUTPUT OF
   THIS LAYER. Indentation, container prefixes, and lazy continuation are 2D
   properties no string grammar can express (they need column arithmetic and
   carried container state); L0 states them as a deterministic one-pass line
   automaton instead of prose. The three engine implementations realize
   this automaton literally.

   INPUT.  Lines, split on newline ('\n' | '\r\n'). Column arithmetic is
           PART 9 §24 C1: a space advances the column by 1, a tab to the next
           multiple of 4; tabs outside indentation are preserved verbatim.
   STATE.  open := stack of open containers, bottom-up (document ->
           innermost), plus the current leaf (none | paragraph | heading |
           ...). Container kinds and the per-line prefix each demands:
             block quote    marker `>` [+ one space]
             list item      indentation to content_column (PART 9 §24 C3)
             footnote def   indentation >= 2 spaces (PART 9 §16)
             fenced body    (code / raw / comment / admonition / div /
                            line block / hard-break block) -- no per-line
                            prefix; bounded by its where-guarded fence pair
   STEP (per line L, executed exactly once -- no backtracking, Design
         Principle 1: the contribution a line makes to block structure never
         depends on a future line, the only lookahead being the DECLARED
         &(...) closer guards on fence / frontmatter openers):
     S1 MATCH PREFIXES. Walk `open` bottom-up, consuming each container's
        prefix from L (strip `> ` for quotes; DEDENT by columns for
        indentation containers, PART 9 §24 C5). Stop at the first container
        whose prefix L does not supply.
     S2 FENCED BODY. If the innermost matched container is a fenced body, L
        is verbatim content unless it matches that fence's where-guarded
        closer (then pop). No other rule applies to L.
     S3 FULL MATCH (every prefix consumed). The residue of L continues the
        innermost container: BLANK -> close the open paragraph (recorded for
        tight/loose, PART 9 §17 L1); a BLOCK OPENER -> the interruption
        relation decides (PART 9 §10); a LIST MARKER reaching
        content_column -> sublist (PART 9 §24 C3); otherwise paragraph /
        lazy item text.
     S4 PARTIAL MATCH. The unmatched containers are candidates to close.
        LAZY CONTINUATION: if the innermost MATCHED context holds an OPEN
        PARAGRAPH and the residue is NOT an interrupting line (PART 9 §10 --
        list markers and plain text fold; visible openers interrupt), L
        folds into that paragraph and NOTHING closes. Otherwise close the
        unmatched containers and re-classify the residue in the surviving
        context (S3).
     S5 OPENERS. When S3/S4 classify the residue as a container opener (a
        quote marker; a list marker at an admissible column; a `+`
        continuation marker at the marker column, PART 9 §17 L3/L4; a fence
        opener whose closer-lookahead guard holds), push the container and
        recurse on the remaining residue.
   OUTPUT. The layout-resolved line stream (equivalently: INDENT / DEDENT /
           CONT events plus line residues). PARTS 1-2 parse blocks over it;
           inline parsing (PART 3) runs per finished block -- the two-phase
           model of PART 8.
   ============================================================================ *)

(* ============================================================================
   PART 1: DOCUMENT STRUCTURE
   ============================================================================ *)

document = [frontmatter], {block}, EOF ;

frontmatter = &(frontmatter_open, {content_line - frontmatter_close}, frontmatter_close),
              frontmatter_open, frontmatter_content, frontmatter_close, newline ;
(* FORMAL closer lookahead (was prose): the leading &(...) guard demands that
   a closing `---` line exist ahead before the opener is claimed. With no
   closer the guard fails, so a bare `---` at document start is an ordinary
   thematic break and the following lines are ordinary blocks (same lookahead
   model as PART 9 §10). *)
(* The opening delimiter may carry a format token (yaml | json | toml | neon |
   ...); a bare `---` defaults to `yaml`. The space before the token is OPTIONAL
   (lenient: both `---yaml` and `--- yaml` are accepted; `---yaml` is canonical).
   The closing delimiter is always a bare `---`. The token distinguishes a typed
   opener from a thematic break. *)
frontmatter_open = "---", [space], [frontmatter_format], newline ;
frontmatter_close = "---", newline ;
frontmatter_format = (letter | digit)+ ;
frontmatter_content = {content_line - frontmatter_close} ;
content_line = {character - newline}, newline ;
(* frontmatter_content lines are metadata in the named format, handed to that
   format's parser; the grammar treats them as opaque lines. *)

block = heading
      | thematic_break
      | code_block
      | blockquote
      | list
      | table
      | line_block
      | local_hard_break_block
      | admonition
      | div
      | comment_block
      | comment_line
      | raw_block
      | reference_definition
      | footnote_definition
      | abbreviation_definition
      | paragraph
      | block_attributes
      | blank_line ;
(* `block_attributes` is a standalone block-level line that renders
   nothing on its own; it attaches to a following block. Its precise
   reach (floating across blank lines), accumulation, and drop-if-dangling
   semantics are NOT context-free -- they are pinned in PART 9 §15.
   `comment_line`, `reference_definition`, `footnote_definition` and
   `abbreviation_definition` are the INVISIBLE blocks: they are parsed and
   consumed (collected into the definition table / dropped) and emit no
   output; like `block_attributes` they also interrupt an open paragraph
   (PART 9 §10, INVISIBLE CONSTRUCTS). *)

blank_line = {whitespace}, newline ;

(* ============================================================================
   PART 2: BLOCK ELEMENTS
   ============================================================================ *)

(* --- Headings --- *)
(* ATX (`#` prefix) only. Setext (underline) headings are intentionally
   NOT supported, matching djot: a `---` underline collides with the
   thematic break and frontmatter delimiter, reintroducing the ambiguity
   djot removed and breaking Design Principle 1 ("one syntax, one
   meaning"). *)
heading = atx_heading ;

atx_heading = heading_first_line, {heading_continuation_line} ;
heading_first_line = heading_marker, space, inline_content, newline ;
heading_marker = '#' | "##" | "###" | "####" | "#####" | "######" ;
(* COLUMN ZERO -- NORMATIVE. A heading_marker sits at COLUMN 0 of its line (no
   leading spaces or tabs); carve does NOT accept CommonMark's 0-3 space indent.
   An indented `#`-line is ordinary paragraph text: "   # H" is <p># H</p>, not
   a heading. This applies to a continuation line too -- an indented
   heading_marker is not a continuation marker, so it folds into the heading as
   literal text rather than being stripped. (Within a container the column is
   measured after the container's own markers are stripped, e.g. "> # H" is a
   quoted heading.) Matches Djot (which dropped CommonMark's indent fuzz);
   pinned by corpus 101-heading-marker-column-zero. *)

(* MULTI-LINE HEADINGS -- NORMATIVE (like Djot). A heading's text spills onto
   following lines until a blank line. There are THREE heading-specific rules:
     1. CONTINUATION: a following line carries the SAME number of `#` markers
        (stripped) or NONE, and its text is appended to the heading -- exactly
        djot ("preceded by the same number of `#` characters ... can also be
        left off"). A line with a DIFFERENT `#` count (MORE or FEWER) is NOT a
        continuation: it starts a NEW heading.
     2. A blank line ends the heading.
     3. A caption (`^ ` ...) attaches to it (§4) and so ends the text.
   Everything else that "ends" a heading is NOT a heading rule -- it is the
   general block structure (a heading is a bounded title, not an open
   paragraph): any line that begins a block (quote, table, fenced code, `:::`
   div, thematic break, a `%%%` comment) ends the heading and starts that
   block, and a LIST MARKER -- with no open paragraph in a title to fold into
   (§10) -- starts a sibling list. This is the SAME outcome the top level gives:
   "# H \n - item" is a heading PLUS a `<ul>`, exactly as "<no open paragraph>
   \n - item" starts a list. So the heading adds no per-construct enders of its
   own beyond the three above.
   Only PLAIN text (and a same-`#` marker line) folds into the heading. The
   heading id is derived from the full folded text.
   NO TRAILING ATTRIBUTES (djot-strict). A heading line carries NO trailing
   `{...}` attribute block: a `{...}` at the end of a heading line is ordinary
   inline content (literal text unless it forms an inline construct), and the
   id derives from the full literal text. Attributes attach via a PRECEDING
   block-attribute line (PART 9 §15), the uniform block rule; an explicit
   `#id` from that line lives on the `<section>` wrapper (PART 9 §13).
   Pinned by corpus 02-headings (the preceding-line and literal-trailing
   pairs). *)
heading_continuation_line = [heading_marker, space], inline_content, newline ;
   (* the optional marker must be the SAME count as the open heading's (djot);
      a different count starts a new heading -- see the NORMATIVE note above *)

(* HEADING IDENTIFIERS -- NORMATIVE. A heading without an explicit {#id} gets an
   automatic identifier from its folded plain text (symbols `:name:` and footnote
   references excluded), GitHub/SSG-style:
     - reverse smart-typography output to its ASCII source first, so the id does
       not depend on presentation (`Don’t` -> `Don-t`, `Step 1 → 2` -> `Step-1-2`);
     - replace each maximal run of non-alphanumeric ASCII characters with a single
       `-` (covers spaces, punctuation, `_`, and runs of `-`) -- the jgm/djot#393 rule;
     - trim leading/trailing `-`;
     - PRESERVE CASE (Unicode-aware): NON-ASCII characters AND letter case pass
       through unchanged (`Café` -> `Café`, `日本語` unchanged). NO Unicode
       normalization (NFC) is applied, so the slug needs no Unicode tables and is
       byte-identical across implementations;
     - a leading-digit result is prefixed `s-` (a bare leading digit is a valid HTML
       id but an invalid CSS selector); an empty result becomes a generated `s-N`;
     - duplicates within a document get a numeric `-2`, `-3`, ... suffix.
   An explicit {#id} is used verbatim (case preserved). Implementations MAY offer
   OPT-IN, orthogonal transforms: lowercasing (carve-js `lowercaseHeadingIds`;
   carve-php `LowercaseHeadingIdsExtension`; carve-rs `lowercase_heading_ids`) and
   ASCII-folding for URL/CSS-fragment portability (carve-js `asciiHeadingIds`;
   carve-php `AsciiHeadingIdsExtension`; not available in carve-rs). Neither is the
   default. NOTE: carve PRESERVES case, matching djot.js / djot-php per #393;
   cross-references resolve case-insensitively against the case-preserved id table. *)

(* --- Thematic Breaks --- *)
thematic_break = (('-', '-', '-', {'-'}) | ('*', '*', '*', {'*'}) | ('_', '_', '_', {'_'})), newline ;

(* --- Code Blocks --- *)
code_block = fenced_code_block, [caption_slot] ;
(* A trailing caption (PART 9 §4) wraps the code block in a figure: a
   numbered LISTING. The caption may carry a `#` number placeholder, and a
   `</#id>` to the block resolves to "Listing N" (PART 9 §19). *)

fenced_code_block = code_fence_open, [space], [code_fence_info], newline,
                    code_content,
                    code_fence_close, newline ;
(* The space between the fence and the info string is OPTIONAL (lenient: both
   ```php and ``` php are accepted; Markdown writes no space, Djot writes the
   space). The no-space form (```php) is canonical and is what the X->Carve
   converters emit. *)
(* CODE-FENCE INFO STRING -- NORMATIVE. After the optional language_info
   token the opener admits, in this fixed order, an optional "header" (a
   quoted_title -- the SAME token as the admonition header, PART 9 §12) and an
   optional bracketed [label]. Both are STRUCTURED METADATA -- neither is part
   of the language/class:
     - HEADER ("...") -- a human-visible title for the block. Carried by core
       as the `title` attribute on the <pre> (rendering choice A: code's
       <pre><code> is atomic preformatted text, so the title cannot be a child
       element the way an admonition title is -- it rides as an attribute, and
       the host CSS/JS decides whether to draw a filename bar or leave it as the
       native mouseover tooltip). The header uses the same quoted_title TOKEN as
       the admonition header, but because it targets an HTML attribute the text
       is LITERAL (not inline-parsed) -- only HTML-escaped -- so markup-like
       characters in a filename (`*.config.js`, `a_b.py`) survive intact. An
       empty "" is a supplied empty title (`title=""`).
     - LABEL ([...]) -- a short grouping identifier. The core renderer ignores
       it; a group extension (e.g. code-group) uses it as the tab label.
   A code fence still carries NO inline {...} attributes -- use the PRECEDING
   block-attribute line for class/id/data-* (PART 9 §15):
       {.fancy #x}
       ```php "src/Auth.php"
       ...
       ```
   which renders on the <pre> (language stays `language-…` on the <code>). If
   the preceding line ALSO sets `title`, the preceding line wins (the explicit
   attribute channel overrides the opener-header sugar).
   INVALID-FENCE FALLBACK: when the text after the language token is anything
   other than an optional "header" then an optional [label] in that order -- a
   bare second word, a key="value" pair (```js title="x"), an inline `{...}`
   block (``` php {.x}), or the wrong order (```php [l] "h") -- the line is NOT
   a fenced code block; it falls back to ordinary inline parsing (the backtick
   run typically opens an inline code span). Quotes and brackets are the only
   delimiters that admit metadata. The raw passthrough opener (```=FORMAT,
   PART 9 §20) is a SEPARATE production matched before this one (a leading `=`
   is never a language, and a raw block takes no header). *)

code_fence_open  = (backtick_fence | tilde_fence):open ;
code_fence_close = (backtick_fence | tilde_fence):close
                   where char(close) = char(open) and len(close) >= len(open) ;
(* FORMAL (PART 9 §2, now stated as a guard): the closer uses the SAME fence
   character as its opener and is at least as long. The `where` clause is the
   grammar's counting device -- see GUARD NOTATION in the header. *)
(* COLUMN-EXACT DELIMITERS -- NORMATIVE (PART 2). A fenced-code delimiter --
   the opener AND the closer -- sits exactly at its container's content column;
   after container prefixes are stripped (PART 0), that means NO leading
   whitespace (at the top level, column 0). Carve has no indented-code-block
   construct, so leading spaces carry no disambiguating meaning and the fence is
   strict like every other block opener -- an indented heading, thematic break,
   or block quote is likewise plain text (PART 9 §10). Consequences:
     - A backtick/tilde run indented at the document level is an ordinary
       paragraph line; its runs fall back to inline code spans.
     - A closing run indented PAST its opener is not a delimiter but code
       content -- which is exactly what lets an indented ``` line appear as
       sample text inside a fence.
   Earlier drafts accepted 0-3 columns of indentation on a delimiter (a Markdown
   inheritance, to disambiguate against the 4-space indented code block); Carve
   has no such construct, so that tolerance is removed. The X->Carve converters
   re-base an indented Markdown fence to its container's content column. *)

backtick_fence = '`', '`', '`', {'`'} ;
tilde_fence = '~', '~', '~', {'~'} ;

code_fence_info = ( language_info, [space+, quoted_title], [space+, label] )
                | ( quoted_title, [space+, label] )
                | label ;
(* fixed order: language, then "header", then [label]; each optional, but a
   non-empty info string matches one of these three shapes or it is an
   INVALID-FENCE FALLBACK (see the NORMATIVE note above). quoted_title is the
   shared header token defined under Admonitions. *)
language_info = (letter | digit | '-' | '_' | '+' | '#' | '.' | '/')+ ;  (* a single token; the punctuation covers real language tags like c++, c#, f#, asp.net, text/html. May start with a digit. The `/` is normative across all impls (so `text/html`, `image/svg` are language tokens, not split). A token MUST NOT start with `=`: a leading `=` is the raw_block opener (above), not a language. *)
label = '[', { character - ']' }, ']' ;  (* structured metadata, e.g. [Installation]; not part of the class *)
code_content = (* any text until matching fence, preserved literally *) ;
(* TABS IN CODE -- NORMATIVE. Literal tab characters in code content (fenced
   code blocks and inline code spans) are PRESERVED verbatim; a tab and N spaces
   are not interchangeable. Tab display width is a presentation concern (CSS
   `tab-size`). Tab-to-space expansion is NOT a default behavior -- it is opt-in
   via a tab-normalize extension (flat replacement, default 2 spaces, content
   only). Matches djot / CommonMark. Pinned by corpus tabs-in-code-preserved. *)

(* --- Blockquotes --- *)
blockquote = blockquote_line, {blockquote_line | lazy_continuation_line | continuation_marker_block}, [caption_slot] ;
(* marked and lazy lines may interleave: `> a` / lazy / `> b` is ONE quote.
   A `+` continuation_marker_block attaches a flush-left block to the quote -- see
   the CONTINUATION MARKER note below and PART 9 §17. *)
blockquote_line = '>', [' '], inline_content, newline ;

(* LAZY CONTINUATION -- NORMATIVE (CommonMark-compatible). After one or more
   blockquote_lines, a line that does NOT carry the '>' marker still continues
   the blockquote -- exactly as if it carried '>' -- provided it is:
     - not blank (a blank line ends the blockquote), and
     - not a block-opener: a heading, table, fenced code, `:::` div, thematic
       break, OR an "invisible" reference / footnote / abbreviation definition
       or comment -- each ends the blockquote and starts that block OUTSIDE it,
       and
     - not a caption ('^ ' ...), which attaches to the blockquote instead.
   A LIST MARKER (bullet or ordered) is NOT an exclusion: a quoted line ends in
   an OPEN PARAGRAPH, and a list marker folds into an open paragraph (§10), so it
   continues the quote exactly as it folds into a top-level paragraph. Thus
   "> quoted \n - item" is ONE quote whose paragraph is "quoted\n- item" -- not a
   quote plus a sibling list. (This makes the quoted paragraph obey §10 like any
   paragraph; only a HEADING, a bounded title, is ended by a list marker. To put
   a real list inside a quote, `>`-prefix it or use the `+` continuation marker,
   §17.)
   Only PLAIN text and a folded list marker continue the open paragraph. The lazy line's text
   is appended to the blockquote's inner content before that content is
   block-parsed, so a hard-wrapped quoted paragraph need not repeat '>' on every
   line.
   CONTINUATION MARKER (PART 9 §17): to attach a real BLOCK (a list, fenced code,
   table, ...) to the quote without `>`-prefixing every line, put a lone `+` at
   column 0 right after a quoted line; the following flush-left block joins the
   quote body. This is the un-prefixed analogue of the list item's `+`. *)
lazy_continuation_line = inline_content, newline ;  (* see the NORMATIVE note above for which lines qualify *)

(* --- Lists --- *)
list = unordered_list | ordered_list | definition_list ;

unordered_list = unordered_item+ ;
unordered_item = bullet_marker, [item_attributes], space, [task_marker], list_item_content ;
(* the optional task_marker makes the item a TASK item (checkbox); the
   plain-vs-task axis is a §11 same-list criterion *)
bullet_marker = '-' | '*' ;  (* `+` is NOT a bullet in Carve -- it is the list
                                continuation marker (PART 9 §17); a `+ ` line is
                                ordinary paragraph text. Deviates from djot. *)
(* MARKER REQUIRES CONTENT -- NORMATIVE. A bullet or ordered marker is a list
   item only when followed by a space AND non-empty content. A content-less
   marker line -- bare (`-`) or with trailing whitespace only (`- `, `-   `) --
   is NOT a list: it is paragraph text. The rule ignores trailing whitespace, so
   `-` and `- ` behave identically (an editor stripping the trailing space cannot
   change the meaning). Carve is stricter than CommonMark, which treats a bare
   `-` as an empty item; requiring content keeps a lone dash (a prose dash /
   placeholder) from silently becoming a list. Pinned by corpus
   content-less-marker-is-not-a-list. *)

ordered_list = ordered_item+ ;
ordered_item = ordered_marker, [item_attributes], space, list_item_content ;
(* The first item fixes the dialect (decimal / alpha / roman) and the
   delimiter (`.` or `)`); classification and the ambiguous-letter
   tie-break are in PART 9 §11. No list -- ordered OR unordered -- interrupts a
   paragraph: an ordered marker in any dialect/value (`1.`, `2.`, `1985.`, `a.`,
   `i.`) and a bullet (`- `/`* `) alike need a blank line before them (matching
   Djot, avoiding the CommonMark `1.`-only heuristic) -- the symmetric
   paragraph-interruption rule is PART 9 §10.
   The two delimiters are the trailing `.` and `)` only. djot's
   parenthesized `(1)` / `(a)` form is INTENTIONALLY NOT a marker -- it is
   too easily confused with a prose parenthetical -- so `(1) text` stays
   literal paragraph text (pinned by corpus parenthesized-ordered-marker). *)
ordered_marker = (digit+ | letter | roman_numeral), ('.' | ')') ;
roman_numeral = ('i' | 'v' | 'x' | 'l' | 'c' | 'd' | 'm')+
              | ('I' | 'V' | 'X' | 'L' | 'C' | 'D' | 'M')+ ;

(* LIST-ITEM ATTRIBUTES (Carve addition; NORMATIVE -- extends PART 9 §15). An
   attribute block ABUTTING the marker (no space between marker and `{`)
   attaches its attributes to the `<li>` itself; the marker's required space
   follows the block:
     `-{.c} text`     -> `<li class="c">text</li>`
     `3.{#x k=v} text`-> `<li id="x" k="v">text</li>`
   For task items the block abuts the marker, before the task marker:
     `-{.c} [ ] text` -> `<li class="c"><input ...> text</li>`.
   WHITESPACE IS THE DISCRIMINATOR -- NORMATIVE:
   - `-{.c} text` (attr ABUTS marker) -> the `{.c}` is part of the MARKER and
     attributes the `<li>`. This is the ONLY way to attribute the `<li>`.
   - `- {.c} text` (a space BEFORE `{`) -> the `{.c}` is ordinary item CONTENT,
     NOT a li-attribute. It then follows the normal inline/block rules: a
     same-line `{.c}` is leading inline content; a `{.c}` alone on its own line
     (inside the item) is a block-attribute line that floats to the next block
     within the item -- e.g. a `{.blue}` line preceding an indented `> quote`
     puts `.blue` on the `<blockquote>` (§15), NOT on the `<li>`. This keeps
     li-attributes and block-attributes cleanly separated.
   - DISAMBIGUATION of the abutting block mirrors the inline-span rule (§14):
     it is consumed as li-attributes only if it yields >= 1 attribute (`#id`,
     `.class`, `key=value`) OR is the blessed empty block (`-{} text` -> bare
     `<li>`). Otherwise (`-{+a+}`, `-{not attrs}`) the `-{` is not a marker and
     the line stays ordinary paragraph/inline text.
   - Free slot: `-{` and `1.{` / `1){` are NOT list markers today (bullet and
     ordered both require the space), so no existing input changes meaning.
   - Diverges from djot (which has no li-attribute form); deliberate Carve
     extension. The lazy-continuation accident -- a trailing `{…}` line folded
     onto a tight item, which carve-php attached to the `<li>` and carve-js
     dropped -- is REJECTED as the mechanism: it was position-dependent (broke
     once a sub-list intervened) and is not normative. *)
item_attributes = attributes ;

list_item_content = first_block_content
                  | (inline_content, newline, {list_continuation | nested_list | continuation_marker_block}) ;
list_continuation = indent, inline_content, newline ;
nested_list = indent, list ;
indent = whitespace, {whitespace} ;
(* Indentation is measured in VISUAL COLUMNS (a tab advances to the next
   multiple of 4 -- CommonMark tab stops); how much indent nests what is NOT
   a fixed character count -- the content-column rules are PART 9 §24. *)
(* MARKER-LINE NESTED LIST -- NORMATIVE. A sub-list whose first marker sits on
   the PARENT item's marker line (`- - A`) is an ordinary persistent nested_list,
   identical to writing the sub-marker on its own indented line. It is NOT a
   one-off lone item: following markers at the sub-list's content column MERGE
   into the same nested list, and a post-blank block indented to that column is
   ABSORBED into the open nested item as a lazy continuation. This matches
   reference djot.js (`@djot/djot`) and CommonMark. It CORRECTS a narrower
   reading carve inherited from djot-php (which did not persist the nested list);
   it is a bug fix, NOT a carve divergence. Pinned by corpus
   marker-line-nested-lists. *)
(* Continuation marker (Carve addition; see PART 9 §17): a lone `+` at the
   current container's marker column attaches the following flush-left block to
   that container -- a list item OR a block quote -- with no blank line, marker
   prefix, or indentation, keeping the container tight. *)
continuation_marker = '+', newline ;
continuation_marker_block = continuation_marker, block ;
(* First-block item (Carve addition; see PART 9 §17): a lone `+` as the sole
   content right after the marker (`- +`) opens an item whose body is the
   flush-left block(s) that follow, with no inline lead and no indentation.
   `- + text` keeps `+ text` as literal inline content -- only a bare `+`
   triggers this. *)
first_block_content = continuation_marker, {block} ;

(* Task lists are unordered lists with special markers *)
task_marker = '[', task_state, ']', space ;
(* `x`/`X` render a CHECKED checkbox; every other state (` `, `-`, `_`,
   `>`, `?`) renders an UNCHECKED checkbox. Matches djot-php. *)
task_state = ' ' | 'x' | 'X' | '-' | '_' | '>' | '?' ;

(* --- Definition Lists --- *)
definition_list = definition_entry+ ;
definition_entry = definition_term+, [blank_line], definition_body,
                   {[blank_line], definition_body} ;
(* A blank line may separate a term from its definition, and one definition
   from the next, for readability -- the blank is a separator only (djot
   parity). A blank NOT followed by a `:  ` definition ends the entry. *)
definition_term = "::", space, inline_content, newline, {term_continuation_line} ;
term_continuation_line = inline_content, newline ;
(* A term folds a following plain line like a heading (soft break); a blank
   line, a new `::`/`:  ` marker, or a block opener ends it. A term holds inline
   content only -- no block body. *)
definition_body = (':', space, space, first_block_content)
                | (':', space, space, inline_content, newline,
                   { definition_continuation
                   | (blank_line+, definition_continuation)
                   | continuation_marker_block
                   | lazy_continuation_line }) ;
(* FIRST-BLOCK (`:  +`): when the sole content is a lone `+`, the body is the
   FOLLOWING flush-left block, with no indentation -- the definition-list twin
   of the list-item `- +` opener (`first_block_content`). `:  \+` is a literal
   `+`. *)
definition_continuation = (space, space, space, inline_content, newline)
                        | (backslash, newline, space, space, space, inline_content, newline) ;
(* The term marker is TWO colons. A single-colon `: term` line is NOT a
   definition list -- it is ordinary paragraph text. (Deliberate: djot uses a
   single colon, but `::` keeps the term marker distinct from the `:::` div
   fence and from a stray `:` in prose; all three reference impls agree.)

   A definition body CONTINUES like a list item (PART 9 §17): a blank line
   followed by an indented block folds in (so a `<dd>` may hold multiple
   paragraphs -- FORM A); a lone `+` attaches the FOLLOWING flush-left block
   with no indentation (`continuation_marker_block` -- FORM B, the un-prefixed
   analogue of the list-item and block-quote `+` forms); and a flush-left line
   with no blank before it that does not start an interrupting block lazily
   continues the open paragraph (`lazy_continuation_line`, the same rule as list
   items and block quotes, matching djot). A blank line that is NOT followed by
   an indented continuation ends the body: the single-blank entry separator
   before the next `:: term` is preserved. *)

(* --- Tables --- *)
table = standard_row, [delimiter_row], {table_row}, [caption_slot] ;
(* a table BEGINS with a standard row; a continuation row only ever follows
   an existing row (its cells append to the row above -- PART 9 §5). An optional
   GFM delimiter row may appear as the SECOND line; it promotes the first row to
   a header and assigns per-column alignment (PART 9 §5, GFM DELIMITER ROW) *)

table_row = standard_row | continuation_row ;

standard_row = '|', table_cell, {'|', table_cell}, '|', [row_attributes], newline ;

continuation_row = '+', table_cell, {'|', table_cell}, '|', newline ;

(* Row-level attributes: a `{…}` attribute block GLUED to the row's closing `|`
   (no intervening space) sets the row's `<tr>` attributes -- the row-level twin
   of a cell's opening-pipe attribute block (cell_attributes). The whole payload
   must be valid attribute syntax (§15); otherwise the `{` is ordinary content
   and the line is not a row attribute (see PART 9 §5, ROW ATTRIBUTES). *)
row_attributes = attributes ;

(* GFM delimiter row (alternative header syntax to the native `|=` header cell).
   Only recognized as the row IMMEDIATELY AFTER the first row of a table; every
   cell must be a delimiter_cell. A delimiter-shaped row in any OTHER position
   (the first line, or any row after the second) is an ordinary standard_row --
   its `-` runs are inline content (smart typography may render `---` as an em
   dash). See PART 9 §5 (GFM DELIMITER ROW) for header promotion + alignment. *)
delimiter_row = '|', delimiter_cell, {'|', delimiter_cell}, '|', newline ;
delimiter_cell = {whitespace}, [':'], '-', {'-'}, [':'], {whitespace} ;
(* at least one '-'; only optional leading/trailing ':' and surrounding
   whitespace. An EMPTY cell is NOT a delimiter_cell, so a row containing one
   (`|---||`) is not a delimiter row (it is an ordinary row). *)

table_cell = header_cell | data_cell | span_cell ;

header_cell = '=', [alignment_marker], {whitespace}, cell_content, {whitespace} ;
data_cell = [cell_attributes], [alignment_marker], {whitespace}, cell_content, {whitespace} ;
(* alignment_marker is glued to the opening '|' (no preceding whitespace),
   mirroring header_cell; a whitespace-delimited lone '^'/'<' is span_cell *)
(* CELL ATTRIBUTES (NORMATIVE): an attribute block GLUED to the opening '|'
   (no preceding whitespace) sets the cell's attributes; the rest of the cell,
   after optional whitespace, is the content. A space before the brace
   (`| {.x}`) is ordinary content, NOT attributes. The whole brace payload must
   be valid attribute syntax (§15); otherwise the '{' is literal content. A
   cell carrying attributes is never a bare span_cell -- its content is literal
   even if it is just '^' or '<'. A computed rowspan/colspan/alignment is
   authoritative, so an author copy of those keys is dropped. (corpus
   97-table-cell-attributes.) *)
cell_attributes = attributes ;
span_cell = rowspan_marker | colspan_marker ;

rowspan_marker = {whitespace}, '^', {whitespace} ;
colspan_marker = {whitespace}, '<', {whitespace} ;

alignment_marker = '<' | '>' | '~' ;  (* left, right, center *)

cell_content = inline_content ;
(* SEMANTIC CONSTRAINT (PART 9 §5): cell content may not contain an
   unescaped `|` outside a code span -- the pipe is the cell separator;
   the exception is not expressible context-free. *)

(* --- Captions --- *)
caption = '^', space, inline_content, newline, {caption_continuation_line} ;
caption_continuation_line = inline_content, newline ;  (* a PLAIN line only *)
(* MULTI-LINE CAPTIONS -- NORMATIVE. A caption is multi-line inline CONTENT, so
   its text spills onto following lines exactly like a PARAGRAPH (§10), NOT like
   a heading. It ends the same way an open paragraph does:
     1. a blank line ends the caption;
     2. a line that INTERRUPTS a paragraph -- a heading, blockquote, table,
        fenced code, `:::` div, thematic break, or `%%%` comment -- ends the
        caption and starts that block;
     3. a LIST MARKER does NOT end the caption: like a paragraph, a caption
        folds a `- ` / `1.` line in as literal text (djot -- a list needs a
        blank line to interrupt). `^ cap` + `- x` yields the caption `cap\n- x`;
     4. a further `^ ` line does NOT continue the caption (there is no repeated
        marker); it ends the caption and, having no captionable block to attach
        to, is ordinary paragraph text.
   Continuation lines join with a newline, so `^ cap` + `more` yields the
   caption text `cap\nmore`. The caption id (`#` placeholder) and cross-reference
   targeting use the full folded text. *)
caption_slot = [blank_line], caption ;
(* FORMAL adjacency (PART 9 §4, now structural): a caption attaches to the
   immediately preceding captionable block, with AT MOST ONE blank line
   between them -- the optional blank_line in caption_slot is the whole
   allowance. Host productions reference caption_slot; a `^ ` line anywhere
   else is ordinary inline/paragraph content. *)

(* CAPTION NUMBER PLACEHOLDER -- NORMATIVE.
   The FIRST bare `#` in a caption's TOP-LEVEL inline_content is a number
   placeholder. "Bare" = a `#` that does NOT begin a tag (a `#`
   followed by whitespace, `:`, `.`, end-of-caption, or any
   non-tag-name character). `#word` stays a tag (PART 9 §19). `\#` is
   a literal `#`, never a placeholder. Only the first bare `#` is a
   placeholder; later bare `#` render literally. A `#` INSIDE inline
   markup (emphasis, link, span) is NOT a placeholder -- put it in the
   caption's top-level text (`^ *Figure* #:`, not `^ *Figure #*:`).
   The LABEL is the inline_content before the placeholder, trailing
   whitespace trimmed; the counter BUCKET KEY is that label's plain
   text. Numbering is per-bucket, 1-based, in document order, assigned
   in the resolution pass. A caption with no bare `#` is unchanged. *)

(* --- Admonitions --- *)
(* A fence is a run of 3+ colons. A longer opener nests shorter blocks. *)
colon_fence = ":::", {":"} ;
colon_fence_close = colon_fence:close
                    where len(close) >= len(open) ;
(* FORMAL (PART 9 §12, now a guard): `open` is the colon_fence captured by
   the enclosing block's opener (admonition_open / div_open /
   line_block_open / local_hard_break_block_open). A block is closed only
   by a bare fence of equal-or-greater length, so a `:::` inside a `::::`
   block is content. Equal-length fences do not nest. *)

admonition = admonition_open, newline,
             {block - admonition_close},
             admonition_close, newline ;
(* the body may be EMPTY, like the generic div's (PART 9 §12) *)

(* STRICT (djot): the opener line carries NO inline `{...}` attributes -- it
   is colon_fence, type, an optional quoted "header", an optional [label],
   and NOTHING else. A trailing `{...}` (or any other text not matching the
   header/label shape) makes the line an ordinary paragraph, not a fence.
   Class/id/data-* attach via a PRECEDING block-attribute line, which floats
   onto the admonition (§15).
   The "header" and [label] are the SAME two metadata tokens a code fence
   takes (PART 9 §2), in the same fixed order:
     - HEADER ("...") -- the visible admonition title (`<p
       class="admonition-title">`, or `<summary>` under the details
       extension). Unchanged role.
     - LABEL ([...]) -- a short grouping identifier; core ignores it standalone,
       a group extension (e.g. tabs) uses it as the tab label. This is the
       canonical replacement for the tabs `{label="..."}` / inner-heading
       convention (those stay supported, deprecated). The `selected`
       default-tab marker is NOT a label -- it remains a boolean attribute on
       the preceding `{...}` line. *)
admonition_open = colon_fence:open, space, admonition_type,
                  [space+, quoted_title], [space+, label] ;
admonition_close = colon_fence_close ;

admonition_type = "note" | "tip" | "warning" | "danger" | "info"
                | "success" | "example" | "quote"  (* Tier 1 canonical *)
                | identifier ;  (* Tier 2 custom types (incl. `details`)
                                   -- see PART 9 §12 for the two-tier
                                   rendering rule *)

quoted_title = '"', {character - '"'}, '"' ;
(* there is NO escape mechanism inside a quoted title -- a title cannot
   contain a `"`; a line whose title is malformed is an ordinary paragraph
   (deliberate strictness, unlike `quoted_value` which accepts escapes) *)

(* --- Generic Divs (djot generic container; PART 9 §12) --- *)
(* A BARE `:::` opener with NO type word is a generic div, NOT an
   admonition (no class added). STRICT (djot): the opener carries NO inline
   `{...}` attributes -- an inline `::: {…}` is a paragraph, not a div.
   Class/id/data-* attach via a PRECEDING block-attribute line, which floats
   onto the div (§15). A typeless div MAY still carry a [label] (e.g.
   `::: [First]` for a tab member with no semantic type); a header without a
   type is not meaningful, so `::: "x"` is a paragraph. As the FIRST token after
   the fence the [label] may sit directly against it (`:::[First]`), exactly as
   a code fence admits a bare `[label]` (a label AFTER a type word still needs a
   space -- admonition_open). Shares the colon-fence closer and the fence-length
   nesting rule (PART 9 §12). *)
div = div_open, newline, {block - admonition_close}, admonition_close, newline ;
div_open = colon_fence:open, [[space], label] ;

(* --- Line block (verse) -- a RECOGNIZED `:::` type that preserves the
   author's per-line layout. `::: |` renders as a generic
   `<div class="line-block">` (NOT an `<aside>`), but unlike an ordinary div
   its body keeps each line's LEADING WHITESPACE and turns each soft line
   break into a hard break. The type token is a bare pipe `|` on the opener --
   NOT a per-line prefix -- so it is free of the pipe/table ambiguity of the
   Pandoc/djot per-line `|` form, and uses no English keyword (jgm/djot#29).
   The behavior keys off the `|` token. STRICT (djot): the opener carries no
   inline attributes, so the inline `::: {.line-block}` form is a PARAGRAPH, not
   a line block; extra attributes attach via a PRECEDING block-attribute line.
   Normative rendering: PART 9 §23. *)
line_block = line_block_open, newline, line_block_body, admonition_close, newline ;
line_block_open = colon_fence:open, space, "|" ;
(* The body is a sequence of stanzas separated by blank lines; each stanza is
   a run of consecutive non-blank lines. Inline content parses normally; the
   per-line leading whitespace is retained and the intra-stanza newlines are
   hard breaks (PART 9 §23). *)
line_block_body = stanza, {blank_line, stanza} ;
stanza = line_block_line, {line_block_line} ;
line_block_line = {whitespace}, inline_content, newline ;
(* the leading whitespace is PRESERVED in output -- PART 9 §23 *)

(* --- Local hard-break block -- a RECOGNIZED `:::` type for authors who want
   Markdown-like visible line breaks locally without the full line-block verse
   model. `::: \` renders as `<div class="hardbreaks">` and turns soft breaks
   in DIRECT paragraph children into hard breaks. It does NOT preserve leading
   whitespace and it does NOT inherit into nested block content. *)
local_hard_break_block = local_hard_break_block_open, newline,
                         {block - admonition_close},
                         admonition_close, newline ;
local_hard_break_block_open = colon_fence:open, space, backslash ;

(* --- Comments --- *)
(* A line comment may be indented: optional leading whitespace before `%%` does
   not matter, so an indented line whose first non-whitespace content is `%%` is
   a comment line just like one in the first column. Like every block, it
   interrupts an open paragraph (the lines before and after become separate
   paragraphs) and renders nothing. *)
comment_line = [whitespace], "%%", {character}, newline ;

(* A `%%%` fence line is a DELIMITER plus an INSIGNIFICANT TAIL: only the
   leading run of `%` is structural, so `%%% TODO` opens and `%%% end` closes.
   `%%%` carries NO info string -- a raw passthrough block is a CODE fence with
   an `=FORMAT` info string -- so `%%% html` is a comment, not a raw block.
   A block opens only when a matching closer exists AHEAD; otherwise the line
   degrades to a `comment_line`. Both rules are NORMATIVE in PART 9 §28. *)
comment_block = comment_block_open, newline,
                {character | newline},
                comment_block_close, newline
                where exists(comment_block_close) ;  (* PART 9 §28: else comment_line *)

comment_block_open = ("%%%", {'%'}):open, {character - newline} ;  (* 3+ `%`, tail ignored *)
comment_block_close = ("%%%", {'%'}):close, {character - newline}
                      where len(close) = len(open) ;  (* FORMAL (PART 9 §2): exact length match *)

(* trailing (inline) line comment: from a %% marker to end of line, content
   not rendered. The marker requires whitespace or start-of-run before it and
   is not formed inside code spans / inline-raw, nor when the first % is
   escaped -- a semantic constraint stated in PART 9, not context-free. *)
inline_comment = "%%", {character - newline} ;

(* PROVENANCE MARKER (convention, not a distinct production): tooling such as
   `carve fmt --stamp` may write a trailing comment recording the spec version a
   document was last processed under and the engine that wrote it. It is an
   ordinary comment -- a `%%` line or a `%%%` block -- so it renders nothing; it
   is identified by a `carve-version:` field as its first key. Two forms:
     %% carve-version: 0.1; generated-by: carve-js 0.1.0
   or
     %%%
     carve-version: 0.1
     generated-by: carve-js 0.1.0
     %%%
   It is tool-written and deterministic (no timestamp), replace-in-place (a tool
   updates the existing marker rather than appending a second), and lives at the
   END of the document. Authors do not hand-write it. See PART 9 / docs. *)

(* --- Raw Blocks --- *)
raw_block = code_fence_open, [space], "=", format_name, newline,
            raw_content,
            code_fence_close, newline ;
(* The opener is a code fence (backticks or tildes) whose info string is `=`
   immediately followed by a format name (```=html). The leading `=` is the
   block parallel of the inline raw `{=format}` attribute; it never starts a
   language token (a code fence's language charset excludes `=`), so this is
   unambiguous against an ordinary code block. The `=` and format name must be
   adjacent -- ```= html (space after `=`) is NOT a raw block. Leading
   whitespace before the `=` is permitted (```=html and ``` =html both open
   raw). The closer follows the PART 9 §2 rule: same fence character, length
   >= opener. Content is verbatim; it is emitted UNESCAPED when the format
   matches the output format (html) and DROPPED otherwise -- the block parallel
   of raw inline (PART 9 §20). This adopts djot's raw-block syntax; the earlier
   ```raw FORMAT keyword form was removed (no English keyword; symbol-based and
   symmetric with inline `{=format}`). *)

format_name = "html" | "latex" | identifier ;
raw_content = (* any content until closing fence *) ;

(* --- Paragraphs --- *)
(* A VISIBLE block (heading, quote, table, fence, thematic break,
   admonition/div) interrupts a paragraph with no blank line before it
   (Markdown-like); invisible constructs (reference definitions, comments,
   block-attribute lines) interrupt too. A LIST marker (bullet OR ordered) is
   the exception: it does NOT interrupt -- a list needs a blank line before it
   (symmetric, Djot-like; see §10). This is the semantic constraint
   PART 9 §10, not a context-free one. A paragraph is terminated by a blank
   line, an interrupting block (§10), or end of file -- the optional trailing
   blank_line here is the paragraph's own terminator, not a precondition for
   the NEXT block.
   TRAILING WHITESPACE -- NORMATIVE. Whitespace at the END of the paragraph's
   final line (with nothing after it) is STRIPPED before rendering (CommonMark /
   Djot: "final spaces are stripped"), so "abc " renders <p>abc</p>, not
   <p>abc </p>. Only this final trailing whitespace is stripped; whitespace
   before an INTERIOR line break is governed by the line-break rules (PART 3)
   and is not touched here. (Carve has no two-trailing-space hard break -- a
   hard break is the backslash form, `\` + newline; see PART 3.) Pinned by
   corpus 102-paragraph-trailing-whitespace. *)
paragraph = inline_content+, [blank_line] ;


(* ============================================================================
   PART 3: INLINE ELEMENTS
   ============================================================================ *)

inline_content = {inline_element | text_run | literal_special}+ ;

(* A special_char that does not begin (or close) a construct under PART 8
   precedence + PART 9 conditions is literal text. Without this production
   the grammar cannot represent "a/b/c", "x = a*b", or an inner delimiter
   such as the middle '/' of /usr/local/ -- yet these are canonical outputs
   (edge-cases.md §1, corpus 01-emphasis). The "is this a construct?"
   decision is the semantic constraint in PART 9, not a context-free one. *)
literal_special = special_char ;

inline_element = escaped_char
               | raw_inline
               | literal_inline
               | code_span
               | autolink
               | auto_text_link
               | link
               | inline_span
               | image
               | math
               | emphasis
               | strong
               | bold_italic
               | underline
               | strikethrough
               | highlight
               | forced_emphasis
               | forced_strong
               | forced_underline
               | forced_strike
               | forced_super
               | forced_sub
               | forced_highlight
               | footnote
               | extension_inline
               | editorial_markup
               | mention
               | tag
               | symbol
               | smart_typography
               | inline_comment
               | hard_break
               | soft_break ;

text_run = (character - special_char)+ ;

special_char = '/' | '*' | '_' | '~' | '^' | '`' | '[' | ']'
             | '!' | '$' | '{' | '}' | '<' | '>' | '@' | '#'
             | ',' | '=' | ':' | '%' | '\'
             | '-' | '.' | '"' | "'" | '(' | '+' ;
(* the second group are the smart-typography trigger characters (dashes,
   ellipsis, quotes, `(c)`-family symbols, `+-`): they must be excluded
   from text_run so the smart_typography productions are reachable; when
   no pattern matches they are literal_special like any other special *)

(* --- Escaped Characters --- *)
escaped_char = '\', ascii_punctuation ;
ascii_punctuation = '!' | '"' | '#' | '$' | '%' | '&' | "'" | '(' | ')'
                  | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';'
                  | '<' | '=' | '>' | '?' | '@' | '[' | '\' | ']'
                  | '^' | '_' | '`' | '{' | '|' | '}' | '~' ;

(* --- Code Spans --- *)
code_span = backtick_run, code_span_content, [backtick_run], [attributes] ;
   (* the closing backtick_run is OPTIONAL: an unclosed run runs to end of
      block -- see UNCLOSED RUN below *)
backtick_run = '`'+ ;  (* the OPENER is a MAXIMAL run; the CLOSER is a run of
   the SAME count, also maximal (not part of a longer run) *)
code_span_content = (* any chars except a closing backtick_run of equal count *) ;
(* UNCLOSED RUN. An opener with no equal-length closer ahead is NOT literal
   text: it opens a verbatim span that runs to END_OF_BLOCK (the block's
   trailing whitespace is stripped; no surrounding single-space strip, which
   applies only to a closed span). Such an unclosed run is OPAQUE -- an
   emphasis delimiter or link tail after it is verbatim content, so the
   surrounding construct never closes. Matches djot upstream and carve-php.
   (Pinned as the "Inline verbatim ..." corpus pair; was a carve-js
   divergence, resolved.) *)
(* A trailing `{…}` on a code span is the generic inline-attribute
   block, NOT a language tag -- EXCEPT the exact form `{=format}`, which is
   raw inline passthrough (raw_inline below, PART 9 §20). For any OTHER
   `{…}`: both impls consume AND apply it -- `\`c\`{.x}` ->
   `<code class="x">c</code>`, and `#id`/`key=value` likewise. (Was an impl
   divergence where carve-js dropped the attributes; RESOLVED -- both now
   apply them identically.) The STRICT attribute rule (§14) applies here and
   to IMAGE trailing attrs exactly as to any other inline attribute: a
   digit-first / invalid payload makes the block literal, so `\`c\`{#1a}` and
   `![a](i){#1a}` keep the braces as text rather than parsing a bogus
   attribute. *)

(* --- Raw Inline (inline parallel of raw_block; PART 9 §20) --- *)
(* A code span whose trailing attribute block is EXACTLY `{=format}` is raw
   inline passthrough: the verbatim span content is emitted unescaped when
   `format` matches the output, else dropped. Any OTHER trailing `{…}` is a
   generic attributed code span (above), NOT raw inline. *)
raw_inline = backtick_run, code_span_content, backtick_run, '{=', format_name, '}' ;

(* --- Literal Inline (PART 9 §27) --- *)
(* A `!` PREFIX on a verbatim code span is an inline literal, mirroring
   `math_inline = '$', code_span` (§18). The span content is HTML-ESCAPED and
   emitted by EVERY renderer (never dropped), but WITHOUT the `<code>` wrapper
   -- it is prose, not code. The trailing `[attributes]` is the ORDINARY
   code-span attribute block: with attributes an emitted `<span>` carries them,
   with none bare escaped text is emitted. Unlike `{=format}` (raw_inline, §20)
   the content is escaped and never target-routed. The `!` binds to a following
   backtick run only; `!` elsewhere (and `![` opening an image) is unaffected,
   so a literal `!` immediately before a backtick run is written `\!`. *)
literal_inline = '!', code_span ;

(* --- Links --- *)
link = inline_link | reference_link | collapsed_reference_link ;
(* autolink and auto_text_link (crossref) are listed directly in
   inline_element, not here. *)

inline_link = '[', link_text, ']', '(', link_destination, [link_title], ')', [attributes] ;
reference_link = '[', link_text, ']', '[', reference_label, ']', [attributes] ;
collapsed_reference_link = '[', link_text, ']', '[', ']', [attributes] ;

link_text = inline_content ;
(* SEMANTIC CONSTRAINT: the link text ends at the matching `]` -- the close
   is balanced-bracket, escape- and LITERAL-SPAN-aware (PART 9 §16's rule for
   inline notes is the same scan); an unescaped bare `]` inside plain text
   closes it. Not expressible context-free. *)
(* WHICH SPANS THE SCAN SKIPS -- NORMATIVE. The scan skips the interior of
   every construct whose content is LITERAL, meaning the parser resolves no
   escapes inside it:

     `code`            inline code
     !`literal`        inline literal (PART 9 §27)
     {# comment #}     editorial comment

   A `]` inside any of them is content, not the close.

   The list is exactly the spans whose content is LITERAL, and that is the
   whole test: a `]` there cannot be escaped, because an escape inside them is
   not an escape. `{+ +}`, `{- -}` and `{~ ~> ~}` take `inline_content`, so
   `[{+a\]b+}](u)` already works and they are deliberately NOT in the list.

   This was previously written as "code-span-aware", which named only the
   first. The others were left to end the label early, and because their
   content is literal there is no way to write the `]` around it: an escape
   inside them is not an escape, so `[{#a\]b#}](u)` produces a link whose
   comment text really contains a backslash. A writer serializing such a
   document therefore had to choose between keeping the link and keeping the
   author's text, with no correct answer available (carve#403).

   Stated as a property rather than as three names: any future construct whose
   content is literal joins the list by construction. *)
(* SEMANTIC CONSTRAINT (links never nest): a link MUST NOT contain another
   link. The link text is inline content, so parsing it may yield a link --
   written explicitly (`[[x](y)](z)`), as a crossref/auto_text_link, or
   produced by an extension matcher (e.g. an autolink on a bare URL inside the
   text). Any such inner link is replaced by its own text content; only the
   outermost link's destination applies. So `[[x](y)](z)` links `x` to `z`,
   and `[a https://b c](/u)` (with an autolink extension) links the literal
   text `a https://b c` to `/u` rather than emitting a nested anchor. Mirrors
   CommonMark "links may not contain other links". *)
link_destination = {destination_escape | balanced_parens
                   | (url_char - '(' - ')') | '\' | unicode_url_char}+ ;
balanced_parens = '(', {destination_escape | balanced_parens
                       | (url_char - '(' - ')') | '\' | unicode_url_char}, ')' ;
destination_escape = '\', ('(' | ')' | '\') ;
(* The destination is ANY run of URL characters: absolute (`https://…`),
   relative (`/path`, `./file`), or a bare fragment (`#section`) -- there is
   no scheme requirement. It ends at the first whitespace (a following
   quoted run is the title) or at the first `)` that has NO unmatched `(`
   left to pair with. Parentheses BALANCE, to any depth, so
   `[x](http://a/b(c))` links to `http://a/b(c)` and `[y](e)f)` links to `e`
   and leaves `f)` literal. This matches djot and CommonMark, both of which
   balance destination parentheses; URLs carrying parentheses (Wikipedia,
   MDN) are thus written plainly, with no escape and no second spelling.
   The ONLY escapes inside a destination are `\(`, `\)` and `\\`, which
   stand for the literal character and do not affect nesting
   (`[t](a\)b)` links to `a)b`). A BACKSLASH before anything else is an
   ordinary destination character kept verbatim (`[t](a\b)` links to `a\b`),
   so URLs full of backslashes need no doubling. Note `url_char` itself
   (used by autolinks) still EXCLUDES `\`, so `<http://a\b>` is not an
   autolink; the backslash is a destination-only literal. There is NO
   angle-bracket-wrapped destination form -- balanced parentheses and the
   three escapes cover what one would be for, and whitespace is
   percent-encoded (`%20`). A newline counts as whitespace, so it ENDS the
   destination: a `(` whose run reaches the end of the line without a
   closing `)` is not a link and stays literal (the `[t](url` / `more)`
   two-line case). *)
unicode_url_char = (* any non-whitespace, non-ASCII Unicode character *) ;
(* WHITESPACE HERE IS UNICODE WHITESPACE -- NORMATIVE, and it always was:
   `unicode_url_char` says "non-whitespace" without qualifying it to ASCII.
   So a NARROW NO-BREAK SPACE, an IDEOGRAPHIC SPACE and a THIN SPACE end an
   inline destination exactly as a plain space does, and an unbracketed
   destination cannot contain one. Stated explicitly because two engines read
   it as ASCII-only and produced a link whose href carried an invisible
   character (carve#404).

   ZERO-WIDTH characters (U+200B, U+FEFF) are NOT whitespace and ARE ordinary
   destination characters. The test is the Unicode White_Space property, not
   "is invisible".

   THE SAME RULE APPLIES IN A REFERENCE DEFINITION, because the definition is
   built from this same `link_destination`. Whitespace ENDS the destination
   there too:

     [r]: https://e.com<U+202F>/path   ->  destination `https://e.com`

   with `/path` following the destination and, not being a quoted title,
   ignored -- exactly as `[r]: https://e.com /path` behaves. Whitespace
   between the mandatory separator space and the destination is leading
   whitespace and is skipped. A definition whose destination is EMPTY once
   that is done is NOT a definition: the line stays literal, as `[r]:`
   already does.

   There is deliberately no "trim the ends, keep the interior" variant. It
   reads as the friendlier rule and it contradicts `link_destination`, which
   admits no whitespace at all -- leaving `[r]: a b` specified two ways. *)
(* Titles accept double OR single quotes -- a deliberate enhancement
   over djot (which has no single-quote titles; it would fold `'...'`
   into the URL). The non-delimiting quote may appear inside the title. *)
(* A backslash-escaped delimiter inside the title is a literal quote
   (CommonMark-style): `\"` in a "-title and `\'` in a '-title are kept
   as the literal quote character and do NOT end the title. This is
   honored for the INLINE-link title (`[t](/url "ti\"tle")` -> title
   `ti"tle`); see the note at `reference_definition` for the ref-def slot. *)
link_title = space, ('"', {('\', '"') | (character - '"')}, '"')
           | space, ("'", {('\', "'") | (character - "'")}, "'") ;

reference_label = (character - ']' - '@'), {character - ']'} ;
reference_definition = '[', reference_label, ']', ':', space, link_destination, [link_title], newline ;
(* The marker-to-content separator is the `space` terminal (U+0020) ONLY, in
   all three definition markers: `reference_definition`, `footnote_definition`
   and `abbreviation_definition`. A tab does NOT satisfy `space`
   (`space = ' '`), so `[^a]:<TAB>x`, `[a]:<TAB>/url` and `*[HTML]:<TAB>x` are
   ordinary paragraphs, not definitions -- the marker line is preserved as
   text. This mirrors the heading, list and task markers, which likewise
   require a literal space after the marker. carve-rs is the reference here. *)
(* A leading `@` is reserved: `[@key]: ...` is never a reference definition, it
   is a citation definition (PART 9 §22, Tier-2). This parallels the `[^...]:`
   footnote-definition precedence and holds in core whether or not the citation
   extension is enabled, so a short entry like `[@a]: A.` is not mis-claimed as
   a link definition. *)
(* The reference-definition title reuses `link_title`, INCLUDING the
   backslash-escape rule above: a `\"` inside the quoted run is a literal
   quote and does not end the title (`[y]: /u "a\"b\"c"` -> title
   `a"b"c`, emitted as `title="a&quot;b&quot;c"`). This matches the
   inline-link title; the earlier carve-js divergence (ref-def title
   stopping at the first raw quote) was fixed in the canonical oracle, and
   the examples corpus now pins the ref-def escaped-quote form too. *)
(* A reference definition requires a NON-EMPTY `link_destination` (one or
   more `url_char`s after the `':' space`). A bare `[r]:` -- with nothing
   after the colon, or only trailing whitespace -- does NOT form a
   definition; the line stays literal paragraph text. *)

(* --- Inline Spans --- *)
(* A bracketed inline run immediately followed by an attribute block
   attaches those attributes to a <span>. Distinguished from links by the
   character after ']': '(' -> inline_link, '[' -> reference link, '{' ->
   inline_span. A bare '[text]' with none of those is literal text (carve
   has no shortcut reference links). See PART 9 §14. *)
inline_span = '[', inline_content, ']', attributes ;

autolink = '<', (url_autolink | email_autolink), '>', [attributes] ;
(* A trailing `[attributes]` block attaches to the autolink (generic
   inline-attribute carrier, examples.md): `<https://example.com>{.ext}`
   -> `<a href="https://example.com" class="ext">https://example.com</a>`.
   Same slot as every other inline carrier; both reference impls agree. *)
(* DISPLAY TEXT -- NORMATIVE: the visible text is the RAW content between
   `<` and `>`, verbatim. A url_autolink keeps its scheme
   (`<mailto:a@b>` -> `<a href="mailto:a@b">mailto:a@b</a>`); an
   email_autolink has no explicit scheme, so it shows the address and the
   `mailto:` is added only to the href
   (`<a@b.com>` -> `<a href="mailto:a@b.com">a@b.com</a>`). *)
url_autolink = scheme, ':', {url_char}+ ;
(* The autolink body is built from `url_char` only; `<` and `>` are NOT
   url_chars, so a `<` inside the brackets cannot be part of the body --
   `<http://a.com/<script>>` is not an autolink and renders as the fully
   escaped literal text. *)
(* The trailing `.`, {letter}+ (TLD) is MANDATORY for an email autolink:
   `<a@b.com>` is a mailto link, but `<a@b>` (no dot+TLD) and `<x@y:z>`
   (`:` is not an email_char) are NOT email autolinks and stay literal. *)
email_autolink = {email_char}+, '@', {email_char}+, '.', {letter}+ ;

auto_text_link = "</#", crossref_id, '>' ;  (* cross-reference with auto text *)
crossref_id = {character - ('>' | whitespace | newline)}+ ;
(* The crossref id token accepts NON-ASCII characters: automatic heading ids
   PRESERVE Unicode and case (`# Café Notes` -> `Café-Notes`), so `</#Café-Notes>`
   must be writable. Resolution is CASE-INSENSITIVE: the reference is folded and
   matched against the case-preserved id table, then the link emits the target's
   actual id (`</#café-notes>` resolves to `Café-Notes`). Matching is NOT
   ASCII-folded, so an ASCII spelling of a non-ASCII id (`cafe` vs `Café`) does
   NOT resolve (corpus 19-heading-ids). EXPLICIT ids (`{#id}`) remain ASCII
   `identifier`s -- the asymmetry is deliberate: authors control explicit
   ids, auto ids follow the heading text. *)

url = scheme, ':', {url_char}+ ;
scheme = letter, {letter | digit | '+' | '-' | '.'} ;  (* 1+ chars; single-letter schemes ok *)
url_char = letter | digit | '-' | '.' | '_' | '~' | ':' | '/' | '?'
         | '#' | '[' | ']' | '@' | '!' | '$' | '&' | "'" | '(' | ')'
         | '*' | '+' | ',' | ';' | '=' | '%' ;
(* `url_char` therefore EXCLUDES `"`, `\`, `` ` ``, `{`, `}`, `|`, `^`
   (and `<` / `>`). Any of these inside `<...>` breaks the autolink: the
   whole run is not an autolink and renders as escaped literal text --
   `<http://a.com/"q">` is literal (the straight quotes additionally pick
   up smart-quote typography). *)

(* --- Images --- *)
(* An image has the same three forms as a link (inline / reference / collapsed
   reference); only the leading `!` and the `<img src>` output differ. *)
image = inline_image | reference_image | collapsed_reference_image ;
inline_image = '!', '[', alt_text, ']', '(', image_source, [image_title], ')', [attributes] ;
reference_image = '!', '[', alt_text, ']', '[', reference_label, ']', [attributes] ;
collapsed_reference_image = '!', '[', alt_text, ']', '[', ']', [attributes] ;
alt_text = {character - ']'} ;
image_source = link_destination ;  (* absolute or relative, like links *)
image_title = link_title ;
(* A reference / collapsed-reference image resolves its label against the
   document's reference definitions EXACTLY as the matching reference_link does
   (case-sensitive), taking the definition's destination as `src` and its
   optional title as the image title. A collapsed `![alt][]` uses alt_text as
   the label, so alt_text must be non-empty; the full `![alt][ref]` form allows
   an empty alt_text (the label is reference_label). An UNRESOLVED reference
   image renders as literal source (the `!` plus the bracketed text) and --
   unlike a reference link -- never matches heading text. There is NO shortcut
   `![alt]` reference image, mirroring the absence of a shortcut reference
   link. Trailing [attributes] attach to the resolved <img>. *)

(* A caption (`^ ` line) may follow a standalone image paragraph on the
   next line, wrapping it in a <figure> -- caption placement is PART 9 §4;
   there is no separate grammar production for the pair. *)

(* --- Math (djot form; PART 9 §18) --- *)
(* Inline `$` + a verbatim (backtick) span; display `$$` + a verbatim
   span. The backtick span removes ambiguity with a literal `$`, so
   currency such as `$5` (no following backtick run) stays literal text.
   Both forms are inline; display math renders inside its own paragraph
   when alone on a line. There is NO bare `$…$` form and NO `\(…\)` input
   form -- those were dropped (`\(` is just an escaped paren). *)
math = math_inline | math_display ;
math_inline = '$', code_span ;
math_display = "$$", code_span ;
(* TRAILING ATTRIBUTES on math -- NORMATIVE: math reuses `code_span`, which
   carries the generic `[attributes]` slot, so `$\`x\`{.c}` / `$$\`x\`{.c}`
   parse a trailing attribute block and APPLY it to the math span, merging
   classes into the existing `math inline` / `math display` class
   (`<span class="math inline c">`); `#id` / `key=value` are applied too.
   Canonical = carve-js / djot.js (pinned, corpus Math section). The
   `{=format}` raw form is code-span-ONLY and is NOT inherited by math:
   `$\`x\`{=html}` leaves the `{=html}` literal (both impls already agree).
   carve-php currently DROPS valid math attributes -- to be fixed.
   The math content is the verbatim text of the reused code_span. *)

(* --- Emphasis (left/right word-boundary rule; PART 9 §9 is normative).
   The word-boundary restriction applies to EVERY bare emphasis delimiter
   (`/`, `*`, `_`, `~`, `=`). No bare delimiter opens or closes
   intraword: `foo*bar*baz`, `foo~bar~baz`, `snake_case`, `a/b/c` all stay
   literal. One rule, no per-character carve-outs. Every bare delimiter is
   single-char -- there is no two-char delimiter. This is STRICTER than Djot,
   whose rule is whitespace-only.

   NO BARE SUPERSCRIPT/SUBSCRIPT -- `^` and `,` are NOT bare delimiters.
   Superscript and subscript exist ONLY in the braced form
   (`{^x^}` -> <sup>, `{,x,}` -> <sub>; the PART 9 §22 brace-pair family,
   whose productions keep their historical forced_* names). Two reasons:
   (a) sub/sup attach to characters, not words -- the dominant uses
   (`H{,2,}O`, `mc{^2^}`, `10{^6^}`) are intraword, which the bare
   word-boundary form could never express anyway; the bare form covered
   only the rare whole-word case. (b) `,` is the most frequent punctuation
   character in prose and `^` is claimed by captions, header rowspan, and
   inline footnotes; a single misplaced space (`typo ,oops, happens`)
   would otherwise conjure a subscript. A bare `^` or `,` is ALWAYS
   literal text.

   FORCED INTRAWORD -- the brace-pair `{X … X}` family (PART 9 §22) is the
   explicit escape hatch: it opens a span regardless of word boundary, so
   `x{*bold*}y`, `x{/italic/}y`, `x{_under_}y` emphasize intraword. It is
   ADDITIVE -- bare delimiters at a word boundary are unchanged; the braces are
   only needed to force a span where a bare delimiter would otherwise stay
   literal.

   HIGHLIGHT is the single-char `=` delimiter. Uniform word boundary makes
   single `=` safe: `x = 5`, `key=value`, `a=b`, `=>` all stay literal -- the
   opener is followed by whitespace, or preceded/followed by an alphanumeric,
   so the word-boundary rule rejects it. A DOUBLED bare delimiter (`==x==`)
   is literal by the same-delimiter-adjacency rule (PART 9 §9), like `**x**`
   or `//x//`. The delimiter set is fully uniform: all single-char. --- *)

(* FORMAL word-boundary guards -- NORMATIVE; production templates (GUARD
   NOTATION, header) instantiated for every bare delimiter
   d in { '/', '*', '_', '~', '=' } (all single-char): *)

bare_opener(d) = <!(alnum | '_' | d | slash_if(d)), d, !(ws | d) ;
bare_closer(d) = <&(non_ws), d, !(alnum) ;

(* slash_if(d) = '/' when d in { '/', '_' }, otherwise nothing: italic and
   underline additionally never open when the char immediately before d is `/`
   (PART 9 §9 -- path protection: /a/_b_, snake_/case/, a_/_a_). For d = '/'
   this coincides with same-delimiter adjacency; for d = '_' it is the extra
   cross-delimiter guard. The other delimiters '*', '~', '=' are NOT guarded
   against a preceding '/', so a/~y~ -> a/<s>y</s>. *)

(* Reading:
   - opener: the char BEFORE d is START, whitespace, or punctuation -- but
     NOT an alphanumeric, NOT `_`, NOT the same delimiter
     (same-delimiter adjacency, PART 9 §9: a doubled delimiter never
     opens), and -- for d in { '/', '_' } only -- NOT `/`
     (slash_if(d), path protection); d is NOT followed by whitespace and NOT
     by another d.
   - closer: d is NOT preceded by whitespace and NOT followed by an
     alphanumeric. (A following same-delimiter is allowed: in `/x//` the
     first `/` after `x` closes; the trailing `/` stays literal.)
   Run disambiguation (nearest valid closer, delimiter stack, no same-type
   nesting) is PART 9 §9; the corpus pins it (tests/corpus/01-emphasis*).
   Forced `{X … X}` spans bypass BOTH guards entirely (PART 9 §22). *)

emphasis = bare_opener('/'), emphasis_content, bare_closer('/'), [attributes] ;  (* italic *)
strong = bare_opener('*'), strong_content, bare_closer('*'), [attributes] ;       (* bold *)
bold_italic = bare_opener('/'), '*', bi_content, '*', bare_closer('/'), [attributes] ;
   (* the combined `/*` opener: the boundary guards apply to the OUTER `/`;
      the inner `*` is part of the two-char token, not separately guarded *)
underline = bare_opener('_'), underline_content, bare_closer('_'), [attributes] ;
strikethrough = bare_opener('~'), strike_content, bare_closer('~'), [attributes] ;
highlight = bare_opener('='), highlight_content, bare_closer('='), [attributes] ;  (* single-char; was "==" *)
(* there is NO bare superscript/subscript production -- `^` and `,` are
   always literal outside the forced forms; see the rationale note above
   and forced_super / forced_sub below *)

(* --- Forced intraword emphasis (PART 9 §22). A brace-pair wrapping a bare
   emphasis delimiter forces a span with NO word-boundary condition, so it
   emphasizes intraword. The closing brace bounds the span; the inner delimiter
   characters are literal unless they open a nested forced span. Every mark has
   a forced form: *)
forced_emphasis   = "{/", forced_content, "/}", [attributes] ;   (* italic    *)
forced_strong     = "{*", forced_content, "*}", [attributes] ;   (* bold      *)
forced_underline  = "{_", forced_content, "_}", [attributes] ;   (* underline *)
forced_super      = "{^", forced_content, "^}", [attributes] ;   (* sup       *)
forced_sub        = "{,", forced_content, ",}", [attributes] ;   (* sub       *)
(* `{~ … ~}` is forced strikethrough UNLESS it contains a top-level `~>`, in
   which case it is editorial substitution (PART 9 §22 disambiguation). *)
forced_strike     = "{~", forced_content, "~}", [attributes] ;
(* `{= … =}` is forced highlight (see Editorial Markup below for the
   distinction from the raw-inline `{=format}` attribute). *)
forced_highlight  = "{=", forced_content, "=}", [attributes] ;
forced_content    = (inline_element | text_run | literal_special)+ ;

(* TRAILING ATTRIBUTES -- each emphasis-family span MAY carry a trailing
   `[attributes]` block: the generic inline-attribute carrier that attaches
   to the immediately-preceding inline node (the general rule, examples.md
   "Trailing attribute block edge cases"). `*x*{.real}` ->
   `<strong class="real">x</strong>`. It is the SAME `[attributes]` slot
   every other inline carrier uses (code_span, link, image, inline_span,
   extension_inline, autolink). Pinned by corpus
   75-trailing-attribute-block-edge-cases; both reference impls agree. *)

(* Content MAY contain the delimiter character: a delimiter that does not
   satisfy the close condition (PART 9 §1) is literal content, not the span
   boundary. The span ends at the matched closer only. Subtracting the
   delimiter here would make e.g. "/usr/local/" -> <em>usr/local</em>
   ungrammatical, contradicting edge-cases.md §1. The boundary is a semantic
   constraint (PART 9 §1, §9), not a context-free one. *)
emphasis_content = (inline_element | text_run | literal_special)+ ;
strong_content = (inline_element | text_run | literal_special)+ ;
bi_content = (inline_element | text_run | literal_special)+ ;
underline_content = (inline_element | text_run | literal_special)+ ;
strike_content = (inline_element | text_run | literal_special)+ ;
highlight_content = (inline_element | text_run | literal_special)+ ;

(* The former prose pseudo-productions emphasis_open_condition /
   emphasis_close_condition are SUPERSEDED by the formal bare_opener(d) /
   bare_closer(d) templates above. *)

(* --- Footnotes (PART 9 §16) --- *)
(* Both implemented forms: the REFERENCE form `[^label]` resolving against a
   `[^label]: body` definition, and the INLINE form `^[content]` (pandoc
   form; a deliberate carve extension over djot). Numbering (one shared
   document-order sequence), endnotes rendering, and the inline form's
   balanced-bracket close are PART 9 §16. A bare `^` is always literal
   (no bare superscript), so `^[` is unambiguous. These two are the only
   note forms; the once-proposed sidenote `[>content]` was DISMISSED (a
   sidenote is footnote content placed by CSS, not a separate construct) --
   see docs/dismissed-syntax.md. *)
footnote = reference_footnote | inline_footnote ;
reference_footnote = "[^", footnote_label, ']' ;
footnote_label = {character - ']'}+ ;
inline_footnote = "^[", inline_content, ']' ;
(* the closing `]` is the balanced, escape- and code-span-aware close
   (PART 9 §16); footnote recognition is DISABLED inside the content *)

footnote_definition = "[^", footnote_label, "]:", space, inline_content, newline,
                      {footnote_continuation | continuation_marker_block} ;
footnote_continuation = (space, space, {whitespace}, inline_content, newline) | blank_line ;
(* the body extends to following lines indented by >= 2 spaces -- PART 9 §16.
   A lone `+` also attaches a flush-left block (`continuation_marker_block`,
   PART 9 §17), the same continuation marker lists, block quotes and definition
   bodies use. *)

(* DISMISSED (not reserved, no production): a sidenote form
     sidenote = "[>", inline_content, ']' ;
   was proposed and declined -- margin placement is CSS over the existing
   footnote/endnote output, so it does not warrant core syntax. The `[>`
   opener is therefore UNCLAIMED and `[>foo]` is literal text (pinned by the
   corpus). Rationale: docs/dismissed-syntax.md. *)

(* --- Citations (PART 9 §22, Tier-2 extension; issue #90) --- *)
(* TIER-2, OFF BY DEFAULT. A `[...]` whose content contains a `@key` and which
   has NO link/ref/span tail (`(url)` / `[ref]` / `{attrs}`) is a citation.
   Bare `@key` stays a mention (PART 19); `\@` is literal. Items are
   `;`-separated; each is `[prefix] [-] @key [, locator]` (`-` suppresses the
   author in author-date mode). Bibliography entries are defined in-document:
   `[@key]: {author= year=}? entry` (the `{…}` is optional, feeds author-date;
   a leading `@` label is reserved here, never a link definition - see
   reference_definition). Numbered output (default) emits `[1]` + an ordered
   references list; author-date (configuration) emits `(Author Year)` + an
   alphabetical list. The references list is appended at document end or
   injected into a `::: references` block.

   INTEGRAL MARKER (issue #226): a single leading `+` immediately after `[`
   marks the whole citation cluster as integral (author-in-text). The `+` is
   a group-level flag - there is no per-item integral mode. An integral cluster
   is wrapped in `<span class="citation" data-cite-mode="integral">…</span>`.
   The vocabulary (`integral` / `non-integral`) matches CSL / Citum `CitationMode`.
   Example: `[+@smith2020]`, `[+see @smith2020, p. 12]`.

   TYPED LOCATORS (issue #226): the locator text after the first `,` in an item
   is parsed into a citeproc `{label, value}` structure plus a trailing suffix.
   Longest-match, boundary rule: a term matches only when followed by end-of-locator,
   whitespace, a digit, `§`, or `¶`. On a leading digit with no preceding term,
   the default label is `page`. Trailing `,`, `&`, `-`, `.` are trimmed from the
   value; the remainder is the suffix. Canonical labels with recognized abbreviations:
     book (bk.) | chapter (chap., chaps.) | column | figure | folio |
     issue (no.) | line (l., ll.) | note (n., nn.) | opus | page (p., pp.) |
     paragraph (para., ¶) | part | section (sec., §) | sub verbo (s.v.) |
     verse (v., vv.) | volume (vol.)

   DATA-* HTML CONTRACT (issue #226): each citation anchor carries, in order,
   only the attributes that apply: `data-cite-key`, `data-suppress-author`,
   `data-cite-prefix`, `data-locator-label`, `data-locator`, `data-suffix`.
   Prefix and suffix are flattened plain text.

   Undefined behavior (impls MAY differ, NOT corpus-pinned): same-author-year
   disambiguation letters (e.g. `2020a` / `2020b`) are out of scope for v1 -
   the bare year is emitted. A `;` inside a locator is not supported (it splits
   items); such a bracket falls back to literal text.
   Failure-mode handling IS pinned (tests/corpus-optional): a group with any
   undefined key renders the verbatim source; a `[@k]{...}` is a span, not a
   citation; a `[@k, a; b]` with a `;` in the locator is literal text.
   Pinned in tests/corpus-optional (features citations-numbered /
   citations-author-date and the failure-mode cases, plus typed-locator,
   integral, and suppress-author enrichment cases 13-24). *)
citation_group = '[', ['+'], citation_item, {';', citation_item}, ']' ;
citation_item = [inline_content], ['-'], '@', citation_key, [',', space, inline_content] ;
(* A single leading `+` immediately after `[` sets the whole citation cluster to
   integral mode (CSL/Citum CitationMode: integral = author-in-text; absent =
   non-integral/parenthetical). This is the Djot draft `[+@foo, p. 15]` form.
   Mode is a GROUP property; there is no per-item `+`. A bare `+` before `@key`
   inside a multi-item bracket (e.g. `[@a; +@b]`) is parsed as prefix text on
   that item, not as an integral marker.
   `-` before `@key` suppresses the author for that item (per-item; applies in
   both numbered and author-date modes via `data-suppress-author="true"`).
   The locator (text after the first `,` in an item) is parsed into a typed
   citeproc label + value + trailing suffix: longest-match label from the fixed
   vocabulary (book/chapter/column/figure/folio/issue/line/note/opus/page/
   paragraph/part/section/sub verbo/verse/volume, with abbreviations), matching
   only at a boundary (end of string, whitespace, ASCII digit, `§`, or `¶`); no
   label but a leading ASCII digit defaults to `page`; value is the leading run
   of digits/roman/`.,&-` with trailing separators trimmed; the remainder is the
   suffix. A roman numeral locator with no label produces no data-locator-label.
   The structure is surfaced as `data-*` attributes on the rendered citation
   anchors (`data-cite-key`, `data-suppress-author`, `data-cite-prefix`,
   `data-locator-label`, `data-locator`, `data-suffix`) and, for an integral
   cluster, a `<span class="citation" data-cite-mode="integral">` wrapper around
   the entire citation group output. *)
citation_key = (letter | digit | '_'), {citation_key_char} ;
(* Pandoc-compatible key charset (carve-js `KEY`): after the first char, any of
   these internal punctuation marks are allowed. *)
citation_key_char = letter | digit | '_' | ':' | '.' | '#' | '$' | '%' | '&'
                  | '+' | '?' | '<' | '>' | '~' | '/' | '-' ;
citation_definition = '[@', citation_key, "]:", space, [attributes], inline_content, newline ;

(* --- Extensions --- *)
extension_inline = ':', extension_name, '[', extension_content, ']', [attributes] ;
extension_name = identifier ;
(* `identifier` already permits a leading `_`, so `_` is a valid extension
   name: `:_[x]` -> `<span class="ext-_">x</span>` (an unrecognized name
   falls back to a generic `<span class="ext-NAME">`). It MUST start with a
   letter or `_`, so a digit-first name is NOT an extension: `:1[x]` stays
   literal text, while `:a1[x]` (digit only after the first letter) is valid.
   On a GENERIC extension an authored `{.class}` MERGES into the single
   `class` attribute -- the structural `ext-NAME` class comes FIRST, then the
   authored classes (`:foo[a]{.cls}` -> `class="ext-foo cls"`); never two
   `class` attributes. *)
extension_content = {character - ']'} ;
(* The content runs up to the FIRST `]`; a nested `]` closes the extension
   early and the remainder is literal: `:foo[a [b] c]` ->
   `<span class="ext-foo">a [b</span> c]`. *)

(* --- Editorial Markup (CriticMarkup-inspired) --- *)
editorial_markup = addition | deletion | substitution | editorial_comment ;

addition = "{+", inline_content, "+}", [attributes] ;
deletion = "{-", inline_content, "-}", [attributes] ;
substitution = "{~", inline_content, "~>", inline_content, "~}" ;
editorial_comment = "{#", comment_content, "#}" ;
comment_content = (* any text until the matching `#}`, preserved literally *) ;
(* CONTENT IS LITERAL -- NORMATIVE. Nothing inside is parsed as markup, spaces
   are preserved, and no escape is resolved: `{# a *b* c #}` holds the eight
   characters ` a *b* c ` and renders them, asterisks included. This says what
   all three implementations already do; the production read `inline_content`,
   which none of them implemented.

   It follows from what the construct is. An editorial comment is the author's
   aside about the text, not more text -- emphasis inside it would be markup
   the reader never sees rendered, and a formatter could not tell it from
   emphasis the author wrote outside. It is also what lets `link_text`'s scan
   skip the span: with no escape available, a `]` inside is uncloseable any
   other way.

   `addition` and `deletion` are NOT literal: they take `inline_content`, and
   `{+a *b* c+}` really does emphasize `b`.

   `substitution` is an OPEN divergence, deliberately left alone here. Its
   production says `inline_content`, and all three implementations render
   `{~*b*~>*e*~}` with the asterisks literal - so either the production or
   every engine is wrong, and which one is a behavior question rather than a
   transcription error. Not folded into this clause. *)
(* ATTRIBUTES -- NORMATIVE: an addition / deletion is an ordinary inline node,
   so a trailing `{...}` attribute block attaches to its `<ins>` / `<del>`,
   exactly like a span, code span, link, or emphasis: `{+a+}{.a}` ->
   `<ins class="a">…</ins>`. (All three impls agreed only after this was
   pinned; carve-js dropped the block, carve-rs kept it literal.) *)
(* DISAMBIGUATION (PART 9 §22). `{~ … ~}` is editorial SUBSTITUTION when it
   contains a top-level `~>`, and forced STRIKETHROUGH otherwise. `{# … #}`
   stays editorial comment (`#` is not an emphasis delimiter, no collision). *)

(* HIGHLIGHT is the single-char `=` delimiter, and `{=text=}` is its FORCED
   intraword form (forced_highlight, PART 9 §22), rendering <mark>. It is
   distinct from the raw-inline `{=format}` attribute on a code span (e.g.
   `` `x`{=html} ``), which has no trailing `=` before the `}`. *)

(* --- Mentions and Tags --- *)
mention = '@', mention_name ;
mention_name = name_word, {'.', name_word} ;

tag = '#', tag_name ;
tag_name = name_word, {'.', name_word} ;

name_word = (letter | digit | '_' | '-')+ ;
(* Dots are INTERIOR-only by construction: a dot followed by another name
   character continues the name (`@john.doe`, `#release-1.0`); a dot at the
   end of the run is sentence punctuation, never part of the name
   (`ping @markus.` -> mention `markus` + literal `.`). Pinned by corpus
   89-mention-and-tag-name-boundaries / 31-mention-ignores-email-addresses. *)

(* Boundary rules (PSEUDO-PRODUCTIONS -- prose conditions on `mention` /
   `tag` / `symbol`, normative in PART 9 §7, not reachable grammar symbols):
   the marker must be at start of content or preceded by a character that is
   NOT a word character ([A-Za-z0-9_]). Whitespace qualifies, but so does any
   punctuation: `(@user)` is a mention, `x=@user` is a mention; `a@b` is not.
   The same guard applies to `#tag` and `:symbol:`. *)
mention_boundary = (* @name: no word character directly before @ *) ;
tag_boundary = (* #name: no word character directly before # *) ;
symbol_boundary = (* :name:: no word character directly before the opening
   ':' -- `a:b:c` and `10:30:` stay literal text; `(:tada:)` is a symbol *) ;

(* --- Symbols --- *)
(* A symbol is a generic named inline placeholder with NO built-in semantics:
   the parser only records the name. Resolution is processor configuration:
   (1) a registered inline-renderer extension handler for symbol nodes wins,
   (2) else the renderer `symbols` map (name -> replacement, emitted RAW in
   the target format -- processor config is trusted, same class as the
   `renderers` map), (3) else the literal text `:name:` (escaped). Emoji
   substitution is the common use, not a language feature. Profiles may
   strip or force-literal symbols ("symbol" is the construct name in
   profile allowlists).
   ATTRIBUTES -- NORMATIVE: a trailing `{...}` attribute block attaches to
   the symbol; in HTML output attributes force a `<span>` wrapper around the
   resolved (or literal) output so they have an element to land on:
   `:rocket:{.big}` -> `<span class="big">🚀</span>` (mapped) /
   `<span class="big">:rocket:</span>` (unmapped). Without attributes no
   wrapper is emitted. *)
symbol = ':', symbol_name, ':', [attributes] ;
symbol_name = (letter | digit | '+' | '-'), {letter | digit | '_' | '+' | '-'} ;
(* The FIRST name character is a letter, a digit, `+` or `-`, so the reaction
   shortcodes `:+1:` / `:-1:` parse (they are the most common ones in the
   wild). It may NOT be `_`: `:_x_:` would otherwise steal from underline
   (`:` + `_x_` + `:`), and no real shortcode needs a leading underscore --
   so `:_x:` stays literal text. No whitespace, no second colon inside. The
   inline extension `:type[...]` is tried first; `:kbd[Ctrl]` never parses
   as a symbol because the name run must be closed by ':'.

   PRECEDENCE vs SMART TYPOGRAPHY -- NORMATIVE: a symbol is recognized BEFORE
   the typographic replacements (PART 9 §8), so a name made of typographic
   punctuation is a symbol, not a substitution: `:+-:` is the symbol `+-`
   (NOT `:` + ± + `:`), and `:--:` is the symbol `--` (NOT an en dash between
   colons). The typographic forms still apply everywhere a symbol does not
   open -- `a +- b` is `a ± b`, and `word:+-:` (no boundary, so no symbol)
   is `word:±:`. Pinned by the corpus (Symbols section). *)

(* --- Smart Typography --- *)
smart_typography = em_dash | en_dash | ellipsis | smart_quote
                 | arrow | comparison | typographic_symbol ;
(* NOTE: fractions are deliberately NOT converted (PART 9 §8 / dismissed-
   syntax.md) -- `1/2` collides with dates (`1/2/2024`) and paths. *)

em_dash = "---" ;     (* → — *)
en_dash = "--" ;      (* → – *)
ellipsis = "..." ;    (* → … *)
(* HYPHEN RUNS -- a run of 2+ hyphens between non-hyphen characters
   decomposes djot-style: length divisible by 3 -> all em dashes;
   else divisible by 2 -> all en dashes; else one em dash (3) and the
   remainder by the same rule. So `--`=–, `---`=—, `----`=––,
   `-----`=—–, `------`=——, `-------`=—–– (PART 9 §8). *)

smart_quote = '"' | "'" ;
(* Smart quotes are PER-CHARACTER contextual substitution, NOT a paired
   span: each `"` / `'` is independently mapped to a left or right
   typographic quote by its PRECEDING character. It is a LEFT (opening)
   quote when preceded by start-of-content, whitespace (incl. NBSP
   U+00A0), OR one of the opening/operator characters `( [ { = : - /`;
   OTHERWISE it is a RIGHT (closing) quote. So an apostrophe (`don't`)
   is a right single quote, while `a="b"`, `:"q"`, `-"q"`, `/"q"` and
   `("q")` all OPEN the first quote. The SAME opening set applies to BOTH
   `"` and `'` (e.g. `('q')` -> `(‘q’)`, `{'q'}` -> `{‘q’}`). Outside that
   set the quote CLOSES: after a closing bracket (`}` `)` `]`), after
   sentence punctuation (`.` `,`), or mid-word (`a"b`) it is a right quote.
   An unpaired quote still converts; an empty `""` opens both marks (the
   second `"` follows a non-opening `"` but has nothing to close against);
   the substitution never spans or pairs across other inline constructs.
   `\"` / `\'` escape to the straight character (PART 9 §8). *)

arrow = "->" | "<-" | "<->" | "=>" ;
comparison = "!=" | "<=" | ">=" ;
(* NOT the `:name:` symbol above -- these are the fixed typographic
   replacements (© ® ™ ±). Named `typographic_symbol` so the two do not
   collide: `symbol` is the `:name:` placeholder (PART 9 §7). *)
typographic_symbol = "(c)" | "(r)" | "(tm)" | "+-" ;

(* --- Breaks --- *)
hard_break = '\', newline ;  (* backslash at end of line *)
soft_break = newline ;        (* regular line ending within paragraph *)

(* ============================================================================
   PART 4: ATTRIBUTES
   ============================================================================ *)

attributes = '{', opt_ws, attribute_list, opt_ws, '}' ;
attribute_list = attribute, {whitespace+, attribute} ;
(* interior padding is accepted: `{ .a }` and `{.a  .b}` are valid inline
   attribute blocks, same whitespace handling as block_attributes (minus
   the line-continuation, which is block-level only) *)
(* RENDER ORDER -- NORMATIVE: attributes are emitted in the order they
   appear in the source, with all classes merged into a single `class`
   attribute placed at the position of the FIRST class. So `{.a #b k=c}`
   -> `class="a" id="b" k="c"` and `{k=c .a #b}` -> `k="c" class="a"
   id="b"`. (Block-attribute merge keeps each slot at its first-appearance
   position; values are last-wins, PART 9 §15.) Repeated class VALUES are
   DEDUPLICATED, keeping first-occurrence order: `{.a .a .b}` -> `class="a
   b"`, and classes accumulated across adjacent/continued attribute blocks
   dedup the same way. *)

attribute = id_attribute | class_attribute | key_value_attribute
          | boolean_attribute ;

id_attribute = '#', identifier ;
class_attribute = '.', identifier ;
key_value_attribute = identifier, '=', attribute_value ;
(* A bare identifier with no value is a BOOLEAN (value-less) attribute,
   rendered `name=""` (a carve extension beyond canonical djot, matching
   djot-php; matched after key_value_attribute so `k=v` is not read as a bare
   `k`). It mixes freely with the other forms. See PART 9 §14. *)
boolean_attribute = identifier ;

attribute_value = unquoted_value | quoted_value ;
unquoted_value = (letter | digit | '-' | '_' | '.' | ':')+ ;
(* `.` and `:` are admitted so common unquoted values -- version strings
   (`v1.2`), paths/data values (`a.b`), and namespaced/pseudo tokens
   (`xml:lang`, `sm:hover`) -- need no quoting. A value containing any OTHER
   non-identifier character must be quoted. *)
(* A backslash escapes ASCII punctuation inside a quoted value (same
   `escaped_char` rule as inline text), so a value can contain a literal
   quote: `"a\"b"` -> the value `a"b`. *)
quoted_value = '"', { escaped_char | (character - '"' - '\') }, '"'
             | "'", { escaped_char | (character - "'" - '\') }, "'" ;

(* Block-level attributes appear on their own line(s) preceding a block.
   A single attribute block may span multiple physical lines (the `}`
   need not be on the opening line): a continuation is a single line
   break optionally followed by indentation. A BLANK line (two line
   breaks) ends the block -- it is not interior padding; an attribute
   block interrupted by a blank line is not a block_attributes (both
   reference impls treat it as literal text). Merge + reach semantics
   (including float-across-blank-lines BETWEEN separate blocks): §15. *)
block_attributes = '{', opt_ws, attribute, {attr_separator, attribute}, opt_ws, '}', newline ;
(* block_attributes requires AT LEAST ONE attribute -- there is NO block-level
   blessed-empty form (unlike the inline `[text]{}` span, which is blessed).
   A bare `{}` line is therefore NOT a block-attribute block; it stays a
   literal paragraph (`<p>{}</p>`). *)
opt_ws = {whitespace} ;                  (* spaces/tabs only, no line breaks *)
attr_separator = (whitespace | continuation), opt_ws ;  (* one ws OR one line break *)
continuation = newline, opt_ws ;         (* a single line break + indent; NOT a blank line *)

(* ============================================================================
   PART 5: ABBREVIATIONS
   ============================================================================ *)

abbreviation_definition = "*[", abbreviation_term, "]:", space, abbreviation_expansion, newline ;
abbreviation_term = (letter | digit)+ ;
(* a term is a SINGLE alphanumeric word -- a bracketed term containing
   punctuation or spaces (`*[e.g.]:`, `*[HTTP API]:`) is NOT a definition;
   the line stays ordinary paragraph text *)
abbreviation_expansion = {character - newline}+ ;

(* ============================================================================
   PART 6: INCLUDES
   ============================================================================ *)

(* PROCESSOR-LEVEL (PART 9 §19): includes are NOT part of the core parser
   and the directive is deliberately NOT reachable from `block`/`inline` --
   a conformant core may leave `{{ … }}` literal. *)
include_directive = "{{", space, include_path, [include_section], [include_options], space, "}}" ;
include_path = {character - ('#' | '@' | '}' | ' ')}+ ;
include_section = '#', identifier ;
include_options = {space, '@', identifier, ':', attribute_value}+ ;

(* ============================================================================
   PART 7: LEXICAL ELEMENTS
   ============================================================================ *)

identifier = (letter | '_'), {letter | digit | '_' | '-'} ;

letter = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j'
       | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't'
       | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
       | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J'
       | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
       | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' ;

digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

whitespace = ' ' | '\t' ;
space = ' ' ;
newline = '\n' | '\r\n' | '\r' ;
backslash = '\' ;

character = (* any Unicode character *) ;
email_char = letter | digit | '.' | '-' | '_' | '+' ;

EOF = (* end of file *) ;

(* ============================================================================
   PART 8: PARSING PRECEDENCE
   ============================================================================ *)

(*
   Block parsing precedence (first pass):
   1. Frontmatter (--- delimited at document start)
   2. Comments (%% line, %%% block)
   3. Raw blocks (```=FORMAT -- matched before ordinary fences),
      then code blocks (``` or ~~~ fenced)
   4. Headings (# prefix)
   5. Thematic breaks (---, ***, ___)
   6. Block quotes (> prefix)
   7. Definitions (link/footnote/abbreviation definition lines) and
      block-attribute lines ({...} alone on a line, PART 9 §15)
   8. Lists (-, * bullets at any indent; ordered markers; :: definition
      lists)
   9. Tables (| prefix)
   10. Colon fences (::: -- admonition with type word, line block with |,
       local hard-break block with \, generic div when bare; longest-token
       wins: a `::: x` line is never a `:: ` definition term)
   11. Paragraphs (everything else)
   Recognition order is NOT interruption: which of these may interrupt an
   open paragraph is PART 9 §10 (list markers -- bullet and ordered -- never do).

   Inline parsing precedence (second pass):
   1. Escaped characters (\x)
   2. Code spans (`...`)
   3. Autolinks (<url>) and crossrefs (</#id>)
   4. Inline footnotes (^[content]) -- above emphasis; a bare `^` is
      always literal, so `^[` is unambiguous; see PART 9 §16
   5. Links, images, reference footnotes, inline spans
      ([text](url), ![alt](src), [^label], [text]{attrs} -- the
      single-char lookahead after `]` selects among them, PART 9 §14;
      a `[^` run is a footnote reference before bracket scanning)
   6. Math ($`…`, $$`…`)
   7. Emphasis markers (/, *, _, ~, =) and forced spans ({/.../}, {^...^},
      {,...,} etc.)
      -- EXCEPT a delimiter that begins a multi-char smart-typography
      pattern: `=>` is the arrow, never a highlight opener (the pattern
      is consumed first; PART 9 §8)
   8. Editorial markup ({+...+}, etc.; {~...~} with `~>` = substitution)
   9. Extensions (:type[content])
   10. Mentions and tags (@user, #tag)
   11. Smart typography (--, ---, etc.)
   12. Trailing line comments (%%, PART 9 §21)
   13. Plain text

   Disambiguation rule:
   - Prefer literal text over markup: a delimiter that has no valid match
     (per the word-boundary conditions, PART 9 §1) is literal.
   - An opener matches the NEAREST following valid closer of the same type.
     Delimiters of the same type between them are literal content (same-type
     spans do not nest, PART 9 §3); e.g. /usr/local/ -> <em>usr/local</em>.
   - Different-type spans nest and are resolved with a delimiter stack in a
     single left-to-right pass (NO backtracking, linear time); e.g.
     *bold /italic/* and /italic *bold*/ nests fully.

   Note: "shorter spans / earlier opening wins" does NOT apply -- it would
   truncate /usr/local/ to <em>usr</em> and break nested emphasis. The
   delimiter-stack model is what makes Design Principle 1 (no backtracking)
   hold while still producing the canonical outputs in the corpus.
*)

(* ============================================================================
   PART 9: SEMANTIC CONSTRAINTS (GUARDS + OPERATIONAL SEMANTICS; the
   rules a context-free production cannot carry, stated formally)
   ============================================================================ *)

(*
   1. WORD BOUNDARIES FOR EMPHASIS -- FORMALIZED. Stated formally by the
      bare_opener(d) / bare_closer(d) guard templates in PART 3, for all
      five single-char bare delimiters `/ * _ ~ =` (sup/sub `^` `,` are
      braced-only, `{^ ^}` / `{, ,}`, not bare); §9 holds the
      run-resolution rules (nearest closer, delimiter stack). No bare
      delimiter emphasizes intraword; the forced `{X … X}` family (§22) is
      the deliberate-intraword escape hatch.
      Canonical pinned examples: "/ not italic /" literal (ws after opener);
      "a/b/c", "foo*bar*baz", "snake_case", "x = 5", "key=value" literal
      (no left boundary / ws after opener); "/usr/local/" ->
      <em>usr/local</em> (01-emphasis-6); "x{*y*}z" -> x<strong>y</strong>z
      (forced, §22). STRICTER than Djot, whose rule is whitespace-only.

   2. MATCHING FENCES -- FORMALIZED. Stated as `where` guards on the
      productions (GUARD NOTATION, header):
      - code_fence_close: same fence character, len(close) >= len(open);
      - colon_fence_close (§12): len(close) >= len(open);
      - comment_block_close: len(close) = len(open). Only the LEADING RUN of
        `%` is compared -- trailing text on either fence line is insignificant,
        and an opener with no matching closer ahead opens nothing (§28).

   3. NO SAME-TYPE NESTING
      - /nested /italic/ here/ is invalid
      - /nested *bold* here/ is valid
      - A consequence (§9 SAME-DELIMITER ADJACENCY): a doubled bare delimiter
        never opens nested same-type emphasis -- it is literal text. So `**x**`
        -> `**x**` and `~~x~~` -> `~~x~~`, uniform with the already-literal
        `//x//` and `__x__` (corpus 74-doubled-emphasis-delimiters). The
        braced-only `^`/`,` are not bare, so `^^x^^` is likewise literal.

   4. CAPTION PLACEMENT -- FORMALIZED (caption_slot in PART 2). A `^ `
      caption attaches via the [caption_slot] slot of the captionable
      hosts: an image paragraph, a blockquote, a table, a fenced code block
      (a captioned code block is a numbered LISTING), or a standalone
      display-math block (a numbered EQUATION; the block must be solely the
      `$$`…`` span). caption_slot's single optional blank_line IS the
      "one blank line allowed between block and caption" rule -- structural
      now, not prose.

   5. TABLE MODEL -- OPERATIONAL SEMANTICS (cell split, row validity,
      header cells, span walk, continuation rows, delimiter row, row
      attributes). STATE: a parsed table is a grid of logical cells;
      consumed(r,c) marks a position absorbed by a span originating at
      another cell; avail(r,c) := a cell was emitted at (r,c) and NOT
      consumed(r,c).
      T1 CELL SPLIT (escape scan, cell-splitting-aware). Scanning a row
         left to right OUTSIDE code spans: an unescaped `|` separates
         cells; `\|` and a `|` inside a code span are content. (The
         code-span state carried across the scan is what makes this
         non-context-free.)
      T2 VALID TABLE ROW. ONE predicate, used both for the §10 I1
         interruption test and for opening a table at a block start:
         valid_row(L) := L's first non-indent character is `|` AND its last
         non-whitespace character is `|` AND at least one cell lies between.
         (T8's row attribute block is glued to that closing `|`, so a row
         ending `|{.x}` still qualifies -- strip the block, then test.)
         A line-initial `|` WITHOUT the closing `|` is prose, wherever it
         appears: `| a | b` neither interrupts a paragraph nor opens a table
         (corpus 140). This matches the `standard_row` production, which
         ends in `'|'`; there is no lenient open form.
         The two uses may never disagree: a line the block parser builds a
         table from but the §17 sub-block test calls prose would make a list
         item loose AND fill it with a table.
      T3 HEADER CELL. A leading `=` GLUED to the cell's opening `|` marks a
         HEADER cell; `\=` yields a data cell whose text starts with `=`.
      T4 SPAN MARKERS. A cell whose WHOLE content is `^` (rowspan) or `<`
         (colspan) is a span marker; ANY other content disqualifies it --
         `{.x} <` is an ordinary cell with literal text `{.x} <` (no
         attributed span marker exists).
      T5 SPAN WALK -- NORMATIVE, a total function. origin(marker) :=
           for `<` at (r,c): the nearest c' < c walking LEFT with
             avail(r,c'), skipping consumed positions;
           for `^` at (r,c): the nearest r' < r walking UP with
             avail(r',c), skipping consumed positions.
         Origin found -> extend its colspan/rowspan across the marker's
         position and mark that position consumed. A `^` MAY reach a HEADER
         cell: the rowspan crosses the thead/tbody boundary
         (`<th rowspan="N">`), for native `|=` and GFM-promoted headers
         alike (corpus 99-table-header-cell-rowspan).
         NO origin (the walk runs off the edge) -> the marker renders as an
         EMPTY cell (`<td></td>` / `<th></th>`, no content, no span); never
         dropped, never literal. Covers uniformly:
           * ORPHAN -- `^` in the FIRST row / `<` in the FIRST column
             (corpus 96-table-span-marker-in-first-column);
           * BLOCKED -- every intermediate position consumed by other spans
             (corpus 104-blocked-span-marker-renders-as-empty-cell). A
             marker that skips consumed neighbors and still reaches an
             available cell DOES extend it -- not blocked.
      T6 CONTINUATION ROWS (`+` in the first column instead of `|`).
         Appends per-column onto the row ABOVE: each non-empty cell joins
         its column's cell content with a single space (soft wrap); empty
         cells append nothing; no `<tr>` is produced. Under rowspan the
         joined content belongs to the SPANNING cell (corpus 40, 41). A
         table cannot BEGIN with a continuation row (PART 2 `table`).
      T7 GFM DELIMITER ROW -- NORMATIVE. delimiter_row(R) := R is EXACTLY
         the SECOND row of the table AND every cell matches delimiter_cell
         (optional ws, optional `:`, one or more `-`, optional `:`,
         optional ws -- PART 2; an empty cell `|---||` or other content
         `|-a-|`, `|:|` disqualifies the whole row, which is then an
         ordinary standard_row). Effects:
           1. promote the FIRST row's cells to header (`<thead>`/`<th>`);
           2. set per-column alignment from the colons (`:--` left, `--:`
              right, `:-:` center, bare `---` none), applied to the WHOLE
              column (header + body cells);
           3. CONSUME the row (no `<tr>`).
         Column-count mismatch: unmatched columns promote with default
         alignment. POSITION IS STRICT: a delimiter-shaped row anywhere
         else (first line, or after the second) is an ordinary row whose
         `-` runs are inline content (`---` may smart-render as an em
         dash). Native `|=` header cells and `<`/`>`/`~` alignment markers
         may COEXIST with a delimiter row; a single column carrying BOTH a
         native marker AND a delimiter colon has UNSPECIFIED alignment (do
         not combine on one column). Verified identical js/php/rs; pinned
         by corpus 09-tables-3.
      T8 ROW ATTRIBUTES -- NORMATIVE. A `{…}` GLUED to a row's closing `|`
         (no intervening space) sets that row's `<tr>` attributes -- the
         row-level twin of a cell's opening-pipe attribute block, same
         syntax, same validity rule. The WHOLE payload must be a valid,
         non-empty attribute block (§15); a space before the brace
         (`| a | {.x}`), an empty `{}`, or an invalid payload is NOT a row
         attribute -- the `{` is ordinary content, so the line (ending in
         `}`, not `|`) is not even a table row. Applies to header and body
         rows; composes with the delimiter row and with cell attributes:
         `|{.c} a |{.r}` -> `<tr class="r">` with `<td class="c">`. Pinned
         by corpus 98-table-row-attributes.

   6. REFERENCE RESOLUTION -- FORMAL STATEMENT MOVED TO PART 9R (rules
      R1-R3; this §-anchor stays for existing references). Summary: link
      labels match EXACTLY (case-sensitive, no whitespace folding; corpus
      73-reference-labels-are-case-sensitive); LAST link definition wins
      vs FIRST footnote definition (deliberate asymmetry, §16); collapsed
      references [text][] use their text as the label; abbreviations match
      at word boundaries only; ALL definitions (link / footnote /
      abbreviation / heading ids) are collected in the first pass (PART 8)
      before inline resolution runs -- a definition after its use still
      resolves; two-pass, O(n), not backtracking. Design Principle 2 ("no
      dependency on later references") scopes to inline
      tokenization/highlighting, which never needs the definition table;
      semantic expansion (abbr, </#id> auto-text, [ref] targets) uses it.

   7. MENTION/TAG BOUNDARIES
      - @ and # only start mentions/tags at word boundaries
      - email@domain.com is NOT a mention
      - The name runs over letters, digits, `_`, `-`, and INTERIOR dots
        (a dot followed by another name character: `@john.doe`,
        `#release-1.0`); a trailing dot is sentence punctuation, not part
        of the name

   8. SMART TYPOGRAPHY CONTEXT
      - Only applies outside code spans/blocks
      - Patterns must match exactly (not partial)
      - FRACTIONS ARE NOT CONVERTED. `1/2`, `3/4`, etc. stay literal --
        they collide with dates (`1/2/2024`) and paths, and djot has no
        fractions (recorded in dismissed-syntax.md). Converted set:
        dashes, ellipsis, smart quotes, arrows, comparisons, `+-`, and
        symbols (`(c)`,`(r)`,`(tm)`).
      - UNCONDITIONAL BY DEFAULT -- NORMATIVE. The substitution belongs to
        the default inline layer, not to an opt-in feature: a conformant
        implementation performs it with NO extension registered. A locale
        or glyph extension (a German / Swiss quote set, say) selects WHICH
        characters are emitted; it does NOT decide WHETHER the substitution
        runs, and removing such an extension does not disable the
        transform. Stated explicitly because the existence of a
        smart-quotes extension otherwise reads as "the transform is that
        extension" -- it is not (carve#338).
      - OPTIONAL OFF SWITCH -- NORMATIVE WHERE OFFERED. A host MAY expose
        one document-global switch, `smartTypography`, defaulting to TRUE
        (idiomatic spellings: `smartTypography: false` in carve-js and
        carve-php, `Options::with_smart_typography(false)` in carve-rs).
        A host that omits the switch stays conformant. Where it IS offered
        and set to FALSE:
        * PARSING IS UNCHANGED. The nodes below are still produced, so the
          AST does not depend on the switch. What changes is rendering: a
          presentation renderer emits the node's SOURCE RUN instead of its
          glyph, exactly as the canonical writer already does. Every trigger
          character therefore survives as the ASCII the author typed --
          dashes, ellipsis, quotes, arrows, comparisons, and the typographic
          symbols `(c)` `(r)` `(tm)` `+-`;
        * the `:name:` symbol production is UNAFFECTED; it is not a
          `smart_typography` production, and its precedence over the
          typographic forms is unchanged;
        * escape handling is UNAFFECTED: `\"`, `\-`, `\.` still consume the
          backslash and yield the bare character (PART 7). Turning the
          switch off changes which UNESCAPED characters convert, nothing
          about escapes;
        * heading ids are BYTE-IDENTICAL either way. Ids are computed from
          the ASCII source (the reverse pass in HEADING IDENTIFIERS above),
          so with the switch off that pass has nothing to reverse -- it is
          a no-op, not a different result;
        * the switch is DOCUMENT-GLOBAL and applies to EVERY target.
          PER-TARGET DEFAULTS ARE NON-CONFORMANT (on for HTML, off for
          Markdown / plain / ANSI): one source must carry the same text on
          every target, and a target-dependent default would let
          `to_html(x)` and `to_markdown(x)` disagree about what the
          document says.
        Pinned in tests/corpus-optional (feature `smart-typography-off`).
      - AST REPRESENTATION -- NORMATIVE. A recognized substitution is its
        own inline node, `smart_punctuation`, carrying BOTH halves: the
        resolved KIND and the author's SOURCE RUN. A presentation renderer
        (HTML, Markdown, plain text, ANSI) emits the glyph for the kind; the
        canonical Carve writer emits the source run, so `fmt` reproduces the
        document instead of normalizing its punctuation. Writing the glyph
        straight into the text buffer is NOT conformant: it discards the
        spelling, after which `...` cannot be told from a literal U+2026 and
        the writer has nothing to reproduce.
        * SOURCE OUTPUT ON THE MARKDOWN TARGET -- OPTIONAL, NORMATIVE WHERE
          OFFERED. The glyph above is the DEFAULT on every presentation
          target and stays so. An implementation MAY additionally offer, as
          the named optional feature `markdown-typography-source`, a
          Markdown-renderer setting that emits the SOURCE RUN instead. An
          implementation that omits it stays conformant. Where it IS
          offered:
          - IT IS NOT A DEFAULT AND NOT DOCUMENT-GLOBAL. It is asked for per
            render call, on the Markdown target only. The `smartTypography`
            switch above is untouched, and so is its rule that per-target
            DEFAULTS are non-conformant: nothing here changes what any
            target emits unless a caller asks;
          - PARSING IS UNCHANGED, exactly as for the off switch. The same
            nodes are produced and the AST does not depend on the setting;
          - THE OTHER PRESENTATION TARGETS MUST NOT OFFER IT. Markdown is
            re-parseable source, which is what makes the canonical writer's
            argument apply to it: the source run survives a re-parse as the
            same node, while the glyph cannot be told from a literal one.
            HTML, plain text and ANSI are terminal presentations, where the
            glyph is the only sensible output and a source run would just be
            worse typography;
          - IT REVERSES ONLY WHAT THE RENDERER SUBSTITUTED. A quote the
            author typed as a typographic character is not a
            `smart_punctuation` node and is emitted unchanged either way.
          Rationale: Markdown is the target generated for machines to read,
          and the glyph breaks exact-match search against the source - a
          quote substituted around a code-ish value (`= 'verification-link-
          sent'`) can no longer be found by searching for what the author
          wrote, and the reader cannot undo it because the glyph is all that
          survives. Pinned in tests/corpus-optional (feature
          `markdown-typography-source`, target `markdown`).
        * KIND NAMES ARE SPEC SURFACE -- an implementation inventing its own
          spelling breaks every AST consumer. The fourteen table-resolved
          kinds are `ellipsis`, `em_dash`, `en_dash`, `left_right_arrow`,
          `rightwards_arrow`, `leftwards_arrow`, `rightwards_double_arrow`,
          `less_than_or_equal`, `greater_than_or_equal`, `not_equal`,
          `plus_minus`, `copyright`, `registered`, `trademark`. The four
          quote kinds are `left_double_quote`, `right_double_quote`,
          `left_single_quote`, `right_single_quote`.
        * A QUOTE NODE ALSO CARRIES ITS RESOLVED GLYPH. Quote characters are
          locale-dependent (a German or Swiss set), and the choice is made
          during parsing, so the node records the character rather than
          leaving it to the kind table. Every other kind resolves through the
          table above, which is why the quote kinds are absent from it.
        * A DASH RUN PARTITIONS INTO ONE NODE PER RESOLVED GLYPH, each
          carrying the hyphens it came from, so `----` (two en dashes)
          round-trips to exactly four hyphens.
        * FOR PROFILES the node folds into the `text` trust class. It is
          ordinary visible prose, not a distinct capability, so it is not
          separately nameable in a profile (profiles.md).
        See divergence-from-djot.md section 12 for how this compares to
        Djot's container-node model.

   9. EMPHASIS SPAN RESOLUTION -- FORMAL GUARDS + DELIMITER-STACK
      OPERATIONAL SEMANTICS (governs the *_content productions in PART 3;
      §1 points here). The per-delimiter boundary tests ARE the
      bare_opener(d) / bare_closer(d) templates in PART 3. Run resolution
      is one left-to-right pass over the inline stream with a delimiter
      stack -- O(n), NO backtracking (Design Principle 1):
      E1 CLASSIFY. At each unescaped bare delimiter d outside code spans /
         raw inline, evaluate bare_opener(d) and bare_closer(d). Neither
         holds -> literal text.
      E2 CLOSE FIRST. A d satisfying bare_closer(d) with an open d-entry on
         the stack closes the NEAREST such entry. Spans NEST, never
         overlap: entries pushed above the closed one are popped and their
         delimiters demoted to literal (the djot model -- an opener that
         gets closed demotes the still-open candidates between).
      E3 NO SAME-TYPE NESTING (§3). While a d-span is open, a further d
         that would open does NOT push -- it is literal content. Hence
         /usr/local/ -> <em>usr/local</em> (01-emphasis-6); the/path/here
         literal (first `/` has no left boundary); unbalanced trailing runs
         stay literal: /b// -> <em>b</em>/ and /x// -> <em>x</em>/
         (edge-cases.md §1).
      E4 OPEN. Otherwise a d satisfying bare_opener(d) pushes an entry.
      E5 END OF BLOCK. Entries still open demote to literal.
      Forced `{X … X}` spans (§22) push/pop on the SAME stack; only the
      boundary tests are skipped.
      CONSEQUENCES (pinned): SAME-DELIMITER ADJACENCY is bare_opener's
      <!(d) / !(d) guards -- a doubled bare delimiter is literal, never
      nested same-type emphasis: `**x**`, `~~x~~`, `==x==` like `//x//`,
      `__x__` (corpus 74-doubled-emphasis-delimiters, 01-emphasis-11);
      longer runs (`===y===`) all-literal. (The braced-only `^`/`,` are not
      bare, so `^^x^^` and `,,x,,` are likewise literal.) SLASH-ADJACENCY
      (slash_if(d), bare_opener's extra left guard): italic `/` and underline
      `_` do NOT open when the immediately preceding char is `/`, so `a_/_a_`,
      `a/_y_` and `/a/_b_` -> <em>a</em>_b_ stay literal at the `_`; for `/`
      this is same-delimiter adjacency. The other delimiters `* ~ =` are NOT
      guarded against a preceding `/` (a/~y~ -> a/<s>y</s>, a/=y= ->
      a/<mark>y</mark>) (corpus 127-emphasis-opener-slash-adjacency). WORD BOUNDARY: (/x/) ->
      (<em>x</em>), a./b/ -> a.<em>b</em> (01-emphasis-8); foo_bar_baz,
      foo*bar*baz, snake_case, x = 5, key=value literal (01-emphasis-10,
      -11); "x /a/b y" literal -- the candidate closer is followed by `b`
      (01-emphasis-9). Intraword emphasis: forced family only, x{*y*}z ->
      x<strong>y</strong>z (§22).

   10. PARAGRAPH INTERRUPTION -- OPERATIONAL SEMANTICS (governs
      `paragraph` in PART 2). The relation interrupts(L, P) decides whether
      a line L ends an OPEN paragraph P with NO blank line between, at the
      document top level AND inside nested content (list item, block quote,
      admonition/div body). L interrupts -> L is parsed as its block and P
      ends on the preceding line (the Markdown-like rule for every block
      EXCEPT lists, which follow Djot).
      I1 VISIBLE OPENERS interrupt. interrupts(L, P) holds when L begins:
         - a heading (`#`..`######` + space);
         - a thematic break (a lone `---` / `***` / `___` line);
         - a block quote (`>`; a line-initial `>` is ALWAYS a quote marker,
           so a prose line starting `>= 5` becomes a quote -- escape it,
           `\>= 5`, to keep it prose);
         - a valid table row (valid_row(L), §5 T2; a stray `|` in prose,
           and a line-initial `|` with no closing `|`, are NOT ones);
         - a fenced code opener WITH a matching closer ahead (I4);
         - a `:::` opener WITH a matching closer ahead (I4).
         So "intro \n # H" is <p>intro</p> + <h1>; "intro \n --- \n more"
         is <p>intro</p> + <hr> + <p>more</p> (corpus
         76-paragraph-interruption).
      I2 LIST MARKERS NEVER INTERRUPT (symmetric). NEITHER a bullet
         (`- `/`* `, `- [x] ` task) NOR an ordered marker (decimal, letter,
         roman, any value) interrupts: the line FOLDS into P as lazy
         continuation; a list needs a blank line before it (matching Djot).
         "Liste: \n - eins \n - zwei" is ONE paragraph, so is
         "x = 5 \n * 3 + 17"; a blank line starts the list. (Deliberate
         change from the earlier bullet/ordered asymmetry: an ordered
         marker is too common in prose, and the symmetric rule removes the
         hard-wrapped-bullet false positive. Pinned by corpus 05-lists-12/13,
         76-paragraph-interruption, 77-blockquote-lazy-continuation,
         79-multi-line-headings.) TIGHT NESTED LISTS UNAFFECTED: an
         indented marker inside an open list ITEM opens a sublist with no
         blank line -- that is §24 C3 (content column), not this relation.
      I3 IMAGE EXCLUDED. A line that is only a bare image ("![alt](url)")
         does not interrupt: an image is not a block of its own in Carve;
         it stays an inline image inside the paragraph.
      I4 CLOSER LOOKAHEAD -- the same &(...) guards the fence / div /
         frontmatter productions carry. An UNTERMINATED ``` or `:::` opener
         does not interrupt; the line stays paragraph text (a stray ``` in
         prose then opens an unclosed inline verbatim run rendering as
         <code> to the end of the block -- the code_span maximal-run rule).
         Pinned by corpus 76-paragraph-interruption (unterminated-fence and
         unterminated-`:::` pairs).
      I5 INVISIBLE CONSTRUCTS interrupt AND are consumed: a reference
         definition (link `[r]: url`, footnote `[^r]: …`, abbreviation
         `*[A]: …`), a comment (`%%` line, `%%%` block), and a
         block-attribute line (`{…}` alone on a line, §15). "See[^m].\n
         [^m]: note" resolves the footnote; "para\n%% x" drops the comment;
         "Para\n{.x}" ends the paragraph with the attribute line floating
         forward (§15; corpus 84-block-attribute-lines-7). (A small
         deviation from djot, which keeps such a line as paragraph text.)
      I6 SCOPE + THE HEADING EXCEPTION. The relation applies to EVERY open
         paragraph, including a blockquote's lazy continuation: "> quoted
         \n - item" folds into ONE quote whose paragraph is
         "quoted\n- item" (NOT quote + <ul>); "> text \n > # H" is a quoted
         heading; a quoted bullet ("> p \n > - x") folds the same way.
         HEADING is the SOLE exception: a bounded title holds no block, so
         a list marker ENDS an open heading and starts a sibling list
         ("# H \n - item" -> heading + <ul>), any block-opener ends it, and
         only plain text (or a same-`#` marker line) folds in (PART 2,
         MULTI-LINE HEADINGS) -- matching djot.

   11. LIST MARKER CHANGE STARTS A NEW LIST -- OPERATIONAL (governs `list`
      in PART 2). §10 decides FIRST whether marker lines establish a list
      at all (the interruption gate); §11 then partitions established
      same-indent items into sibling lists. The rules layer; §11 never
      overrides §10.
      N1 SAME LIST iff two adjacent items at the same indent match on ALL
         axes: bullet character (`-` vs `*`; `+` is NOT a bullet in Carve --
         it is the continuation marker, §17), ordered_marker alternative,
         delimiter (`.` vs `)`), and plain-vs-task classification. Any axis
         differs -> a new sibling list at the same indent. Matches djot
         (minus djot's `+` bullet). Example:
           - a              produces:    <ul><li>a</li><li>b</li></ul>
           - b                            <ul><li>c</li><li>d</li></ul>
           * c
           * d
         `- a` followed by `- [x] b` is likewise two lists.
      N2 ORDERED DIALECT. A list's dialect (decimal / lower- or upper-alpha
         / lower- or upper-roman) and delimiter are fixed by its FIRST
         item; a later marker outside the dialect, or with the other
         delimiter, starts a new sibling list. The first item also sets the
         `<ol>` `type` (a/A/i/I; decimal omits it) and `start` (the
         marker's value; omitted when 1).
      N3 AMBIGUOUS-LETTER TIE-BREAK. A single roman-letter first marker
         (i/v/x/l/c/d/m) is ROMAN when the next sibling is the consecutive
         roman numeral (`iv.`+`v.`, `i.`+`ii.`) and ALPHA when the next is
         the consecutive letter (`c.`+`d.`, `v.`+`w.`); a lone `i`/`I`
         defaults to roman, any other lone letter to alpha.
      N1 looks only one item back; N2/N3 carry the FIRST item's
      classification (with N3's one-item-forward tie-break) for the rest of
      the list.

   12. FENCED ":::" BLOCK RENDERING -- NORMATIVE (governs `admonition`
      and `div` in PART 2; cross-references syntax.md §4.20 "Block
      Extensions"). A TYPED `::: word` block renders by a TWO-TIER rule
      on the type identifier (below). A BARE `:::` opener with NO type
      word is a GENERIC DIV: a plain `<div>` (no class added; an empty div
      is `<div></div>`), i.e. djot's generic container.

      INLINE OPENER ATTRIBUTES -- STRICT (djot). The opener line carries NO
      inline attributes: it is the colon fence, an optional type word, and
      an optional quoted title, and NOTHING else. Any trailing `{...}` (or
      other non-title text after the type) makes the line an ORDINARY
      PARAGRAPH, not a fence -- so `::: note {.x}`, `::: {.x}`, and
      `:::{k=v}` are all paragraphs. To attribute a div or admonition, use
      a PRECEDING block-attribute line, which floats onto the block (§15):
      `{.x #id}` then `::: note` yields `<aside class="admonition note x"
      id="id">`. This matches canonical djot, which rejects any non-class
      text on the fence line.

      FENCE LENGTH & NESTING: a fence is a run of 3+ colons; the same rule
      governs both admonitions and generic divs. A block is closed only by
      a bare fence of EQUAL-OR-GREATER colon length, so a longer opener
      nests shorter blocks (a `:::` inside a `::::` block is content, not a
      closer). Equal-length fences do not nest. A bare opener with no
      matching closer ahead is literal text.

      FENCE OPENER WITH A NESTED-LIST BODY: a `:::` opener inside a LIST
      ITEM opens its block even when the opener's body is a nested list
      (`- item` / `1. item`). The matching closer is located at the ITEM
      CONTENT COLUMN (the column the opener starts at), not at column zero;
      a `:::` at column zero is outside the item and does NOT close it. With
      a valid closer, the bullet/ordered marker lines are the admonition's
      body and the block wraps the nested `<ul>`/`<ol>`. A blank line
      between the opener and the nested list still opens the block, and an
      empty body (opener immediately followed by its closer) opens too. With
      NO matching closer, the opener degrades to literal text and the marker
      lines form an ordinary nested list. The corpus pins these
      (docs/examples.md "Fence opener with a nested-list body inside a list
      item").

      The two tiers for a typed block:

      Tier 1 -- CANONICAL ADMONITION TYPES render as a semantic
        `<aside>` with the `admonition` marker class:
            <aside class="admonition {type}">
              [<p class="admonition-title">{quoted_title}</p>]
              {body}
            </aside>
        The canonical set is the eight call-out types
        `note`, `tip`, `warning`, `danger`, `info`, `success`,
        `example`, `quote`. Consuming CSS frameworks target these exact
        class names.

      Tier 2 -- ANY OTHER (custom) type renders as a generic block-level
        `<div>` carrying the verbatim type as its class:
            <div class="{type}">
              [<p class="admonition-title">{quoted_title}</p>]
              {body}
            </div>
        This is the carve fenced-div primitive that the block-extension
        mechanism (syntax.md §4.20) builds on -- e.g. `::: tabs`,
        `::: mermaid`, `::: codepen` produce `<div class="tabs">`,
        `<div class="mermaid">`, `<div class="codepen">` respectively,
        which a registered extension may post-process. An unregistered
        custom type still renders as its generic `<div class="{type}">`
        so the document stays readable. `details` is an ordinary Tier-2
        type (`<div class="details">`); an extension that wants the
        HTML5 `<details>/<summary>` disclosure element opts in via §4.20.

      Pinned rules common to both tiers:
        a. The class is the literal type identifier (Tier 1 prefixes it
           with `admonition ` separated by a single space:
           `admonition note`, NOT `admonition-note`; Tier 2 uses the
           bare `{type}`). The type is NEVER folded together with the
           quoted title -- the title is a child element, never a class.
        b. A `<p class="admonition-title">…</p>` line is emitted ONLY
           when the author supplied a `quoted_title` after the type.
           Carve does NOT invent a default title from the type name.
           An explicitly empty `""` still counts as a supplied (empty)
           title. Applies to both tiers.

   13. HEADING SECTION WRAPPING -- NORMATIVE (governs `atx_heading` in
      PART 2 and the renderer; matches djot, see `jgm/djot` official
      `headings.test`). Every heading emits a `<section id="{id}">`
      wrapper around itself and the following content up to the next
      same-or-shallower heading. The slug-derivation rule (ASCII
      transliteration etc., see syntax.md §4.1) computes
      the id; the id lives on the `<section>` element, NOT on the
      `<h*>` element. This shifts the existing carve behavior
      (id-on-h*, no section wrapper) to match djot's structural model.
      Algorithm (one stateful pass over top-level blocks):
        let openSections : Stack of (level, sectionElement) = []
        for each top-level child node:
          if node is Heading at level N:
            while openSections is non-empty AND top.level >= N:
              emit </section>; pop openSections
            emit <section id="{slug}">; push (N, ...)
            emit <h{N}>{inline-rendered children}</h{N}>
          else:
            emit node normally
        while openSections is non-empty:
          emit </section>; pop openSections
      Properties:
      - Adjacent same-level headings: produce sibling sections at the
        same level (`# A / # B` -> two <section>s).
      - Skipped levels: a `# H1 / ### H3` sequence nests H3's section
        inside H1's, because the §11-style stack-close test (top.level
        >= N) only closes sections at level >= 3 (none open at level 3
        or 5/6). The structure mirrors djot's behavior; carve does NOT
        synthesize intermediate <h2>/<section> nodes.
      - Explicit `{#id}` on a heading: the id still lives on the
        `<section>`, not on the `<h*>`. The dedup namespace (syntax.md
        §4.1 / heading-id-tracker rules) is unchanged -- explicit ids are
        reserved before auto-ids in document order.
      - Empty document or doc with no headings: zero <section> elements
        are emitted.
      - The two-pass id resolution (collect explicit ids first, then
        auto-slug headings with collision-dedup) is unaffected by this
        rule -- only the EMISSION site of the id moves from <h*> to
        <section>. Existing crossref (`</#id>`) and implicit-heading
        ref (`[Heading][]`) resolution continue to target the same
        slug; the fragment URL `#{slug}` resolves to <section id> the
        same way browsers resolve to <h* id>.
      - Carve does NOT add `<section>` wrappers around non-heading
        top-level blocks. Headings are the only trigger.

   14. INLINE SPAN VS LINK DISAMBIGUATION -- NORMATIVE (governs `link`
      and `inline_span` in PART 3). After scanning a bracketed run
      `[ inline_content ]`, the IMMEDIATELY FOLLOWING character selects
      the construct (single-character lookahead, no backtracking):
        - `(` -> inline_link  ([text](url))
        - `[` -> reference_link or collapsed_reference_link
                 ([text][ref] / [text][])
        - `{` -> inline_span  ([text]{attrs})
        - anything else (or end of input) -> the `[`...`]` is LITERAL
          text. Carve has no shortcut reference link: a bare `[label]`
          never resolves against a `[label]: url` definition.
      An inline_span renders as
            <span {attrs}>{inline-rendered content}</span>
      where {attrs} is the attribute block applied verbatim (id, classes,
      key=value), in the same form as any other attribute carrier
      (`#id` -> id, `.x` -> class, `k=v` -> key). The bracketed content
      is full inline content and is parsed recursively, so
      `[a /b/ c]{.x}` -> `<span class="x">a <em>b</em> c</span>`.
      Because the `{` test fires only when the attribute block directly
      abuts `]`, `[text] {.x}` (with a space) is literal `[text]`
      followed by a literal `{.x}` -- the span requires adjacency.
      EMPTY OR INVALID ATTRIBUTE BLOCK -- NORMATIVE. The directly-abutting
      `{...}` is classified by its content:
        - yields at least one attribute (`[text]{.c}`) -> span with those
          attributes.
        - empty or whitespace-only (`[text]{}`, `[text]{ }`) -> a valid EMPTY
          block: a bare `<span>text</span>`. Both reference impls agree, even
          though the bare `attribute_list` production nominally needs >= 1
          attribute; the empty block is a blessed exception so a
          default-attribute processor can target the span.
        - unrecognized non-empty content (`[text]{???}`, `[text]{=y=}`) -> NOT
          an attribute block: the `]` and the `{...}` render literally. The
          inner bracket content is still inline-parsed, so `[*x*]{???}` ->
          `[<strong>x</strong>]{???}`.
        - a name (id, class, or key) that is not a grammar `identifier` (the
          `identifier` production) is also unrecognized: a DIGIT-FIRST name (`{.123}`, `{#1}`,
          `{2=v}`) or a name with a non-identifier character (`{.a!b}`).
          ONE invalid name makes the WHOLE block not an attribute block, even
          when mixed with valid attributes (`{.ok .1}`), so the run stays
          literal. (Deliberate strictness beyond djot, which accepts
          digit-first identifiers / `class="123"`; see jgm/djot issue 399.)
          A digit/`-`/`_` AFTER the first character is valid (`{.a1}`).
      BOOLEAN (value-less) ATTRIBUTES -- NORMATIVE. A bare identifier with no
      value (`[text]{kbd}`, `{.note open}`, `{disabled}` on a block line) is a
      boolean attribute, rendered `name=""`. It is matched after the
      `key_value_attribute` form (so `k=v` stays a key/value) and mixes freely
      with id/class/key=value in source order; multiple are allowed; a
      digit-first bare word is not a valid name (the block stays literal, as for
      any digit-first identifier). All impls support it (a carve extension
      beyond canonical djot, matching djot-php; corpus 95-boolean-attributes).
      The boundary of "yields an attribute" still differs at one margin:
      carve-php additionally accepts colon-bearing keys/classes
      (`{xml:lang="en"}`, `{.sm:hover}`) as spans, where carve-js treats those
      as literal. Use a plain `.class` / `#id` / `key=value` / bare word for a
      portable span. (A `{% ... %}` block is NOT a comment in Carve -- comments
      are `%%` / `%%%` only, PART 9 -- and a `%`-bearing attribute block is an
      invalid attribute that renders literally in ALL impls.)

   15. BLOCK ATTRIBUTE LINES -- OPERATIONAL SEMANTICS (governs
      `block_attributes` in PART 2; matches djot). STATE: pending := the
      ordered list of collected attribute blocks not yet attached.
      A1 COLLECT. A `{…}` alone on a line (the `}` may arrive on a later
         line, A5) that parses as a valid attribute list appends to
         pending and renders NOTHING. It also interrupts an open paragraph
         (§10 I5) and NEVER attaches backward: "Para\n{.class}" ->
         <p>Para</p> (pending then dropped, A4); "Para\n{.class}\n\nNext"
         -> <p>Para</p> then <p class="class">Next</p>.
      A2 FLOAT FORWARD. pending survives blank lines and further attribute
         lines and attaches to the NEXT block element: `{#id}\n\nText` ->
         <p id="id">Text</p>.
      A3 MERGE (over all pending blocks, source order):
         * id -> last one wins;
         * key=value -> last value for a given key wins;
         * class -> ALL classes accumulate in source order, NO
           de-duplication ({.a .b} then {.b .c} -> class="a b b c",
           matching djot and carve-php).
         Worked djot-canonical example: {#id} {key=val} {.foo .bar}
         {key=val2} {.baz} {#id2} then "Okay" ->
         <p id="id2" key="val2" class="foo bar baz">Okay</p>.
      A4 DROP IF DANGLING. pending with no following block element (end of
         document) produces no output.
      A5 MULTI-LINE BLOCK. One block may wrap across lines ({#id\n .foo} is
         one block: id `id`, class `foo`). A BLANK line inside the braces
         ends the attempt -- the text is then literal, not a
         block_attributes. A quoted value containing a literal `}` spanning
         lines is unsupported by the reference impls (pathological).
      A6 NOT AN ATTRIBUTE LIST -> NOT AN ATTRIBUTE LINE. A `{…}` line whose
         braces do not parse as an attribute list is ordinary paragraph
         content under normal inline rules (a `{#a #}` line is an editorial
         comment, §22 family).
      A7 THE ONLY BLOCK CHANNEL. NO block construct takes a trailing `{…}`
         on its own line (djot-strict; headings, fences and `:::` openers
         included) -- the preceding block-attribute line is the sole
         block-level attribute channel. A trailing `{…}` on a heading line
         is literal inline content (PART 2, headings).
      Implementation status: php, js, and rs implement §15 in full. Pinned
      by corpus 84-block-attribute-lines.

   16. FOOTNOTES -- NORMATIVE (governs `footnote` / `footnote_definition`
      in PART 3). The reference form `[^label]` + `[^label]: body` and the
      INLINE form `^[content]` are implemented, and they are the only note
      forms: the proposed sidenote (`[>content]`) was dismissed as a
      presentation concern, so `[>` is unclaimed and `[>foo]` is literal
      text (docs/dismissed-syntax.md). Numbering and resolution are formalized
      as PART 9R rule R2; the rendering rules below stay normative here.
      - A `[^label]` reference with a matching definition is NUMBERED by
        document reference order (first referenced = 1); the same label
        referenced again reuses its number.
      - Definitions may appear anywhere (order-independent); the FIRST
        definition for a label wins. A body is the def line plus any
        following lines indented by >= 2 spaces (single blank lines
        allowed between chunks), parsed as blocks. A definition is
        document-level metadata: it is COLLECTED and RESOLVED even when it
        sits INSIDE A CONTAINER (a blockquote or a list item). The reference
        resolves to the one endnotes section and the container that held the
        definition renders EMPTY (the definition is not left behind as
        content). The corpus pins this (docs/examples.md "Footnote
        definition inside a container is collected").
      - A reference with NO matching definition renders as literal source
        text `[^label]` (a trailing `{attrs}` on an unresolved ref is not
        round-tripped -- a marginal mistyped-label edge). An UNREFERENCED
        definition is dropped.
      - Rendering (djot-compatible roles):
          reference -> <a id="{refId}" href="#fn{n}" role="doc-noteref"><sup>{n}</sup></a>
          endnotes  -> ONE <section role="doc-endnotes"> appended AFTER
                       all body content (outside heading <section>s),
                       containing <hr> then an <ol> of <li id="fn{n}">;
                       each note's last paragraph gets a trailing backlink
                       <a href="#{refId}" role="doc-backlink">↩</a>.
        First reference to note n has refId `fnref{n}`; a k-th (k>1)
        repeat uses `fnref{n}-{k}` and adds another backlink. The
        backlink glyph is the plain return arrow `↩` (Carve's choice;
        djot appends a variation selector).
      - INLINE FOOTNOTE `^[content]` (pandoc form; a deliberate carve
        extension -- canonical djot has no inline footnotes). A `^`
        IMMEDIATELY before `[` opens an anonymous note whose content is the
        balanced bracket span (escape- and code-span-aware close, same as link
        text). Properties:
          * Content is INLINE-only, parsed recursively with footnote
            recognition DISABLED inside it (no `^[…]` or `[^ref]` nested in a
            note, either direction).
          * Numbered in the SAME document-order sequence as reference notes
            (an inline note always takes a fresh anonymous number; it cannot
            be re-referenced). Renders the same noteref + an `<li id="fn{n}">`
            in the one endnotes section, content as one `<p>` + backlink.
          * A trailing `{attrs}` attaches to the noteref `<a>` (like a
            reference note).
          * Empty or whitespace-only (`^[]`, `^[ ]`) is literal; an unclosed
            `^[…` is literal.
        PRECEDENCE (PART 8): `^[` opens an inline note; a bare `^` anywhere
        else is literal text (there is no bare superscript). Consequences:
        `^[x]^` is a note then a literal `^`; `\^[x]` is literal; a `^` not
        immediately followed by `[` is plain text.
      - LIMITATION: a footnote (reference `[^1]` or inline `^[…]`) inside link
        text (`[t[^1]](u)`) or inside a heading later cloned by a `</#id>`
        crossref nests an <a> in an <a>; avoid footnotes in those positions.

   17. TIGHT vs LOOSE LISTS + CONTINUATION MARKER -- OPERATIONAL (governs
      `list` rendering and the continuation_marker productions in PART 2).
      L1 LOOSE(list) := some item is followed by a blank line before the
         next sibling marker, OR some item holds a blank-line-separated
         second PARAGRAPH; otherwise TIGHT. A tight item's paragraph
         renders WITHOUT `<p>` (<li>text</li>); a loose item's paragraphs
         ARE wrapped (<li><p>text</p></li>).
      L2 COMPACT LIST BLOCKS -- Carve deviation from djot/CommonMark. A
         blank line before an item's sub-BLOCK (sub-list, block quote,
         fenced code, fenced div, heading, table) does NOT loosen: the item
         stays tight, lead text inline, block attached. Only a genuine
         second PARAGRAPH (blank + indented plain prose), or a blank line
         BETWEEN items, loosens. The blank line is still required to START
         the block, so recognition and the uniformity principle are
         unchanged -- only the tight/loose RENDERING differs (djot renders
         these loose). Pinned by corpus compact-list-blocks.
      L3 CONTINUATION MARKER `+` -- Carve addition. A line whose only
         content is `+`, at the current container's MARKER COLUMN, attaches
         the FOLLOWING flush-left block to that container -- ONE block of
         ANY kind (paragraph, list, fenced code, table, block quote, div,
         ...), up to the next blank line, sibling marker, or a further `+`
         -- with no marker prefix or indentation, keeping the container
         tight. `+` is not a Carve bullet (§11 N1), so a lone `+` is
         unambiguous; outside any container a lone `+` is literal text.
         The marker only ATTACHES, never separates: a blank line is the
         sole way to END a container, so a `+` can never break the
         following block OUT of its container -- it adds it IN.
      L4 CONTAINERS THAT TAKE THE MARKER:
         - LIST ITEM (pinned by corpus list-continuation-marker): `+` at
           the item's marker column attaches the block to the item (for
           code/tables one would rather not indent). FIRST-BLOCK form: the
           marker may sit on the marker line itself (`- +`, a bare `+`) to
           open an item whose body is the following flush-left block(s)
           with no inline lead; `- + text` keeps `+ text` as literal text.
         - BLOCK QUOTE: `+` at column 0 immediately after a blockquote line
           attaches the following flush-left block to the quote -- the
           un-prefixed alternative to repeating `>`. "> quoted \n + \n
           - item" is a quote whose body is the paragraph "quoted" PLUS a
           real <ul> (not a lazily-folded paragraph, cf. §10 I6). A blank
           line after the quote still ends it.
      L5 NOT CONTINUATION CONTAINERS (a lone `+` keeps its plain meaning):
         - fenced containers (admonition / generic div / line block) --
           their bodies already hold flush-left blocks; a lone `+` inside
           is literal text;
         - TABLE -- a first-column `+` is the continuation ROW (§5 T6), a
           distinct construct;
         - LEAF blocks (heading, thematic break, code block, raw block,
           paragraph) hold no child blocks; a heading is a bounded title.
         - FOOTNOTE DEFINITION bodies compose by indentation today; a `+`
           attach form is possible future work, deliberately OUT OF SCOPE.

   18. MATH -- NORMATIVE (governs `math` in PART 3; djot form). Inline
      `$` + verbatim backtick span; display `$$` + verbatim backtick
      span. A `$` NOT immediately followed by a backtick run is literal
      (`$5` is currency); `\$` escapes a literal `$`. Rendering:
        inline  -> <span class="math inline">\(content\)</span>
        display -> <span class="math display">\[content\]</span>
      `content` is the verbatim span text, HTML-escaped (`&`,`<`,`>`).
      Display math alone on a line renders inside its own paragraph. A
      standalone display-math block carrying a trailing caption (PART 9 §4)
      is a numbered EQUATION: the math paragraph is wrapped in a figure and a
      `</#id>` to it resolves to "Equation N" (§19).
      Author `{attrs}` after the span merge onto the <span> (the base
      `math …` class is kept; see PART 10 §1).

   19. CROSSREF AUTO-TEXT + DEFAULT-ON / PROCESSOR FEATURES -- NORMATIVE
      (resolution formalized as PART 9R rules R4/R5; behavior detail here).
      - `</#id>` clones the target heading's full inline children, so a
        heading with markup (`# *Setup*`) yields a link that KEEPS the
        markup (`<a href="#…"><strong>Setup</strong> …</a>`).
      - A `</#id>` cross-reference resolves to exactly ONE LEVEL: it links
        to the target and adopts the target's text, FLATTENING any nested
        cross-reference in that text (a `</#…>` in the cloned heading text is
        not re-expanded). One-level resolution makes a self-reference
        (`# A </#a>`) and a mutual cycle (`# A </#b>` + `# B </#a>`) SAFE --
        each side resolves once, no infinite expansion. A normal, non-cyclic
        crossref still resolves. The corpus pins this (docs/examples.md
        "Cyclic cross-reference resolves to one level").
      - `</#id>` to a NUMBERED CAPTION (a figure / table / listing / equation
        whose caption carries a number placeholder and an `{#id}`) yields the
        caption's LABEL + NUMBER ("Figure 1"), markup preserved, NOT the
        caption prose. A `</#id>` to an id that is neither a heading nor a
        numbered caption stays unresolved (renders literal), as today.
      - Mentions (`@user`), tags (`#tag`) and smart typography are ON by
        default in the conformant core; a processor MAY disable them. The
        corpus pins the default-on behavior.
      - Includes (`{{ path }}`) are a PROCESSOR-LEVEL directive, NOT part
        of the core parser; a conformant core MAY leave `{{ … }}` literal.
        An implementing processor MUST treat includes as opt-in (off for
        untrusted input) and MUST resolve include paths only to files under
        the configured project root AFTER symlink resolution (rejecting
        `..`-traversal and absolute paths that escape the root). It MUST NOT
        fetch remote URLs unless an allowlist is explicitly configured, and
        MUST bound BOTH recursion depth AND total expanded byte size to
        prevent include-bomb amplification (a file that includes another N
        times, transitively). See PART 9 §25 for the general security model.

   20. RAW PASSTHROUGH -- NORMATIVE (governs `raw_inline` in PART 3 and the
      semantics of `raw_block` in PART 2). The BLOCK form (```=FORMAT
      fence, PART 2) emits its verbatim content UNESCAPED when FORMAT
      matches the output format and drops it otherwise; the inline form
      follows. Block and inline raw share the `=FORMAT` spelling (```=html
      / `…`{=html}); the former ```raw FORMAT keyword form was removed. A code span whose trailing attribute block
      is EXACTLY `{=format}` (a single `=`-prefixed format name, with no
      other classes/ids/keys) is raw passthrough: the verbatim span content
      is emitted UNESCAPED when `format` matches the output format (`html`),
      and DROPPED otherwise. The `{=format}` tag is consumed, not rendered.
      A code span with any OTHER trailing `{…}` is a generic attributed code
      span (PART 3 `code_span` note), not raw inline. The corpus pins this
      (tests/corpus/50-raw-inline):
        `<br>`{=html}   -> <br>            (emitted verbatim)
        `\foo`{=latex}  ->                (dropped in HTML output)

   21. TRAILING LINE COMMENTS -- NORMATIVE (governs `inline_comment` in PART 3).
      - A `%%` token is a trailing comment marker only when the character
        immediately before it is whitespace (space or tab) or it starts the
        inline run (beginning of paragraph content after leading whitespace).
      - When recognized, the `%%` and ALL remaining characters up to (but not
        including) the line break are consumed and produce no output.
      - `%%` is NEVER recognized inside a code span or raw inline (PART 9 §20);
        those contexts pass both `%` characters through verbatim.
      - `\%%` (escaped first percent) is literal -- `%%` is literal text, not a
        comment marker.
      - Without preceding whitespace (e.g. `50%%` or `a%%b`) the `%%` is
        literal -- percentages and doubled-percent tokens in prose are safe.
      - The comment does NOT cross a line break; a soft-wrapped continuation on
        the next line of the same paragraph is unaffected (corpus 46-comments-6).

   22. FORCED INTRAWORD EMPHASIS -- NORMATIVE (governs the forced_* productions
      in PART 3). The brace-pair family is the escape hatch for emphasis where a
      bare delimiter would otherwise stay literal (§9 word boundary).
      - FORMS. Each emphasis mark has a forced form that renders the SAME
        element as its bare delimiter:
          {/x/}   -> <em>x</em>        {*x*}   -> <strong>x</strong>
          {_x_}   -> <u>x</u>          {~x~}   -> <s>x</s>
          {^x^}   -> <sup>x</sup>      {,x,}   -> <sub>x</sub>
          {=x=}   -> <mark>x</mark>
      - NO WORD BOUNDARY. A forced span opens and closes regardless of the
        surrounding characters, so it emphasizes intraword: foo{*bar*}baz ->
        foo<strong>bar</strong>baz; my{_path_}name -> my<u>path</u>name. This
        is the ONLY way to emphasize intraword.
      - BOUNDS = THE BRACES. The closing `X}` ends the span; a bare delimiter
        of the same kind INSIDE is literal content, like the bare-span rule
        (§9): {/a/b/} -> <em>a/b</em>. Inner content otherwise parses normally,
        so cross-type nesting works: {/italic *bold*/} -> <em>italic
        <strong>bold</strong></em>.
      - `{~ … ~}` DISAMBIGUATION. The brace pair is editorial SUBSTITUTION when
        it contains a top-level `~>` ({~old~>new~} -> <del>old</del><ins>new
        </ins>), and forced STRIKETHROUGH otherwise ({~old~} ->
        <s>old</s>). The `~>` test is the discriminator.
      - `{= … =}` is forced highlight (<mark>), distinct from the raw-inline
        `{=format}` attribute on a code span (`` `x`{=html} ``), which has no
        trailing `=` before the `}`.
      - ESCAPING. A literal `{/` (etc.) is written `\{/` -- one backslash on the
        brace is enough; the inner delimiter need not be escaped.
      - ATTRIBUTES. A forced span MAY carry a trailing `[attributes]` block like
        any bare span: {*x*}{.c} -> <strong class="c">x</strong>.
      (corpus 01-emphasis-12 forced-intraword, 01-emphasis-13 forced-nesting.)

   23. LINE BLOCK (VERSE) -- NORMATIVE (governs `line_block` in PART 2). A
      `::: |` fenced block preserves the author's per-line layout.
      - TRIGGER. A bare pipe `|` TYPE TOKEN on the opener (`::: |`) selects this
        behavior. The token is a pipe, not a word. STRICT (djot): the opener
        carries no inline attributes, so the `::: {.line-block}` class form is an
        ordinary PARAGRAPH, not a line block -- the class alone does not trigger it.
      - WRAPPER. Renders as `<div class="line-block">` (a generic div, never an
        `<aside>`); extra classes / id attach via a PRECEDING block-attribute
        line, which floats onto the div (§15).
      - LINE BREAKS. Each soft line break inside a stanza becomes a HARD break
        (`<br>`). So consecutive non-blank body lines render as one paragraph
        with `<br>` between them.
      - STANZAS. A blank line ends a stanza; each stanza is its own `<p>` inside
        the single `<div class="line-block">`.
      - LEADING WHITESPACE. Each body line's leading whitespace is PRESERVED
        (unlike ordinary paragraph content, which is trimmed). In HTML each
        leading space serializes as a NBSP (`&nbsp;`, U+00A0), so indentation is
        visible with no external CSS (matching Pandoc). Plain-text / ANSI
        renderers emit the literal spaces. A tab in the leading run follows the
        tab-stop rule (PART 9 §24).
      - REFERENCE COLUMN. Leading whitespace is measured RELATIVE TO THE FENCE
        indent, not absolute column 0: a `line-block` nested in a list has the
        container's structural indent stripped first (the list dedents to its
        content column), so only the author's intra-verse indent survives. The
        `+` first-block form (fence at column 0) is the canonical way to nest a
        `line-block` without the container consuming indentation.
      - INLINE CONTENT parses normally (emphasis, links, code, ...); only the
        whitespace and soft-break handling differ from an ordinary div.
      (corpus 88-line-blocks*.)

      LOCAL HARD-BREAK BLOCK -- NORMATIVE (governs `local_hard_break_block` in
      PART 2). A `::: \` fenced block is the smaller local opt-in for visible
      line breaks.
      - TRIGGER. A bare backslash `\` TYPE TOKEN on the opener (`::: \`) selects
        this behavior. The token is not an English type word. Extra classes / id
        attach via a PRECEDING block-attribute line.
      - WRAPPER. Renders as `<div class="hardbreaks">` (a generic div, never an
        `<aside>`).
      - DIRECT PARAGRAPHS ONLY. Each soft break in a direct paragraph child of
        the block becomes a hard break (`<br>`). Nested blocks parse normally and
        keep normal soft-break behavior; the mode is not inherited.
      - NO WHITESPACE PRESERVATION. Unlike `::: |`, leading whitespace has no
        special visual-preservation rule. Ordinary paragraph parsing trims or
        folds it as usual.
      (corpus 88-line-blocks*.)

   24. TABS AND INDENTATION -- FORMALIZED COLUMN ARITHMETIC (governs
      `indent` in PART 2 and every indentation-sensitive comparison; the
      arithmetic PART 0's layout automaton runs on. Adjudicated with the
      tab-stop work, carve #99 / Rule B carve #103; shipped in all three
      implementations).
      C1 COLUMNS, NOT CHARACTERS. col advances by 1 for a space and to the
         next multiple of 4 for a tab (CommonMark tab stops; a leading tab
         indents to column 4). Outside indentation (code content, inline
         text) tabs are PRESERVED verbatim and never expanded (PART 2,
         TABS IN CODE).
      C2 MARKER SEPARATOR IS A SPACE. The required separator after a list
         marker is a single SPACE character -- a syntax delimiter, not
         indentation. `-<TAB>a` is NOT a list item (paragraph text).
      C3 CONTENT COLUMN. content_column(item) := the visual column where
         the item's content starts (marker width + separator: `- ` -> 2,
         `1. ` -> 3, `10. ` -> 4). A child -- a LIST MARKER (bullet `-`/`*`,
         task, or ordered, SYMMETRIC) opening a sublist, OR any OTHER block
         opener -- must reach the parent item's content_column to belong to the
         item. The block-opener set is UNIFORM and closed: block quote (`>`),
         heading (`#`), thematic break, fenced code, colon fence / admonition
         (`:::`), TABLE (a `|`-delimited row), and DEFINITION LIST (a `:: term`
         opener; the two-line `:: `/`:  ` marker is recognized by the same
         look-ahead the top level uses, and only the `:: ` TERM line opens the
         block). None of these is special-cased: a def-list or table interrupts
         at column 0 and nests at content_column exactly as a quote or heading
         does (carve#295). The content_column is
         the item body's COLUMN 0: a block opener is recognized there exactly
         as a block opener is recognized only at column 0 at the top level.
         This one rule holds whether or not a blank line precedes the child;
         the blank only decides tight vs loose (§17), and there is no relaxed
         `base_column + 2` channel.
           - BELOW content_column: the line is outside the item body. A list
             marker folds as lazy item text (markers never interrupt, §10 I2);
             every other line, likewise, never nests -- with no blank it
             lazily continues the open item paragraph (§10 I2), and after a
             blank it ends the item and parses at DOCUMENT level.
           - AT content_column: dedented to the body's column 0, a block
             opener nests and a list marker opens a sublist.
           - ABOVE content_column: the residual indent means the line is no
             longer a block opener -- exactly as ` # h` with a leading space is
             a paragraph, not a heading, at the top level -- so it folds in as
             lazy paragraph text.
         This is an INTENTIONAL divergence from djot, which attaches a block
         opener at any indent past the marker; carve requires the content
         column, consistent with its column-0 block-opener rule
         (docs/divergence-from-djot, carve#295).
      C4 RULE B (TOP LEVEL, ANY INDENT). A bullet (`-`/`*` + space +
         content) opens a list at ANY indentation at the TOP LEVEL -- no
         column-0 restriction (unlike CommonMark, `    - x` is a list, not
         indented code); an ordered marker behaves the same. Scope: WHERE a
         top-level list may open -- NOT sublist nesting (C3) and NOT
         interrupting an open paragraph (§10 I2).
      C5 DEDENT. Re-parsing nested content strips the container's
         structural indent BY COLUMNS. A tab straddling the dedent boundary
         is consumed WHOLE on non-marker lines (Carve has no
         indent-sensitive block below list nesting, so the residual columns
         never matter); on a LIST-MARKER line the straddling tab's
         unconsumed columns are re-inserted as spaces, so sibling markers
         aligned with mixed tab/space indentation keep the same visual
         column and stay siblings.

   25. SECURITY REQUIREMENTS -- NORMATIVE (HTML output target). These rules
      bind any implementation that emits HTML for UNTRUSTED input. They were
      previously stated only in docs/security.md (non-normative); they are
      hoisted here so an implementation written to this grammar + the corpus
      cannot be conformant yet a trivial XSS vector. The corpus pins
      representative cases (docs/examples.md "Security hardening").
      - URL SINK SCHEME DENYLIST. On every clickable URL sink -- link
        destination (`href`), image source (`src`), and autolink -- an HTML
        renderer MUST reject any URL whose scheme is `javascript`, `vbscript`,
        `data`, or `file`, emitting an EMPTY value (`href=""` / `src=""`).
        The denylist additionally covers the OS protocol-handler and
        command-execution class (CVE-2026-20841 class): `ms-msdt`, `ms-office`,
        `ms-word`, `ms-excel`, `ms-powerpoint`, `ms-access`, `ms-visio`,
        `ms-project`, `ms-publisher`, `ms-infopath`, `ms-spd`, `ms-search`,
        `search-ms`, `ms-cxh`, `ms-cxh-full`, `shell`, `vscode`,
        `vscode-insiders`, and `jar`. These route to an operating-system
        handler that can launch a binary or open a macro-bearing document, so
        they are blanked like `javascript:`. Ordinary web and contact schemes
        -- `http`, `https`, `mailto`, `tel`, `ftp`, `sms` -- remain allowed.
        Scheme detection MUST first strip leading ASCII control characters
        AND ALL UNICODE WHITESPACE -- not just ASCII space/tab/newline but
        every Unicode space separator, including NARROW NO-BREAK SPACE
        (U+202F), NBSP (U+00A0), the U+2000..U+200A spaces, MEDIUM
        MATHEMATICAL SPACE (U+205F), IDEOGRAPHIC SPACE (U+3000), the
        line/paragraph separators (U+2028 / U+2029), and the BOM (U+FEFF) --
        before matching the scheme, so an obfuscated `<U+202F>javascript:`
        destination cannot slip past the denylist. The corpus pins the
        U+202F case (docs/examples.md "Scheme probe strips Unicode
        whitespace"). A `data:` exception for images, if offered, MUST be
        opt-in. An attribute-block `href`/`src` override (PART 4) MUST NOT
        reintroduce a rejected scheme.

        THIS APPLIES TO EVERY TARGET THAT EMITS A RESOLVABLE URL, not only to
        the HTML renderer. A Markdown target's link and image destinations are
        resolved by whatever renders that Markdown, so a scheme blanked here
        and passed through there is not blocked -- it is deferred by one step.
        Measured before this was written down: all three engines blanked
        `ms-msdt:` in HTML and emitted it verbatim in Markdown, because each
        had mirrored the four-scheme core of the denylist into that target and
        not the OS-handler class. One also failed to strip Unicode whitespace
        in the Markdown probe, so `<U+202F>javascript:` survived there while
        being blanked in its own HTML output (carve#352, corpus 119).

        The rule was scoped to "an HTML renderer" because that is where the
        sink was first identified; the scoping was not a decision that other
        targets are safe. A target that emits no URL, such as plain text, has
        nothing to blank and is unaffected.
      - ATTRIBUTE NAME/VALUE HARDENING. On EVERY rendered element -- including
        elements emitted by extensions -- an HTML renderer MUST drop
        attributes whose name begins with `on` (case-insensitive) and the
        names `srcdoc` and `formaction`; MUST blank any attribute value whose
        scheme (control/space-stripped) is `javascript`/`vbscript`/`data`/
        `file`; and MUST blank any `style` value containing `expression(`,
        `url(`, `@import`, `behavior:`, or `-moz-binding` (whitespace
        collapsed first). This applies to author attribute blocks AND to
        attributes an extension copies onto its wrapper element: an extension
        MUST route authored attributes through the same hardening and MUST NOT
        emit them raw.
      - RAW PASSTHROUGH OPT-OUT. Raw blocks (```=FORMAT, PART 2) and raw
        inline (`{=format}`, §20) emit verbatim UNESCAPED content. An
        implementation MUST provide a mode in which raw passthrough is NOT
        emitted and is instead escaped as literal text. Implementations SHOULD
        make safe handling the default for untrusted input; a processor that
        emits raw passthrough by default MUST document that its input is
        trusted.
      - URL TEMPLATE ENCODING. Any value taken from document content (mention
        name, tag, citation key, crossref id) and substituted into a
        configured URL template MUST be percent-encoded before substitution,
        and the resulting URL MUST then pass the scheme denylist above.
      - MATH + CODE are element TEXT, not raw HTML: their verbatim content MUST
        be HTML-escaped (`&`,`<`,`>`) before wrapping (§18, PART 10 §6). They
        are NOT a raw-HTML bypass.
      - FRONTMATTER MUST be parsed with a SAFE loader -- no arbitrary object
        instantiation and no custom-tag deserialization (e.g. no YAML
        `!!`-tag object construction). Any frontmatter value later rendered
        into output MUST be escaped per the output target's rules.
      - RESOURCE BOUNDS (DoS). An implementation MUST parse and render any
        input in time and space LINEAR in input size (within a constant
        factor). It MUST enforce a FINITE nesting-depth cap for block and
        inline containers, beyond which further openers degrade to literal
        text rather than recursing. The cap is a single constant
        MAX_NESTING_DEPTH applied UNIFORMLY to blockquote, list, and
        fenced-div / admonition nesting (and to footnote-body and the inline
        recursion): the reference uses MAX_NESTING_DEPTH = 200, and a
        conformant implementation SHOULD use the same value (it MAY choose a
        larger fixed cap, but never an unbounded one). Past the cap every
        container kind FLATTENS the same way -- the opener becomes literal
        paragraph text -- rather than crashing. The same ceiling MUST apply
        to recursive RENDER / RESOLVE / FILTER passes over a
        programmatically constructed tree, not only the parse path. It MUST
        bound abbreviation, reference,
        footnote, and crossref expansion so total work stays O(n): in
        particular an empty abbreviation term is rejected, a reference is not
        resolved by rescanning the whole output per occurrence, and a crossref
        target is not cloned without bound.
      - NON-HTML TARGETS. The Markdown, plain-text, and terminal (ANSI)
        renderers MUST NOT become injection vectors either: text, code, math,
        AND URL values MUST be stripped of control characters before emission
        to a terminal, and a Markdown link/image title MUST escape the `"`
        that would otherwise terminate it.

   26. TROJAN-SOURCE / INVISIBLE-UNICODE HARDENING -- NORMATIVE (CVE-2021-42574
      class). Visually deceptive or invisible Unicode MUST NOT produce
      diverging heading ids nor smuggle a live control character into the
      output. The corpus pins representative cases (docs/examples.md
      "Trojan-Source: heading ids ...", "Trojan-Source: rendered text and
      code ...").
      - HEADING IDS are NFC-normalized and stripped of bidi-override /
        isolate controls (U+202A..U+202E and U+2066..U+2069) and zero-width
        characters (U+200B, U+200C, U+200D, U+2060, U+FEFF, U+00AD) before
        slugging. Consequences: a precomposed `é` (U+00E9) and a decomposed
        `e` + U+0301 yield the SAME id (NFC); a heading containing U+202E and
        U+200B yields an id with NEITHER codepoint.
      - RENDERED TEXT and CODE-SPAN / CODE-BLOCK content strip the
        bidi-override / isolate controls (U+202A..U+202E, U+2066..U+2069):
        these are DOM-inert, and entity-encoding one would let it DECODE BACK
        to the raw control downstream, so they are REMOVED, not escaped. A
        control in a code span is likewise stripped, NOT entity-encoded.
      - PRESERVED in rendered text: the directional MARKS LRM (U+200E) and
        RLM (U+200F), and the zero-width characters (U+200B etc.) -- these are
        stripped from IDS (above) but kept in TEXT, since they can be
        legitimate content and are not an override/isolate injection. (So
        `# A<U+202E>B<U+200B>C` renders text `AB<U+200B>C` -- override gone,
        zero-width kept -- with id `ABC`.)

   27. INLINE LITERAL -- NORMATIVE (governs `literal_inline` in PART 3).
      A `!` PREFIX immediately before a verbatim code span is an INLINE
      LITERAL, mirroring how `$` before a code span is inline math (§18). It
      exists so notation that collides with the bare emphasis delimiters --
      phonemic transcription `/kaet/`, glob patterns, paths -- can be written
      without per-character escaping.
      - CONTENT is captured VERBATIM by the backtick run exactly as for a code
        span: no inline construct inside it is recognized, and SMART TYPOGRAPHY
        (§8) does NOT apply -- `--`, `...` and quotes stay as authored.
      - CONTENT is HTML-ESCAPED on output. This is the opposite of raw
        passthrough (§20), which emits its content unescaped.
      - It is EMITTED BY EVERY RENDERER and is NEVER dropped or target-routed.
        Again the opposite of §20, where a non-matching format drops the span.
      - The `<code>` WRAPPER IS DROPPED: an inline literal is prose, not code.
      - ELEMENT. The TRAILING `{…}` is the ORDINARY code-span attribute block,
        not a special form. With no attribute block the content is emitted as
        BARE escaped text, with no element at all. When an attribute block is
        present (id / class / key=value / boolean) a `<span>` is emitted
        carrying it, in the §14 render order. So the element appears exactly
        when an attribute needs somewhere to live.
      - The `!` sigil is CONSUMED, never rendered, and binds to a FOLLOWING
        BACKTICK RUN only. It is tried before `image` (`![`) and before a bare
        literal `!` in the inline order, so `!` before `[` still opens an image
        and `!` anywhere else stays literal text. A literal `!` immediately
        before a backtick run is therefore written `\!` -- the single case this
        construct reinterprets. (Chosen over the earlier trailing-`{!}` sigil:
        the `!` prefix matches the `$`-math / `!`-image family, keeps the
        attribute block ordinary, and reads as "not code". See #280.)
      - There is NO block form. A fenced code block already carries its content
        verbatim; the inline case is the one the notation problem needs, and a
        block spelling is deliberately left undefined rather than invented.
      Examples (corpus-pinned):
        !`/kaet/`        -> /kaet/
        !`/kaet/`{.ipa}  -> <span class="ipa">/kaet/</span>
        !`a<b>`          -> a&lt;b&gt;

   28. BLOCK COMMENT FENCES -- NORMATIVE (governs `comment_block`,
      `comment_block_open` and `comment_block_close` in PART 2). A `%%%` fence
      line is DELIMITER PLUS INSIGNIFICANT TAIL: the LEADING RUN of 3+ `%` is
      the delimiter, and ANY remaining text on the line is IGNORED. So
      `%%% TODO`, `%%% notes` and `%%%html` all open a block comment, and
      `%%% end` closes one. No separating space is required.
      - `%%%` HAS NO INFO STRING. A raw passthrough block (§20) is a CODE
        fence with an `=FORMAT` info string (```` ```=html ````), never a
        percent fence, so `%%% html` is a COMMENT and its body stays hidden.
      - The CLOSER matches on EXACT delimiter length (§2), so a longer opener
        nests shorter fences and a too-short line is content, not a closer.
      - A `%%%` opener with NO MATCHING CLOSER AHEAD does NOT open a block.
        The line degrades to a `comment_line` (its first two `%` are the line
        marker), so every FOLLOWING BLOCK STILL RENDERS. This mirrors the
        `:::` rule (§12: "a bare opener with no matching closer ahead is
        literal text") and exists for the same reason: an unterminated opener
        must not swallow the rest of the document. An implementation that
        surfaces diagnostics SHOULD report an unterminated comment fence;
        emitting one is OPTIONAL, since not every implementation has a
        warning channel.
      - TAIL DISPOSITION (serialization, PART 10): the OPENER's tail is kept as
        the comment body's FIRST LINE, so `carve fmt` round-trips the words
        rather than deleting them; the CLOSER's tail is DISCARDED, as a code
        fence's closing info is. Neither is ever rendered.
      Examples (corpus-pinned):
        "%%% html\nsecret\n%%%\n\nafter"  -> <p>after</p>
        "%%%\nsecret\n%%% end\n\nafter"   -> <p>after</p>
        "%%% TODO\nsecret\n\nafter"       -> <p>secret</p><p>after</p>
*)


(* ============================================================================
   PART 9R: RESOLUTION SEMANTICS (NORMATIVE) -- whole-document passes
   ============================================================================

   The two-pass model (PART 8 "first pass"): PASS 1 collects definitions
   while blocks are parsed; PASS 2 resolves inline uses. A definition may
   appear before or after its use. Complexity is O(n) -- no rescanning per
   occurrence (PART 9 §25 RESOURCE BOUNDS). Definitions are collected even
   inside containers (blockquote / list item); the container then renders
   without them (PART 9 §16; corpus "Footnote definition inside a container
   is collected"). These rules are attribute-grammar / two-pass semantics --
   deliberately NOT grammar productions: symbol-table resolution is provably
   outside context-free syntax, and stating it as rules over declared state
   keeps it machine-checkable without pretending it is EBNF.

   STATE (symbol tables, PASS 1):
     linkDefs     : label -> (url, title?, attrs?)   -- LAST definition wins
     footnoteDefs : label -> body                    -- FIRST definition wins
     abbrDefs     : term -> expansion                -- non-empty term (§25)
     headingIds   : slug -> heading node             -- case-preserving; the
                    PART 2 HEADING IDENTIFIERS rules: explicit {#id} ids are
                    reserved first in document order, then auto-slugs, with
                    -2/-3/... collision dedup
     captionIds   : id -> (kind in {figure, table, listing, equation},
                    number)                          -- from R5
     footnoteSeq  : ONE shared document-order counter for both footnote
                    forms

   RULES (PASS 2):
   R1 LINK REFERENCES. [text][label] and [text][] (label := text) resolve
      against linkDefs. Matching is EXACT: case-sensitive, no whitespace
      folding (corpus 73-reference-labels-are-case-sensitive). Definition
      attributes transfer to the link; link attributes override per key.
      Unresolved -> the bracketed run renders as literal source text. Carve
      has NO shortcut reference: a bare [label] never resolves (PART 9 §14).
   R2 FOOTNOTES (rendering in PART 9 §16). A [^label] use with a matching
      footnoteDefs entry is numbered by FIRST-REFERENCE order from
      footnoteSeq; repeat references reuse the number (k-th repeat gets
      refId fnref{n}-{k} and an extra backlink). An inline ^[content] note
      draws a fresh anonymous number from the SAME footnoteSeq (it cannot
      be re-referenced). Unresolved reference -> literal source text;
      unreferenced definition -> dropped.
   R3 ABBREVIATIONS. Each abbrDefs term matches in rendered text at WORD
      BOUNDARIES only and emits <abbr>. Expansion is bounded per PART 9 §25
      (an empty term is rejected; total work stays O(n)).
   R4 CROSSREFS (behavior detail in PART 9 §19). </#id> folds
      case-insensitively and matches against the case-preserved headingIds
      (an ASCII spelling of a non-ASCII id does NOT resolve; corpus
      19-heading-ids), then against captionIds for numbered captions (the
      link text is then LABEL + NUMBER, e.g. "Figure 1", markup preserved).
      Resolution is ONE LEVEL: cloned heading text is not re-expanded, so
      self-references and cycles are safe. Unresolved -> literal.
   R5 NUMBERED CAPTIONS. Per kind, in document order, each caption whose
      top-level text carries a bare `#` placeholder receives the next
      number of its kind (PART 2, CAPTION NUMBER PLACEHOLDER); an {#id} on
      the caption registers it in captionIds for R4.
   R6 SECTION WRAPPING consumes headingIds after dedup -- the stateful
      emission algorithm in PART 9 §13.
   ============================================================================ *)

(* ============================================================================
   PART 10: HTML SERIALIZATION CONVENTIONS (NORMATIVE)
   ============================================================================

   The corpus pins exact bytes against the reference (carve-js); these
   conventions state the rules a SECOND implementation (e.g. carve-php)
   must follow to match WITHOUT copying carve-js output. On any
   disagreement the corpus wins.

   1. ATTRIBUTE ORDER. Attributes serialize in the AUTHOR's source order
      (the recorded `order`; see the RENDER ORDER rule in PART 4 and
      PART 9 §12-style note). An Attrs built programmatically with no
      recorded order falls back to a fixed `class`, then `id`, then
      key=value order. All classes accumulate (space-joined) into one
      `class` slot at its first-appearance position. An element with no
      effective attributes emits none. (Heading ids move to <section>,
      §13.) A mandatory base class (math `math inline|display`) is
      prepended to that class slot.

   2. ATTRIBUTE QUOTING + ESCAPING. Attribute values are double-quoted.
      In an attribute value escape `&`->&amp;, `<`->&lt;, `>`->&gt;,
      `"`->&quot;, `'`->&apos;. In text/content escape `&`,`<`,`>` (NOT
      quotes). No other entities are produced. This describes only the
      QUOTING of values that survive filtering: dangerous attribute names,
      values, and URL schemes are dropped/blanked FIRST per PART 9 §25.

   3. VOID ELEMENTS are written WITHOUT a trailing slash: `<hr>`,
      `<img …>`, `<br>`. A hard break serializes as `<br>` + newline.

   4. BLOCK WHITESPACE. Block elements are newline-separated, one per
      line. Nested block structures (list, blockquote, table, admonition,
      div, figure, section, endnotes) indent their children by TWO spaces
      per nesting level. Inline content stays flat on its block's line.

   5. PARAGRAPH WRAPPING follows the tight/loose rule (§17): a tight list
      item's text has no <p>; a loose item's paragraphs are wrapped.

   6. CODE. A fenced code block is `<pre><code[ class="language-X"]>` +
      escaped content + a trailing newline + `</code></pre>`. An inline
      code span is `<code>` + escaped content + `</code>`.

   7. TABLES. Alignment renders as `style="text-align: VALUE;"` (one
      space after the colon, trailing semicolon); a cell with no effective
      alignment emits NO style attribute. `rowspan`/`colspan` are emitted
      only when their count is > 1.

   8. STABLE STRINGS. Mentions and tags render as inert `<span>`s (no link)
      when no URL template is configured; a configured template links to them. The footnote
      backlink glyph is `↩` (§16). Math wraps content in `\(…\)` /
      `\[…\]` inside `<span class="math inline|display">` (§18).

   ============================================================================
   PART 11: CANONICAL SOURCE WRITER (NORMATIVE)
   ============================================================================

   The canonical writer (`carve fmt`, the `carve` render target) serializes a
   document back to Carve source. Nothing here specified it before, so its
   behavior was defined only by three implementations happening to agree.

   1. THE INVARIANTS. For any document x:

        parse(fmt(x)) == parse(x)        -- meaning is preserved
        fmt(fmt(x))   == fmt(x)          -- the writer is idempotent

      The first is about the AST, not the bytes: `fmt` MAY normalize the
      author's spelling (indentation, marker alignment, escape form) but MUST
      NOT change what the document says. The second forbids growth -- a value
      that gains an escape per pass violates it, which is how over-escaping is
      usually first noticed.

      EQUALITY IS MODULO ESCAPING -- NORMATIVE. `escaped_text` and `text`
      compare EQUAL, and an adjacent run of them compares as the single text
      node holding the same characters in the same order. `a\-b` and `a-b`
      are the same document for the purpose of this invariant, and differ only
      in what the author wrote to get there.

      Without that clause the first invariant would be unattainable by
      construction, not merely unmet. §5 requires the writer to escape `"` and
      `'` UNCONDITIONALLY: a bare quote reaching the writer as text would
      otherwise re-derive as smart punctuation (PART 9 §8) and change the
      document. So a text node holding a quote MUST come back carrying an
      escape, which parses to `escaped_text`. Read strictly, §1 and §5 would
      contradict each other for every document containing a quote.

      This is also the comparison §4 already performs internally: the two
      renders it weighs differ precisely in their escaping, so telling them
      apart BY the escaping would escalate every document. The invariant and
      the strategy therefore ask the same question -- does dropping the escapes
      change anything ELSE?

      What escaping still MUST preserve is the rendered result: the escape is
      authoring syntax, so `to_html(fmt(x)) == to_html(x)` admits no such
      latitude, and neither does idempotence.

      KNOWN GAP -- the first invariant is not met by every implementation on
      every input. The four constructs first recorded here (a table carrying a
      colspan, a doubled alignment marker, a list-item attribute form and a
      line-block shape) are fixed; a corpus-wide sweep finds others, tracked in
      carve#369. Nothing caught any of them because every existing check
      compares rendered HTML, which is equal in all of those cases. It is
      stated rather than left implied: an implementation that satisfies §2 can
      still fail §1, and §4's comparison is built to be unaffected by that.

   2. THE ESCAPING RULE -- NORMATIVE. A character is escaped IF AND ONLY IF
      omitting the escape would change the re-parsed AST.

      Escaping MORE than that is a defect, not a safe default. It is how a
      formatter turns `50% faster: yes (ok)` into `50\% faster\: yes \(ok\)`,
      which re-parses identically -- so nothing there needed escaping.

      Escaping LESS is a correctness bug: the document changes meaning.

      THE UNIT IS THE OPENER, NOT THE CHARACTER. Where a construct opens on a
      RUN of characters, the whole run is escaped. `--` opens an en dash
      (PART 9 §8), and `\--` suppresses it just as `\-\-` does, so a
      character-at-a-time reading of the rule would call the second escape
      unnecessary and require `\--`. It does not: the escaped form of a
      suppressed opener is the whole opener escaped. A half-escaped run is a
      shape that happens to work rather than one that says what it means, it
      breaks as soon as the surrounding text shifts what the run abuts, and
      `\.\.\.` versus `\...` is the same question with the same answer.

      So the escape decision is taken per OPENER OCCURRENCE: would omitting the
      escapes on this occurrence let the construct form? Yes -> escape the run.
      No -> emit it bare.

   3. WHY A STATIC CHARACTER TABLE CANNOT IMPLEMENT §2. Whether a character
      is significant depends on the line, not on the character:

        `[`  literal alone, an opener in `[a](b)`
        `(`  literal in `[a] (b)`, an opener in `[a](b)`
        `^`  literal at column 0 before a space, an opener in `^[note]`
        `-`  literal mid-word, a list marker at column 0, an en dash doubled
        `_`  literal alone, an opener only when a closer follows on the line

      A writer that decides per character with no context must therefore
      over-escape to stay correct, which is the defect §2 forbids.

   4. A PERMITTED STRATEGY -- NOT THE PINNED OUTPUT. The output is pinned by
      §2 and by §2 alone: a character is escaped if and only if omitting the
      escape would change the re-parsed AST. This section describes ONE way to
      compute that, which an implementation MAY use and is NOT required to.
      Where the two differ, §2 wins.

      That ordering was reversed until carve#374. This section used to pin the
      output, and it yields a form §2 forbids: for a document containing one
      genuinely-escaped candidate character, the procedure below escapes EVERY
      candidate in the document, including characters whose bareness changes
      nothing. carve-php implemented this section and emitted
      `\*x\* and \(b\)`; carve-rs and carve-js implemented §2 and emitted
      `\*x\* and (b)`. Both were reading the spec correctly, which is how the
      contradiction surfaced.

      §2 is the rule worth keeping, because it is the one an author can see the
      difference in, and because §3's argument does not lead here: it shows a
      static character TABLE cannot implement §2, not that per-character
      decisions are impossible. A writer that knows whether a construct would
      START at a given position decides exactly, and two engines do.

      The procedure, then, as a strategy:

      W1 Render the document twice: the MINIMAL form, escaping only the
         UNCONDITIONAL set (§5), and the CONSERVATIVE form, additionally
         escaping every CANDIDATE-set character (§5).
      W2 If the two are byte-identical, emit either.
      W3 Otherwise parse BOTH and compare the resulting documents, ignoring
         source positions and key order.
      W4 Equal -> emit the minimal form. Different -> emit the conservative
         form.

      Two forms and one comparison, with the choice made by the parser rather
      than by a table -- so the writer cannot drift from the grammar as the
      grammar grows. What it buys is safety without a table; what it costs is
      precision, and §2 is where the precision is required.

      WHY THIS SCOPE, FOR AN IMPLEMENTATION THAT USES IT. An earlier draft
      decided per line. That cannot be implemented: a line re-parsed on its own
      has lost the document's link-reference and footnote definitions, so a
      paragraph carrying `[text][ref]` comes back with an empty destination and
      reports a difference escaping never caused. Any scope smaller than the
      document has the same defect for the same reason.

      So an implementation that computes §2 by COMPARING TWO RENDERS is stuck
      with document scope, and therefore with over-escaping. That is the trade
      it accepts, and it is why this is a strategy rather than the requirement:
      an implementation that instead asks, per position, whether a construct
      would start there needs no comparison and no scope at all.

      WHY THE TWO RENDERS ARE COMPARED WITH EACH OTHER, and not the minimal
      render with the document being written. The writer is not required to be
      AST-faithful for every construct, and in practice is not: a table with a
      colspan, a doubled alignment marker, some list-item attribute forms and
      one line-block shape all re-parse to a DIFFERENT document while rendering
      identical HTML. Comparing against the source document would inherit those
      defects, flip the escaping decision between passes and break §1's
      idempotence for a reason that has nothing to do with escaping. Comparing
      the two renders isolates the only question this decision asks: does
      dropping the candidate escapes change anything?

   5. THE TWO SETS.

      UNCONDITIONAL -- always escaped in a text node, never re-derivable:
        the backslash itself; the backtick (opens a code span in any
        position); and `"` / `'`. A BARE quote re-derives as smart
        punctuation (PART 9 §8), so a quote that reached the writer as TEXT
        is one the author escaped, and it stays escaped. A quote carried by a
        `smart_punctuation` node is NOT text: the writer emits that node's
        source run BARE, which is what makes `fmt` reproduce the author's
        spelling.

        The CARET is also unconditional, for a different reason: it opens
        nothing on its own, but its escape carries information the AST records
        separately. A text node whose LEADING caret came from an escape is
        marked as such, so an image followed by a caret line stays a paragraph
        instead of being promoted to a figure. Comparing that mark in W3 would
        escalate every document whose text begins with a caret; ignoring it
        would silently turn the image case into a figure. Escaping the caret in
        BOTH forms keeps the two renders identical on that point, at the cost
        of one escape on a character that is rare in prose.

      CANDIDATE -- escaped only when W4 selects the conservative form:
        `* _ ~ / = ,`        emphasis and the braced pair delimiters
        `[ ] ( ) { } <`      links, images, spans, attributes, autolinks
        `@ #`                mentions and tags (§7 boundary rules); `#` also
                             opens a heading at column 0
        `> + - . ) : | % !`  block openers at column 0, plus the inline forms
                             `%%` and `::` (the inline-footnote `^[` is covered
                             by the unconditional caret above)
        `$`                  math

      A character in neither set is never escaped.

      The set must cover EVERY construct opener, or W4 has nothing to fall
      back to and the document is unrecoverable: `\@user` would emit as
      `@user`, W3 would correctly see a `mention` where the source had text,
      and W4 could not fix it. When a new opener is added to the grammar, its
      character belongs here.

   6. WHAT IS NOT THE WRITER'S JOB. `fmt` does not re-wrap prose, reorder
      attributes, or respell a construct to a synonym (`*` vs `-` bullets,
      fence length, ordered-list dialect). Those are the author's choices and
      the AST records them; changing one would satisfy §1 while still
      surprising the author.

      NESTED BOLD ITALIC joins that list, and needed the AST to catch up before
      it could. `/*x*/` is a single production (`bold_italic`, PART 3) while
      `*/x/*` is ordinary nesting, and BOTH yield the same `strong` wrapping
      `emphasis`. So §1 held for either spelling and nothing pinned which one to
      emit: carve-rs reproduced the author's, carve-js and carve-php normalized
      to `*/x/*` (carve#375).

      The AST therefore RECORDS the combined form -- a `strong` parsed from
      `bold_italic` carries `boldItalic` (PART 12 §3) -- and the writer
      reproduces it. Without that mark the rule could not be stated as an
      author's choice at all, because there would be nothing to preserve.

      `/*x*/` is also the spelling Carve teaches (docs/cheatsheet.md,
      docs/migrate-from-markdown.md), so normalizing it away rewrote the
      documented form into one documented nowhere. That is the concrete harm
      behind the general rule.

   7. NO WHITESPACE-ONLY LINE -- NORMATIVE. `fmt` never emits a line whose only
      content is ASCII spaces or tabs. Such a line is emitted EMPTY.

      EXCEPT where those spaces are VERBATIM CONTENT. Inside a code block or a
      raw block the lines are data, and a line of three spaces renders as three
      spaces:

        ```           ->  <pre><code>a
        a                 [three spaces]
        [3 spaces]        b
        b                 </code></pre>
        ```

      so emptying it would change the document. The rule there applies to the
      STRUCTURAL INDENT only: a verbatim line sitting inside a list item carries
      the item's content-column indent, and when the verbatim content on that
      line is EMPTY the indent alone is what remains -- that is layout, and it is
      omitted. When the verbatim content is itself whitespace, indent and content
      are both reproduced, and the resulting line is whitespace-only by
      necessity.

      The case this decides is a blank line INSIDE a list item, where the item's
      continuation is indented. Indenting the blank line to the content column
      keeps the block structure visible in the source, and carve-rs and
      carve-php did that; carve-js emitted an empty line (carve#375).

      The empty line is required. A whitespace-only line is not stable: editors
      that strip trailing whitespace on save, `git apply --whitespace=fix`, and
      CI whitespace checks all rewrite it, so a formatter emitting one produces
      output that ordinary tooling changes behind it -- and `fmt` then reports a
      diff on a file nobody edited. A formatter whose output cannot survive the
      tools it will be stored under is not idempotent in practice, whatever §1
      says about it in isolation.

      WHAT THIS DOES NOT SAY, and said wrongly for one release of this text: it
      does NOT extend to whitespace at the end of a line that HAS content. That
      whitespace can be document content, and §1 outranks tidiness:

        `a` + SPACE + newline + `b`   renders  `<p>a \nb</p>`
        `a` + newline + `b`           renders  `<p>a\nb</p>`

      The space survives into the output, so stripping it breaks
      `to_html(fmt(x)) == to_html(x)`. carve-rs already limited its stripping to
      lines that END a block for this reason (carve#359); a writer that strips
      before a soft break corrupts the document. Where the PARSER discards
      trailing whitespace the writer may too, and nowhere else.

      A trailing NO-BREAK space is likewise content, not layout -- the author
      wrote it and it renders as `&nbsp;` -- so an implementation whose
      whitespace test treats U+00A0 as whitespace must exclude it here. Corpus
      case 139 pins that.

   8. THE MARKDOWN TARGET'S ESCAPING -- NORMATIVE. The Markdown renderer is not
      the canonical writer and does not follow §2. It has a different
      re-parser to answer to, so it escapes on a different rule:

      M1 Markdown metacharacters in text are escaped, unconditionally.
      M2 An `escaped_text` node is emitted AS AN ESCAPE, whatever the
         character. The author escaped it; the escape carries intent the
         character alone does not.
      M3 Nothing else is escaped. A character the author did not escape and
         Markdown does not read as markup is emitted bare.

      M2 is the rule that needs stating. `\-\-` was written precisely so a
      processor with smart punctuation on (SmartyPants, markdown-it with
      typographer, pandoc's default) would NOT read an en dash. Emitting `--`
      bare loses the author's intent exactly where it was explicit, and the
      characters this matters for -- `"` `'` `-` `.` -- are not Markdown
      metacharacters, so M1 does not cover them. This was divergent across
      implementations with nothing to notice it (carve#350).

      A document that escapes nothing gains no backslashes from M2, so the
      cost falls only on documents that asked for it.

      An implementation whose AST has no `escaped_text` node cannot honour M2:
      it cannot tell an escaped character from a literal one. That is a defect
      in the AST, not licence to skip the rule -- `escaped_text` is in the
      inline vocabulary (docs/profiles.md).

   9. THE MARKDOWN TARGET'S HARD BREAK -- NORMATIVE. A `hard_break` is emitted
      as a BACKSLASH before the newline, never as two trailing spaces.

      Both spellings mean the same thing to a CommonMark reader:

        `a` SPACE SPACE newline `b`   ->  <p>a<br />b</p>
        `a` BACKSLASH newline `b`     ->  <p>a<br />b</p>

      The difference is what survives handling. Trailing whitespace is removed by
      editors that strip it on save, by `git apply --whitespace=fix`, and by CI
      whitespace checks -- and removing ONE of the two spaces is enough:

        `a` SPACE newline `b`         ->  <p>a b</p>

      so the break does not degrade, it VANISHES, silently, in a file nobody
      edited. The backslash form cannot be damaged that way, and it is the only
      hard-break spelling this target emits that ordinary tooling will not touch.

      The cost is pre-CommonMark processors, which show a literal backslash
      instead of a break. That is a visible artifact in a rare consumer, against
      silent data loss in a common one.

      This is a property of the MARKDOWN target only. The canonical writer spells
      a hard break the way the Carve grammar does (`hard_break`, PART 3), and
      §7's no-whitespace-only-line rule already covers the writer's side.

  10. A LIST ITEM'S CONTINUATION LINES ARE ALIGNED -- NORMATIVE. Every
      continuation line of a list item is indented to the WIDTH OF THE MARKER,
      including one whose text would otherwise read as a list marker. Such a
      line is kept literal by ESCAPING it, per §2, not by indenting it less:

        1. outer
           1\. inner

      The alternative -- leaving the line at an indent below the nesting
      threshold so no escape is needed -- looks appealing because it removes an
      escape the author did not write. It is rejected because its correctness
      depends on the WIDTH of the marker above it. Two spaces sit below the
      threshold under `1. ` and reach it under `- `, so the same rule produces
      a continuation in one list and a nested list in the other, silently
      changing what the document says (carve#352).

      This does not conflict with §2. §2 governs whether a character is escaped
      GIVEN the line the writer emits; §10 fixes the line. A writer may not
      manufacture a layout whose only merit is dodging an escape.

  11. THE MARKDOWN TARGET'S CROSS-REFERENCES -- NORMATIVE. A resolved
      cross-reference is emitted as a Markdown link to the target heading's id,
      and the target heading gains an explicit `{#id}` suffix. Both halves are
      required, and neither is emitted without the other:

        # H                       ->  # H {#H}
        See </#h>.                ->  See [H](#H).

      A heading NOTHING references gains no suffix. Markdown has no portable
      heading-id syntax, so an id every processor must be told about is noise
      unless something links to it.

      THE ID IS THE ONE THE DOCUMENT ASSIGNED, not a fresh slug of the heading
      text. The two differ whenever the core disambiguated a repeat (`Setup`
      and `Setup-2`), and a renderer that re-derives the slug does not merely
      mislabel the heading: it fails to recognize the reference as pointing at
      a known heading at all, so it drops the `{#id}` AND degrades the link to
      bare text -- one cause presenting as two unrelated symptoms.

      A CROSS-REFERENCE CONTRIBUTES NOTHING TO A HEADING'S SLUG. By the time
      this target runs, resolution has replaced `</#a>` with a link carrying
      the target heading's text, so counting it slugs `# A </#a>` as `A-A`.

      THE SCAN INCLUDES FOOTNOTE DEFINITION BODIES. They render as block
      content, so a reference inside one is a reference; missing them leaves a
      heading without the id that the link this target also emits depends on
      (carve#352).

   ============================================================================
   PART 12: AST SERIALIZATION (NORMATIVE)
   ============================================================================

   A parsed document is exchangeable: an implementation MAY serialize its AST
   to JSON, and a consumer written against one implementation MUST be able to
   read another's output. Nothing specified this before, and the engines'
   INTERNAL field names already differ for the same node - one calls a link's
   destination `href`, another `destination` - so three dialects were the
   default outcome rather than a risk.

   1. THE SHAPE IS carve-js's. It is the reference implementation, its AST is
      already plain data, and the one serializer in the wild (carve-rb, over
      carve-rs's tree) independently arrived at the same field names. An
      implementation whose internals differ MAPS on the way out; it does not
      export its internals.

      THE TYPE NAME COMES FROM profiles.md, NOT FROM carve-js -- NORMATIVE.
      This clause is about the SHAPE of a node: which fields it carries and
      what they are called. Where carve-js's `type` string disagrees with the
      vocabulary in profiles.md, the vocabulary wins.

      Read without that limit, §1 made carve-js's spelling normative even when
      it collided with the vocabulary. An inline footnote reference is the
      case that surfaced it: carve-js called it `footnote`, which is ALREADY
      the block type for the definition, so one identifier named two different
      nodes and `footnote_ref` -- listed in profiles.md as an inline type --
      named none. carve-php emitted `footnote_ref`. Deferring to carve-js there
      would have resolved the disagreement by deleting a distinction the
      vocabulary makes on purpose (carve#405).

   2. EVERY NODE IS AN OBJECT with a `type` holding the node-type identifier
      from profiles.md - the same snake_case strings PART 9 §8 already calls
      spec surface for kind names. A document is
      `{"type":"document","children":[...]}`.

   3. FIELD NAMES ARE SPEC SURFACE, exactly as kind names are. The normative
      set per node type is the carve-js interface for that type. Notably:
        link       `href`, `children`, and `ref` when the author wrote a
                   reference link
        image      `src`, `alt`, `title`
        code       `value`
        text       `value`
        heading    `level`, `children`
        list       `ordered`, `tight`, `start`, `items`, and the AUTHOR-CHOICE
                   fields `bulletChar` and `delim` (PART 11 §6) when the author's
                   spelling is not the default
        strong     `children`, and `boldItalic` when the author wrote the
                   combined `/*…*/` form rather than nesting `*` around `/`
      An implementation MUST NOT invent a synonym, and MUST NOT expose an
      internal field the reference does not have. A consumer reading `href`
      must not have to know which engine produced the document.

   4. POSITIONS ARE REQUIRED. Every node EXCEPT the document root carries
      `pos`:

        {"startLine":1,"endLine":1,"startColumn":2,"endColumn":5,
         "startOffset":1,"endOffset":4}

      Lines are 1-based. Columns are 1-based and offsets 0-based, both counted
      in UNICODE CODEPOINTS. `endColumn` and `endOffset` are exclusive.

      The unit is stated because there is no convention to inherit: every
      implementation otherwise reports whatever its own strings are indexed by,
      and the differences are invisible on ASCII. Measured on `\u{1F600} *b*`,
      where the delimiter sits at codepoint 2, UTF-16 unit 3 and byte 5:
      djot.js and commonmark.js report 3, carve-js reported 3, and a byte-indexed
      engine would report 5 -- three implementations, three answers, all
      believing they were right.

      Codepoints rather than bytes or UTF-16 code units because a codepoint index
      always lands on a CHARACTER BOUNDARY. A byte offset can point inside a
      UTF-8 sequence and a UTF-16 offset inside a surrogate pair; either lets a
      consumer slice a document into invalid text. This also follows djot's
      reference implementation, which builds a byte-to-charpos table specifically
      so it can report characters from a byte-indexed language -- the format's
      author paid that conversion deliberately rather than exporting the host's
      unit.

      Every implementation therefore converts, and each pays it once per document
      rather than per position: the mapping is a single pass, and it is the
      identity for any document without an astral character.

      The document root is exempt because it spans the whole source by
      definition, so a span on it carries no information a consumer does not
      already have. The first draft of this section said "every node" without
      that carve-out, which made the reference implementation non-conformant
      with a rule written a day earlier - the conformance checker found it on
      its first run.

      This is the field that makes a serialized AST worth exchanging: an
      editor, a language server or a tool grounding output back to source needs
      to say WHERE, and a tree without positions can only be re-parsed rather
      than navigated. Making it optional would mean every consumer has to
      handle its absence, which in practice means not using it.

      POSITION TRACKING MAY BE OPT-IN, SERIALIZATION MAY NOT. An implementation
      MAY gate position tracking behind a parse option, and MUST enable it when
      asked to serialize. What is forbidden is a serialized document without
      positions, not a parse without them.

      This is a concession to where the cost falls. Recording a span for every
      node is not free -- carve-rs builds its line map only when the
      source-line render option asks for it, precisely so an ordinary parse does
      not pay -- and serialization is an operation most callers never perform.
      Charging every parse in the fastest engine for a feature used by
      exporters, editors and language servers would be the wrong trade, and a
      spec that demanded it would be quietly ignored or quietly unimplemented.

      The contract a consumer relies on is unchanged: JSON it is handed carries
      positions. How the producer arranged that is the producer's business.

      IMPLEMENTATION STATUS. Only carve-js records positions today. carve-php
      has none on its nodes; carve-rs has none either, though its parser keeps a
      line map for the source-line render option, so line granularity exists
      there and columns and offsets do not. Both therefore need position
      tracking before they can serialize conformantly. An implementation that
      cannot yet produce positions MUST NOT emit `pos` with invented values, and
      MUST NOT omit it silently: it does not yet serialize conformantly, and
      should say so.

   5. WHAT IS NOT IN THE SERIALIZED FORM. Formatter-internal nodes (PART 11,
      and the `raw_text` case profiles.md excludes) are not part of the
      document and are not serialized. Resolution results that a consumer can
      recompute - footnote numbering, caption numbers - ARE serialized, because
      recomputing them requires reimplementing PART 9R.

   6. ROUND TRIP. `parse(x)` serialized and deserialized MUST equal `parse(x)`.
      A serializer that loses a field is not a lossy convenience; it is a
      consumer breaking silently one document later.
*)

Formal grammar ​

Carve Core: the executable spec ​

Formal grammar

Carve Core: the executable spec