Formal grammar
This is the canonical EBNF specification of Carve. It defines block-level constructs first (frontmatter, headings, lists, tables, fenced code, captions), then inline constructs (emphasis, links, attributes, extensions, mentions, tags, CriticMarkup).
Implementations should match this grammar. The case study explains the design rationale, the reference page covers parsing edge cases, and the examples show the expected HTML output for each construct.
The full grammar lives at resources/grammar.ebnf in the repository.
ebnf
(* ============================================================================
Carve Markup Language - EBNF Grammar
Version: 0.1-draft
This grammar defines the syntax of Carve, a post-Markdown lightweight markup
language with visual mnemonics and human-centered design.
Notation:
- (*...*) = comment
- 'x' = terminal character
- "x" = terminal string
- x | y = alternation
- x y = concatenation
- [x] = optional (0 or 1)
- {x} = repetition (0 or more)
- x+ = repetition (1 or more)
- {x}+ = repetition (1 or more) -- same as x+, used where the repeated
unit is a group
- x - y = exception (x but not y)
- (x) = grouping
NORMATIVITY
- This file (resources/grammar.ebnf) is the NORMATIVE specification of
Carve. Where any document disagrees with it, this file wins. Within
it, PART 9 (semantic constraints not expressible in pure EBNF) is the
authority for behavior the grammar productions cannot encode.
- docs/case-study/syntax.md and docs/edge-cases.md are EXPLANATORY and
DERIVED — prose for humans, non-normative.
- docs/examples.md and the generated tests/corpus/* pairs are the
CONFORMANCE CONTRACT: every example is checked against the reference
implementation in CI. A spec change is not real until a corpus pair
pins it.
============================================================================ *)
(* ============================================================================
PART 1: DOCUMENT STRUCTURE
============================================================================ *)
document = [frontmatter], {block}, EOF ;
frontmatter = frontmatter_open, frontmatter_content, frontmatter_close, newline ;
(* The opening delimiter may carry a format token (yaml | json | toml | neon |
...); a bare `---` defaults to `yaml`. The space before the token is OPTIONAL
(lenient: both `---yaml` and `--- yaml` are accepted; `---yaml` is canonical).
The closing delimiter is always a bare `---`. The token distinguishes a typed
opener from a thematic break. A BARE `---` at document start is frontmatter
only when a closing `---` line exists ahead (closer lookahead, same model as
PART 9 §10); with no closer the line is an ordinary thematic break and the
following lines are ordinary blocks. *)
frontmatter_open = "---", [space], [frontmatter_format], newline ;
frontmatter_close = "---", newline ;
frontmatter_format = (letter | digit)+ ;
frontmatter_content = (* metadata in the named format, parsed by that parser *) ;
block = heading
| thematic_break
| code_block
| blockquote
| list
| table
| line_block
| admonition
| div
| comment_block
| comment_line
| raw_block
| reference_definition
| footnote_definition
| abbreviation_definition
| paragraph
| block_attributes
| blank_line ;
(* `block_attributes` is a standalone block-level line that renders
nothing on its own; it attaches to a following block. Its precise
reach (floating across blank lines), accumulation, and drop-if-dangling
semantics are NOT context-free -- they are pinned in PART 9 §15.
`comment_line`, `reference_definition`, `footnote_definition` and
`abbreviation_definition` are the INVISIBLE blocks: they are parsed and
consumed (collected into the definition table / dropped) and emit no
output; like `block_attributes` they also interrupt an open paragraph
(PART 9 §10, INVISIBLE CONSTRUCTS). *)
blank_line = {whitespace}, newline ;
(* ============================================================================
PART 2: BLOCK ELEMENTS
============================================================================ *)
(* --- Headings --- *)
(* ATX (`#` prefix) only. Setext (underline) headings are intentionally
NOT supported, matching djot: a `---` underline collides with the
thematic break and frontmatter delimiter, reintroducing the ambiguity
djot removed and breaking Design Principle 1 ("one syntax, one
meaning"). *)
heading = atx_heading ;
atx_heading = heading_first_line, {heading_continuation_line} ;
heading_first_line = heading_marker, space, inline_content, newline ;
heading_marker = '#' | "##" | "###" | "####" | "#####" | "######" ;
(* MULTI-LINE HEADINGS -- NORMATIVE (like Djot; consistent with §10 paragraph
interruption). A heading's text spills onto following lines until a blank
line. A continuation line:
- may carry SAME-or-LOWER number of `#` markers (stripped) or none,
and its text is appended to the heading;
- a heading marker with MORE `#` than the open heading starts a NEW
heading (ends this one);
- a caption (`^ ` ...) or a fenced comment (`%%%`) ends the heading;
- a blank line ends it;
- a block-opener (list, quote, table, fenced code, `:::` div, thematic
break) ends the heading and starts that block, exactly as it interrupts
a paragraph (§10).
Only PLAIN text folds into the heading; an ordered marker still folds (it
never interrupts a paragraph either, §10/§11). The heading id is derived from
the full folded text.
NO TRAILING ATTRIBUTES (djot-strict). A heading line carries NO trailing
`{...}` attribute block: a `{...}` at the end of a heading line is ordinary
inline content (literal text unless it forms an inline construct), and the
id derives from the full literal text. Attributes attach via a PRECEDING
block-attribute line (PART 9 §15), the uniform block rule; an explicit
`#id` from that line lives on the `<section>` wrapper (PART 9 §13).
Pinned by corpus 02-headings (the preceding-line and literal-trailing
pairs). *)
heading_continuation_line = [heading_marker, space], inline_content, newline ;
(* the optional marker must be same-or-lower than the open heading's --
see the NORMATIVE note above *)
(* HEADING IDENTIFIERS -- NORMATIVE. A heading without an explicit {#id} gets an
automatic identifier from its folded plain text (symbols `:name:` and footnote
references excluded), GitHub/SSG-style:
- replace each maximal run of non-alphanumeric ASCII characters with a single
`-` (covers spaces, punctuation, `_`, and runs of `-`) -- the jgm/djot#393 rule;
- trim leading/trailing `-`;
- LOWERCASE (Unicode-aware): NON-ASCII characters are PRESERVED, only their case
is folded (`Café` -> `café`, `日本語` unchanged). Lowercasing makes ids and the
common cross-reference case-insensitive, matching author expectation;
- a leading-digit result is prefixed `s-` (a bare leading digit is a valid HTML
id but an invalid CSS selector); an empty result becomes a generated `s-N`;
- duplicates within a document get a numeric `-2`, `-3`, ... suffix.
An explicit {#id} is used verbatim (case preserved). Implementations MAY offer an
OPT-IN mode that ASCII-folds the identifier for URL/CSS-fragment portability
(carve-js `asciiHeadingIds` parse option; carve-php `AsciiHeadingIdsExtension`);
ASCII folding is never the default. NOTE: carve lowercases by design, deliberately
diverging from djot.js / djot-php which preserve case per #393. *)
(* --- Thematic Breaks --- *)
thematic_break = (('-', '-', '-', {'-'}) | ('*', '*', '*', {'*'}) | ('_', '_', '_', {'_'})), newline ;
(* --- Code Blocks --- *)
code_block = fenced_code_block, [caption] ;
(* A trailing caption (PART 9 §4) wraps the code block in a figure: a
numbered LISTING. The caption may carry a `#` number placeholder, and a
`</#id>` to the block resolves to "Listing N" (PART 9 §19). *)
fenced_code_block = code_fence_open, [space], [code_fence_info], newline,
code_content,
code_fence_close, newline ;
(* The space between the fence and the info string is OPTIONAL (lenient: both
```php and ``` php are accepted; Markdown writes no space, Djot writes the
space). The no-space form (```php) is canonical and is what the X->Carve
converters emit. *)
(* CODE-FENCE INFO STRING -- NORMATIVE. The info string is a single
language_info token, optionally followed by a bracketed [label], or a bare
[label] (no language). The label is structured metadata: it is NOT part of
the language/class; the core renderer ignores it, an extension (e.g.
code-group) may use it. A code fence carries NO inline attributes -- use the
PRECEDING block-attribute line for those (PART 9 §15):
{.fancy #x}
```php
...
```
which renders on the <pre> (language stays `language-…` on the <code>).
INVALID-FENCE FALLBACK: when the text after the language token is anything
other than a [label] -- a bare second word, a quoted value, an inline `{...}`
block (```js title="x", ``` php {.x}) -- the line is NOT a fenced code block;
it falls back to ordinary inline parsing (the backtick run typically opens an
inline code span). The bracket is the only delimiter that admits structured
metadata. The raw passthrough opener (```=FORMAT, PART 9 §20) is a SEPARATE
production matched before this one (a leading `=` is never a language). *)
code_fence_open = backtick_fence | tilde_fence ;
code_fence_close = backtick_fence | tilde_fence ; (* same fence char; length >= opener (PART 9 §2) *)
backtick_fence = '`', '`', '`', {'`'} ;
tilde_fence = '~', '~', '~', {'~'} ;
code_fence_info = (language_info, [space+, label]) | label ;
language_info = (letter | digit | '-' | '_' | '+' | '#' | '.' | '/')+ ; (* a single token; the punctuation covers real language tags like c++, c#, f#, asp.net, text/html. May start with a digit. The `/` is normative across all impls (so `text/html`, `image/svg` are language tokens, not split). A token MUST NOT start with `=`: a leading `=` is the raw_block opener (above), not a language. *)
label = '[', { any_char - ']' }, ']' ; (* structured metadata, e.g. [Installation]; not part of the class *)
code_content = (* any text until matching fence, preserved literally *) ;
(* TABS IN CODE -- NORMATIVE. Literal tab characters in code content (fenced
code blocks and inline code spans) are PRESERVED verbatim; a tab and N spaces
are not interchangeable. Tab display width is a presentation concern (CSS
`tab-size`). Tab-to-space expansion is NOT a default behavior -- it is opt-in
via a tab-normalize extension (flat replacement, default 2 spaces, content
only). Matches djot / CommonMark. Pinned by corpus tabs-in-code-preserved. *)
(* --- Blockquotes --- *)
blockquote = blockquote_line, {blockquote_line | lazy_continuation_line}, [caption] ;
(* marked and lazy lines may interleave: `> a` / lazy / `> b` is ONE quote *)
blockquote_line = '>', [' '], inline_content, newline ;
(* LAZY CONTINUATION -- NORMATIVE (CommonMark-compatible). After one or more
blockquote_lines, a line that does NOT carry the '>' marker still continues
the blockquote -- exactly as if it carried '>' -- provided it is:
- not blank (a blank line ends the blockquote), and
- not a block-opener that would interrupt a paragraph (§10): a heading,
list, table, fenced code, `:::` div, thematic break, OR an "invisible"
reference / footnote / abbreviation definition or comment -- each ends
the blockquote and starts that block OUTSIDE it, and
- not a caption ('^ ' ...), which attaches to the blockquote instead.
Only PLAIN text (continuing an open paragraph) folds in; an ordered marker
folds too (it never interrupts, §10/§11). The lazy line's text is appended to
the blockquote's inner content before that content is block-parsed, so a
hard-wrapped quoted paragraph need not repeat '>' on every line. *)
lazy_continuation_line = inline_content, newline ; (* see the NORMATIVE note above for which lines qualify *)
(* --- Lists --- *)
list = unordered_list | ordered_list | definition_list ;
unordered_list = unordered_item+ ;
unordered_item = bullet_marker, [item_attributes], space, [task_marker], list_item_content ;
(* the optional task_marker makes the item a TASK item (checkbox); the
plain-vs-task axis is a §11 same-list criterion *)
bullet_marker = '-' | '*' ; (* `+` is NOT a bullet in Carve -- it is the list
continuation marker (PART 9 §17); a `+ ` line is
ordinary paragraph text. Deviates from djot. *)
(* MARKER REQUIRES CONTENT -- NORMATIVE. A bullet or ordered marker is a list
item only when followed by a space AND non-empty content. A content-less
marker line -- bare (`-`) or with trailing whitespace only (`- `, `- `) --
is NOT a list: it is paragraph text. The rule ignores trailing whitespace, so
`-` and `- ` behave identically (an editor stripping the trailing space cannot
change the meaning). Carve is stricter than CommonMark, which treats a bare
`-` as an empty item; requiring content keeps a lone dash (a prose dash /
placeholder) from silently becoming a list. Pinned by corpus
content-less-marker-is-not-a-list. *)
ordered_list = ordered_item+ ;
ordered_item = ordered_marker, [item_attributes], space, list_item_content ;
(* The first item fixes the dialect (decimal / alpha / roman) and the
delimiter (`.` or `)`); classification and the ambiguous-letter
tie-break are in PART 9 §11. An ordered list does NOT interrupt a paragraph
in any dialect/value (`1.`, `2.`, `1985.`, `a.`, `i.`); it needs a blank
line before it (matching Djot, avoiding the CommonMark `1.`-only heuristic)
-- the paragraph-interruption rule is PART 9 §10.
The two delimiters are the trailing `.` and `)` only. djot's
parenthesized `(1)` / `(a)` form is INTENTIONALLY NOT a marker -- it is
too easily confused with a prose parenthetical -- so `(1) text` stays
literal paragraph text (pinned by corpus parenthesized-ordered-marker). *)
ordered_marker = (digit+ | letter | roman_numeral), ('.' | ')') ;
roman_numeral = ('i' | 'v' | 'x' | 'l' | 'c' | 'd' | 'm')+
| ('I' | 'V' | 'X' | 'L' | 'C' | 'D' | 'M')+ ;
(* LIST-ITEM ATTRIBUTES (Carve addition; NORMATIVE -- extends PART 9 §15). An
attribute block ABUTTING the marker (no space between marker and `{`)
attaches its attributes to the `<li>` itself; the marker's required space
follows the block:
`-{.c} text` -> `<li class="c">text</li>`
`3.{#x k=v} text`-> `<li id="x" k="v">text</li>`
For task items the block abuts the marker, before the task marker:
`-{.c} [ ] text` -> `<li class="c"><input ...> text</li>`.
WHITESPACE IS THE DISCRIMINATOR -- NORMATIVE:
- `-{.c} text` (attr ABUTS marker) -> the `{.c}` is part of the MARKER and
attributes the `<li>`. This is the ONLY way to attribute the `<li>`.
- `- {.c} text` (a space BEFORE `{`) -> the `{.c}` is ordinary item CONTENT,
NOT a li-attribute. It then follows the normal inline/block rules: a
same-line `{.c}` is leading inline content; a `{.c}` alone on its own line
(inside the item) is a block-attribute line that floats to the next block
within the item -- e.g. a `{.blue}` line preceding an indented `> quote`
puts `.blue` on the `<blockquote>` (§15), NOT on the `<li>`. This keeps
li-attributes and block-attributes cleanly separated.
- DISAMBIGUATION of the abutting block mirrors the inline-span rule (§14):
it is consumed as li-attributes only if it yields >= 1 attribute (`#id`,
`.class`, `key=value`) OR is the blessed empty block (`-{} text` -> bare
`<li>`). Otherwise (`-{+a+}`, `-{not attrs}`) the `-{` is not a marker and
the line stays ordinary paragraph/inline text.
- Free slot: `-{` and `1.{` / `1){` are NOT list markers today (bullet and
ordered both require the space), so no existing input changes meaning.
- Diverges from djot (which has no li-attribute form); deliberate Carve
extension. The lazy-continuation accident -- a trailing `{…}` line folded
onto a tight item, which carve-php attached to the `<li>` and carve-js
dropped -- is REJECTED as the mechanism: it was position-dependent (broke
once a sub-list intervened) and is not normative. *)
item_attributes = attributes ;
list_item_content = first_block_content
| (inline_content, newline, {list_continuation | nested_list | continuation_marker_block}) ;
list_continuation = indent, inline_content, newline ;
nested_list = indent, list ;
indent = whitespace, {whitespace} ;
(* Indentation is measured in VISUAL COLUMNS (a tab advances to the next
multiple of 4 -- CommonMark tab stops); how much indent nests what is NOT
a fixed character count -- the content-column rules are PART 9 §24. *)
(* List continuation marker (Carve addition; see PART 9 §17): a lone `+` at
the item's marker column attaches the following flush-left block to the
item with no blank line, keeping the list tight. *)
continuation_marker = '+', newline ;
continuation_marker_block = continuation_marker, block ;
(* First-block item (Carve addition; see PART 9 §17): a lone `+` as the sole
content right after the marker (`- +`) opens an item whose body is the
flush-left block(s) that follow, with no inline lead and no indentation.
`- + text` keeps `+ text` as literal inline content -- only a bare `+`
triggers this. *)
first_block_content = continuation_marker, {block} ;
(* Task lists are unordered lists with special markers *)
task_marker = '[', task_state, ']', space ;
(* `x`/`X` render a CHECKED checkbox; every other state (` `, `-`, `_`,
`>`, `?`) renders an UNCHECKED checkbox. Matches djot-php. *)
task_state = ' ' | 'x' | 'X' | '-' | '_' | '>' | '?' ;
(* --- Definition Lists --- *)
definition_list = definition_entry+ ;
definition_entry = definition_term+, definition_body+ ;
definition_term = "::", space, inline_content, newline ;
definition_body = ':', space, space, inline_content, newline, {definition_continuation} ;
definition_continuation = (space, space, space, inline_content, newline)
| (backslash, newline, space, space, space, inline_content, newline) ;
(* The term marker is TWO colons. A single-colon `: term` line is NOT a
definition list -- it is ordinary paragraph text. (Deliberate: djot uses a
single colon, but `::` keeps the term marker distinct from the `:::` div
fence and from a stray `:` in prose; all three reference impls agree.) *)
(* --- Tables --- *)
table = standard_row, {table_row}, [caption] ;
(* a table BEGINS with a standard row; a continuation row only ever follows
an existing row (its cells append to the row above -- PART 9 §5) *)
table_row = standard_row | continuation_row ;
standard_row = '|', table_cell, {'|', table_cell}, '|', newline ;
continuation_row = '+', table_cell, {'|', table_cell}, '|', newline ;
table_cell = header_cell | data_cell | span_cell ;
header_cell = '=', [alignment_marker], {whitespace}, cell_content, {whitespace} ;
data_cell = [cell_attributes], [alignment_marker], {whitespace}, cell_content, {whitespace} ;
(* alignment_marker is glued to the opening '|' (no preceding whitespace),
mirroring header_cell; a whitespace-delimited lone '^'/'<' is span_cell *)
(* CELL ATTRIBUTES (NORMATIVE): an attribute block GLUED to the opening '|'
(no preceding whitespace) sets the cell's attributes; the rest of the cell,
after optional whitespace, is the content. A space before the brace
(`| {.x}`) is ordinary content, NOT attributes. The whole brace payload must
be valid attribute syntax (§15); otherwise the '{' is literal content. A
cell carrying attributes is never a bare span_cell -- its content is literal
even if it is just '^' or '<'. A computed rowspan/colspan/alignment is
authoritative, so an author copy of those keys is dropped. (corpus
97-table-cell-attributes.) *)
cell_attributes = attributes ;
span_cell = rowspan_marker | colspan_marker ;
rowspan_marker = {whitespace}, '^', {whitespace} ;
colspan_marker = {whitespace}, '<', {whitespace} ;
alignment_marker = '<' | '>' | '~' ; (* left, right, center *)
cell_content = inline_content ;
(* SEMANTIC CONSTRAINT (PART 9 §5): cell content may not contain an
unescaped `|` outside a code span -- the pipe is the cell separator;
the exception is not expressible context-free. *)
(* --- Captions --- *)
caption = '^', space, inline_content, newline ;
(* CAPTION NUMBER PLACEHOLDER -- NORMATIVE.
The FIRST bare `#` in a caption's TOP-LEVEL inline_content is a number
placeholder. "Bare" = a `#` that does NOT begin a tag (a `#`
followed by whitespace, `:`, `.`, end-of-caption, or any
non-tag-name character). `#word` stays a tag (PART 9 §19). `\#` is
a literal `#`, never a placeholder. Only the first bare `#` is a
placeholder; later bare `#` render literally. A `#` INSIDE inline
markup (emphasis, link, span) is NOT a placeholder -- put it in the
caption's top-level text (`^ *Figure* #:`, not `^ *Figure #*:`).
The LABEL is the inline_content before the placeholder, trailing
whitespace trimmed; the counter BUCKET KEY is that label's plain
text. Numbering is per-bucket, 1-based, in document order, assigned
in the resolution pass. A caption with no bare `#` is unchanged. *)
(* --- Admonitions --- *)
(* A fence is a run of 3+ colons. A longer opener nests shorter blocks; a
block is closed only by a bare fence of equal-or-greater length, so a
`:::` inside a `::::` block is content (PART 9 §12). *)
colon_fence = ":::", {":"} ;
admonition = admonition_open, newline,
{block - admonition_close},
admonition_close, newline ;
(* the body may be EMPTY, like the generic div's (PART 9 §12) *)
(* STRICT (djot): the opener line carries NO inline attributes -- it is
colon_fence, type, and an optional quoted title, and NOTHING else. A
trailing `{...}` (or any other non-title text) makes the line an
ordinary paragraph, not a fence. Attributes attach via a PRECEDING
block-attribute line, which floats onto the admonition (§15). *)
admonition_open = colon_fence, space, admonition_type, [space, quoted_title] ;
admonition_close = colon_fence ;
admonition_type = "note" | "tip" | "warning" | "danger" | "info"
| "success" | "example" | "quote" (* Tier 1 canonical *)
| identifier ; (* Tier 2 custom types (incl. `details`)
-- see PART 9 §12 for the two-tier
rendering rule *)
quoted_title = '"', {character - '"'}, '"' ;
(* there is NO escape mechanism inside a quoted title -- a title cannot
contain a `"`; a line whose title is malformed is an ordinary paragraph
(deliberate strictness, unlike `quoted_value` which accepts escapes) *)
(* --- Generic Divs (djot generic container; PART 9 §12) --- *)
(* A BARE `:::` opener with NO type word is a generic div, NOT an
admonition (no class added). STRICT (djot): the opener carries NO inline
attributes -- an inline `::: {…}` is a paragraph, not a div. Attributes
attach via a PRECEDING block-attribute line, which floats onto the div
(§15). Shares the colon-fence closer and the fence-length nesting rule
(PART 9 §12). *)
div = div_open, newline, {block - admonition_close}, admonition_close, newline ;
div_open = colon_fence ;
(* --- Line block (verse) -- a RECOGNIZED `:::` type that preserves the
author's per-line layout. `::: |` renders as a generic
`<div class="line-block">` (NOT an `<aside>`), but unlike an ordinary div
its body keeps each line's LEADING WHITESPACE and turns each soft line
break into a hard break. The type token is a bare pipe `|` on the opener --
NOT a per-line prefix -- so it is free of the pipe/table ambiguity of the
Pandoc/djot per-line `|` form, and uses no English keyword (jgm/djot#29).
The behavior keys off the `|` token. STRICT (djot): the opener carries no
inline attributes, so the inline `::: {.line-block}` form is a PARAGRAPH, not
a line block; extra attributes attach via a PRECEDING block-attribute line.
Normative rendering: PART 9 §23. *)
line_block = line_block_open, newline, line_block_body, admonition_close, newline ;
line_block_open = colon_fence, space, "|" ;
(* The body is a sequence of stanzas separated by blank lines; each stanza is
a run of consecutive non-blank lines. Inline content parses normally; the
per-line leading whitespace is retained and the intra-stanza newlines are
hard breaks (PART 9 §23). *)
line_block_body = stanza, {blank_line, stanza} ;
stanza = line_block_line, {line_block_line} ;
line_block_line = {whitespace}, inline_content, newline ;
(* the leading whitespace is PRESERVED in output -- PART 9 §23 *)
(* --- Comments --- *)
comment_line = "%%", {character}, newline ;
comment_block = comment_block_open, newline,
{character | newline},
comment_block_close, newline ;
comment_block_open = "%%%", {'%'} ; (* 3+ percent signs *)
comment_block_close = "%%%", {'%'} ; (* must match opener length *)
(* trailing (inline) line comment: from a %% marker to end of line, content
not rendered. The marker requires whitespace or start-of-run before it and
is not formed inside code spans / inline-raw, nor when the first % is
escaped -- a semantic constraint stated in PART 9, not context-free. *)
inline_comment = "%%", {character - newline} ;
(* --- Raw Blocks --- *)
raw_block = code_fence_open, [space], "=", format_name, newline,
raw_content,
code_fence_close, newline ;
(* The opener is a code fence (backticks or tildes) whose info string is `=`
immediately followed by a format name (```=html). The leading `=` is the
block parallel of the inline raw `{=format}` attribute; it never starts a
language token (a code fence's language charset excludes `=`), so this is
unambiguous against an ordinary code block. The `=` and format name must be
adjacent -- ```= html (space after `=`) is NOT a raw block. Leading
whitespace before the `=` is permitted (```=html and ``` =html both open
raw). The closer follows the PART 9 §2 rule: same fence character, length
>= opener. Content is verbatim; it is emitted UNESCAPED when the format
matches the output format (html) and DROPPED otherwise -- the block parallel
of raw inline (PART 9 §20). This adopts djot's raw-block syntax; the earlier
```raw FORMAT keyword form was removed (no English keyword; symbol-based and
symmetric with inline `{=format}`). *)
format_name = "html" | "latex" | identifier ;
raw_content = (* any content until closing fence *) ;
(* --- Paragraphs --- *)
(* A VISIBLE block (heading, list, quote, table, fence, thematic break,
admonition/div) interrupts a paragraph with no blank line before it
(Markdown-like); invisible constructs (reference definitions, comments,
block-attribute lines) interrupt too. This is the semantic constraint
PART 9 §10, not a context-free one. A paragraph is terminated by a blank
line, an interrupting block (§10), or end of file -- the optional trailing
blank_line here is the paragraph's own terminator, not a precondition for
the NEXT block. *)
paragraph = inline_content+, [blank_line] ;
(* ============================================================================
PART 3: INLINE ELEMENTS
============================================================================ *)
inline_content = {inline_element | text_run | literal_special}+ ;
(* A special_char that does not begin (or close) a construct under PART 8
precedence + PART 9 conditions is literal text. Without this production
the grammar cannot represent "a/b/c", "x = a*b", or an inner delimiter
such as the middle '/' of /usr/local/ -- yet these are canonical outputs
(edge-cases.md §1, corpus 01-emphasis). The "is this a construct?"
decision is the semantic constraint in PART 9, not a context-free one. *)
literal_special = special_char ;
inline_element = escaped_char
| raw_inline
| code_span
| autolink
| auto_text_link
| link
| inline_span
| image
| math
| emphasis
| strong
| bold_italic
| underline
| strikethrough
| superscript
| subscript
| highlight
| forced_emphasis
| forced_strong
| forced_underline
| forced_strike
| forced_super
| forced_sub
| forced_highlight
| footnote
| extension_inline
| editorial_markup
| mention
| tag
| smart_typography
| inline_comment
| hard_break
| soft_break ;
text_run = (character - special_char)+ ;
special_char = '/' | '*' | '_' | '~' | '^' | '`' | '[' | ']'
| '!' | '$' | '{' | '}' | '<' | '>' | '@' | '#'
| ',' | '=' | ':' | '%' | '\'
| '-' | '.' | '"' | "'" | '(' | '+' ;
(* the second group are the smart-typography trigger characters (dashes,
ellipsis, quotes, `(c)`-family symbols, `+-`): they must be excluded
from text_run so the smart_typography productions are reachable; when
no pattern matches they are literal_special like any other special *)
(* --- Escaped Characters --- *)
escaped_char = '\', ascii_punctuation ;
ascii_punctuation = '!' | '"' | '#' | '$' | '%' | '&' | "'" | '(' | ')'
| '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';'
| '<' | '=' | '>' | '?' | '@' | '[' | '\' | ']'
| '^' | '_' | '`' | '{' | '|' | '}' | '~' ;
(* --- Code Spans --- *)
code_span = backtick_run, code_span_content, [backtick_run], [attributes] ;
(* the closing backtick_run is OPTIONAL: an unclosed run runs to end of
block -- see UNCLOSED RUN below *)
backtick_run = '`'+ ; (* the OPENER is a MAXIMAL run; the CLOSER is a run of
the SAME count, also maximal (not part of a longer run) *)
code_span_content = (* any chars except a closing backtick_run of equal count *) ;
(* UNCLOSED RUN. An opener with no equal-length closer ahead is NOT literal
text: it opens a verbatim span that runs to END_OF_BLOCK (the block's
trailing whitespace is stripped; no surrounding single-space strip, which
applies only to a closed span). Such an unclosed run is OPAQUE -- an
emphasis delimiter or link tail after it is verbatim content, so the
surrounding construct never closes. Matches djot upstream and carve-php.
(Pinned as the "Inline verbatim ..." corpus pair; was a carve-js
divergence, resolved.) *)
(* A trailing `{…}` on a code span is the generic inline-attribute
block, NOT a language tag -- EXCEPT the exact form `{=format}`, which is
raw inline passthrough (raw_inline below, PART 9 §20). For any OTHER
`{…}`: both impls consume AND apply it -- `\`c\`{.x}` ->
`<code class="x">c</code>`, and `#id`/`key=value` likewise. (Was an impl
divergence where carve-js dropped the attributes; RESOLVED -- both now
apply them identically.) *)
(* --- Raw Inline (inline parallel of raw_block; PART 9 §20) --- *)
(* A code span whose trailing attribute block is EXACTLY `{=format}` is raw
inline passthrough: the verbatim span content is emitted unescaped when
`format` matches the output, else dropped. Any OTHER trailing `{…}` is a
generic attributed code span (above), NOT raw inline. *)
raw_inline = backtick_run, code_span_content, backtick_run, '{=', format_name, '}' ;
(* --- Links --- *)
link = inline_link | reference_link | collapsed_reference_link ;
(* autolink and auto_text_link (crossref) are listed directly in
inline_element, not here. *)
inline_link = '[', link_text, ']', '(', link_destination, [link_title], ')', [attributes] ;
reference_link = '[', link_text, ']', '[', reference_label, ']', [attributes] ;
collapsed_reference_link = '[', link_text, ']', '[', ']', [attributes] ;
link_text = inline_content ;
(* SEMANTIC CONSTRAINT: the link text ends at the matching `]` -- the close
is balanced-bracket, escape- and code-span-aware (PART 9 §16's rule for
inline notes is the same scan); an unescaped bare `]` inside plain text
closes it. Not expressible context-free. *)
link_destination = {(url_char - ')') | unicode_url_char}+ ;
(* The destination is ANY run of URL characters: absolute (`https://…`),
relative (`/path`, `./file`), or a bare fragment (`#section`) -- there is
no scheme requirement. It ends at the first whitespace (a following
quoted run is the title) or at the first `)`; there is NO balanced-paren
rule and NO escape inside the destination (`[x](http://a/b(c))` links to
`http://a/b(c` and leaves the second `)` literal) -- percent-encode a
literal `)` as `%29`. There is NO angle-bracket-wrapped destination
form. *)
unicode_url_char = (* any non-whitespace, non-ASCII Unicode character *) ;
(* Titles accept double OR single quotes -- a deliberate enhancement
over djot (which has no single-quote titles; it would fold `'...'`
into the URL). The non-delimiting quote may appear inside the title. *)
link_title = space, ('"', {character - '"'}, '"')
| space, ("'", {character - "'"}, "'") ;
reference_label = {character - ']'}+ ;
reference_definition = '[', reference_label, ']', ':', space, link_destination, [link_title], newline ;
(* --- Inline Spans --- *)
(* A bracketed inline run immediately followed by an attribute block
attaches those attributes to a <span>. Distinguished from links by the
character after ']': '(' -> inline_link, '[' -> reference link, '{' ->
inline_span. A bare '[text]' with none of those is literal text (carve
has no shortcut reference links). See PART 9 §14. *)
inline_span = '[', inline_content, ']', attributes ;
autolink = '<', (url_autolink | email_autolink), '>', [attributes] ;
(* A trailing `[attributes]` block attaches to the autolink (generic
inline-attribute carrier, examples.md): `<https://example.com>{.ext}`
-> `<a href="https://example.com" class="ext">https://example.com</a>`.
Same slot as every other inline carrier; both reference impls agree. *)
url_autolink = scheme, ':', {url_char}+ ;
email_autolink = {email_char}+, '@', {email_char}+, '.', {letter}+ ;
auto_text_link = "</#", crossref_id, '>' ; (* cross-reference with auto text *)
crossref_id = {character - ('>' | whitespace | newline)}+ ;
(* The crossref id token accepts NON-ASCII characters: automatic heading ids
PRESERVE Unicode (`# Café Notes` -> `café-notes`), so `</#café-notes>`
must be writable. Resolution is exact (case-sensitive against the id
table); an ASCII-folded spelling of a Unicode id does NOT resolve
(corpus 19-heading-ids). EXPLICIT ids (`{#id}`) remain ASCII
`identifier`s -- the asymmetry is deliberate: authors control explicit
ids, auto ids follow the heading text. *)
url = scheme, ':', {url_char}+ ;
scheme = letter, {letter | digit | '+' | '-' | '.'} ; (* 1+ chars; single-letter schemes ok *)
url_char = letter | digit | '-' | '.' | '_' | '~' | ':' | '/' | '?'
| '#' | '[' | ']' | '@' | '!' | '$' | '&' | "'" | '(' | ')'
| '*' | '+' | ',' | ';' | '=' | '%' ;
(* --- Images --- *)
image = '!', '[', alt_text, ']', '(', image_source, [image_title], ')', [attributes] ;
alt_text = {character - ']'} ;
image_source = link_destination ; (* absolute or relative, like links *)
image_title = link_title ;
(* A caption (`^ ` line) may follow a standalone image paragraph on the
next line, wrapping it in a <figure> -- caption placement is PART 9 §4;
there is no separate grammar production for the pair. *)
(* --- Math (djot form; PART 9 §18) --- *)
(* Inline `$` + a verbatim (backtick) span; display `$$` + a verbatim
span. The backtick span removes ambiguity with a literal `$`, so
currency such as `$5` (no following backtick run) stays literal text.
Both forms are inline; display math renders inside its own paragraph
when alone on a line. There is NO bare `$…$` form and NO `\(…\)` input
form -- those were dropped (`\(` is just an escaped paren). *)
math = math_inline | math_display ;
math_inline = '$', code_span ;
math_display = "$$", code_span ;
(* TRAILING ATTRIBUTES on math -- NORMATIVE: math reuses `code_span`, which
carries the generic `[attributes]` slot, so `$\`x\`{.c}` / `$$\`x\`{.c}`
parse a trailing attribute block and APPLY it to the math span, merging
classes into the existing `math inline` / `math display` class
(`<span class="math inline c">`); `#id` / `key=value` are applied too.
Canonical = carve-js / djot.js (pinned, corpus Math section). The
`{=format}` raw form is code-span-ONLY and is NOT inherited by math:
`$\`x\`{=html}` leaves the `{=html}` literal (both impls already agree).
carve-php currently DROPS valid math attributes -- to be fixed.
The math content is the verbatim text of the reused code_span. *)
(* --- Emphasis (left/right word-boundary rule; PART 9 §9 is normative).
The word-boundary restriction applies to EVERY bare emphasis delimiter
(`/`, `*`, `_`, `~`, `^`, `=`, `,`). No bare delimiter opens or closes
intraword: `foo*bar*baz`, `foo~bar~baz`, `snake_case`, `a/b/c` all stay
literal. One rule, no per-character carve-outs. Every bare delimiter is
single-char -- there is no two-char delimiter. This is STRICTER than Djot,
whose rule is whitespace-only.
FORCED INTRAWORD -- the brace-pair `{X … X}` family (PART 9 §22) is the
explicit escape hatch: it opens a span regardless of word boundary, so
`x{*bold*}y`, `x{/italic/}y`, `x{_under_}y` emphasize intraword. It is
ADDITIVE -- bare delimiters at a word boundary are unchanged; the braces are
only needed to force a span where a bare delimiter would otherwise stay
literal.
HIGHLIGHT and SUBSCRIPT are the single-char `=` and `,` delimiters. Uniform
word boundary makes single `=` and `,` safe: `x = 5`, `key=value`, `a=b`,
`=>`, `1,2,3`, `a,b,c`, `x, y, z`, `$1,000` all stay literal -- the opener is
followed by whitespace, or preceded/followed by an alphanumeric, so the
word-boundary rule rejects it. A DOUBLED bare delimiter (`==x==`, `,,x,,`)
is literal by the same-delimiter-adjacency rule (PART 9 §9), like `**x**` or
`//x//`. The delimiter set is fully uniform: all single-char. --- *)
(* For EVERY bare delimiter (`/`, `*`, `_`, `~`, `^`, `=`, `,` -- all single-char):
Opener: NOT followed by whitespace, AND preceded by start, whitespace, or
punctuation -- but NOT by an alphanumeric, by `_`, or by the same
delimiter character.
Closer: NOT preceded by whitespace, AND NOT followed by an alphanumeric.
Exact disambiguation of delimiter runs is pinned by the conformance corpus
(tests/corpus/01-emphasis*); see PART 9 §9. Forced `{X … X}` spans bypass
these conditions entirely (PART 9 §22). *)
emphasis = '/', emphasis_content, '/', [attributes] ; (* italic *)
strong = '*', strong_content, '*', [attributes] ; (* bold *)
bold_italic = "/*", bi_content, "*/", [attributes] ; (* combined *)
underline = '_', underline_content, '_', [attributes] ;
strikethrough = '~', strike_content, '~', [attributes] ;
superscript = '^', super_content, '^', [attributes] ;
subscript = ',', sub_content, ',', [attributes] ; (* single-char; was ",," *)
highlight = '=', highlight_content, '=', [attributes] ; (* single-char; was "==" *)
(* --- Forced intraword emphasis (PART 9 §22). A brace-pair wrapping a bare
emphasis delimiter forces a span with NO word-boundary condition, so it
emphasizes intraword. The closing brace bounds the span; the inner delimiter
characters are literal unless they open a nested forced span. Every mark has
a forced form: *)
forced_emphasis = "{/", forced_content, "/}", [attributes] ; (* italic *)
forced_strong = "{*", forced_content, "*}", [attributes] ; (* bold *)
forced_underline = "{_", forced_content, "_}", [attributes] ; (* underline *)
forced_super = "{^", forced_content, "^}", [attributes] ; (* sup *)
forced_sub = "{,", forced_content, ",}", [attributes] ; (* sub *)
(* `{~ … ~}` is forced strikethrough UNLESS it contains a top-level `~>`, in
which case it is editorial substitution (PART 9 §22 disambiguation). *)
forced_strike = "{~", forced_content, "~}", [attributes] ;
(* `{= … =}` is forced highlight (see Editorial Markup below for the
distinction from the raw-inline `{=format}` attribute). *)
forced_highlight = "{=", forced_content, "=}", [attributes] ;
forced_content = (inline_element | text_run | literal_special)+ ;
(* TRAILING ATTRIBUTES -- each emphasis-family span MAY carry a trailing
`[attributes]` block: the generic inline-attribute carrier that attaches
to the immediately-preceding inline node (the general rule, examples.md
"Trailing attribute block edge cases"). `*x*{.real}` ->
`<strong class="real">x</strong>`. It is the SAME `[attributes]` slot
every other inline carrier uses (code_span, link, image, inline_span,
extension_inline, autolink). Pinned by corpus
75-trailing-attribute-block-edge-cases; both reference impls agree. *)
(* Content MAY contain the delimiter character: a delimiter that does not
satisfy the close condition (PART 9 §1) is literal content, not the span
boundary. The span ends at the matched closer only. Subtracting the
delimiter here would make e.g. "/usr/local/" -> <em>usr/local</em>
ungrammatical, contradicting edge-cases.md §1. The boundary is a semantic
constraint (PART 9 §1, §9), not a context-free one. *)
emphasis_content = (inline_element | text_run | literal_special)+ ;
strong_content = (inline_element | text_run | literal_special)+ ;
bi_content = (inline_element | text_run | literal_special)+ ;
underline_content = (inline_element | text_run | literal_special)+ ;
strike_content = (inline_element | text_run | literal_special)+ ;
super_content = (inline_element | text_run | literal_special)+ ;
sub_content = (inline_element | text_run | literal_special)+ ;
highlight_content = (inline_element | text_run | literal_special)+ ;
(* Word boundary rule for emphasis openers/closers (normative: PART 9 §9).
Applies to EVERY bare delimiter. Forced `{X … X}` spans (PART 9 §22) are
exempt. PSEUDO-PRODUCTIONS -- prose conditions, not reachable grammar
symbols. *)
emphasis_open_condition = (* preceding char is whitespace, start, or punctuation but NOT alphanumeric or '_'; following char is non-whitespace *) ;
emphasis_close_condition = (* preceding char is non-whitespace; following char is NOT alphanumeric *) ;
(* --- Footnotes (PART 9 §16) --- *)
(* Both implemented forms: the REFERENCE form `[^label]` resolving against a
`[^label]: body` definition, and the INLINE form `^[content]` (pandoc
form; a deliberate carve extension over djot). Numbering (one shared
document-order sequence), endnotes rendering, and the inline form's
balanced-bracket close are PART 9 §16. PART 8 ranks `^[` ABOVE
superscript. Only the sidenote form remains deferred. *)
footnote = reference_footnote | inline_footnote ;
reference_footnote = "[^", footnote_label, ']' ;
footnote_label = {character - ']'}+ ;
inline_footnote = "^[", inline_content, ']' ;
(* the closing `]` is the balanced, escape- and code-span-aware close
(PART 9 §16); footnote recognition is DISABLED inside the content *)
footnote_definition = "[^", footnote_label, "]:", space, inline_content, newline ;
(* the body extends to following lines indented by >= 2 spaces -- PART 9 §16 *)
(* DEFERRED (reserved, not implemented):
sidenote = "[>", inline_content, ']' ; *)
(* --- Extensions --- *)
extension_inline = ':', extension_name, '[', extension_content, ']', [attributes] ;
extension_name = identifier ;
extension_content = {character - ']'} ;
(* --- Editorial Markup (CriticMarkup-inspired) --- *)
editorial_markup = addition | deletion | substitution | editorial_comment ;
addition = "{+", inline_content, "+}" ;
deletion = "{-", inline_content, "-}" ;
substitution = "{~", inline_content, "~>", inline_content, "~}" ;
editorial_comment = "{#", inline_content, "#}" ;
(* DISAMBIGUATION (PART 9 §22). `{~ … ~}` is editorial SUBSTITUTION when it
contains a top-level `~>`, and forced STRIKETHROUGH otherwise. `{# … #}`
stays editorial comment (`#` is not an emphasis delimiter, no collision). *)
(* HIGHLIGHT is the single-char `=` delimiter, and `{=text=}` is its FORCED
intraword form (forced_highlight, PART 9 §22), rendering <mark>. It is
distinct from the raw-inline `{=format}` attribute on a code span (e.g.
`` `x`{=html} ``), which has no trailing `=` before the `}`. *)
(* --- Mentions and Tags --- *)
mention = '@', mention_name ;
mention_name = name_word, {'.', name_word} ;
tag = '#', tag_name ;
tag_name = name_word, {'.', name_word} ;
name_word = (letter | digit | '_' | '-')+ ;
(* Dots are INTERIOR-only by construction: a dot followed by another name
character continues the name (`@john.doe`, `#release-1.0`); a dot at the
end of the run is sentence punctuation, never part of the name
(`ping @markus.` -> mention `markus` + literal `.`). Pinned by corpus
89-mention-and-tag-name-boundaries / 31-mention-ignores-email-addresses. *)
(* Boundary rules (PSEUDO-PRODUCTIONS -- prose conditions on `mention` /
`tag`, normative in PART 9 §7, not reachable grammar symbols):
must be preceded by whitespace or start of content *)
mention_boundary = (* @name must have word boundary before @ *) ;
tag_boundary = (* #name must have word boundary before # *) ;
(* --- Smart Typography --- *)
smart_typography = em_dash | en_dash | ellipsis | smart_quote
| arrow | comparison | symbol ;
(* NOTE: fractions are deliberately NOT converted (PART 9 §8 / dismissed-
syntax.md) -- `1/2` collides with dates (`1/2/2024`) and paths. *)
em_dash = "---" ; (* → — *)
en_dash = "--" ; (* → – *)
ellipsis = "..." ; (* → … *)
(* HYPHEN RUNS -- a run of 2+ hyphens between non-hyphen characters
decomposes djot-style: length divisible by 3 -> all em dashes;
else divisible by 2 -> all en dashes; else one em dash (3) and the
remainder by the same rule. So `--`=–, `---`=—, `----`=––,
`-----`=—–, `------`=——, `-------`=—–– (PART 9 §8). *)
smart_quote = '"' | "'" ;
(* Smart quotes are PER-CHARACTER contextual substitution, NOT a paired
span: each `"` / `'` is independently mapped to a left or right
typographic quote by its FLANKING characters -- preceded by whitespace
or start-of-content -> left (opening) quote; otherwise -> right
(closing) quote. An apostrophe (`don't`) is therefore a right single
quote; an unpaired quote still converts; the substitution never spans
or pairs across other inline constructs. `\"` / `\'` escape to the
straight character (PART 9 §8). *)
arrow = "->" | "<-" | "<->" | "=>" ;
comparison = "!=" | "<=" | ">=" ;
symbol = "(c)" | "(r)" | "(tm)" | "+-" ;
(* --- Breaks --- *)
hard_break = '\', newline ; (* backslash at end of line *)
soft_break = newline ; (* regular line ending within paragraph *)
(* ============================================================================
PART 4: ATTRIBUTES
============================================================================ *)
attributes = '{', opt_ws, attribute_list, opt_ws, '}' ;
attribute_list = attribute, {whitespace+, attribute} ;
(* interior padding is accepted: `{ .a }` and `{.a .b}` are valid inline
attribute blocks, same whitespace handling as block_attributes (minus
the line-continuation, which is block-level only) *)
(* RENDER ORDER -- NORMATIVE: attributes are emitted in the order they
appear in the source, with all classes merged into a single `class`
attribute placed at the position of the FIRST class. So `{.a #b k=c}`
-> `class="a" id="b" k="c"` and `{k=c .a #b}` -> `k="c" class="a"
id="b"`. Matches djot and carve-php. (Block-attribute merge keeps
each slot at its first-appearance position; values are last-wins,
PART 9 §15.) *)
attribute = id_attribute | class_attribute | key_value_attribute
| boolean_attribute ;
id_attribute = '#', identifier ;
class_attribute = '.', identifier ;
key_value_attribute = identifier, '=', attribute_value ;
(* A bare identifier with no value is a BOOLEAN (value-less) attribute,
rendered `name=""` (a carve extension beyond canonical djot, matching
djot-php; matched after key_value_attribute so `k=v` is not read as a bare
`k`). It mixes freely with the other forms. See PART 9 §14. *)
boolean_attribute = identifier ;
attribute_value = unquoted_value | quoted_value ;
unquoted_value = (letter | digit | '-' | '_')+ ;
(* A backslash escapes ASCII punctuation inside a quoted value (same
`escaped_char` rule as inline text), so a value can contain a literal
quote: `"a\"b"` -> the value `a"b`. *)
quoted_value = '"', { escaped_char | (character - '"' - '\') }, '"'
| "'", { escaped_char | (character - "'" - '\') }, "'" ;
(* Block-level attributes appear on their own line(s) preceding a block.
A single attribute block may span multiple physical lines (the `}`
need not be on the opening line): a continuation is a single line
break optionally followed by indentation. A BLANK line (two line
breaks) ends the block -- it is not interior padding; an attribute
block interrupted by a blank line is not a block_attributes (both
reference impls treat it as literal text). Merge + reach semantics
(including float-across-blank-lines BETWEEN separate blocks): §15. *)
block_attributes = '{', opt_ws, attribute, {attr_separator, attribute}, opt_ws, '}', newline ;
opt_ws = {whitespace} ; (* spaces/tabs only, no line breaks *)
attr_separator = (whitespace | continuation), opt_ws ; (* one ws OR one line break *)
continuation = newline, opt_ws ; (* a single line break + indent; NOT a blank line *)
(* ============================================================================
PART 5: ABBREVIATIONS
============================================================================ *)
abbreviation_definition = "*[", abbreviation_term, "]:", space, abbreviation_expansion, newline ;
abbreviation_term = (letter | digit)+ ;
(* a term is a SINGLE alphanumeric word -- a bracketed term containing
punctuation or spaces (`*[e.g.]:`, `*[HTTP API]:`) is NOT a definition;
the line stays ordinary paragraph text *)
abbreviation_expansion = {character - newline}+ ;
(* ============================================================================
PART 6: INCLUDES
============================================================================ *)
(* PROCESSOR-LEVEL (PART 9 §19): includes are NOT part of the core parser
and the directive is deliberately NOT reachable from `block`/`inline` --
a conformant core may leave `{{ … }}` literal. *)
include_directive = "{{", space, include_path, [include_section], [include_options], space, "}}" ;
include_path = {character - ('#' | '@' | '}' | ' ')}+ ;
include_section = '#', identifier ;
include_options = {space, '@', identifier, ':', attribute_value}+ ;
(* ============================================================================
PART 7: LEXICAL ELEMENTS
============================================================================ *)
identifier = (letter | '_'), {letter | digit | '_' | '-'} ;
letter = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j'
| 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't'
| 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
| 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J'
| 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
| 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' ;
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
whitespace = ' ' | '\t' ;
space = ' ' ;
newline = '\n' | '\r\n' | '\r' ;
backslash = '\' ;
character = (* any Unicode character *) ;
email_char = letter | digit | '.' | '-' | '_' | '+' ;
EOF = (* end of file *) ;
(* ============================================================================
PART 8: PARSING PRECEDENCE
============================================================================ *)
(*
Block parsing precedence (first pass):
1. Frontmatter (--- delimited at document start)
2. Comments (%% line, %%% block)
3. Raw blocks (```=FORMAT -- matched before ordinary fences),
then code blocks (``` or ~~~ fenced)
4. Headings (# prefix)
5. Thematic breaks (---, ***, ___)
6. Block quotes (> prefix)
7. Definitions (link/footnote/abbreviation definition lines) and
block-attribute lines ({...} alone on a line, PART 9 §15)
8. Lists (-, * bullets at any indent; ordered markers; :: definition
lists)
9. Tables (| prefix)
10. Colon fences (::: -- admonition with type word, line block with |,
generic div when bare; longest-token wins: a `::: x` line is never
a `:: ` definition term)
11. Paragraphs (everything else)
Recognition order is NOT interruption: which of these may interrupt an
open paragraph is PART 9 §10 (ordered markers, for one, never do).
Inline parsing precedence (second pass):
1. Escaped characters (\x)
2. Code spans (`...`)
3. Autolinks (<url>) and crossrefs (</#id>)
4. Inline footnotes (^[content]) -- above emphasis, so `^[` is a note,
not superscript; see PART 9 §16
5. Links, images, reference footnotes, inline spans
([text](url), , [^label], [text]{attrs} -- the
single-char lookahead after `]` selects among them, PART 9 §14;
a `[^` run is a footnote reference before bracket scanning)
6. Math ($`…`, $$`…`)
7. Emphasis markers (/, *, _, ~, ^, =, ,) and forced spans ({/.../} etc.)
-- EXCEPT a delimiter that begins a multi-char smart-typography
pattern: `=>` is the arrow, never a highlight opener (the pattern
is consumed first; PART 9 §8)
8. Editorial markup ({+...+}, etc.; {~...~} with `~>` = substitution)
9. Extensions (:type[content])
10. Mentions and tags (@user, #tag)
11. Smart typography (--, ---, etc.)
12. Trailing line comments (%%, PART 9 §21)
13. Plain text
Disambiguation rule:
- Prefer literal text over markup: a delimiter that has no valid match
(per the word-boundary conditions, PART 9 §1) is literal.
- An opener matches the NEAREST following valid closer of the same type.
Delimiters of the same type between them are literal content (same-type
spans do not nest, PART 9 §3); e.g. /usr/local/ -> <em>usr/local</em>.
- Different-type spans nest and are resolved with a delimiter stack in a
single left-to-right pass (NO backtracking, linear time); e.g.
*bold /italic/* and /italic *bold*/ nests fully.
Note: "shorter spans / earlier opening wins" does NOT apply -- it would
truncate /usr/local/ to <em>usr</em> and break nested emphasis. The
delimiter-stack model is what makes Design Principle 1 (no backtracking)
hold while still producing the canonical outputs in the corpus.
*)
(* ============================================================================
PART 9: SEMANTIC CONSTRAINTS (NOT EXPRESSIBLE IN PURE EBNF)
============================================================================ *)
(*
1. WORD BOUNDARIES FOR EMPHASIS (summary -- §9 below is NORMATIVE).
The word-boundary conditions apply to EVERY bare delimiter
(`/ * _ ~ ^ = ,` -- all single-char). No bare delimiter emphasizes
intraword. The forced `{X … X}` family (§22) is the escape hatch for
deliberate intraword emphasis.
- SAME-DELIMITER ADJACENCY applies to ALL SEVEN single-char delimiters
`/ * _ ~ ^ = ,`: a delimiter adjacent to another of the same delimiter
(before or after) never opens, so a doubled delimiter is literal
(`**x**`, `~~x~~`, `^^x^^`, `==x==`, `,,x,,` literal, like `//x//`,
`__x__`).
- Opener (all bare): NOT followed by whitespace AND preceded by
whitespace, start, or punctuation but NOT by an alphanumeric or `_`
(left word boundary)
- Closer (all bare): NOT preceded by whitespace AND NOT followed by an
alphanumeric (right word boundary).
- Example: "/ not italic /" is literal (whitespace after opener);
"a/b/c", "foo*bar*baz", "snake_case", "x = 5", "key=value" are literal
(opener has no left boundary, or is followed by whitespace);
"/usr/local/" -> <em>usr/local</em> (01-emphasis-6); "x{*y*}z" ->
x<strong>y</strong>z (forced, §22).
- This is STRICTER than Djot, whose rule is whitespace-only.
2. MATCHING FENCES
- Code fence closer must have same character and >= same length as opener
- Comment block closer must have same length as opener
3. NO SAME-TYPE NESTING
- /nested /italic/ here/ is invalid
- /nested *bold* here/ is valid
- A consequence (§9 SAME-DELIMITER ADJACENCY): a doubled bare delimiter
never opens nested same-type emphasis -- it is literal text. So `**x**`
-> `**x**`, `~~x~~` -> `~~x~~`, `^^x^^` -> `^^x^^`, uniform with the
already-literal `//x//` and `__x__` (corpus 71-doubled-emphasis-
delimiters).
4. CAPTION PLACEMENT
- ^ caption must immediately follow image, blockquote, table, a fenced
code block (a captioned code block is a numbered LISTING), or a
standalone display-math block (a captioned equation is a numbered
EQUATION; its block must be solely the `$$`…`` span)
- One blank line allowed between block and caption
5. TABLE CELL CONTENT, ROW VALIDITY, CONTINUATION ROWS
- Pipes must be escaped (\|) or inside code spans. Escape processing
is cell-splitting-aware: an unescaped `|` outside a code span
separates cells; `\|` and a `|` inside a code span are content.
- Span markers (^ <) must be sole cell content. A leading `=` on a
cell (glued to the `|`) marks a HEADER cell; escape it (`\=`) for a
data cell whose text starts with `=`. Because a span marker must be the
WHOLE cell, any other content disqualifies it: a cell like `{.x} <` is
NOT a colspan marker -- it is an ordinary cell whose literal text is
`{.x} <`. (So a leading attribute block does not attach to a span
marker; there is no such thing as an attributed span marker.)
- SPAN MARKER WITH NOTHING TO MERGE (NORMATIVE): a rowspan `^` in the
FIRST row, or a colspan `<` in the FIRST column, has no cell above /
to its left to extend. Such a marker renders as an EMPTY cell (a
`<td></td>` / `<th></th>` carrying no content and no span), rather than
being dropped. (corpus 96-table-span-marker-in-first-column.)
- VALID TABLE ROW (the §10 interruption test): a line whose first
non-indent character is `|` AND whose last non-whitespace character
is `|`, with at least one cell between them. A line-initial `|`
without the trailing `|` does NOT interrupt a paragraph (it stays
prose); at BLOCK START (after a blank line) the trailing `|` is
lenient -- a `| a | b` line still opens a table.
- CONTINUATION ROWS (`+` first column instead of `|`): a continuation
row appends to the row ABOVE it -- each of its non-empty cells is
joined onto the corresponding cell of that row (separated by a
single space, as a soft wrap); empty cells append nothing. It adds
no `<tr>`. Works under rowspan: the joined content belongs to the
spanning cell (corpus 40, 41). A table cannot BEGIN with a
continuation row (PART 2 `table`).
6. REFERENCE RESOLUTION
- Reference links [text][ref] resolved to [ref]: url definitions
- LABEL MATCHING is EXACT: case-sensitive, no whitespace folding
(corpus 73-reference-labels-are-case-sensitive). When the same
label is defined twice, the LAST definition wins for links --
note the deliberate asymmetry with footnote definitions, where
the FIRST wins (§16).
- Collapsed references [text][] use text as reference label
- Abbreviations matched at word boundaries only
- All definitions (link/footnote/abbreviation references, heading IDs)
are collected in the first pass (PART 8, "first pass") BEFORE inline
resolution runs. A document-order definition appearing AFTER its use
is still resolved -- this is the defined two-pass model, not
backtracking, and is O(n). Design Principle 2 ("no dependency on
later references") scopes to inline *tokenization/highlighting*,
which never needs the definition table; semantic *expansion*
(abbr <abbr>, </#id> auto-text, [ref] targets) uses it.
7. MENTION/TAG BOUNDARIES
- @ and # only start mentions/tags at word boundaries
- email@domain.com is NOT a mention
- The name runs over letters, digits, `_`, `-`, and INTERIOR dots
(a dot followed by another name character: `@john.doe`,
`#release-1.0`); a trailing dot is sentence punctuation, not part
of the name
8. SMART TYPOGRAPHY CONTEXT
- Only applies outside code spans/blocks
- Patterns must match exactly (not partial)
- FRACTIONS ARE NOT CONVERTED. `1/2`, `3/4`, etc. stay literal --
they collide with dates (`1/2/2024`) and paths, and djot has no
fractions (recorded in dismissed-syntax.md). Converted set:
dashes, ellipsis, smart quotes, arrows, comparisons, `+-`, and
symbols (`(c)`,`(r)`,`(tm)`).
9. EMPHASIS SPAN BOUNDARY -- NORMATIVE (governs *_content productions in
PART 3; §1 above is a summary of this section).
- SAME-DELIMITER ADJACENCY (ALL SEVEN single-char delimiters `/ * _ ~ ^ = ,`).
A bare delimiter immediately adjacent -- before OR after -- to another
of the SAME delimiter never opens a span. A doubled delimiter is
therefore literal text, never nested same-type emphasis (§3): `**x**`,
`~~x~~`, `^^x^^`, `==x==`, `,,x,,` stay literal exactly like `//x//` and
`__x__` (corpus 71-doubled-emphasis-delimiters, 01-emphasis-11). A
longer run (`,,,y,,,`, `===y===`) is likewise all-literal (corpus
74-two-char-delimiter-runs).
- WORD BOUNDARY (ALL bare delimiters `/ * _ ~ ^ = ,`; no
bare intraword emphasis). A delimiter opens only if NOT followed by
whitespace AND preceded by whitespace, start, or punctuation -- but NOT
by an alphanumeric or `_` (left word boundary). So (/x/) -> (<em>x</em>)
and a./b/ -> a.<em>b</em> (corpus 01-emphasis-8), while foo_bar_baz,
foo*bar*baz, snake_case, x = 5, key=value stay literal (01-emphasis-10,
01-emphasis-11). For deliberate intraword emphasis use the forced
`{X … X}` family (§22): x{*y*}z -> x<strong>y</strong>z.
- A delimiter closes only if NOT preceded by whitespace AND NOT followed
by an alphanumeric (right word boundary), for every bare delimiter.
"/ not italic /" stays literal (ws after opener); "x /a/b y" stays
literal because the candidate closer is followed by `b` (corpus
01-emphasis-9).
- The span runs from the opener to its NEAREST valid closer; any
same-type delimiter in between is LITERAL content (no same-type
nesting, §3). Hence /usr/local/ -> <em>usr/local</em> (01-emphasis-6)
and the/path/here is literal (first / has no left boundary). The
trailing delimiter of an unbalanced run stays literal, so /b// ->
<em>b</em>/ and /x// -> <em>x</em>/. See edge-cases.md §1.
- Resolution uses a delimiter stack, single left-to-right pass, no
backtracking (Design Principle 1). Forced `{X … X}` spans (§22) are
pushed on the same stack; only their word-boundary test is skipped.
10. PARAGRAPH INTERRUPTION -- NORMATIVE (governs `paragraph` in PART 2).
A VISIBLE block INTERRUPTS an open paragraph with NO blank line before
it, at the document top level AND inside nested content (list item,
block quote, admonition/div body). A continuation line that begins a
block is parsed as that block; the paragraph ends on the preceding line.
This is the Markdown-like rule (CommonMark "a paragraph can be
interrupted") and the key block-level divergence from Djot, where an open
paragraph runs until a blank line so a marker line stays paragraph text.
The interrupting blocks:
- heading `#`..`######` + space;
- thematic break a lone `---` / `***` / `___` line;
- block quote `>` (NOTE: a line-initial `>` is ALWAYS a quote
marker, so a prose line starting `>= 5` becomes a
quote -- escape it, `\>= 5`, to keep it prose);
- unordered / task `- ` / `* ` (bullet + space; `- [x] ` task; `+` is
NOT a bullet -- it is the continuation marker, §17).
A bullet interrupts at ANY INDENTATION (Rule B,
§24): an indented ` - item` line interrupts
exactly like a column-0 one;
- table row a `|`...`|` line that is a valid table row
(a stray `|` in prose is NOT -- see VALID TABLE
ROW in §5);
- fenced code a ``` / ~~~ opener that HAS a matching closer
ahead (see CLOSER LOOKAHEAD);
- admonition / div a `:::` opener that HAS a matching closer ahead.
So "Liste: \n - eins \n - zwei" is <p>Liste:</p> + a <ul>; "intro
\n # H" is <p>intro</p> + <h1>; "x = 5 \n * 3 + 17" is <p>x = 5</p> +
<ul><li>3 + 17</li></ul> (corpus 05-lists-12 and the
76-paragraph-interruption family).
That last case is the ACCEPTED COST of paragraph interruption: a
hard-wrapped prose line that happens to begin with a bullet-plus-space
(or `>` / `#` / a valid table row) starts a block. Insert a blank line, or
backslash-escape the marker ("\* 3 + 17"), to keep such a line as prose.
ORDERED LISTS DO NOT INTERRUPT. An ordered-list marker -- decimal,
letter, or roman, in any value -- never interrupts a paragraph; it needs a
blank line before it (matching Djot). This is deliberate: unlike a bullet
(`- `/`* ` is unambiguous), an ordered marker is too common in prose
("see step 2.", "version 1985.", "upgrade to 1. today"), and the only way
to allow it would be the CommonMark `1.`-only heuristic -- the exact wart
Djot removed. Carve drops the heuristic by not interrupting on ordered
markers at all (corpus 54-ordered-marker-vs-prose, Paragraph interruption).
IMAGE EXCLUDED. A line that is only a bare image ("") does NOT
interrupt: an image is not a block of its own in Carve, so it stays an
inline image inside the paragraph.
CLOSER LOOKAHEAD (fenced code, admonition, div). A fence / `:::` opener
interrupts ONLY when a matching closer exists ahead in the same context.
An UNTERMINATED opener does NOT interrupt -- the line stays paragraph
text -- so a stray ``` or `:::` in prose never swallows the rest of the
block. An unterminated ``` is then an unclosed inline verbatim run that
renders as a `<code>` span to the end of the block (the `code_span`
maximal-run rule). Pinned by corpus 76-paragraph-interruption (the
unterminated-fence and unterminated-`:::` pairs).
INVISIBLE CONSTRUCTS. A reference definition (link `[r]: url`, footnote
`[^r]: …`, abbreviation `*[A]: …`), a comment (`%%` line, `%%%` block),
or a block-attribute line (`{…}` alone on a line, §15) also interrupts
a paragraph with no blank line: it is parsed and consumed/collected,
rather than left as literal text. So "See[^m].\n[^m]:
note" resolves the footnote, "para\n%% x" drops the comment, and
"Para\n{.x}" ends the paragraph with the attribute line floating
forward (§15; corpus 84-block-attribute-lines-7).
(Unchanged from the prior rule; a small deviation from djot, which keeps
such a line as paragraph text.)
NESTING. Interruption applies inside nested content too: a `# H` after a
prose line in a block quote ("> text \n > # H") is a heading, and an
indented sublist still nests with no blank line ("- a \n - b" -> a list
whose first item holds a sublist). A col-0 marker with no `>` inside a
block quote terminates the quote (the marker is a sibling block), per the
block-quote lazy-continuation rule (see `blockquote`). HEADING
CONTINUATION follows the same principle (see `heading`): a block-opener
interrupts an open heading and starts that block, exactly as it interrupts
a paragraph; a bare-prose or same/lower `#` continuation line folds into
the heading text instead.
11. LIST MARKER CHANGE STARTS A NEW LIST -- NORMATIVE (governs `list`
in PART 2). §10 decides FIRST whether a sequence of marker lines
establishes a list at all (paragraph-interruption gate); §11 then
decides whether established list items belong to one list or to
adjacent sibling lists. The rules layer; §11 never overrides §10.
Given two adjacent list items at the same indent that both passed
§10, they belong to the SAME list only when their marker matches:
- same character within `-`, `*` for unordered (`+` is NOT a
bullet in Carve -- it is the continuation marker, §17);
- same `ordered_marker` alternative for ordered;
- same plain-vs-task classification.
A line whose marker differs from the immediately-preceding item
on any of those axes starts a new sibling list at the same indent.
Matches djot (minus djot's `+` bullet). Example:
- a produces: <ul><li>a</li><li>b</li></ul>
- b <ul><li>c</li><li>d</li></ul>
* c
* d
Same axis for plain-vs-task: `- a` followed by `- [x] b` is two
lists (matches djot). The marker-match rule itself looks only one
item back; the ordered DIALECT below additionally carries the
first item's classification (and its one-item-forward tie-break)
for the rest of the list.
ORDERED DIALECTS. An ordered list's dialect (decimal / lower- or
upper-alpha / lower- or upper-roman) and delimiter (`.` or `)`) are
fixed by its FIRST item; a later item whose marker is outside that
dialect, or uses the other delimiter, starts a new sibling list. The
first item also sets the `<ol>` `type` (a/A/i/I; decimal omits it)
and `start` (the marker's value; omitted when 1).
AMBIGUOUS-LETTER TIE-BREAK: a single roman-letter first marker
(i/v/x/l/c/d/m) is ROMAN when the next sibling is the consecutive
roman numeral (`iv.`+`v.`, `i.`+`ii.`) and ALPHA when the next is the
consecutive letter (`c.`+`d.`, `v.`+`w.`); a lone `i`/`I` defaults to
roman, any other lone letter to alpha.
12. FENCED ":::" BLOCK RENDERING -- NORMATIVE (governs `admonition`
and `div` in PART 2; cross-references syntax.md §4.20 "Block
Extensions"). A TYPED `::: word` block renders by a TWO-TIER rule
on the type identifier (below). A BARE `:::` opener with NO type
word is a GENERIC DIV: a plain `<div>` (no class added; an empty div
is `<div></div>`), i.e. djot's generic container.
INLINE OPENER ATTRIBUTES -- STRICT (djot). The opener line carries NO
inline attributes: it is the colon fence, an optional type word, and
an optional quoted title, and NOTHING else. Any trailing `{...}` (or
other non-title text after the type) makes the line an ORDINARY
PARAGRAPH, not a fence -- so `::: note {.x}`, `::: {.x}`, and
`:::{k=v}` are all paragraphs. To attribute a div or admonition, use
a PRECEDING block-attribute line, which floats onto the block (§15):
`{.x #id}` then `::: note` yields `<aside class="admonition note x"
id="id">`. This matches canonical djot, which rejects any non-class
text on the fence line.
FENCE LENGTH & NESTING: a fence is a run of 3+ colons; the same rule
governs both admonitions and generic divs. A block is closed only by
a bare fence of EQUAL-OR-GREATER colon length, so a longer opener
nests shorter blocks (a `:::` inside a `::::` block is content, not a
closer). Equal-length fences do not nest. A bare opener with no
matching closer ahead is literal text.
The two tiers for a typed block:
Tier 1 -- CANONICAL ADMONITION TYPES render as a semantic
`<aside>` with the `admonition` marker class:
<aside class="admonition {type}">
[<p class="admonition-title">{quoted_title}</p>]
{body}
</aside>
The canonical set is the eight call-out types
`note`, `tip`, `warning`, `danger`, `info`, `success`,
`example`, `quote`. Consuming CSS frameworks target these exact
class names.
Tier 2 -- ANY OTHER (custom) type renders as a generic block-level
`<div>` carrying the verbatim type as its class:
<div class="{type}">
[<p class="admonition-title">{quoted_title}</p>]
{body}
</div>
This is the carve fenced-div primitive that the block-extension
mechanism (syntax.md §4.20) builds on -- e.g. `::: tabs`,
`::: mermaid`, `::: codepen` produce `<div class="tabs">`,
`<div class="mermaid">`, `<div class="codepen">` respectively,
which a registered extension may post-process. An unregistered
custom type still renders as its generic `<div class="{type}">`
so the document stays readable. `details` is an ordinary Tier-2
type (`<div class="details">`); an extension that wants the
HTML5 `<details>/<summary>` disclosure element opts in via §4.20.
Pinned rules common to both tiers:
a. The class is the literal type identifier (Tier 1 prefixes it
with `admonition ` separated by a single space:
`admonition note`, NOT `admonition-note`; Tier 2 uses the
bare `{type}`). The type is NEVER folded together with the
quoted title -- the title is a child element, never a class.
b. A `<p class="admonition-title">…</p>` line is emitted ONLY
when the author supplied a `quoted_title` after the type.
Carve does NOT invent a default title from the type name.
An explicitly empty `""` still counts as a supplied (empty)
title. Applies to both tiers.
13. HEADING SECTION WRAPPING -- NORMATIVE (governs `atx_heading` in
PART 2 and the renderer; matches djot, see `jgm/djot` official
`headings.test`). Every heading emits a `<section id="{id}">`
wrapper around itself and the following content up to the next
same-or-shallower heading. The slug-derivation rule (ASCII
transliteration etc., see syntax.md §4.1) computes
the id; the id lives on the `<section>` element, NOT on the
`<h*>` element. This shifts the existing carve behavior
(id-on-h*, no section wrapper) to match djot's structural model.
Algorithm (one stateful pass over top-level blocks):
let openSections : Stack of (level, sectionElement) = []
for each top-level child node:
if node is Heading at level N:
while openSections is non-empty AND top.level >= N:
emit </section>; pop openSections
emit <section id="{slug}">; push (N, ...)
emit <h{N}>{inline-rendered children}</h{N}>
else:
emit node normally
while openSections is non-empty:
emit </section>; pop openSections
Properties:
- Adjacent same-level headings: produce sibling sections at the
same level (`# A / # B` -> two <section>s).
- Skipped levels: a `# H1 / ### H3` sequence nests H3's section
inside H1's, because the §11-style stack-close test (top.level
>= N) only closes sections at level >= 3 (none open at level 3
or 5/6). The structure mirrors djot's behavior; carve does NOT
synthesize intermediate <h2>/<section> nodes.
- Explicit `{#id}` on a heading: the id still lives on the
`<section>`, not on the `<h*>`. The dedup namespace (syntax.md
§4.1 / heading-id-tracker rules) is unchanged -- explicit ids are
reserved before auto-ids in document order.
- Empty document or doc with no headings: zero <section> elements
are emitted.
- The two-pass id resolution (collect explicit ids first, then
auto-slug headings with collision-dedup) is unaffected by this
rule -- only the EMISSION site of the id moves from <h*> to
<section>. Existing crossref (`</#id>`) and implicit-heading
ref (`[Heading][]`) resolution continue to target the same
slug; the fragment URL `#{slug}` resolves to <section id> the
same way browsers resolve to <h* id>.
- Carve does NOT add `<section>` wrappers around non-heading
top-level blocks. Headings are the only trigger.
14. INLINE SPAN VS LINK DISAMBIGUATION -- NORMATIVE (governs `link`
and `inline_span` in PART 3). After scanning a bracketed run
`[ inline_content ]`, the IMMEDIATELY FOLLOWING character selects
the construct (single-character lookahead, no backtracking):
- `(` -> inline_link ([text](url))
- `[` -> reference_link or collapsed_reference_link
([text][ref] / [text][])
- `{` -> inline_span ([text]{attrs})
- anything else (or end of input) -> the `[`...`]` is LITERAL
text. Carve has no shortcut reference link: a bare `[label]`
never resolves against a `[label]: url` definition.
An inline_span renders as
<span {attrs}>{inline-rendered content}</span>
where {attrs} is the attribute block applied verbatim (id, classes,
key=value), in the same form as any other attribute carrier
(`#id` -> id, `.x` -> class, `k=v` -> key). The bracketed content
is full inline content and is parsed recursively, so
`[a /b/ c]{.x}` -> `<span class="x">a <em>b</em> c</span>`.
Because the `{` test fires only when the attribute block directly
abuts `]`, `[text] {.x}` (with a space) is literal `[text]`
followed by a literal `{.x}` -- the span requires adjacency.
EMPTY OR INVALID ATTRIBUTE BLOCK -- NORMATIVE. The directly-abutting
`{...}` is classified by its content:
- yields at least one attribute (`[text]{.c}`) -> span with those
attributes.
- empty or whitespace-only (`[text]{}`, `[text]{ }`) -> a valid EMPTY
block: a bare `<span>text</span>`. Both reference impls agree, even
though the bare `attribute_list` production nominally needs >= 1
attribute; the empty block is a blessed exception so a
default-attribute processor can target the span.
- unrecognized non-empty content (`[text]{???}`, `[text]{=y=}`) -> NOT
an attribute block: the `]` and the `{...}` render literally. The
inner bracket content is still inline-parsed, so `[*x*]{???}` ->
`[<strong>x</strong>]{???}`.
- a name (id, class, or key) that is not a grammar `identifier` (line
905) is also unrecognized: a DIGIT-FIRST name (`{.123}`, `{#1}`,
`{2=v}`) or a name with a non-identifier character (`{.a!b}`).
ONE invalid name makes the WHOLE block not an attribute block, even
when mixed with valid attributes (`{.ok .1}`), so the run stays
literal. (Deliberate strictness beyond djot, which accepts
digit-first identifiers / `class="123"`; see jgm/djot issue 399.)
A digit/`-`/`_` AFTER the first character is valid (`{.a1}`).
BOOLEAN (value-less) ATTRIBUTES -- NORMATIVE. A bare identifier with no
value (`[text]{kbd}`, `{.note open}`, `{disabled}` on a block line) is a
boolean attribute, rendered `name=""`. It is matched after the
`key_value_attribute` form (so `k=v` stays a key/value) and mixes freely
with id/class/key=value in source order; multiple are allowed; a
digit-first bare word is not a valid name (the block stays literal, as for
any digit-first identifier). All impls support it (a carve extension
beyond canonical djot, matching djot-php; corpus 95-boolean-attributes).
The boundary of "yields an attribute" still differs at the margins:
carve-php additionally accepts colon-bearing keys/classes
(`{xml:lang="en"}`, `{.sm:hover}`) and comment-only blocks (`{% c %}`) as
spans, where carve-js treats those as literal. Use a plain
`.class` / `#id` / `key=value` / bare word for a portable span.
15. BLOCK ATTRIBUTE LINES -- NORMATIVE (governs `block_attributes`
in PART 2; matches djot). A `{...}` attribute block on its own
line is a leading block attribute: it does NOT render -- it
attaches to the next block element. Semantics:
- REACH: leading attribute blocks float forward to the next block
element, across intervening blank lines. `{#id}\n\nText` yields
`<p id="id">Text</p>`. A run with no following block element
(e.g. at end of document) is dropped, producing no output.
- TRAILING A PARAGRAPH -- NORMATIVE. A `{...}` line directly after
paragraph content (no blank line) is still a leading block-attribute
line: it INTERRUPTS the paragraph (§10) and floats FORWARD like any
other. It does NOT attach backward to the paragraph it follows, and
does NOT fold into the paragraph as literal text. `Para\n{.class}`
yields `<p>Para</p>` (dropped -- no following block); and
`Para\n{.class}\n\nNext` yields `<p>Para</p>` then
`<p class="class">Next</p>`. (A trailing SAME-LINE `{...}` with no
abutting host is not a block-attribute line at all -- it stays literal
inline content, §14: `Para {.x}` -> `<p>Para {.x}</p>`.)
- ACCUMULATION across consecutive blocks: when several attribute
blocks precede the same target (adjacent or blank-separated),
they merge in source order with these rules --
* id: last one wins.
* key=value: last value for a given key wins.
* class: ALL classes accumulate in source order, with NO
de-duplication (`{.a .b}` then `{.b .c}` -> `class="a b b c"`,
matching djot and carve-php).
Worked example (the djot canonical case):
{#id}
{key=val}
{.foo .bar}
{key=val2}
{.baz}
{#id2}
Okay
-> <p id="id2" key="val2" class="foo bar baz">Okay</p>
- MULTI-LINE BLOCK: a single attribute block may wrap across lines
-- the `}` need not sit on the opening line. `{#id\n .foo}` is
one block contributing id `id` and class `foo`. A BLANK line
inside the braces ends the block (it is then literal text, not a
block_attributes). A quoted value containing a literal `}` that
spans lines is not supported by the reference impls
(pathological).
- NOT AN ATTRIBUTE LIST -> NOT AN ATTRIBUTE LINE: a `{…}` line whose
braces do not parse as an attribute list is ordinary paragraph
content and follows the normal inline rules (e.g. a `{#a #}` line
is an editorial comment, §22 family, not a block-attribute line).
- NO block construct takes a trailing `{…}` on its own line
(djot-strict; headings, fences and `:::` openers included) --
the preceding block-attribute line is the ONLY block-level
attribute channel. A trailing `{…}` on a heading line is literal
inline content (PART 2, headings).
Implementation status: carve-php, carve-js, and carve-rs implement §15
in full -- leading block-attribute lines, float-forward across blank
lines (including a line that trails a paragraph), accumulation, and
multi-line blocks. Pinned by corpus 84-block-attribute-lines.
16. FOOTNOTES -- NORMATIVE (governs `footnote` / `footnote_definition`
in PART 3). The reference form `[^label]` + `[^label]: body` and the
INLINE form `^[content]` are implemented; the sidenote (`[>content]`)
remains deferred.
- A `[^label]` reference with a matching definition is NUMBERED by
document reference order (first referenced = 1); the same label
referenced again reuses its number.
- Definitions may appear anywhere (order-independent); the FIRST
definition for a label wins. A body is the def line plus any
following lines indented by >= 2 spaces (single blank lines
allowed between chunks), parsed as blocks.
- A reference with NO matching definition renders as literal source
text `[^label]` (a trailing `{attrs}` on an unresolved ref is not
round-tripped -- a marginal mistyped-label edge). An UNREFERENCED
definition is dropped.
- Rendering (djot-compatible roles):
reference -> <a id="{refId}" href="#fn{n}" role="doc-noteref"><sup>{n}</sup></a>
endnotes -> ONE <section role="doc-endnotes"> appended AFTER
all body content (outside heading <section>s),
containing <hr> then an <ol> of <li id="fn{n}">;
each note's last paragraph gets a trailing backlink
<a href="#{refId}" role="doc-backlink">↩</a>.
First reference to note n has refId `fnref{n}`; a k-th (k>1)
repeat uses `fnref{n}-{k}` and adds another backlink. The
backlink glyph is the plain return arrow `↩` (Carve's choice;
djot appends a variation selector).
- INLINE FOOTNOTE `^[content]` (pandoc form; a deliberate carve
extension -- canonical djot has no inline footnotes). A `^`
IMMEDIATELY before `[` opens an anonymous note whose content is the
balanced bracket span (escape- and code-span-aware close, same as link
text). Properties:
* Content is INLINE-only, parsed recursively with footnote
recognition DISABLED inside it (no `^[…]` or `[^ref]` nested in a
note, either direction).
* Numbered in the SAME document-order sequence as reference notes
(an inline note always takes a fresh anonymous number; it cannot
be re-referenced). Renders the same noteref + an `<li id="fn{n}">`
in the one endnotes section, content as one `<p>` + backlink.
* A trailing `{attrs}` attaches to the noteref `<a>` (like a
reference note).
* Empty or whitespace-only (`^[]`, `^[ ]`) is literal; an unclosed
`^[…` is literal.
PRECEDENCE (PART 8): `^[` is ranked ABOVE superscript. Consequences:
`^[x]^` is a note then a literal `^` (no longer superscript-of-`[x]`;
escape `\^[x]^` to keep superscript); `^^[x]` is suppressed by
same-delimiter `^` adjacency (literal); `\^[x]` is literal; a `^` not
immediately followed by `[` is superscript as before.
- LIMITATION: a footnote (reference `[^1]` or inline `^[…]`) inside link
text (`[t[^1]](u)`) or inside a heading later cloned by a `</#id>`
crossref nests an <a> in an <a>; avoid footnotes in those positions.
17. TIGHT vs LOOSE LISTS -- NORMATIVE (governs `list` rendering). A list is
LOOSE if any item is followed by a blank line before the next sibling
marker, OR any item holds a blank-line-separated second PARAGRAPH;
otherwise TIGHT. A tight item's paragraph renders WITHOUT a `<p>`
(`<li>text</li>`); a loose item's paragraphs ARE wrapped
(`<li><p>text</p></li>`).
COMPACT LIST BLOCKS -- Carve deviation from djot/CommonMark. A blank line
before an item's sub-BLOCK (sub-list, block quote, fenced code, fenced
div, heading, table) does NOT loosen the list: the item stays tight, lead
text inline, block attached. Only a genuine second PARAGRAPH (blank +
indented plain prose), or a blank line between items, loosens. The blank
line is still required to START the block, so block recognition and the
uniformity principle are unchanged -- only the tight/loose RENDERING
differs. djot renders these loose; Carve renders them tight. Pinned by
corpus compact-list-blocks.
LIST CONTINUATION MARKER -- Carve addition. A line whose only content is
`+`, at the item's marker column, attaches the FOLLOWING flush-left block
to the current item with no blank line, keeping the list tight (useful for
code/tables you would rather not indent). `+` is not a Carve bullet (PART 9
§11), so a lone `+` is unambiguous; outside a list a lone `+` is literal
text. It attaches one block, up to the next blank line, sibling item, or a
further `+`. The same marker may sit on the marker line itself (`- +`) to
open an item whose body is the flush-left block(s) that follow, with no
inline lead -- the FIRST-BLOCK form (a bare `+`; `- + text` keeps `+ text`
as literal text). Pinned by corpus list-continuation-marker.
18. MATH -- NORMATIVE (governs `math` in PART 3; djot form). Inline
`$` + verbatim backtick span; display `$$` + verbatim backtick
span. A `$` NOT immediately followed by a backtick run is literal
(`$5` is currency); `\$` escapes a literal `$`. Rendering:
inline -> <span class="math inline">\(content\)</span>
display -> <span class="math display">\[content\]</span>
`content` is the verbatim span text, HTML-escaped (`&`,`<`,`>`).
Display math alone on a line renders inside its own paragraph. A
standalone display-math block carrying a trailing caption (PART 9 §4)
is a numbered EQUATION: the math paragraph is wrapped in a figure and a
`</#id>` to it resolves to "Equation N" (§19).
Author `{attrs}` after the span merge onto the <span> (the base
`math …` class is kept; see PART 10 §1).
19. CROSSREF AUTO-TEXT + DEFAULT-ON / PROCESSOR FEATURES -- NORMATIVE.
- `</#id>` clones the target heading's full inline children, so a
heading with markup (`# *Setup*`) yields a link that KEEPS the
markup (`<a href="#…"><strong>Setup</strong> …</a>`).
- `</#id>` to a NUMBERED CAPTION (a figure / table / listing / equation
whose caption carries a number placeholder and an `{#id}`) yields the
caption's LABEL + NUMBER ("Figure 1"), markup preserved, NOT the
caption prose. A `</#id>` to an id that is neither a heading nor a
numbered caption stays unresolved (renders literal), as today.
- Mentions (`@user`), tags (`#tag`) and smart typography are ON by
default in the conformant core; a processor MAY disable them. The
corpus pins the default-on behavior.
- Includes (`{{ path }}`) are a PROCESSOR-LEVEL directive, NOT part
of the core parser; a conformant core MAY leave `{{ … }}` literal.
An implementing processor MUST forbid path traversal outside the
project root, bound recursion depth, and treat includes as opt-in.
20. RAW PASSTHROUGH -- NORMATIVE (governs `raw_inline` in PART 3 and the
semantics of `raw_block` in PART 2). The BLOCK form (```=FORMAT
fence, PART 2) emits its verbatim content UNESCAPED when FORMAT
matches the output format and drops it otherwise; the inline form
follows. Block and inline raw share the `=FORMAT` spelling (```=html
/ `…`{=html}); the former ```raw FORMAT keyword form was removed. A code span whose trailing attribute block
is EXACTLY `{=format}` (a single `=`-prefixed format name, with no
other classes/ids/keys) is raw passthrough: the verbatim span content
is emitted UNESCAPED when `format` matches the output format (`html`),
and DROPPED otherwise. The `{=format}` tag is consumed, not rendered.
A code span with any OTHER trailing `{…}` is a generic attributed code
span (PART 3 `code_span` note), not raw inline. The corpus pins this
(tests/corpus/50-raw-inline):
`<br>`{=html} -> <br> (emitted verbatim)
`\foo`{=latex} -> (dropped in HTML output)
21. TRAILING LINE COMMENTS -- NORMATIVE (governs `inline_comment` in PART 3).
- A `%%` token is a trailing comment marker only when the character
immediately before it is whitespace (space or tab) or it starts the
inline run (beginning of paragraph content after leading whitespace).
- When recognized, the `%%` and ALL remaining characters up to (but not
including) the line break are consumed and produce no output.
- `%%` is NEVER recognized inside a code span or raw inline (PART 9 §20);
those contexts pass both `%` characters through verbatim.
- `\%%` (escaped first percent) is literal -- `%%` is literal text, not a
comment marker.
- Without preceding whitespace (e.g. `50%%` or `a%%b`) the `%%` is
literal -- percentages and doubled-percent tokens in prose are safe.
- The comment does NOT cross a line break; a soft-wrapped continuation on
the next line of the same paragraph is unaffected (corpus 46-comments-6).
22. FORCED INTRAWORD EMPHASIS -- NORMATIVE (governs the forced_* productions
in PART 3). The brace-pair family is the escape hatch for emphasis where a
bare delimiter would otherwise stay literal (§9 word boundary).
- FORMS. Each emphasis mark has a forced form that renders the SAME
element as its bare delimiter:
{/x/} -> <em>x</em> {*x*} -> <strong>x</strong>
{_x_} -> <u>x</u> {~x~} -> <s>x</s>
{^x^} -> <sup>x</sup> {,x,} -> <sub>x</sub>
{=x=} -> <mark>x</mark>
- NO WORD BOUNDARY. A forced span opens and closes regardless of the
surrounding characters, so it emphasizes intraword: foo{*bar*}baz ->
foo<strong>bar</strong>baz; my{_path_}name -> my<u>path</u>name. This
is the ONLY way to emphasize intraword.
- BOUNDS = THE BRACES. The closing `X}` ends the span; a bare delimiter
of the same kind INSIDE is literal content, like the bare-span rule
(§9): {/a/b/} -> <em>a/b</em>. Inner content otherwise parses normally,
so cross-type nesting works: {/italic *bold*/} -> <em>italic
<strong>bold</strong></em>.
- `{~ … ~}` DISAMBIGUATION. The brace pair is editorial SUBSTITUTION when
it contains a top-level `~>` ({~old~>new~} -> <del>old</del><ins>new
</ins>), and forced STRIKETHROUGH otherwise ({~old~} ->
<s>old</s>). The `~>` test is the discriminator.
- `{= … =}` is forced highlight (<mark>), distinct from the raw-inline
`{=format}` attribute on a code span (`` `x`{=html} ``), which has no
trailing `=` before the `}`.
- ESCAPING. A literal `{/` (etc.) is written `\{/` -- one backslash on the
brace is enough; the inner delimiter need not be escaped.
- ATTRIBUTES. A forced span MAY carry a trailing `[attributes]` block like
any bare span: {*x*}{.c} -> <strong class="c">x</strong>.
(corpus 01-emphasis-12 forced-intraword, 01-emphasis-13 forced-nesting.)
23. LINE BLOCK (VERSE) -- NORMATIVE (governs `line_block` in PART 2). A
`::: |` fenced block preserves the author's per-line layout.
- TRIGGER. A bare pipe `|` TYPE TOKEN on the opener (`::: |`) selects this
behavior. The token is a pipe, not a word. STRICT (djot): the opener
carries no inline attributes, so the `::: {.line-block}` class form is an
ordinary PARAGRAPH, not a line block -- the class alone does not trigger it.
- WRAPPER. Renders as `<div class="line-block">` (a generic div, never an
`<aside>`); extra classes / id attach via a PRECEDING block-attribute
line, which floats onto the div (§15).
- LINE BREAKS. Each soft line break inside a stanza becomes a HARD break
(`<br>`). So consecutive non-blank body lines render as one paragraph
with `<br>` between them.
- STANZAS. A blank line ends a stanza; each stanza is its own `<p>` inside
the single `<div class="line-block">`.
- LEADING WHITESPACE. Each body line's leading whitespace is PRESERVED
(unlike ordinary paragraph content, which is trimmed). In HTML each
leading space serializes as a NBSP (` `, U+00A0), so indentation is
visible with no external CSS (matching Pandoc). Plain-text / ANSI
renderers emit the literal spaces. A tab in the leading run follows the
tab-stop rule (PART 9 §24).
- REFERENCE COLUMN. Leading whitespace is measured RELATIVE TO THE FENCE
indent, not absolute column 0: a `line-block` nested in a list has the
container's structural indent stripped first (the list dedents to its
content column), so only the author's intra-verse indent survives. The
`+` first-block form (fence at column 0) is the canonical way to nest a
`line-block` without the container consuming indentation.
- INLINE CONTENT parses normally (emphasis, links, code, ...); only the
whitespace and soft-break handling differ from an ordinary div.
(corpus 88-line-blocks*.)
24. TABS AND INDENTATION -- NORMATIVE (governs `indent` in PART 2 and all
indentation-sensitive comparisons). Adjudicated with the tab-stop work
(carve #99 / Rule B carve #103); shipped in all three implementations.
- COLUMNS, NOT CHARACTERS. Indentation is measured in VISUAL COLUMNS:
a tab advances to the next multiple of 4 (CommonMark tab stops), a
space advances by 1. A leading tab therefore indents to column 4.
Outside indentation (code content, inline text) tabs are PRESERVED
verbatim and never expanded (PART 2, TABS IN CODE).
- MARKER SEPARATOR IS A SPACE. The required separator after a list
marker is a single SPACE character -- a syntax delimiter, not
indentation. `-<TAB>a` is NOT a list item (paragraph text).
- CONTENT COLUMN. An item's content column is the visual column where
its content starts (marker width + separator: `- ` -> 2, `1. ` -> 3,
`10. ` -> 4). An ORDERED child marker must reach the parent item's
content column to nest; below it, the line folds as lazy item text
(ordered markers never interrupt, §10). UNORDERED/TASK bullets and
every other block opener (quote, heading, fence, table, ...) nest at
ANY indent past the parent item's base column.
- RULE B (BULLETS AT ANY INDENT). A bullet (`-`/`*` + space + content)
opens a list at ANY indentation, uniformly in every context: at the
top level, interrupting an open paragraph (§10), and nested in a
list item. There is no column-0 restriction.
- DEDENT. When nested content is re-parsed, the container's structural
indent is stripped by columns. A tab straddling the dedent boundary
is consumed WHOLE on non-marker lines (Carve has no indent-sensitive
block below list nesting, so the residual columns never matter);
on a LIST-MARKER line the straddling tab's unconsumed columns are
re-inserted as spaces, so sibling markers aligned with mixed
tab/space indentation keep the same visual column and stay siblings.
*)
(* ============================================================================
PART 10: HTML SERIALIZATION CONVENTIONS (NORMATIVE)
============================================================================
The corpus pins exact bytes against the reference (carve-js); these
conventions state the rules a SECOND implementation (e.g. carve-php)
must follow to match WITHOUT copying carve-js output. On any
disagreement the corpus wins.
1. ATTRIBUTE ORDER. Attributes serialize in the AUTHOR's source order
(the recorded `order`; see the RENDER ORDER rule in PART 4 and
PART 9 §12-style note). An Attrs built programmatically with no
recorded order falls back to a fixed `class`, then `id`, then
key=value order. All classes accumulate (space-joined) into one
`class` slot at its first-appearance position. An element with no
effective attributes emits none. (Heading ids move to <section>,
§13.) A mandatory base class (math `math inline|display`) is
prepended to that class slot.
2. ATTRIBUTE QUOTING + ESCAPING. Attribute values are double-quoted.
In an attribute value escape `&`->&, `<`-><, `>`->>,
`"`->", `'`->'. In text/content escape `&`,`<`,`>` (NOT
quotes). No other entities are produced.
3. VOID ELEMENTS are written WITHOUT a trailing slash: `<hr>`,
`<img …>`, `<br>`. A hard break serializes as `<br>` + newline.
4. BLOCK WHITESPACE. Block elements are newline-separated, one per
line. Nested block structures (list, blockquote, table, admonition,
div, figure, section, endnotes) indent their children by TWO spaces
per nesting level. Inline content stays flat on its block's line.
5. PARAGRAPH WRAPPING follows the tight/loose rule (§17): a tight list
item's text has no <p>; a loose item's paragraphs are wrapped.
6. CODE. A fenced code block is `<pre><code[ class="language-X"]>` +
escaped content + a trailing newline + `</code></pre>`. An inline
code span is `<code>` + escaped content + `</code>`.
7. TABLES. Alignment renders as `style="text-align: VALUE;"` (one
space after the colon, trailing semicolon); a cell with no effective
alignment emits NO style attribute. `rowspan`/`colspan` are emitted
only when their count is > 1.
8. STABLE STRINGS. Mentions and tags render as inert `<span>`s (no link)
when no URL template is configured; a configured template links to them. The footnote
backlink glyph is `↩` (§16). Math wraps content in `\(…\)` /
`\[…\]` inside `<span class="math inline|display">` (§18).
*)