Parsing & AST
Part 5: Parsing Rules
5.1 Block Identification (First Pass)
- Frontmatter (
---delimited at document start) - Headings (
#prefix) - Thematic breaks (
---,***,___) - Code blocks (
```` or~~~` fenced) - Block quotes (
>prefix) - Lists (
-,*, or1.prefix;+is the list-continuation marker, not a bullet) - Tables (
|prefix) - Special blocks (
:::delimited) - Paragraphs (everything else)
5.2 Inline Parsing (Second Pass)
Parse in this precedence order:
- Escaped characters (
\*) - Code spans (
`) - Autolinks (
<url>) - Links, images, spans (
[text](url),,[text]{attrs}) - Math (
$`…`,$$`…`) - Emphasis markers (
/,*,_,~,^,,,=) - Smart typography
5.3 The Disambiguation Rule
- Literal over markup. A delimiter with no valid match (per the word-boundary conditions) is literal text.
- Opener → nearest valid same-type closer. Same-type delimiters between them are literal content (same-type spans do not nest), so
/usr/local/is<em>usr/local</em>, not<em>usr</em>local/. - Different-type spans nest, resolved with a delimiter stack in a single left-to-right pass — linear time, no backtracking (Design Principle 1).
This is not "shortest span / earliest opening wins": that rule would truncate /usr/local/ to <em>usr</em> and break nested emphasis. See resources/grammar.ebnf PART 8 and PART 9 §9, and docs/edge-cases.md §1, §8.
5.4 Whitespace Rules
- Line ending = soft break (default, configurable)
- Blank line = paragraph break
- Two+ blank lines = paragraph break with extra space (optional)
- Indentation: 2+ spaces for list continuation
- Tabs: Normalized to spaces (default: 4)
Part 6: AST Design
6.1 Node Types
Document
├── Frontmatter (optional)
├── Block+
├── Heading { level, content, id? }
├── Paragraph { content }
├── CodeBlock { language?, content, attributes }
├── BlockQuote { blocks, attribution? }
├── List { type, tight, items }
│ └── ListItem { blocks, checked? }
├── Table { headers, rows, alignment[] } // alignment[] = column defaults
│ └── Cell { content, align?, colspan?, rowspan? } // align? overrides the column default
├── ThematicBreak
├── Admonition { type, title?, blocks }
└── RawBlock { format, content }
Inline
├── Text { content }
├── Emphasis { content } // /text/
├── Strong { content } // *text*
├── Underline { content } // _text_
├── Strikethrough { content } // ~text~
├── Superscript { content } // ^text^
├── Subscript { content } // ,text,
├── Highlight { content } // =text=
├── Code { content, language? }
├── Math { content, display }
├── Link { content, url, title? }
├── Image { src, alt, caption? }
├── Footnote { content }
├── SoftBreak
├── HardBreak
└── RawInline { format, content }(id is optional in the parsed AST but always populated after the identifier-resolution pass — automatic or explicit — so consumers downstream of resolution can treat it as required.)
6.2 Source Mapping
Every node includes:
position: {
start: { line, column, offset }
end: { line, column, offset }
}