commonmark
Pure Haskell commonmark parser.
https://github.com/jgm/commonmark-hs
LTS Haskell 23.1: | 0.2.6.1 |
Stackage Nightly 2024-12-26: | 0.2.6.1 |
Latest on Hackage: | 0.2.6.1 |
commonmark-0.2.6.1@sha256:cdc783466e4d24ed49ea95953b9e9311f066b228d72af96e816a06f536c700c6,3498
commonmark
This package provides the core parsing functionality for commonmark, together with HTML renderers.
:construction: This library is still in an experimental state. Comments on the API and implementation are very much welcome. Further changes should be expected.
The library is fully commonmark-compliant and passes the test suite for version 0.30 of the commonmark spec. It is designed to be customizable and easily extensible. To customize the output, create an AST, or support a new output format, one need only define some new typeclass instances. It is also easy to add new syntax elements or modify existing ones.
Accurate information about source positions is available for all block and inline elements. Thus the library can be used to create an accurate syntax highlighter or an editor with synced live preview.
Finally, the library has been designed for robust performance even in pathological cases. The parser behaves well on pathological cases that tend to cause stack overflows or exponential slowdowns in other parsers, with parsing speed that varies linearly with input length.
Related libraries
-
commonmark-extensions
provides a set of useful extensions to core commonmark syntax, including all GitHub-flavored Markdown extensions and many pandoc extensions. For convenience, the package of extensions defining GitHub-flavored Markdown is exported asgfmExtensions
. -
commonmark-pandoc
defines type instances for parsing commonmark as a Pandoc AST. -
commonmark-cli
is a command-line program that uses this library to convert and syntax-highlight commonmark documents.
Simple usage example
This program reads commonmark from stdin and renders HTML to stdout:
{-# LANGUAGE ScopedTypeVariables #-}
import Commonmark
import Data.Text.IO as TIO
import Data.Text.Lazy.IO as TLIO
main = do
res <- commonmark "stdin" <$> TIO.getContents
case res of
Left e -> error (show e)
Right (html :: Html ()) -> TLIO.putStr $ renderHtml html
Notes on the design
The input is a token stream ([Tok]
), which can be
be produced from a Text
using tokenize
. The Tok
elements record source positions, making these easier
to track.
Extensibility is emphasized throughout. There are two ways in which one might want to extend a commonmark converter. First, one might want to support an alternate output format, or to change the output for a given format. Second, one might want to add new syntactic elements (e.g., definition lists).
To support both kinds of extension, we export the function
parseCommonmarkWith :: (Monad m, IsBlock il bl, IsInline il)
=> SyntaxSpec m il bl -- ^ Defines syntax
-> [Tok] -- ^ Tokenized commonmark input
-> m (Either ParseError bl) -- ^ Result or error
The parser function takes two arguments: a SyntaxSpec
which
defines parsing for the various syntactic elements, and a list
of tokens. Output is polymorphic: you can
convert commonmark to any type that is an instance of the
IsBlock
typeclass. This gives tremendous flexibility.
Want to produce HTML? You can use the Html ()
type defined
in Commonmark.Types
for basic HTML, or Html SourceRange
for HTML with source range attributes on every element.
GHCI> :set -XOverloadedStrings
GHCI>
GHCI> parseCommonmarkWith defaultSyntaxSpec (tokenize "source" "Hi there") :: IO (Either ParseError (Html ()))
Right <p>Hi there</p>
> parseCommonmarkWith defaultSyntaxSpec (tokenize "source" "Hi there") :: IO (Either ParseError (Html SourceRange))
Right <p data-sourcepos="source@1:1-1:9">Hi there</p>
Want to produce a Pandoc AST? You can use the type
Cm a Text.Pandoc.Builder.Blocks
defined in commonmark-pandoc
.
GHCI> parseCommonmarkWith defaultSyntaxSpec (tokenize "source" "Hi there") :: Maybe (Either ParseError (Cm () B.Blocks))
Just (Right (Cm {unCm = Many {unMany = fromList [Para [Str "Hi",Space,Str "there"]]}}))
GHCI> parseCommonmarkWith defaultSyntaxSpec (tokenize "source" "Hi there") :: Maybe (Either ParseError (Cm SourceRange B.Blocks))
Just (Right (Cm {unCm = Many {unMany = fromList [Div ("",[],[("data-pos","source@1:1-1:9")]) [Para [Span ("",[],[("data-pos","source@1:1-1:3")]) [Str "Hi"],Span ("",[],[("data-pos","source@1:3-1:4")]) [Space],Span ("",[],[("data-pos","source@1:4-1:9")]) [Str "there"]]]]}}))
If you want to support another format (for example, Haddock’s DocH
),
just define typeclass instances of IsBlock
and IsInline
for
your type.
Supporting a new syntactic element generally requires (a) adding
a SyntaxSpec
for it and (b) defining relevant type class
instances for the element. See the examples in
Commonmark.Extensions.*
. Note that SyntaxSpec
is a Monoid,
so you can specify myNewSyntaxSpec <> defaultSyntaxSpec
.
Performance
Here are some benchmarks on real-world commonmark documents,
using make benchmark
. To get benchmark.md
, we concatenated
a number of real-world commonmark documents. The resulting file
was 355K. The
bench
tool was
used to run the benchmarks.
program | time (ms) |
---|---|
cmark | 12 |
cheapskate | 105 |
commonmark.js | 217 |
commonmark-hs | 229 |
pandoc -f commonmark | 948 |
It would be good to improve performance. I’d welcome help with this.
Changes
Changelog for commonmark
0.2.6.1
- Fix parsing of link destinations that look like
code
or HTML (#136, Michael Howell). This affects parsing of things like[link](`)`x`
.
0.2.6
-
Make list tightness match the reference implementation closer (#150, Michael Howell). This solves the problem where blank lines in the middle of a list are attributed to the list itself instead of the item, making its parent list become spuriously loose.
-
Fix bug with entities inside link destinations (#149). The bug affects cases like this:
[link](\!)
; the backslash escape was being ignored here. -
Commonmark.Entity: export
pEntity
[API change].
0.2.5.1
- Replace
source
withsearch
in list of block tags. This is a spec 0.31 change that was forgotten in the last release.
0.2.5
-
Fix HTML comment parser to conform to 0.31.2 spec.
-
Update spec.txt tests to commonmark-spec 0.31.2.
-
Match HTML declaration blocks with lowercase letters (Michael Howell).
-
Specifically track the position where enders end (Michael Howell).
0.2.4.1
-
Commonmark.Html: Add
aria-hidden
,d
, andviewBox
to allowed attributes list. -
Correctly merge list blanks with non-list blanks (#133, Michael Howell).
-
Do not look for backslashed hard breaks in link titles (#130, Michael Howell).
-
Work around ghc bug with
-K
RTS options, to set the stack space properly for tests (#129). See https://gitlab.haskell.org/ghc/ghc/-/issues/10445. -
Revert block state completely if lazy line (#126). This fixes an issue with lazily-wrapped footnotes.
-
Avoid adding trailing newline to list block if it’s already there (Michael Howell). This fixes tight/loose classification in a few cases.
-
Fix incorrectly parsing links with nested
[]
(Michael Howell).
0.2.4
-
Do not parse hard line breaks in fenced codeblock info (#116, Michael Howell). This change makes commonmark-hs conform to the spec and behave like other implementations when an info string in a code block ends with a backslash.
-
[API change] Commonmark.Inlines now exports
pEscapedSymbol
(#116, Michael Howell). -
Tokenize combining marks as WordChars not Symbol (#114).
0.2.3
- Re-export Text.Parsec.Pos from Commonmark.Types (Fraser Tweedale, #106).
0.2.2
- Blocks: export
getParentListType
[API change]. - Require unicode-data >= 0.3.
- Change
mkFormattingSpecMap
so it integrates different FormattingSpecs that use the same character (#87). Otherwise we have problems if you have one formatting spec that reacts to single delimiters and another that reacts to pairs; if the first fails to match, the fallback behavior is produced and the second never matches. - Use unicode-data’s faster versions of Data.Char functions. This speeds up benchmarks for tokenize considerably; little difference in other benchmarks. unicode-data is already a transitive dependency, via unicode-transforms.
- Increase strictness in tokenize/go.
- Remove legacy cpp needed to support ghc < 8.4.
0.2.1.1
- Fix bug in
prettyShow
forSourceRange
(#80). The bug led to an infinite loop in certain cases.
0.2.1
- Use official 0.30 spec.txt.
- Update HTML block parser for recent spec changes.
- Fix test case from commonmark/cmark#383. We need to index the list of stack bottoms not just by the length mod 3 of the closer but by whether it can be an opener, since this goes into the calculation of whether the delimiters can match.
0.2
- Commonmark.Inlines: export LinkInfo(..) [API change].
- Commonmark.Inlines: export pLink [API chage].
- Comonmark.ReferenceMap: Add linkPos field to LinkInfo [API change].
- Commonmark.Tokens: normalize unicode to NFC before tokenizing (#57). Normalization might affect detection of flankingness, recognition of reference links, etc.
- Commonmark.Html: add data-prefix to non-HTML5 attributes, as pandoc does.
- Remove unnecessary build-depends.
- Use lightweight tasty-bench instead of criterion for benchmarks.
0.1.1.4
- Fix build with GHC 9.0.1 (Simon Jakobi, #72).
0.1.1.3
- Fix bug in links with spaces at the beginning or end of link description (#67). We were putting flankingness constraints on the link delimiters, but this isn’t requried by the spec.
0.1.1.2
- Fix bug in fix to #65 (#66).
0.1.1.1
- Fixed corner case with link suffix parsing, which could result in dropped tokens in certain cases (#65).
0.1.1
- Export
reverseSubforests
fromCommonmark.Blocks
[API change] (#64).
0.1.0.2
- Fix tight/loose list detection with multiple blank lines at end (#56).
0.1.0.1
- Set source position when we add a token in gobbleSpaces (#54). This fixes a bug in gobbling indented spaces in some nested contexts.
- Drop support for ghc 7.10/base 4.8. We need StrictData. Move SCC annotations; ghc 8.0.x doesn’t support them on declarations.
0.1.0.0
- Initial release