zenacy-html
A standard compliant HTML parsing library
https://github.com/mlcfp/zenacy-html
LTS Haskell 22.43: | 2.1.0 |
Stackage Nightly 2023-12-26: | 2.1.0 |
Latest on Hackage: | 2.1.0 |
zenacy-html-2.1.0@sha256:ebb91d04574499b0a89a4456d28d11d28742e9a635fc6dc2f9450be762e26563,4631
Module documentation for 2.1.0
- Zenacy
- Zenacy.HTML
- Zenacy.HTML.Internal
- Zenacy.HTML.Internal.BS
- Zenacy.HTML.Internal.Buffer
- Zenacy.HTML.Internal.Char
- Zenacy.HTML.Internal.Core
- Zenacy.HTML.Internal.DOM
- Zenacy.HTML.Internal.Entity
- Zenacy.HTML.Internal.Filter
- Zenacy.HTML.Internal.HTML
- Zenacy.HTML.Internal.Image
- Zenacy.HTML.Internal.Lexer
- Zenacy.HTML.Internal.Oper
- Zenacy.HTML.Internal.Parser
- Zenacy.HTML.Internal.Query
- Zenacy.HTML.Internal.Render
- Zenacy.HTML.Internal.Token
- Zenacy.HTML.Internal.Trie
- Zenacy.HTML.Internal.Types
- Zenacy.HTML.Internal.Zip
- Zenacy.HTML.Internal
- Zenacy.HTML
Zenacy HTML
Zenacy HTML is an HTML parsing and processing library that implements the WHATWG HTML parsing standard. The standard is described as a state machine that this library implements exactly as spelled out including all the error handling, recovery, and conformance checks that makes it robust in handling any HTML pulled from the web. In addition to parsing, the library provides many processing features to help extract information from web pages or rewrite them and render the modified results.
Introduction
The Zenacy HTML parser is an implementation of the HTML parsing standard defined by the WHATWG.
https://html.spec.whatwg.org/multipage/parsing.html
The standard defines a parsing state machine, so it is very prescriptive on how HTML is handled including many edge cases and error recovery. This library aims to follow the standard closely in such a way to match the code back to the standard and make future updates straightforward.
One of the main uses an a HTML parser is for extracting information from the web. Having a parser that can handle all the nuances of poorly formatted HTML helps to make this extraction as robust as possible. This was a key motivation in deciding to implement a parser in this fashion. Additionally, the standard describes the algorithms needed to produce the correct document structure. Applications that are sensitive to the document structure, such as extracting and rewriting large portions of a web page, may benefit from Zenacy HTML.
The library provides a wide variety of features including:
- A fully standard compliant HTML parser
- HTML Fragment parsing
- Document rendering
- A zipper type for document traversal
- An iterator type for document walking
- Various functions for processing aspects of HTML
- Lightweight queries for rewriting
Parsing
The library is designed to be imported unqualified.
import Zenacy.HTML
The htmlParseEasy
function can be used to parse an HTML document string
and return the document model.
htmlParseEasy "<div>HelloWorld</div>"
Note that some of the missing elements where automatically added to the document structure as required by the standard.
HTMLDocument ""
[ HTMLElement "html" HTMLNamespaceHTML []
[ HTMLElement "head" HTMLNamespaceHTML [] []
, HTMLElement "body" HTMLNamespaceHTML []
[ HTMLElement "div" HTMLNamespaceHTML []
[ HTMLText "HelloWorld" ] ] ] ]
The parsed result can also be rendered using htmlRender
.
htmlRender $ htmlParseEasy "<div>HelloWorld</div>"
The resulting rendered document appears like so.
<html><head></head><body><div>HelloWorld</div></body></html>
Rewriting
This example illustrates a function that converts span elements to divs.
rewrite :: Text -> Text
rewrite = htmlRender . htmlMapElem f . fromJust . htmlDocHtml . htmlParseEasy
where
f x
| htmlElemHasName "span" x = htmlElemRename "div" x
| otherwise = x
rewrite "<span>Hello</span><span>World</span>"
Running the above gives the modified document.
<html><head></head><body><div>Hello</div><div>World</div></body></html>
Extraction
The next example shows one way to find all the hyperlinks in a document. This solution recurses over the document elements while ignoring fragments and templates.
extract :: Text -> [Text]
extract = go . htmlParseEasy
where
go = \case
HTMLDocument n c ->
concatMap go c
e @ (HTMLElement "a" s a c) ->
case htmlElemAttrFind (htmlAttrHasName "href") e of
Just (HTMLAttr n v s) ->
v : concatMap go c
Nothing ->
concatMap go c
HTMLElement n s a c ->
concatMap go c
_otherwise ->
[]
extract "<a href=\"https://example1.com\"></a><a href=\"https://example2.com\"></a>"
The extract function will give the following list.
[ "https://example1.com"
, "https://example2.com"
]
Queries
The library includes a basic query facility implemented as a thin wrapper
around an HTMLZipper
. Queries match patterns in HTML structures and can
be used to extract information or update documents. As a first example,
consider the following HTML.
<p>
<span id="x" class="y z"></span>
<br>
<a href="bbb">AAA</a>
<img>
</p>
The HTML can be parsed as normal. Note though the additional step of whitespace removal, which is often important in documents that include indentation such as above.
fromJust . htmlSpaceRemove . fromJust . htmlDocBody . htmlParseEasy
Now a query function can be defined. This function expects to be given
a body
element whose first child is a p
element whose first child
has an id of x
whose second sibling is an anchor element. If all of
those conditions are met, the the text contents of the anchor is returned.
query :: HTMLNode -> Maybe Text
query = htmlQueryExec $ do
htmlQueryName "body"
htmlQueryFirst
htmlQueryName "p"
htmlQueryFirst
htmlQueryId "x"
htmlQueryNext
htmlQueryNext
htmlQueryName "a"
a <- htmlQueryNode
htmlQuerySucc $
fromMaybe "" $ htmlElemText a
Running the query on the parsed document will give the result.
Just "AAA"
Queries can also be used to modifiy documents. In the next example, let’s
say we would like to find any img
that is the only content in a div
and
replace the div
with a link. The document could look as follows.
<section><div><img src="aaa"></div></section>
<section><div><img src="bbb"></div></section>
<section><div><img src="ccc"></div></section>
A query function can be defined to match the desired pattern and return the modified element.
query2 :: HTMLNode -> HTMLNode
query2 = htmlQueryTry $ do
htmlQueryName "div"
htmlQueryOnly "img"
a <- htmlQueryNode
let Just b = htmlElemGetAttr "src" a
htmlQuerySucc $
htmlElem "a" [ htmlAttr "href" b ]
[ htmlText b ]
The query can then be applied to the entire document using htmlMapElem
.
htmlMapElem query2
Rendering the mapped query with give the updated content.
<section><a href="aaa">aaa</a></section>
<section><a href="bbb">bbb</a></section>
<section><a href="ccc">ccc</a></section>
Samples
The unit tests include the above samples as well as many other example usages of the library.
Origin
Zenacy HTML was originally developed for Zenacy Reader Technologies LLC starting around 2015 and used in a web reading SaaS for a few years. The need to understand and handle the wide variety and sublties of HTML found on the web lead to the development of library that closely followed the standard. The library was tweaked and optimized a bit and though there is room for more improvements the result worked quite well in production (a lot of credit goes to the GHC team and Haskell community for providing such great, fast functional programming tools).
Changes
Change Log
2.1.0
- Fix removal of earliest active format element
- Fix DOM node attribute matching
2.0.7
- Update benchmarks for ST MonadFail removal
2.0.6
- Relax bounds on mtl and vector
2.0.5
- Update for removal for ST MonadFail instance
2.0.4
- Remove whitespace around @ patterns for GHC 9
- Upgrade transformers dependency
2.0.2
- Upgrade bytestring dependency
2.0.1
- Make version one line in cabal file to make shields work
2.0.0
- Initial FOSS release
1.0.0
- Initial release