scalpel

A high level web scraping library for Haskell.

https://github.com/fimad/scalpel

Version on this page:0.3.1
LTS Haskell 22.40:0.6.2.2
Stackage Nightly 2024-11-04:0.6.2.2
Latest on Hackage:0.6.2.2

See all snapshots scalpel appears in

Apache-2.0 licensed by Will Coster
Maintained by [email protected]
This version can be pinned in stack with:scalpel-0.3.1@sha256:942d692314b0cc1d0c2374d78a90bda5171d8879b38c7437653b208155922b8b,2345

Module documentation for 0.3.1

Scalpel Build Status Hackage

Scalpel is a web scraping library inspired by libraries like Parsec and Perl’s Web::Scraper. Scalpel builds on top of TagSoup to provide a declarative and monadic interface.

There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers.

Selectors

Selectors describe a location within an HTML DOM tree. The simplest selector, that can be written is a simple string value. For example, the selector "div" matches every single div node in a DOM. Selectors can be combined using tag combinators. The // operator to define nested relationships within a DOM tree. For example, the selector "div" // "a" matches all anchor tags nested arbitrarily deep within a div tag.

In addition to describing the nested relationships between tags, selectors can also include predicates on the attributes of a tag. The @: operator creates a selector that matches a tag based on the name and various conditions on the tag’s attributes. An attribute predicate is just a function that takes an attribute and returns a boolean indicating if the attribute matches a criteria. There are several attribute operators that can be used to generate common predicates. The @= operator creates a predicate that matches the name and value of an attribute exactly. For example, the selector "div" @: ["id" @= "article"] matches div tags where the id attribute is equal to "article".

Scrapers

Scrapers are values that are parameterized over a selector and produce a value from an HTML DOM tree. The Scraper type takes two type parameters. The first is the string like type that is used to store the text values within a DOM tree. Any string like type supported by Text.StringLike is valid. The second type is the type of value that the scraper produces.

There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.

Example

Complete examples can be found in the examples folder in the scalpel git repository.

The following is an example that demonstrates most of the features provided by this library. Supposed you have the following hypothetical HTML located at "http://example.com/article.html" and you would like to extract a list of all of the comments.

<html>
  <body>
    <div class='comments'>
      <div class='comment container'>
        <span class='comment author'>Sally</span>
        <div class='comment text'>Woo hoo!</div>
      </div>
      <div class='comment container'>
        <span class='comment author'>Bill</span>
        <img class='comment image' src='http://example.com/cat.gif' />
      </div>
      <div class='comment container'>
        <span class='comment author'>Susan</span>
        <div class='comment text'>WTF!?!</div>
      </div>
    </div>
  </body>
</html>

The following snippet defines a function, allComments, that will download the web page, and extract all of the comments into a list:

type Author = String

data Comment
    = TextComment Author String
    | ImageComment Author URL
    deriving (Show, Eq)

allComments :: IO (Maybe [Comment])
allComments = scrapeURL "http://example.com/article.html" comments
   where
       comments :: Scraper String [Comment]
       comments = chroots ("div" @: [hasClass "container"]) comment

       comment :: Scraper String Comment
       comment = textComment <|> imageComment

       textComment :: Scraper String Comment
       textComment = do
           author      <- text $ "span" @: [hasClass "author"]
           commentText <- text $ "div"  @: [hasClass "text"]
           return $ TextComment author commentText

       imageComment :: Scraper String Comment
       imageComment = do
           author   <- text       $ "span" @: [hasClass "author"]
           imageURL <- attr "src" $ "img"  @: [hasClass "image"]
           return $ ImageComment author imageURL

Changes

Change Log

HEAD

0.3.1

  • Added the innerHTML and innerHTMLs scraper.
  • Added the match function which allows for the creation of arbitrary attribute predicates.
  • Fixed build breakage with GHC 8.0.1.

0.3.0.1

  • Make tag and attribute matching case-insensitive.

0.3.0

  • Added benchmarks and many optimizations.
  • The select method is removed from the public API.
  • Many methods now have a constraint that the string type parametrizing TagSoup’s tag type now must be order-able.
  • Added scrapeUrlWithConfig that will hopefully put an end to multiplying scrapeUrlWith* methods.
  • The default behaviour of the scrapeUrl* methods is to attempt to infer the character encoding from the Content-Type header.

0.2.1.1

  • Cleanup stale instance references in documentation of TagName and AttributeName.

0.2.1

  • Made Scraper an instance of MonadPlus.

0.2.0.1

  • Fixed examples in documentation and added an examples folder for ready to compile examples. Added travis tests to ensures that examples remain compilable.

0.2.0

  • Removed the StringLike parameter from the Selector, Selectable, AttributePredicate, AttributeName, and TagName types. Instead they are now agnostic to the underlying string type, and are only constructable with Strings and the Any type.

0.1.3.1

  • Tighten dependencies and drop download-curl all together.

0.1.3

  • Add the html and html scraper primitives for extracting raw HTML.

0.1.2

  • Make scrapeURL follow redirects by default.
  • Expose a new function scrapeURLWithOpts that takes a list of curl options.
  • Fix bug (#2) where image tags that do not have a trailing “/” are not selectable.

0.1.1

  • Tighten dependencies on download-curl.

0.1.0

  • First version!